Re: [Qemu-devel] [PATCH v3 5/6] target-i386: Don't enable nested VMX by default
Here I'm less certain what the best approach is. As you point out, there's an inconsistency that I agree should be fixed. I wonder however whether an approach similar to 3/6 for KVM only would be better? I.e., have VMX as a sometimes-KVM-supported feature be listed in the model and filter it out for accel=kvm so that -cpu enforce works, but let accel=tcg fail with features not implemented. This would mean that -cpu coreduo,enforce doesn't work on TCG, but -cpu Nehalem,enforce works. This does not make much sense to me. In fact, I would even omit the x86_cpu_compat_set_features altogether. The inclusion of vmx in these models was a mistake, and nested VMX is not really useful with anything but -cpu host because there are too many capabilities communicated via MSRs rather than CPUID. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC 07/11] dataplane: allow virtio-1 devices
On Tue, 28 Oct 2014 16:22:54 +0100 Greg Kurz gk...@linux.vnet.ibm.com wrote: On Tue, 7 Oct 2014 16:40:03 +0200 Cornelia Huck cornelia.h...@de.ibm.com wrote: Handle endianness conversion for virtio-1 virtqueues correctly. Note that dataplane now needs to be built per-target. It also affects hw/virtio/virtio-pci.c: In file included from include/hw/virtio/dataplane/vring.h:23:0, from include/hw/virtio/virtio-scsi.h:21, from hw/virtio/virtio-pci.c:24: include/hw/virtio/virtio-access.h: In function ‘virtio_access_is_big_endian’: include/hw/virtio/virtio-access.h:28:15: error: attempt to use poisoned TARGET_WORDS_BIGENDIAN #elif defined(TARGET_WORDS_BIGENDIAN) ^ FWIW when I added endian ambivalent support to virtio, I remember *some people* getting angry at the idea of turning common code into per-target... :) Well, it probably can't be helped for something that is endian-sensitive like virtio :( (Although we should try to keep it as local as possible.) See comment below. Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- hw/block/dataplane/virtio-blk.c |3 +- hw/scsi/virtio-scsi-dataplane.c |2 +- hw/virtio/Makefile.objs |2 +- hw/virtio/dataplane/Makefile.objs |2 +- hw/virtio/dataplane/vring.c | 85 +++ include/hw/virtio/dataplane/vring.h | 64 -- 6 files changed, 113 insertions(+), 45 deletions(-) diff --git a/include/hw/virtio/dataplane/vring.h b/include/hw/virtio/dataplane/vring.h index d3e086a..fde15f3 100644 --- a/ +++ b/include/hw/virtio/dataplane/vring.h @@ -20,6 +20,7 @@ #include qemu-common.h #include hw/virtio/virtio_ring.h #include hw/virtio/virtio.h +#include hw/virtio/virtio-access.h Since the following commit: commit 244e2898b7a7735b3da114c120abe206af56a167 Author: Fam Zheng f...@redhat.com Date: Wed Sep 24 15:21:41 2014 +0800 virtio-scsi: Add VirtIOSCSIVring in VirtIOSCSIReq The include/hw/virtio/dataplane/vring.h header is indirectly included by hw/virtio/virtio-pci.c. Why don't you move all this target dependent helpers to another header ? Ah, this seems to have come in after I hacked on that code - I'll take a look at splitting off the accessors. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/?l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require vma-vm_flags VM_USERFAULT to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the userfault resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to remove the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it doesn't look clean having to open the userfaultfd just to issue an hidden mcopy_atomic. So it becomes a decision if the basic SIGBUS mode for single threaded apps should be supported or not. As long as we support SIGBUS too and we don't force to use userfaultfd as the only mechanism to be notified about userfaults, having a separate
Re: [PATCH v12 1/6] KVM: Add architecture-defined TLB flush support
On Wed, 22 Oct 2014 15:34:06 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch adds support for architecture implemented VM TLB flush, currently ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL. This leaves other architectures unaffected using the generic version. In subsequent patch ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL and it's own TLB flush interface. Can you reword this a bit? Allow architectures to override the generic kvm_flush_remote_tlbs() function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to provide its own TLB flush interface. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- virt/kvm/Kconfig|3 +++ virt/kvm/kvm_main.c |2 ++ 2 files changed, 5 insertions(+) Providing an override for the special cases looks sane to me. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)
Am 23.10.2014 00:34, schrieb Mario Smarduch: This patch series introduces dirty page logging for ARMv7 and adds some degree of generic dirty logging support for x86, armv7 and later armv8. I implemented Alex's suggestion after he took a look at the patches at kvm forum to simplify the generic/arch split - leaving mips, powerpc, s390, (ia64 although broken) unchanged. x86/armv7 now share some dirty logging code. armv8 dirty log patches have been posted and tested but for time being armv8 is non-generic as well. I briefly spoke to most of you at kvm forum, and this is the patch series I was referring to. Implementation changed from previous version (patches 1 2), those who acked previous revision, please review again. Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant changes. Testing: - Generally live migration + checksumming of source/destination memory regions is used validate correctness. - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest memory cycling. - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers did a basic bringup using qemu. - x86_64 qemu default machine model, tested migration on HP Z620, tested convergence for several dirty page rates See https://github.com/mjsmar/arm-dirtylog-tests - Dirtlogtest-setup.pdf for ARMv7 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch series has been compiled for affected architectures: - x86_64 - defconfig - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked around that to make sure new changes don't break build. Eventually build breaks due to other reasons. - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig - s390 - s390x-linux-gcc4.6.3 - defconfig - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig ARMv7 Dirty page logging implementation overivew- - initially write protects VM RAM memory region - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into small page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. Small page size logging supports higher memory dirty rates. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Changes since v11: - Implemented Alex's comments to simplify generic layer. Changes since v10: - addressed wanghaibin comments - addressed Christoffers comments Changes since v9: - Split patches into generic and architecture specific variants for TLB Flushing and dirty log read (patches 1,2 3,4,5,6) - rebased to 3.16.0-rc1 - Applied Christoffers comments. Mario Smarduch (6): KVM: Add architecture-defined TLB flush support KVM: Add generic support for dirty page logging arm: KVM: Add ARMv7 API to flush TLBs arm: KVM: Add initial dirty page locking infrastructure arm: KVM: dirty log read write protect support arm: KVM: ARMv7 dirty page logging 2nd stage page fault arch/arm/include/asm/kvm_asm.h|1 + arch/arm/include/asm/kvm_host.h | 14 +++ arch/arm/include/asm/kvm_mmu.h| 20 arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/Kconfig |2 + arch/arm/kvm/Makefile |1 + arch/arm/kvm/arm.c|2 + arch/arm/kvm/interrupts.S | 11 ++ arch/arm/kvm/mmu.c| 209 +++-- arch/x86/include/asm/kvm_host.h |3 - arch/x86/kvm/Kconfig |1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c| 86 -- include/linux/kvm_host.h |4 + virt/kvm/Kconfig |6 + virt/kvm/dirtylog.c | 112 ++ virt/kvm/kvm_main.c |2 + 17 files changed, 380 insertions(+), 96 deletions(-) create mode 100644 virt/kvm/dirtylog.c Patches 1-3 seem to work fine on s390. The other patches are arm-only (well cant find 5 and 6) so I guess its ok for s390. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Zhang Haoyu zhan...@sangfor.com wrote on 30/10/2014 01:30:08 PM: From: Zhang Haoyu zhan...@sangfor.com To: Razya Ladelsky/Haifa/IBM@IBMIL, mst m...@redhat.com Cc: Razya Ladelsky/Haifa/IBM@IBMIL, kvm kvm@vger.kernel.org Date: 30/10/2014 01:30 PM Subject: Re: Benchmarking for vhost polling patch Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu Hi Zhang, Do you mean the change in code for poll_stop_idle? Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging
On Wed, 22 Oct 2014 15:34:07 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and makefile. No other architectures are affected, each uses it's own version. This changed from previous patch revision where non-generic architectures were modified. In subsequent patch armv7 does samething. All other architectures continue use architecture defined version. Hm. The x86 specific version of dirty page logging is generic enough to be used by other architectures, noteably ARMv7. So let's move the x86 code under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other architectures continue to use their own implementations. ? Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/x86/include/asm/kvm_host.h |3 -- arch/x86/kvm/Kconfig|1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c | 86 -- include/linux/kvm_host.h|4 ++ virt/kvm/Kconfig|3 ++ virt/kvm/dirtylog.c | 112 +++ 7 files changed, 121 insertions(+), 89 deletions(-) create mode 100644 virt/kvm/dirtylog.c diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c new file mode 100644 index 000..67a --- /dev/null +++ b/virt/kvm/dirtylog.c @@ -0,0 +1,112 @@ +/* + * kvm generic dirty logging support, used by architectures that share + * comman dirty page logging implementation. s/comman/common/ The approach looks sane to me, especially as it does not change other architectures needlessly. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 29/10/14 05:19, Andy Lutomirski wrote: CPUID leaf 4F02H: miscellaneous features [...] ### CommonHV RNG If CPUID.4F02H.EAX is nonzero, then it contains an MSR index used to communicate with a hypervisor random number generator. This MSR is referred to as MSR_COMMONHV_RNG. rdmsr(MSR_COMMONHV_RNG) returns a 64-bit best-effort random number. If the hypervisor is able to generate a 64-bit cryptographically secure random number, it SHOULD return it. If not, then the hypervisor SHOULD do its best to return a random number suitable for seeding a cryptographic RNG. A guest is expected to read MSR_COMMONHV_RNG several times in a row. The hypervisor SHOULD return different values each time. rdmsr(MSR_COMMONHV_RNG) MUST NOT result in an exception, but guests MUST NOT assume that its return value is indeed secure. For example, a hypervisor is free to return zero in response to rdmsr(MSR_COMMONHV_RNG). I would add: If the hypervisor's pool of random data is exhausted, it MAY return 0. The hypervisor MUST provide at least 4 (?) non-zero numbers to each guest. Xen does not have a continual source of entropy and the only feasible way is for the toolstack to provide each guest with a fixed size pool of random data during guest creation. The fixed size pool could be refilled by the guest if further random data is needed (e.g., before an in-guest kexec). wrmsr(MSR_COMMONHV_RNG) offers the hypervisor up to 64 bits of entropy. The hypervisor MAY use it as it sees fit to improve its own random number generator. A hypervisor SHOULD make a reasonable effort to avoid making values written to MSR_COMMONHV_RNG visible to untrusted parties, but guests SHOULD NOT write sensitive values to wrmsr(MSR_COMMONHV_RNG). I don't think unprivileged guests should be able to influence the hypervisor's RNG. Unless the intention here is it only affects the numbers returned to this guest? But since the write is optional, I don't object to it. David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On 10/30/2014 01:21 PM, David Vrabel wrote: I would add: If the hypervisor's pool of random data is exhausted, it MAY return 0. The hypervisor MUST provide at least 4 (?) non-zero numbers to each guest. Mandating non-zero numbers sounds like a bad idea. Just use the RNG for what it was designed; returning non-random numbers will not be a problem. Paolo -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
* zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require vma-vm_flags VM_USERFAULT to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. Hmm, I can see the same process being useful for the fault-tolerance schemes like COLO, it needs a memory state snapshot. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. What pages would be non-present at this point - just balloon? Dave Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the userfault resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to remove the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it doesn't
[PATCH 3/3] KVM: x86: optimize some accesses to LVTT and SPIV
We mirror a subset of these registers in separate variables. Using them directly should be faster. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 10 +++--- arch/x86/kvm/lapic.h | 6 +++--- 2 files changed, 6 insertions(+), 10 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index d3a3a1c..67af5d2 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -239,21 +239,17 @@ static inline int apic_lvt_vector(struct kvm_lapic *apic, int lvt_type) static inline int apic_lvtt_oneshot(struct kvm_lapic *apic) { - return ((kvm_apic_get_reg(apic, APIC_LVTT) - apic-lapic_timer.timer_mode_mask) == APIC_LVT_TIMER_ONESHOT); + return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_ONESHOT; } static inline int apic_lvtt_period(struct kvm_lapic *apic) { - return ((kvm_apic_get_reg(apic, APIC_LVTT) - apic-lapic_timer.timer_mode_mask) == APIC_LVT_TIMER_PERIODIC); + return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_PERIODIC; } static inline int apic_lvtt_tscdeadline(struct kvm_lapic *apic) { - return ((kvm_apic_get_reg(apic, APIC_LVTT) - apic-lapic_timer.timer_mode_mask) == - APIC_LVT_TIMER_TSCDEADLINE); + return apic-lapic_timer.timer_mode == APIC_LVT_TIMER_TSCDEADLINE; } static inline int apic_lvt_nmi_mode(u32 lvt_val) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 755a954..2c56885 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -121,11 +121,11 @@ static inline int kvm_apic_hw_enabled(struct kvm_lapic *apic) extern struct static_key_deferred apic_sw_disabled; -static inline int kvm_apic_sw_enabled(struct kvm_lapic *apic) +static inline bool kvm_apic_sw_enabled(struct kvm_lapic *apic) { if (static_key_false(apic_sw_disabled.key)) - return kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_APIC_ENABLED; - return APIC_SPIV_APIC_ENABLED; + return apic-sw_enabled; + return true; } static inline bool kvm_apic_present(struct kvm_vcpu *vcpu) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] KVM: x86: detect LVTT changes under APICv
APICv traps register writes, so we can't retrieve previous value and omit timer cancelation when mode changes. timer_mode_mask shouldn't be changing as it depends on cpuid. Signed-off-by: Radim Krčmář rkrc...@redhat.com --- #define assign(a, b) (a == b ? false : (a = b, true)) arch/x86/kvm/lapic.c | 12 arch/x86/kvm/lapic.h | 1 + 2 files changed, 9 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index f538b14..d3a3a1c 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -1205,17 +1205,20 @@ static int apic_reg_write(struct kvm_lapic *apic, u32 reg, u32 val) break; - case APIC_LVTT: - if ((kvm_apic_get_reg(apic, APIC_LVTT) - apic-lapic_timer.timer_mode_mask) != - (val apic-lapic_timer.timer_mode_mask)) + case APIC_LVTT: { + u32 timer_mode = val apic-lapic_timer.timer_mode_mask; + + if (apic-lapic_timer.timer_mode != timer_mode) { + apic-lapic_timer.timer_mode = timer_mode; hrtimer_cancel(apic-lapic_timer.timer); + } if (!kvm_apic_sw_enabled(apic)) val |= APIC_LVT_MASKED; val = (apic_lvt_mask[0] | apic-lapic_timer.timer_mode_mask); apic_set_reg(apic, APIC_LVTT, val); break; + } case APIC_TMICT: if (apic_lvtt_tscdeadline(apic)) @@ -1449,6 +1452,7 @@ void kvm_lapic_reset(struct kvm_vcpu *vcpu) for (i = 0; i APIC_LVT_NUM; i++) apic_set_reg(apic, APIC_LVTT + 0x10 * i, APIC_LVT_MASKED); + apic-lapic_timer.timer_mode = 0; apic_set_reg(apic, APIC_LVT0, SET_APIC_DELIVERY_MODE(0, APIC_MODE_EXTINT)); diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 5fcc3d3..755a954 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -11,6 +11,7 @@ struct kvm_timer { struct hrtimer timer; s64 period; /* unit: ns */ + u32 timer_mode; u32 timer_mode_mask; u64 tscdeadline; atomic_t pending; /* accumulated triggered timers */ -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 0/3] kvm: APICv register write workaround
APICv traps register writes, so we can't retrieve previous value, but our code depends on detecting changes. Apart from disabling APIC register virtualization, we can detect the change by using extra memory. One value history is enough, but we still don't want to keep it for every APIC register, for performance reasons. This leaves us with either a new framework, or exceptions ... The latter options fits KVM's path better [1,2]. And when we already mirror a part of registers, optimizing access is acceptable [3]. (Squashed to keep bisecters happy.) --- Radim Krčmář (3): KVM: x86: detect SPIV changes under APICv KVM: x86: detect LVTT changes under APICv KVM: x86: optimize some accesses to LVTT and SPIV arch/x86/kvm/lapic.c | 32 +--- arch/x86/kvm/lapic.h | 8 +--- 2 files changed, 22 insertions(+), 18 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] KVM: x86: detect SPIV changes under APICv
APICv traps register writes, so we can't retrieve previous value. (A bit of blame on Intel.) This caused a migration bug: LAPIC is enabled, so our restore code correctly lowers apic_sw_enabled, but doesn't increase it after APICv is disabled, so we get below zero when freeing it; resulting in this trace: WARNING: at kernel/jump_label.c:81 __static_key_slow_dec+0xa6/0xb0() jump label: negative count! [816bf898] dump_stack+0x19/0x1b [8107c6f1] warn_slowpath_common+0x61/0x80 [8107c76c] warn_slowpath_fmt+0x5c/0x80 [811931e6] __static_key_slow_dec+0xa6/0xb0 [81193226] static_key_slow_dec_deferred+0x16/0x20 [a0637698] kvm_free_lapic+0x88/0xa0 [kvm] [a061c63e] kvm_arch_vcpu_uninit+0x2e/0xe0 [kvm] [a05ff301] kvm_vcpu_uninit+0x21/0x40 [kvm] [a067cec7] vmx_free_vcpu+0x47/0x70 [kvm_intel] [a061bc50] kvm_arch_vcpu_free+0x50/0x60 [kvm] [a061ca22] kvm_arch_destroy_vm+0x102/0x260 [kvm] [810b68fd] ? synchronize_srcu+0x1d/0x20 [a06030d1] kvm_put_kvm+0xe1/0x1c0 [kvm] [a06036f8] kvm_vcpu_release+0x18/0x20 [kvm] [81215c62] __fput+0x102/0x310 [81215f4e] fput+0xe/0x10 [810ab664] task_work_run+0xb4/0xe0 [81083944] do_exit+0x304/0xc60 [816c8dfc] ? _raw_spin_unlock_irq+0x2c/0x50 [810fd22d] ? trace_hardirqs_on_caller+0xfd/0x1c0 [8108432c] do_group_exit+0x4c/0xc0 [810843b4] SyS_exit_group+0x14/0x20 [816d33a9] system_call_fastpath+0x16/0x1b Signed-off-by: Radim Krčmář rkrc...@redhat.com --- arch/x86/kvm/lapic.c | 10 ++ arch/x86/kvm/lapic.h | 1 + 2 files changed, 7 insertions(+), 4 deletions(-) diff --git a/arch/x86/kvm/lapic.c b/arch/x86/kvm/lapic.c index b8345dd..f538b14 100644 --- a/arch/x86/kvm/lapic.c +++ b/arch/x86/kvm/lapic.c @@ -201,11 +201,13 @@ out: static inline void apic_set_spiv(struct kvm_lapic *apic, u32 val) { - u32 prev = kvm_apic_get_reg(apic, APIC_SPIV); + bool enabled = val APIC_SPIV_APIC_ENABLED; apic_set_reg(apic, APIC_SPIV, val); - if ((prev ^ val) APIC_SPIV_APIC_ENABLED) { - if (val APIC_SPIV_APIC_ENABLED) { + + if (enabled != apic-sw_enabled) { + apic-sw_enabled = enabled; + if (enabled) { static_key_slow_dec_deferred(apic_sw_disabled); recalculate_apic_map(apic-vcpu-kvm); } else @@ -1320,7 +1322,7 @@ void kvm_free_lapic(struct kvm_vcpu *vcpu) if (!(vcpu-arch.apic_base MSR_IA32_APICBASE_ENABLE)) static_key_slow_dec_deferred(apic_hw_disabled); - if (!(kvm_apic_get_reg(apic, APIC_SPIV) APIC_SPIV_APIC_ENABLED)) + if (!apic-sw_enabled) static_key_slow_dec_deferred(apic_sw_disabled); if (apic-regs) diff --git a/arch/x86/kvm/lapic.h b/arch/x86/kvm/lapic.h index 6a11845..5fcc3d3 100644 --- a/arch/x86/kvm/lapic.h +++ b/arch/x86/kvm/lapic.h @@ -33,6 +33,7 @@ struct kvm_lapic { * Note: Only one register, the TPR, is used by the microcode. */ void *regs; + bool sw_enabled; gpa_t vapic_addr; struct gfn_to_hva_cache vapic_cache; unsigned long pending_events; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
Adding the bhyve guys. El 29/10/14 a les 6.19, Andy Lutomirski ha escrit: Here's a draft CommonHV spec. It's also on github: https://github.com/amluto/CommonHV So far, this provides a two-way RNG interface, a way to detect it, and a way to detect other hypervisor leaves. The latter is because, after both the enormous public thread and some private discussions, it seems that detection of existing CPUID paravirt leaves is annoying and inefficient. If we're going to define some cross-vendor CPUID leaves, it seems like it would be useful to offer a way to quickly enumerate other leaves. I've been told the AMD intends to update their manual to match Intel's so that hypervisors can use the entire 0x4F?? CPUID range. I have intentionally not fixed an MSR value for the RNG because the range of allowed MSRs is very small in both the Intel and AMD manuals. If any given hypervisor wants to ignore that small range and advertise a higher-numbered MSR, it is welcome to, but I don't want to codify something that doesn't comply with the manuals. Here's the draft. Comments? To the people who work on various hypervisors: Would you implement this? Do you like it? Is there anything, major or minor, that you'd like to see changed? Do you think that this is a good idea at all? I've tried to get good coverage of various hypervisors. There are Hyper-V, VMWare, KVM, and Xen people on the cc list. Thanks, Andy CommonHV, a common hypervisor interface === This is CommonHV draft 1. The CommonHV specification is Copyright (c) 2014 Andrew Lutomirski. Licensing will be determined soon. The license is expected to be extremely liberal. I am currently leaning towards CC-BY-SA for the specification and an explicit license permitting anyone to implement the specification with no restrictions whatsoever. I have not patented, nor do I intend to patent, anything required to implement this specification. I am not aware of any current or future intellectual property rights that would prevent a royalty-free implementation of this specification. I would like to find a stable, neutral steward of this specification going forward. Help with this would be much appreciated. Scope - CommonHV is a simple interface for communication between hypervisors and their guests. CommonHV is intended to be very simple and to avoid interfering with existing paravirtual interfaces. To that end, its scope is limited. CommonHV does only two types of things: * It provides a way to enumerate other paravirtual interfaces. * It provides a small, extensible set of paravirtual features that do not modify or replace standard system functionality. For example, CommonHV does not and will not define anything related to interrupt handling or virtual CPU management. For now, CommonHV is only applicable to the x86 platform. Discovery - A CommonHV hypervisor MUST set the hypervisor bit (bit 31 in CPUID.1H.0H.ECX) and provide the CPUID leaf 4F00H, containing: * CPUID.4F00H.0H.EAX = max_commonhv_leaf * CPUID.4F00H.0H.EBX = 0x6D6D6F43 * CPUID.4F00H.0H.ECX = 0x56486E6F * CPUID.4F00H.0H.EDX = 0x66746e49 EBX, ECX, and EDX form the string CommonHVIntf in little-endian ASCII. max_commonhv_leaf MUST be a number between 0x4F00 and 0x4FFF. It indicates the largest leaf defined in this specification that is provided. Any leaves described in this specification with EAX values that exceed max_commonhv_leaf MUST be handled by guests as though they contain all zeros. CPUID leaf 4F01H: hypervisor interface enumeration -- If max_commonhv_leaf = 0x4F01, CommonHV provides a list of tuples (location, signature). Each tuple indicates the presence of another paravirtual interface identified by the signature at the indicated CPUID location. It is expected that CPUID.location.0H will have (EBX, ECX, EDX) == signature, although whether this is required is left to the specification associated with the given signature. If the list contains N tuples, then, for each 0 = i N: * CPUID.4F01H.i.EBX, CPUID.4F01H.i.ECX, and CPUID.4F01H.i.EDX are the signature. * CPUID.4F01H.i.EAX is the location. CPUID with EAX = 0x4F01 and ECX = N MUST return all zeros. To the extent that the hypervisor prefers a given interface, it should specify that interface earlier in the list. For example, KVM might place its KVMKVMKVM signature first in the list to indicate that it should be used by guests in preference to other supported interfaces. Other hypervisors would likely use a different order. The exact semantics of the ordering of the list is beyond the scope of this specification. CPUID leaf 4F02H: miscellaneous features
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On Thu, Oct 30, 2014 at 5:21 AM, David Vrabel david.vra...@citrix.com wrote: On 29/10/14 05:19, Andy Lutomirski wrote: CPUID leaf 4F02H: miscellaneous features [...] ### CommonHV RNG If CPUID.4F02H.EAX is nonzero, then it contains an MSR index used to communicate with a hypervisor random number generator. This MSR is referred to as MSR_COMMONHV_RNG. rdmsr(MSR_COMMONHV_RNG) returns a 64-bit best-effort random number. If the hypervisor is able to generate a 64-bit cryptographically secure random number, it SHOULD return it. If not, then the hypervisor SHOULD do its best to return a random number suitable for seeding a cryptographic RNG. A guest is expected to read MSR_COMMONHV_RNG several times in a row. The hypervisor SHOULD return different values each time. rdmsr(MSR_COMMONHV_RNG) MUST NOT result in an exception, but guests MUST NOT assume that its return value is indeed secure. For example, a hypervisor is free to return zero in response to rdmsr(MSR_COMMONHV_RNG). I would add: If the hypervisor's pool of random data is exhausted, it MAY return 0. The hypervisor MUST provide at least 4 (?) non-zero numbers to each guest. Xen does not have a continual source of entropy and the only feasible way is for the toolstack to provide each guest with a fixed size pool of random data during guest creation. Xen could seed a very simple per-guest DRBG at guest startup and then let the rdmsr call read from it. The fixed size pool could be refilled by the guest if further random data is needed (e.g., before an in-guest kexec). That gets complicated. Then you need an API to refill it. wrmsr(MSR_COMMONHV_RNG) offers the hypervisor up to 64 bits of entropy. The hypervisor MAY use it as it sees fit to improve its own random number generator. A hypervisor SHOULD make a reasonable effort to avoid making values written to MSR_COMMONHV_RNG visible to untrusted parties, but guests SHOULD NOT write sensitive values to wrmsr(MSR_COMMONHV_RNG). I don't think unprivileged guests should be able to influence the hypervisor's RNG. Unless the intention here is it only affects the numbers returned to this guest? An RNG can be designed to be secure even if malicious users can provide input. Linux has one of these, and I assume that Windows does, too. Xen doesn't for the entirely legitimate reason that Xen has no need for such a thing. (Xen dom0, on the other hand, has Linux's.) But since the write is optional, I don't object to it. Draft 2 has a bit that Xen could clear to ask the guest not to even try to use this feature. I'll send out draft 2 by email later today. It's on github now, though. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Xen-devel] [RFC] Hypervisor RNG and enumeration
On Thu, 2014-10-30 at 07:45 -0700, Andy Lutomirski wrote: Xen does not have a continual source of entropy and the only feasible way is for the toolstack to provide each guest with a fixed size pool of random data during guest creation. Xen could seed a very simple per-guest DRBG at guest startup and then let the rdmsr call read from it. I think I'm a bit confused by the intended scope of this facility. The original spec said: Note that the CommonHV RNG is not intended to replace stronger, asynchronous paravirtual random number generator interfaces. It is intended primarily for seeding guest RNGs early in boot. Which to me reads that the guest should be using this facility to seed it's own simple DRBG on boot (with some finite amount of seed data from the hv) and then using that until it can switch to something better. Is that not the intention? I think it's important to nail down the intended scope of this interface, since it has quite an impact on what would be considered a reasonable common design. Post boot I would as you say expect most OSes to switch over to something more capable, not continue to rely on this facility for the duration. Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[kvm-unit-tests PATCH 0/6] arm: enable MMU
This first patch of this series fixes a bug caused by attempting to use spinlocks without enabling the MMU. The next three do some prep for the fifth, and also fix arm's PAGE_ALIGN. The fifth is prep for the sixth, which finally turns the MMU on for arm unit tests. Andrew Jones (6): arm: fix crash on cubietruck lib: add ALIGN() macro lib: steal const.h from kernel arm: apply ALIGN() and const.h to arm files arm: import some Linux page table API arm: turn on the MMU arm/cstart.S| 33 +++ config/config-arm.mak | 3 ++- lib/alloc.c | 4 +-- lib/arm/asm/mmu.h | 43 ++ lib/arm/asm/page.h | 43 +++--- lib/arm/asm/pgtable-hwdef.h | 65 + lib/arm/mmu.c | 53 lib/arm/processor.c | 11 lib/arm/setup.c | 3 +++ lib/arm/spinlock.c | 7 + lib/asm-generic/page.h | 17 ++-- lib/const.h | 11 lib/libcflat.h | 4 +++ 13 files changed, 275 insertions(+), 22 deletions(-) create mode 100644 lib/arm/asm/mmu.h create mode 100644 lib/arm/asm/pgtable-hwdef.h create mode 100644 lib/arm/mmu.c create mode 100644 lib/const.h -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/6] arm: fix crash on cubietruck
Cubietruck seems to be more sensitive than my Midway when attempting to use [ldr|str]ex instructions without caches enabled (mmu disabled). Fix this by making the spinlock implementation (currently the only user of *ex instructions) conditional on the mmu being enabled. Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/arm/asm/mmu.h | 11 +++ lib/arm/spinlock.c | 7 +++ 2 files changed, 18 insertions(+) create mode 100644 lib/arm/asm/mmu.h diff --git a/lib/arm/asm/mmu.h b/lib/arm/asm/mmu.h new file mode 100644 index 0..987928b2c432c --- /dev/null +++ b/lib/arm/asm/mmu.h @@ -0,0 +1,11 @@ +#ifndef __ASMARM_MMU_H_ +#define __ASMARM_MMU_H_ +/* + * Copyright (C) 2014, Red Hat Inc, Andrew Jones drjo...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2. + */ + +#define mmu_enabled() (0) + +#endif /* __ASMARM_MMU_H_ */ diff --git a/lib/arm/spinlock.c b/lib/arm/spinlock.c index d8a6d4c3383d6..e2bb1ace43c4e 100644 --- a/lib/arm/spinlock.c +++ b/lib/arm/spinlock.c @@ -1,12 +1,19 @@ #include libcflat.h #include asm/spinlock.h #include asm/barrier.h +#include asm/mmu.h void spin_lock(struct spinlock *lock) { u32 val, fail; dmb(); + + if (!mmu_enabled()) { + lock-v = 1; + return; + } + do { asm volatile( 1: ldrex %0, [%2]\n -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 5/6] arm: import some Linux page table API
Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/arm/asm/page.h | 22 +++ lib/arm/asm/pgtable-hwdef.h | 65 + 2 files changed, 87 insertions(+) create mode 100644 lib/arm/asm/pgtable-hwdef.h diff --git a/lib/arm/asm/page.h b/lib/arm/asm/page.h index 4602d735b7886..6ff849a0c0e3b 100644 --- a/lib/arm/asm/page.h +++ b/lib/arm/asm/page.h @@ -18,6 +18,28 @@ #include asm/setup.h +typedef u64 pteval_t; +typedef u64 pmdval_t; +typedef u64 pgdval_t; +typedef struct { pteval_t pte; } pte_t; +typedef struct { pmdval_t pmd; } pmd_t; +typedef struct { pgdval_t pgd; } pgd_t; +typedef struct { pteval_t pgprot; } pgprot_t; + +#define pte_val(x) ((x).pte) +#define pmd_val(x) ((x).pmd) +#define pgd_val(x) ((x).pgd) +#define pgprot_val(x) ((x).pgprot) + +#define __pte(x) ((pte_t) { (x) } ) +#define __pmd(x) ((pmd_t) { (x) } ) +#define __pgd(x) ((pgd_t) { (x) } ) +#define __pgprot(x)((pgprot_t) { (x) } ) + +typedef struct { pgd_t pgd; } pud_t; +#define pud_val(x) (pgd_val((x).pgd)) +#define __pud(x) ((pud_t) { __pgd(x) } ) + #ifndef __virt_to_phys #define __phys_to_virt(x) ((unsigned long) (x)) #define __virt_to_phys(x) (x) diff --git a/lib/arm/asm/pgtable-hwdef.h b/lib/arm/asm/pgtable-hwdef.h new file mode 100644 index 0..a2564aaca05a3 --- /dev/null +++ b/lib/arm/asm/pgtable-hwdef.h @@ -0,0 +1,65 @@ +#ifndef _ASMARM_PGTABLE_HWDEF_H_ +#define _ASMARM_PGTABLE_HWDEF_H_ +/* + * From arch/arm/include/asm/pgtable-3level-hwdef.h + */ + +/* + * Hardware page table definitions. + * + * + Level 1/2 descriptor + * - common + */ +#define PMD_TYPE_MASK (_AT(pmdval_t, 3) 0) +#define PMD_TYPE_FAULT (_AT(pmdval_t, 0) 0) +#define PMD_TYPE_TABLE (_AT(pmdval_t, 3) 0) +#define PMD_TYPE_SECT (_AT(pmdval_t, 1) 0) +#define PMD_TABLE_BIT (_AT(pmdval_t, 1) 1) +#define PMD_BIT4 (_AT(pmdval_t, 0)) +#define PMD_DOMAIN(x) (_AT(pmdval_t, 0)) +#define PMD_APTABLE_SHIFT (61) +#define PMD_APTABLE(_AT(pgdval_t, 3) PGD_APTABLE_SHIFT) +#define PMD_PXNTABLE (_AT(pgdval_t, 1) 59) + +/* + * - section + */ +#define PMD_SECT_BUFFERABLE(_AT(pmdval_t, 1) 2) +#define PMD_SECT_CACHEABLE (_AT(pmdval_t, 1) 3) +#define PMD_SECT_USER (_AT(pmdval_t, 1) 6) /* AP[1] */ +#define PMD_SECT_AP2 (_AT(pmdval_t, 1) 7) /* read only */ +#define PMD_SECT_S (_AT(pmdval_t, 3) 8) +#define PMD_SECT_AF(_AT(pmdval_t, 1) 10) +#define PMD_SECT_nG(_AT(pmdval_t, 1) 11) +#define PMD_SECT_PXN (_AT(pmdval_t, 1) 53) +#define PMD_SECT_XN(_AT(pmdval_t, 1) 54) +#define PMD_SECT_AP_WRITE (_AT(pmdval_t, 0)) +#define PMD_SECT_AP_READ (_AT(pmdval_t, 0)) +#define PMD_SECT_AP1 (_AT(pmdval_t, 1) 6) +#define PMD_SECT_TEX(x)(_AT(pmdval_t, 0)) + +/* + * AttrIndx[2:0] encoding (mapping attributes defined in the MAIR* registers). + */ +#define PMD_SECT_UNCACHED (_AT(pmdval_t, 0) 2) /* strongly ordered */ +#define PMD_SECT_BUFFERED (_AT(pmdval_t, 1) 2) /* normal non-cacheable */ +#define PMD_SECT_WT(_AT(pmdval_t, 2) 2) /* normal inner write-through */ +#define PMD_SECT_WB(_AT(pmdval_t, 3) 2) /* normal inner write-back */ +#define PMD_SECT_WBWA (_AT(pmdval_t, 7) 2) /* normal inner write-alloc */ + +/* + * + Level 3 descriptor (PTE) + */ +#define PTE_TYPE_MASK (_AT(pteval_t, 3) 0) +#define PTE_TYPE_FAULT (_AT(pteval_t, 0) 0) +#define PTE_TYPE_PAGE (_AT(pteval_t, 3) 0) +#define PTE_TABLE_BIT (_AT(pteval_t, 1) 1) +#define PTE_BUFFERABLE (_AT(pteval_t, 1) 2) /* AttrIndx[0] */ +#define PTE_CACHEABLE (_AT(pteval_t, 1) 3) /* AttrIndx[1] */ +#define PTE_AP2(_AT(pteval_t, 1) 7) /* AP[2] */ +#define PTE_EXT_SHARED (_AT(pteval_t, 3) 8) /* SH[1:0], inner shareable */ +#define PTE_EXT_AF (_AT(pteval_t, 1) 10)/* Access Flag */ +#define PTE_EXT_NG (_AT(pteval_t, 1) 11)/* nG */ +#define PTE_EXT_XN (_AT(pteval_t, 1) 54)/* XN */ + +#endif /* _ASMARM_PGTABLE_HWDEF_H_ */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/6] lib: add ALIGN() macro
Add a type-considerate ALIGN[_UP] macro to libcflat, and apply it to /lib code that can make use of it. This will be used to fix PAGE_ALIGN on arm, which can be used on phys_addr_t addresses, which may exceed 32 bits. Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/alloc.c| 4 +--- lib/asm-generic/page.h | 9 ++--- lib/libcflat.h | 4 3 files changed, 11 insertions(+), 6 deletions(-) diff --git a/lib/alloc.c b/lib/alloc.c index 5d55e285dcd1d..1abe4961ae9dd 100644 --- a/lib/alloc.c +++ b/lib/alloc.c @@ -7,8 +7,6 @@ #include asm/spinlock.h #include asm/io.h -#define ALIGN_UP_MASK(x, mask) (((x) + (mask)) ~(mask)) -#define ALIGN_UP(x, a) ALIGN_UP_MASK(x, (typeof(x))(a) - 1) #define MIN(a, b) ((a) (b) ? (a) : (b)) #define MAX(a, b) ((a) (b) ? (a) : (b)) @@ -70,7 +68,7 @@ static phys_addr_t phys_alloc_aligned_safe(phys_addr_t size, spin_lock(lock); - addr = ALIGN_UP(base, align); + addr = ALIGN(base, align); size += addr - base; if ((top_safe - base) size) { diff --git a/lib/asm-generic/page.h b/lib/asm-generic/page.h index 559938fcf0b3f..8602752002f71 100644 --- a/lib/asm-generic/page.h +++ b/lib/asm-generic/page.h @@ -16,13 +16,16 @@ #define PAGE_SIZE (1 PAGE_SHIFT) #endif #define PAGE_MASK (~(PAGE_SIZE-1)) -#define PAGE_ALIGN(addr) (((addr) + (PAGE_SIZE-1)) PAGE_MASK) #ifndef __ASSEMBLY__ + +#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) + #define __va(x)((void *)((unsigned long) (x))) #define __pa(x)((unsigned long) (x)) #define virt_to_pfn(kaddr) (__pa(kaddr) PAGE_SHIFT) #define pfn_to_virt(pfn) __va((pfn) PAGE_SHIFT) -#endif -#endif +#endif /* __ASSEMBLY__ */ + +#endif /* _ASM_GENERIC_PAGE_H_ */ diff --git a/lib/libcflat.h b/lib/libcflat.h index a43eba0329f8e..7db29a4f4f3cb 100644 --- a/lib/libcflat.h +++ b/lib/libcflat.h @@ -30,6 +30,10 @@ #define xstr(s) xxstr(s) #define xxstr(s) #s +#define __ALIGN_MASK(x, mask) (((x) + (mask)) ~(mask)) +#define __ALIGN(x, a) __ALIGN_MASK(x, (typeof(x))(a) - 1) +#define ALIGN(x, a)__ALIGN((x), (a)) + typedef uint8_tu8; typedef int8_t s8; typedef uint16_t u16; -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/6] arm: apply ALIGN() and const.h to arm files
This fixes PAGE_ALIGN for greater than 32-bit addresses. Also fix up some whitespace in lib/arm/asm/page.h Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/arm/asm/page.h | 21 +++-- 1 file changed, 11 insertions(+), 10 deletions(-) diff --git a/lib/arm/asm/page.h b/lib/arm/asm/page.h index 606d76f5775cf..4602d735b7886 100644 --- a/lib/arm/asm/page.h +++ b/lib/arm/asm/page.h @@ -6,16 +6,16 @@ * This work is licensed under the terms of the GNU LGPL, version 2. */ +#include const.h + #define PAGE_SHIFT 12 -#ifndef __ASSEMBLY__ -#define PAGE_SIZE (1UL PAGE_SHIFT) -#else -#define PAGE_SIZE (1 PAGE_SHIFT) -#endif +#define PAGE_SIZE (_AC(1,UL) PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) -#define PAGE_ALIGN(addr) (((addr) + (PAGE_SIZE-1)) PAGE_MASK) #ifndef __ASSEMBLY__ + +#define PAGE_ALIGN(addr) ALIGN(addr, PAGE_SIZE) + #include asm/setup.h #ifndef __virt_to_phys @@ -26,8 +26,9 @@ #define __va(x)((void *)__phys_to_virt((phys_addr_t)(x))) #define __pa(x)__virt_to_phys((unsigned long)(x)) -#define virt_to_pfn(kaddr) (__pa(kaddr) PAGE_SHIFT) -#define pfn_to_virt(pfn)__va((pfn) PAGE_SHIFT) -#endif +#define virt_to_pfn(kaddr) (__pa(kaddr) PAGE_SHIFT) +#define pfn_to_virt(pfn) __va((pfn) PAGE_SHIFT) -#endif +#endif /* __ASSEMBLY__ */ + +#endif /* _ASMARM_PAGE_H_ */ -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/6] lib: steal const.h from kernel
And apply it to /lib files. This prepares for the import of kernel headers that make use of the const.h macros. Signed-off-by: Andrew Jones drjo...@redhat.com --- lib/asm-generic/page.h | 8 +++- lib/const.h| 11 +++ 2 files changed, 14 insertions(+), 5 deletions(-) create mode 100644 lib/const.h diff --git a/lib/asm-generic/page.h b/lib/asm-generic/page.h index 8602752002f71..66c72a62bb0f7 100644 --- a/lib/asm-generic/page.h +++ b/lib/asm-generic/page.h @@ -9,12 +9,10 @@ * This work is licensed under the terms of the GNU LGPL, version 2. */ +#include const.h + #define PAGE_SHIFT 12 -#ifndef __ASSEMBLY__ -#define PAGE_SIZE (1UL PAGE_SHIFT) -#else -#define PAGE_SIZE (1 PAGE_SHIFT) -#endif +#define PAGE_SIZE (_AC(1,UL) PAGE_SHIFT) #define PAGE_MASK (~(PAGE_SIZE-1)) #ifndef __ASSEMBLY__ diff --git a/lib/const.h b/lib/const.h new file mode 100644 index 0..5cd94d7067541 --- /dev/null +++ b/lib/const.h @@ -0,0 +1,11 @@ +#ifndef _CONST_H_ +#define _CONST_H_ +#ifdef __ASSEMBLY__ +#define _AC(X,Y) X +#define _AT(T,X) X +#else +#define __AC(X,Y) (X##Y) +#define _AC(X,Y) __AC(X,Y) +#define _AT(T,X) ((T)(X)) +#endif +#endif -- 1.9.3 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 6/6] arm: turn on the MMU
We should probably always run with the mmu on, so let's enable it from setup with an identity map. Signed-off-by: Andrew Jones drjo...@redhat.com --- arm/cstart.S | 33 config/config-arm.mak | 3 ++- lib/arm/asm/mmu.h | 34 - lib/arm/mmu.c | 53 +++ lib/arm/processor.c | 11 +++ lib/arm/setup.c | 3 +++ 6 files changed, 135 insertions(+), 2 deletions(-) create mode 100644 lib/arm/mmu.c diff --git a/arm/cstart.S b/arm/cstart.S index cc87ece4b6b40..a1ccfb24bb4e0 100644 --- a/arm/cstart.S +++ b/arm/cstart.S @@ -72,6 +72,39 @@ halt: b 1b /* + * asm_mmu_enable + * Inputs: + * (r0 - lo, r1 - hi) is the base address of the translation table + * Outputs: none + */ +.equ PRRR, 0xeeaa4400 @ MAIR0 (from Linux kernel) +.equ NMRR, 0xff04 @ MAIR1 (from Linux kernel) +.globl asm_mmu_enable +asm_mmu_enable: + /* TTBCR */ + mrc p15, 0, r2, c2, c0, 2 + orr r2, #(1 31) @ TTB_EAE + mcr p15, 0, r2, c2, c0, 2 + + /* MAIR */ + ldr r2, =PRRR + mrc p15, 0, r2, c10, c2, 0 + ldr r2, =NMRR + mrc p15, 0, r2, c10, c2, 1 + + /* TTBR0 */ + mcrrp15, 0, r0, r1, c2 + + /* SCTLR */ + mrc p15, 0, r2, c1, c0, 0 + orr r2, #CR_C + orr r2, #CR_I + orr r2, #CR_M + mcr p15, 0, r2, c1, c0, 0 + + mov pc, lr + +/* * Vector stubs * Simplified version of the Linux kernel implementation * arch/arm/kernel/entry-armv.S diff --git a/config/config-arm.mak b/config/config-arm.mak index 8a274c50332b0..86e1d75169b59 100644 --- a/config/config-arm.mak +++ b/config/config-arm.mak @@ -42,7 +42,8 @@ cflatobjs += \ lib/arm/io.o \ lib/arm/setup.o \ lib/arm/spinlock.o \ - lib/arm/processor.o + lib/arm/processor.o \ + lib/arm/mmu.o libeabi = lib/arm/libeabi.a eabiobjs = lib/arm/eabi_compat.o diff --git a/lib/arm/asm/mmu.h b/lib/arm/asm/mmu.h index 987928b2c432c..451c7493c2aba 100644 --- a/lib/arm/asm/mmu.h +++ b/lib/arm/asm/mmu.h @@ -5,7 +5,39 @@ * * This work is licensed under the terms of the GNU LGPL, version 2. */ +#include asm/page.h +#include asm/barrier.h +#include alloc.h -#define mmu_enabled() (0) +#define PTRS_PER_PGD 4 +#define PGDIR_SHIFT30 +#define PGDIR_SIZE (1UL PGDIR_SHIFT) +#define PGDIR_MASK (~((1 PGDIR_SHIFT) - 1)) + +#define pgd_free(pgd) free(pgd) +static inline pgd_t *pgd_alloc(void) +{ + pgd_t *pgd = memalign(L1_CACHE_BYTES, PTRS_PER_PGD * sizeof(pgd_t)); + memset(pgd, 0, PTRS_PER_PGD * sizeof(pgd_t)); + return pgd; +} + +static inline void local_flush_tlb_all(void) +{ + asm volatile(mcr p15, 0, %0, c8, c7, 0 :: r (0)); + dsb(); + isb(); +} + +static inline void flush_tlb_all(void) +{ + //TODO + local_flush_tlb_all(); +} + +extern bool mmu_enabled(void); +extern void mmu_enable(pgd_t *pgtable); +extern void mmu_enable_idmap(void); +extern void mmu_init_io_sect(pgd_t *pgtable); #endif /* __ASMARM_MMU_H_ */ diff --git a/lib/arm/mmu.c b/lib/arm/mmu.c new file mode 100644 index 0..c9d39bf6464b8 --- /dev/null +++ b/lib/arm/mmu.c @@ -0,0 +1,53 @@ +/* + * MMU enable and page table manipulation functions + * + * Copyright (C) 2014, Red Hat Inc, Andrew Jones drjo...@redhat.com + * + * This work is licensed under the terms of the GNU LGPL, version 2. + */ +#include asm/setup.h +#include asm/mmu.h +#include asm/pgtable-hwdef.h + +static bool mmu_on; +static pgd_t idmap[PTRS_PER_PGD] __attribute__((aligned(L1_CACHE_BYTES))); + +bool mmu_enabled(void) +{ + return mmu_on; +} + +extern void asm_mmu_enable(phys_addr_t pgtable); +void mmu_enable(pgd_t *pgtable) +{ + asm_mmu_enable(__pa(pgtable)); + flush_tlb_all(); + mmu_on = true; +} + +void mmu_init_io_sect(pgd_t *pgtable) +{ + /* +* mach-virt reserves the first 1G section for I/O +*/ + pgd_val(pgtable[0]) = PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_USER; + pgd_val(pgtable[0]) |= PMD_SECT_UNCACHED; +} + +void mmu_enable_idmap(void) +{ + unsigned long sect, end; + + mmu_init_io_sect(idmap); + + end = sizeof(long) == 8 || !(PHYS_END 32) ? PHYS_END : 0xf000; + + for (sect = PHYS_OFFSET PGDIR_MASK; sect end; sect += PGDIR_SIZE) { + int i = sect PGDIR_SHIFT; + pgd_val(idmap[i]) = sect; + pgd_val(idmap[i]) |= PMD_TYPE_SECT | PMD_SECT_AF | PMD_SECT_USER; + pgd_val(idmap[i]) |= PMD_SECT_S | PMD_SECT_WBWA; + } + + mmu_enable(idmap); +} diff --git a/lib/arm/processor.c b/lib/arm/processor.c index 382a128edd415..866d11975b23b 100644 --- a/lib/arm/processor.c +++ b/lib/arm/processor.c @@ -92,6 +92,17 @@ void do_handle_exception(enum vector v,
Re: [PATCH kvm-unit-tests] arm: fix crash when caches are off
On Fri, Sep 26, 2014 at 09:51:15AM +0200, Christoffer Dall wrote: On Tue, Sep 16, 2014 at 08:57:31AM -0400, Andrew Jones wrote: - Original Message - Il 16/09/2014 14:43, Andrew Jones ha scritto: I don't think we need to worry about this case. AFAIU, enabling the caches for a particular cpu shouldn't require any synchronization. So we should be able to do enable caches spin_lock start other processors spin_unlock Ok, I'll test and apply your patch then. Once you change the code to enable caches, please consider hanging on spin_lock with caches disabled. Unfortunately I can't do that without changing spin_lock into a wrapper function. Early setup code calls functions that use spin_locks, e.g. puts(), and we won't want to move the cache enablement into early setup code, as that should be left for unit tests to turn on off as they wish. Thus we either need to be able to change the spin_lock implementation dynamically, or just leave the test/return as is. My take on this whole thing is that we're doing something fundamentally wrong. I think what we should do is to always enable the MMU for running actual tests, bringing up multiple CPUs etc. We could have an early_printf() that doesn't use the spinlock. I think this will just be a more stable setup. Do we have clear ideas of which kinds of tests it would make sense to run without the MMU turned on? If we can be more concrete on this subject, perhaps a special path (or build) that doesn't enable the MMU for running the aforementioned test cases could be added. Finally carving out kvm-unit-tests time again and fixed this properly. A series is on the list [kvm-unit-tests PATCH 0/6] arm: enable MMU. drew -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
suspicious rcu_dereference_check() usage warning with 3.18-rc2
Hi, I've got a warning with the latest Linus tree like below: [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted --- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0001 880209983ca8 816f514f 8802099b8990 880209983cd8 810bd687 000fee00 880208a2c000 880208a1 88020ef50040 880209983d08 Call Trace: [816f514f] dump_stack+0x4e/0x71 [810bd687] lockdep_rcu_suspicious+0xe7/0x120 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm] [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm] [a0380885] gfn_to_page+0x25/0x90 [kvm] [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm] [810bc664] ? __lock_is_held+0x54/0x80 [812231f0] do_vfs_ioctl+0x300/0x520 [8122ee45] ? __fget+0x5/0x250 [8122f0fa] ? __fget_light+0x2a/0xe0 [81223491] SyS_ioctl+0x81/0xa0 [816fed6d] system_call_fastpath+0x16/0x1b kvm: zapping shadow pages for mmio generation wraparound kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x The machine itself and KVM work fine even after this warning. I'm not sure whether this is new, maybe it's triggered now since I changed my Kconfig to cover more RCU testing recently. The warning is reproducible, I can see it at the first invocation of kvm after each fresh boot. Does this ring a bell to anyone? thanks, Takashi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH RFC 00/11] qemu: towards virtio-1 host support
On Tue, 28 Oct 2014 06:43:29 +0200 Michael S. Tsirkin m...@redhat.com wrote: On Fri, Oct 24, 2014 at 10:38:39AM +0200, Cornelia Huck wrote: On Fri, 24 Oct 2014 00:42:20 +0300 Michael S. Tsirkin m...@redhat.com wrote: On Tue, Oct 07, 2014 at 04:39:56PM +0200, Cornelia Huck wrote: This patchset aims to get us some way to implement virtio-1 compliant and transitional devices in qemu. Branch available at git://github.com/cohuck/qemu virtio-1 I've mainly focused on: - endianness handling - extended feature bits - virtio-ccw new/changed commands So issues identified so far: Thanks for taking a look. - devices not converted yet should not advertize 1.0 Neither should an uncoverted transport. So we either can - have transport set the bit and rely on devices -get_features callback to mask it out (virtio-ccw has to change the calling order for get_features, btw.) - have device set the bit and the transport mask it out later. Feels a bit weird, as virtio-1 is a transport feature bit. I thought more about it, I think the right thing would be for unconverted transports to clear high bits on ack and get features. This should work out of the box with my patches (virtio-pci and virtio-mmio return 0 for high feature bits). So bit 32 is set, but not exposed to guests. In fact at least for PCI, we have a 32 bit field for features in 0.9 so it's automatic. Didn't check mmio yet. We still to make sure the bit is not set for unconverted devices, though. But you're probably right that having the device set the bit is less error-prone. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC 05/11] virtio: introduce legacy virtio devices
On Tue, 28 Oct 2014 16:40:18 +0100 Greg Kurz gk...@linux.vnet.ibm.com wrote: On Tue, 7 Oct 2014 16:40:01 +0200 Cornelia Huck cornelia.h...@de.ibm.com wrote: Introduce a helper function to indicate whether a virtio device is operating in legacy or virtio standard mode. It may be used to make decisions about the endianess of virtio accesses and other virtio-1 specific changes, enabling us to support transitional devices. Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- hw/virtio/virtio.c|6 +- include/hw/virtio/virtio-access.h |4 include/hw/virtio/virtio.h| 13 +++-- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7aaa953..e6ae3a0 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque) VirtIODevice *vdev = opaque; assert(vdev-device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN); -return vdev-device_endian != virtio_default_endian(); +if (virtio_device_is_legacy(vdev)) { +return vdev-device_endian != virtio_default_endian(); +} +/* Devices conforming to VIRTIO 1.0 or later are always LE. */ +return vdev-device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE; } Shouldn't we have some code doing the following somewhere ? if (!virtio_device_is_legacy(vdev)) { vdev-device_endian = VIRTIO_DEVICE_ENDIAN_LITTLE; } also, since virtio-1 is LE only, do we expect device_endian to be different from VIRTIO_DEVICE_ENDIAN_LITTLE ? device_endian should not depend on whether the device is legacy or not. virtio_is_big_endian always returns false for virtio-1 devices, though. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging
On 10/30/2014 05:14 AM, Cornelia Huck wrote: On Wed, 22 Oct 2014 15:34:07 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and makefile. No other architectures are affected, each uses it's own version. This changed from previous patch revision where non-generic architectures were modified. In subsequent patch armv7 does samething. All other architectures continue use architecture defined version. Hm. The x86 specific version of dirty page logging is generic enough to be used by other architectures, noteably ARMv7. So let's move the x86 code under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other architectures continue to use their own implementations. ? I'll update descriptions for both patches, with the more concise descriptions. Thanks. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/x86/include/asm/kvm_host.h |3 -- arch/x86/kvm/Kconfig|1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c | 86 -- include/linux/kvm_host.h|4 ++ virt/kvm/Kconfig|3 ++ virt/kvm/dirtylog.c | 112 +++ 7 files changed, 121 insertions(+), 89 deletions(-) create mode 100644 virt/kvm/dirtylog.c diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c new file mode 100644 index 000..67a --- /dev/null +++ b/virt/kvm/dirtylog.c @@ -0,0 +1,112 @@ +/* + * kvm generic dirty logging support, used by architectures that share + * comman dirty page logging implementation. s/comman/common/ The approach looks sane to me, especially as it does not change other architectures needlessly. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)
On 10/30/2014 05:11 AM, Christian Borntraeger wrote: Am 23.10.2014 00:34, schrieb Mario Smarduch: This patch series introduces dirty page logging for ARMv7 and adds some degree of generic dirty logging support for x86, armv7 and later armv8. I implemented Alex's suggestion after he took a look at the patches at kvm forum to simplify the generic/arch split - leaving mips, powerpc, s390, (ia64 although broken) unchanged. x86/armv7 now share some dirty logging code. armv8 dirty log patches have been posted and tested but for time being armv8 is non-generic as well. I briefly spoke to most of you at kvm forum, and this is the patch series I was referring to. Implementation changed from previous version (patches 1 2), those who acked previous revision, please review again. Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant changes. Testing: - Generally live migration + checksumming of source/destination memory regions is used validate correctness. - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest memory cycling. - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers did a basic bringup using qemu. - x86_64 qemu default machine model, tested migration on HP Z620, tested convergence for several dirty page rates See https://github.com/mjsmar/arm-dirtylog-tests - Dirtlogtest-setup.pdf for ARMv7 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch series has been compiled for affected architectures: - x86_64 - defconfig - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked around that to make sure new changes don't break build. Eventually build breaks due to other reasons. - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig - s390 - s390x-linux-gcc4.6.3 - defconfig - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig ARMv7 Dirty page logging implementation overivew- - initially write protects VM RAM memory region - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into small page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. Small page size logging supports higher memory dirty rates. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Changes since v11: - Implemented Alex's comments to simplify generic layer. Changes since v10: - addressed wanghaibin comments - addressed Christoffers comments Changes since v9: - Split patches into generic and architecture specific variants for TLB Flushing and dirty log read (patches 1,2 3,4,5,6) - rebased to 3.16.0-rc1 - Applied Christoffers comments. Mario Smarduch (6): KVM: Add architecture-defined TLB flush support KVM: Add generic support for dirty page logging arm: KVM: Add ARMv7 API to flush TLBs arm: KVM: Add initial dirty page locking infrastructure arm: KVM: dirty log read write protect support arm: KVM: ARMv7 dirty page logging 2nd stage page fault arch/arm/include/asm/kvm_asm.h|1 + arch/arm/include/asm/kvm_host.h | 14 +++ arch/arm/include/asm/kvm_mmu.h| 20 arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/Kconfig |2 + arch/arm/kvm/Makefile |1 + arch/arm/kvm/arm.c|2 + arch/arm/kvm/interrupts.S | 11 ++ arch/arm/kvm/mmu.c| 209 +++-- arch/x86/include/asm/kvm_host.h |3 - arch/x86/kvm/Kconfig |1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c| 86 -- include/linux/kvm_host.h |4 + virt/kvm/Kconfig |6 + virt/kvm/dirtylog.c | 112 ++ virt/kvm/kvm_main.c |2 + 17 files changed, 380 insertions(+), 96 deletions(-) create mode 100644 virt/kvm/dirtylog.c Patches 1-3 seem to work fine on s390. The other patches are arm-only (well cant find 5 and 6) so I guess its ok for s390. The patches are there but threading is broken, due to mail server message threshold rate. Just in case links below https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011730.html https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011731.html Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2
On Thu, Oct 30, 2014 at 9:44 AM, Takashi Iwai ti...@suse.de wrote: Hi, I've got a warning with the latest Linus tree like below: [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted --- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0001 880209983ca8 816f514f 8802099b8990 880209983cd8 810bd687 000fee00 880208a2c000 880208a1 88020ef50040 880209983d08 Call Trace: [816f514f] dump_stack+0x4e/0x71 [810bd687] lockdep_rcu_suspicious+0xe7/0x120 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm] [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm] [a0380885] gfn_to_page+0x25/0x90 [kvm] [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm] [810bc664] ? __lock_is_held+0x54/0x80 [812231f0] do_vfs_ioctl+0x300/0x520 [8122ee45] ? __fget+0x5/0x250 [8122f0fa] ? __fget_light+0x2a/0xe0 [81223491] SyS_ioctl+0x81/0xa0 [816fed6d] system_call_fastpath+0x16/0x1b kvm: zapping shadow pages for mmio generation wraparound kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x The machine itself and KVM work fine even after this warning. I'm not sure whether this is new, maybe it's triggered now since I changed my Kconfig to cover more RCU testing recently. The warning is reproducible, I can see it at the first invocation of kvm after each fresh boot. Does this ring a bell to anyone? see exactly the same trace when lockdep is on. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH RFC 05/11] virtio: introduce legacy virtio devices
On Thu, 30 Oct 2014 19:02:01 +0100 Cornelia Huck cornelia.h...@de.ibm.com wrote: On Tue, 28 Oct 2014 16:40:18 +0100 Greg Kurz gk...@linux.vnet.ibm.com wrote: On Tue, 7 Oct 2014 16:40:01 +0200 Cornelia Huck cornelia.h...@de.ibm.com wrote: Introduce a helper function to indicate whether a virtio device is operating in legacy or virtio standard mode. It may be used to make decisions about the endianess of virtio accesses and other virtio-1 specific changes, enabling us to support transitional devices. Reviewed-by: Thomas Huth th...@linux.vnet.ibm.com Signed-off-by: Cornelia Huck cornelia.h...@de.ibm.com --- hw/virtio/virtio.c|6 +- include/hw/virtio/virtio-access.h |4 include/hw/virtio/virtio.h| 13 +++-- 3 files changed, 20 insertions(+), 3 deletions(-) diff --git a/hw/virtio/virtio.c b/hw/virtio/virtio.c index 7aaa953..e6ae3a0 100644 --- a/hw/virtio/virtio.c +++ b/hw/virtio/virtio.c @@ -883,7 +883,11 @@ static bool virtio_device_endian_needed(void *opaque) VirtIODevice *vdev = opaque; assert(vdev-device_endian != VIRTIO_DEVICE_ENDIAN_UNKNOWN); -return vdev-device_endian != virtio_default_endian(); +if (virtio_device_is_legacy(vdev)) { +return vdev-device_endian != virtio_default_endian(); +} +/* Devices conforming to VIRTIO 1.0 or later are always LE. */ +return vdev-device_endian != VIRTIO_DEVICE_ENDIAN_LITTLE; } Shouldn't we have some code doing the following somewhere ? if (!virtio_device_is_legacy(vdev)) { vdev-device_endian = VIRTIO_DEVICE_ENDIAN_LITTLE; } also, since virtio-1 is LE only, do we expect device_endian to be different from VIRTIO_DEVICE_ENDIAN_LITTLE ? device_endian should not depend on whether the device is legacy or not. virtio_is_big_endian always returns false for virtio-1 devices, though. Sorry, I had missed the virtio_is_big_endian() change: it that makes device_endian a legacy virtio only matter. So why would we care to migrate the endian subsection when we have a virtio-1 device ? Shouldn't virtio_device_endian_needed() return false for virtio-1 ? -- Greg -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 4/6] hw_random: fix unregister race.
Herbert Xu herb...@gondor.apana.org.au writes: On Thu, Sep 18, 2014 at 08:37:45PM +0800, Amos Kong wrote: From: Rusty Russell ru...@rustcorp.com.au The previous patch added one potential problem: we can still be reading from a hwrng when it's unregistered. Add a wait for zero in the hwrng_unregister path. Signed-off-by: Rusty Russell ru...@rustcorp.com.au You totally corrupted Rusty's patch. If you're going to repost his series you better make sure that you've got the right patches. Just as well though as it made me think a little more about this patch :) OK Amos, can you please repost the complete series? Thanks, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/30 20:49, Dr. David Alan Gilbert wrote: * zhanghailiang (zhang.zhanghaili...@huawei.com) wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: Hi Zhanghailiang, On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: Hi Andrea, Thanks for your hard work on userfault;) This is really a useful API. I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. I think this will help supporting vhost-scsi,ivshmem for migration, we can trace dirty page in userspace. Actually, i'm trying to relize live memory snapshot based on pre-copy and userfault, but reading memory from migration thread will also trigger userfault. It will be easy to implement live memory snapshot, if we support configuring userfault for writing memory only. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. After some chat during the KVMForum I've been already thinking it could be beneficial for some usage to give userland the information about the fault being read or write, combined with the ability of mapping pages wrprotected to mcopy_atomic (that would work without false positives only with MADV_DONTFORK also set, but it's already set in qemu). That will require vma-vm_flags VM_USERFAULT to be checked also in the wrprotect faults, not just in the not present faults, but it's not a massive change. Returning the read/write information is also a not massive change. This will then payoff mostly if there's also a way to remove the memory atomically (kind of remap_anon_pages). Would that be enough? I mean are you still ok if non present read fault traps too (you'd be notified it's a read) and you get notification for both wrprotect and non present faults? Hi Andrea, Thanks for your reply, and your patience;) Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. My initial solution scheme for live memory snapshot is: (1) pause VM (2) using userfaultfd to mark all memory of VM is wrprotect (readonly) (3) save deivce state to snapshot file (4) resume VM (5) snapshot thread begin to save page of memory to snapshot file (6) VM is going to run, and it is OK for VM or other thread to read ram (no fault trap), but if VM try to write page (dirty the page), there will be a userfault trap notification. (7) a fault-handle-thread reads the page request from userfaultfd, it will copy content of the page to some buffers, and then remove the page's wrprotect limit(still using the userfaultfd to tell kernel). (8) after step (7), VM can continue to write the page which is now can be write. (9) snapshot thread save the page cached in step (7) (10) repeat step (5)~(9) until all VM's memory is saved to snapshot file. Hmm, I can see the same process being useful for the fault-tolerance schemes like COLO, it needs a memory state snapshot. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. What pages would be non-present at this point - just balloon? Er, sorry, it should be 'no-present page faults';) Dave Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. The question then is how you mark the memory readonly to let the wrprotect faults trap if the memory already existed and you didn't map it yourself in the guest with mcopy_atomic with a readonly flag. My current plan would be: - keep MADV_USERFAULT|NOUSERFAULT just to set VM_USERFAULT for the fast path check in the not-present and wrprotect page fault - if VM_USERFAULT is set, find if there's a userfaultfd registered into that vma too if yes engage userfaultfd protocol otherwise raise SIGBUS (single threaded apps should be fine with SIGBUS and it'll avoid them to spawn a thread in order to talk the userfaultfd protocol) - if userfaultfd protocol is engaged, return read|write fault + fault address to read(ufd) syscalls - leave the userfault resolution mechanism independent of the userfaultfd protocol so we keep the two problems separated and we don't mix them in the same API which makes it even harder to finalize it. add mcopy_atomic (with a flag to map the page readonly too) The alternative would be to hide mcopy_atomic (and even remap_anon_pages in order to remove the memory atomically for the externalization into the cloud) as userfaultfd commands to write into the fd. But then there would be no much point to keep MADV_USERFAULT around if I do so and I could just remove it too or it
Re: suspicious rcu_dereference_check() usage warning with 3.18-rc2
On Thu, Oct 30, 2014 at 05:44:48PM +0100, Takashi Iwai wrote: Hi, I've got a warning with the latest Linus tree like below: [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted --- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0001 880209983ca8 816f514f 8802099b8990 880209983cd8 810bd687 000fee00 880208a2c000 880208a1 88020ef50040 880209983d08 Call Trace: [816f514f] dump_stack+0x4e/0x71 [810bd687] lockdep_rcu_suspicious+0xe7/0x120 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm] [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm] [a0380885] gfn_to_page+0x25/0x90 [kvm] [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] The srcu read lock must be held while accessing memslots (e.g. when using gfn_to_* functions, however, kvm_vcpu_reload_apic_access_page() doesn't do this. I will send a patch to fix it after reproducibe. Regards, Wanpeng Li [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm] [810bc664] ? __lock_is_held+0x54/0x80 [812231f0] do_vfs_ioctl+0x300/0x520 [8122ee45] ? __fget+0x5/0x250 [8122f0fa] ? __fget_light+0x2a/0xe0 [81223491] SyS_ioctl+0x81/0xa0 [816fed6d] system_call_fastpath+0x16/0x1b kvm: zapping shadow pages for mmio generation wraparound kvm [2369]: vcpu0 disabled perfctr wrmsr: 0xc1 data 0x The machine itself and KVM work fine even after this warning. I'm not sure whether this is new, maybe it's triggered now since I changed my Kconfig to cover more RCU testing recently. The warning is reproducible, I can see it at the first invocation of kvm after each fresh boot. Does this ring a bell to anyone? thanks, Takashi -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Benchmarking for vhost polling patch
Hi Michael, Following the polling patch thread: http://marc.info/? l=kvmm=140853271510179w=2, I changed poll_stop_idle to be counted in micro seconds, and carried out experiments using varying sizes of this value. The setup for netperf consisted of 1 vm and 1 vhost , each running on their own dedicated core. Could you provide your changing code? Thanks, Zhang Haoyu Hi Zhang, Do you mean the change in code for poll_stop_idle? Yes, it's better to provide the complete code, including the polling patch. Thanks, Zhang Haoyu Thanks, Razya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: x86: fix access memslots w/o hold srcu read lock
The srcu read lock must be held while accessing memslots (e.g. when using gfn_to_* functions), however, commit c24ae0dcd3e8 (kvm: x86: Unpin and remove kvm_arch-apic_access_page) call gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it which leads to suspicious rcu_dereference_check() usage warning. This patch fix it by holding srcu read lock when call gfn_to_page() in kvm_vcpu_reload_apic_access_page() function. [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted --- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0001 880209983ca8 816f514f 8802099b8990 880209983cd8 810bd687 000fee00 880208a2c000 880208a1 88020ef50040 880209983d08 Call Trace: [816f514f] dump_stack+0x4e/0x71 [810bd687] lockdep_rcu_suspicious+0xe7/0x120 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm] [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm] [a0380885] gfn_to_page+0x25/0x90 [kvm] [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm] [810bc664] ? __lock_is_held+0x54/0x80 [812231f0] do_vfs_ioctl+0x300/0x520 [8122ee45] ? __fget+0x5/0x250 [8122f0fa] ? __fget_light+0x2a/0xe0 [81223491] SyS_ioctl+0x81/0xa0 [816fed6d] system_call_fastpath+0x16/0x1b Reported-by: Takashi Iwai ti...@suse.de Reported-by: Alexei Starovoitov alexei.starovoi...@gmail.com Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/x86.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0033df3..2d97329 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6059,6 +6059,7 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu) void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) { struct page *page = NULL; + int idx; if (!irqchip_in_kernel(vcpu-kvm)) return; @@ -6066,7 +6067,9 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) if (!kvm_x86_ops-set_apic_access_page_addr) return; + idx = srcu_read_lock(vcpu-kvm-srcu); page = gfn_to_page(vcpu-kvm, APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); + srcu_read_unlock(vcpu-kvm-srcu, idx); kvm_x86_ops-set_apic_access_page_addr(vcpu, page_to_phys(page)); /* -- 1.9.1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On 2014/10/31 11:29, zhanghailiang wrote: On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Again;) For snapshot, i don't use its dirty tracing ability, i just use it to block write action, and save page, and then i will remove its write protect. Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/17] RFC: userfault v2
On Thu, Oct 30, 2014 at 9:38 PM, zhanghailiang zhang.zhanghaili...@huawei.com wrote: On 2014/10/31 11:29, zhanghailiang wrote: On 2014/10/31 10:23, Peter Feiner wrote: On Thu, Oct 30, 2014 at 07:31:48PM +0800, zhanghailiang wrote: On 2014/10/30 1:46, Andrea Arcangeli wrote: On Mon, Oct 27, 2014 at 05:32:51PM +0800, zhanghailiang wrote: I want to confirm a question: Can we support distinguishing between writing and reading memory for userfault? That is, we can decide whether writing a page, reading a page or both trigger userfault. Mail is going to be long enough already so I'll just assume tracking dirty memory in userland (instead of doing it in kernel) is worthy feature to have here. I'll open that can of worms :-) [...] Er, maybe i didn't describe clearly. What i really need for live memory snapshot is only wrprotect fault, like kvm's dirty tracing mechanism, *only tracing write action*. So, what i need for userfault is supporting only wrprotect fault. i don't want to get notification for non present reading faults, it will influence VM's performance and the efficiency of doing snapshot. Given that you do care about performance Zhanghailiang, I don't think that a userfault handler is a good place to track dirty memory. Every dirtying write will block on the userfault handler, which is an expensively slow proposition compared to an in-kernel approach. Agreed, but for doing live memory snapshot (VM is running when do snapsphot), we have to do this (block the write action), because we have to save the page before it is dirtied by writing action. This is the difference, compared to pre-copy migration. Again;) For snapshot, i don't use its dirty tracing ability, i just use it to block write action, and save page, and then i will remove its write protect. You could do a CoW in the kernel, post a notification, keep going, and expose an interface for user-space to mmap the preserved copy. Getting the life-cycle of the preserved page(s) right is tricky, but doable. Anyway, it's easy to hand-wave without knowing your specific requirements. Opening the discussion a bit, this does look similar to the xen-access interface, in which a xen domain vcpu could be stopped in its tracks while user-space was notified (and acknowledged) a variety of scenarios: page was written to, page was read from, vcpu is attempting to execute from page, etc. Very applicable to anti-viruses right away, for example you can enforce W^X properties on pages. I don't know that Andrea wants to open the game so broadly for userfault, and the code right now is very specific to triggering on pte_none(), but that's a nice reward down this road. Andres Also, i think this feature will benefit for migration of ivshmem and vhost-scsi which have no dirty-page-tracing now. I do agree wholeheartedly with you here. Manually tracking non-guest writes adds to the complexity of device emulation code. A central fault-driven means for dirty tracking writes from the guest and host would be a welcome simplification to implementing pre-copy migration. Indeed, that's exactly what I'm working on! I'm using the softdirty bit, which was introduced recently for CRIU migration, to replace the use of KVM's dirty logging and manual dirty tracking by the VMM during pre-copy migration. See Great! Do you plan to issue your patches to community? I mean is your work based on qemu? or an independent tool (CRIU migration?) for live-migration? Maybe i could fix the migration problem for ivshmem in qemu now, based on softdirty mechanism. Documentation/vm/soft-dirty.txt and pagemap.txt in case you aren't familiar. To I have read them cursorily, it is useful for pre-copy indeed. But it seems that it can not meet my need for snapshot. make softdirty usable for live migration, I've added an API to atomically test-and-clear the bit and write protect the page. How can i find the API? Is it been merged in kernel's master branch already? Thanks, zhanghailiang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html . -- Andres Lagar-Cavilla | Google Kernel Team | andre...@google.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: x86: fix access memslots w/o hold srcu read lock
On 2014/10/31 12:33, Wanpeng Li wrote: The srcu read lock must be held while accessing memslots (e.g. when using gfn_to_* functions), however, commit c24ae0dcd3e8 (kvm: x86: Unpin and remove kvm_arch-apic_access_page) call gfn_to_page() in kvm_vcpu_reload_apic_access_page() w/o hold it which leads to suspicious rcu_dereference_check() usage warning. This patch fix it by holding srcu read lock when call gfn_to_page() in kvm_vcpu_reload_apic_access_page() function. [ INFO: suspicious RCU usage. ] 3.18.0-rc2-test2+ #70 Not tainted --- include/linux/kvm_host.h:474 suspicious rcu_dereference_check() usage! other info that might help us debug this: rcu_scheduler_active = 1, debug_locks = 0 1 lock held by qemu-system-x86/2371: #0: (vcpu-mutex){+.+...}, at: [a037d800] vcpu_load+0x20/0xd0 [kvm] stack backtrace: CPU: 4 PID: 2371 Comm: qemu-system-x86 Not tainted 3.18.0-rc2-test2+ #70 Hardware name: Dell Inc. OptiPlex 9010/0M9KCM, BIOS A12 01/10/2013 0001 880209983ca8 816f514f 8802099b8990 880209983cd8 810bd687 000fee00 880208a2c000 880208a1 88020ef50040 880209983d08 Call Trace: [816f514f] dump_stack+0x4e/0x71 [810bd687] lockdep_rcu_suspicious+0xe7/0x120 [a037d055] gfn_to_memslot+0xd5/0xe0 [kvm] [a03807d3] __gfn_to_pfn+0x33/0x60 [kvm] [a0380885] gfn_to_page+0x25/0x90 [kvm] [a038aeec] kvm_vcpu_reload_apic_access_page+0x3c/0x80 [kvm] [a08f0a9c] vmx_vcpu_reset+0x20c/0x460 [kvm_intel] [a039ab8e] kvm_vcpu_reset+0x15e/0x1b0 [kvm] [a039ac0c] kvm_arch_vcpu_setup+0x2c/0x50 [kvm] [a037f7e0] kvm_vm_ioctl+0x1d0/0x780 [kvm] [810bc664] ? __lock_is_held+0x54/0x80 [812231f0] do_vfs_ioctl+0x300/0x520 [8122ee45] ? __fget+0x5/0x250 [8122f0fa] ? __fget_light+0x2a/0xe0 [81223491] SyS_ioctl+0x81/0xa0 [816fed6d] system_call_fastpath+0x16/0x1b Reported-by: Takashi Iwai ti...@suse.de Reported-by: Alexei Starovoitov alexei.starovoi...@gmail.com Signed-off-by: Wanpeng Li wanpeng...@linux.intel.com --- arch/x86/kvm/x86.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/x86.c b/arch/x86/kvm/x86.c index 0033df3..2d97329 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -6059,6 +6059,7 @@ static void kvm_vcpu_flush_tlb(struct kvm_vcpu *vcpu) void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) { struct page *page = NULL; + int idx; if (!irqchip_in_kernel(vcpu-kvm)) return; @@ -6066,7 +6067,9 @@ void kvm_vcpu_reload_apic_access_page(struct kvm_vcpu *vcpu) if (!kvm_x86_ops-set_apic_access_page_addr) return; + idx = srcu_read_lock(vcpu-kvm-srcu); There's another scenario that we already hold srcu before call kvm_vcpu_reload_apic_access_page(), __vcpu_run() | + vcpu-srcu_idx = srcu_read_lock(kvm-srcu); + r = vcpu_enter_guest(vcpu); | + kvm_vcpu_reload_apic_access_page(vcpu); So according to backtrace I think we should fix as follows: kvm: x86: vmx: hold kvm-srcu while reload apic access page kvm_vcpu_reload_apic_access_page() needs to access memslots via gfn_to_page(), so its necessary to hold kvm-srcu. Signed-off-by: Tiejun Chen tiejun.c...@intel.com --- arch/x86/kvm/vmx.c | 3 +++ 1 file changed, 3 insertions(+) diff --git a/arch/x86/kvm/vmx.c b/arch/x86/kvm/vmx.c index b25a588..9fa1f46 100644 --- a/arch/x86/kvm/vmx.c +++ b/arch/x86/kvm/vmx.c @@ -4442,6 +4442,7 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) { struct vcpu_vmx *vmx = to_vmx(vcpu); struct msr_data apic_base_msr; + int idx; vmx-rmode.vm86_active = 0; @@ -4509,7 +4510,9 @@ static void vmx_vcpu_reset(struct kvm_vcpu *vcpu) vmcs_write32(TPR_THRESHOLD, 0); } + idx = srcu_read_lock(vcpu-kvm-srcu); kvm_vcpu_reload_apic_access_page(vcpu); + srcu_read_unlock(vcpu-kvm-srcu, idx); if (vmx_vm_has_apicv(vcpu-kvm)) memset(vmx-pi_desc, 0, sizeof(struct pi_desc)); -- 1.9.1 Thanks Tiejun page = gfn_to_page(vcpu-kvm, APIC_DEFAULT_PHYS_BASE PAGE_SHIFT); + srcu_read_unlock(vcpu-kvm-srcu, idx); kvm_x86_ops-set_apic_access_page_addr(vcpu, page_to_phys(page)); /* -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 1/6] KVM: Add architecture-defined TLB flush support
On Wed, 22 Oct 2014 15:34:06 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch adds support for architecture implemented VM TLB flush, currently ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL. This leaves other architectures unaffected using the generic version. In subsequent patch ARMv7 defines HAVE_KVM_ARCH_TLB_FLUSH_ALL and it's own TLB flush interface. Can you reword this a bit? Allow architectures to override the generic kvm_flush_remote_tlbs() function via HAVE_KVM_ARCH_TLB_FLUSH_ALL. ARMv7 will need this to provide its own TLB flush interface. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- virt/kvm/Kconfig|3 +++ virt/kvm/kvm_main.c |2 ++ 2 files changed, 5 insertions(+) Providing an override for the special cases looks sane to me. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)
Am 23.10.2014 00:34, schrieb Mario Smarduch: This patch series introduces dirty page logging for ARMv7 and adds some degree of generic dirty logging support for x86, armv7 and later armv8. I implemented Alex's suggestion after he took a look at the patches at kvm forum to simplify the generic/arch split - leaving mips, powerpc, s390, (ia64 although broken) unchanged. x86/armv7 now share some dirty logging code. armv8 dirty log patches have been posted and tested but for time being armv8 is non-generic as well. I briefly spoke to most of you at kvm forum, and this is the patch series I was referring to. Implementation changed from previous version (patches 1 2), those who acked previous revision, please review again. Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant changes. Testing: - Generally live migration + checksumming of source/destination memory regions is used validate correctness. - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest memory cycling. - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers did a basic bringup using qemu. - x86_64 qemu default machine model, tested migration on HP Z620, tested convergence for several dirty page rates See https://github.com/mjsmar/arm-dirtylog-tests - Dirtlogtest-setup.pdf for ARMv7 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch series has been compiled for affected architectures: - x86_64 - defconfig - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked around that to make sure new changes don't break build. Eventually build breaks due to other reasons. - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig - s390 - s390x-linux-gcc4.6.3 - defconfig - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig ARMv7 Dirty page logging implementation overivew- - initially write protects VM RAM memory region - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into small page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. Small page size logging supports higher memory dirty rates. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Changes since v11: - Implemented Alex's comments to simplify generic layer. Changes since v10: - addressed wanghaibin comments - addressed Christoffers comments Changes since v9: - Split patches into generic and architecture specific variants for TLB Flushing and dirty log read (patches 1,2 3,4,5,6) - rebased to 3.16.0-rc1 - Applied Christoffers comments. Mario Smarduch (6): KVM: Add architecture-defined TLB flush support KVM: Add generic support for dirty page logging arm: KVM: Add ARMv7 API to flush TLBs arm: KVM: Add initial dirty page locking infrastructure arm: KVM: dirty log read write protect support arm: KVM: ARMv7 dirty page logging 2nd stage page fault arch/arm/include/asm/kvm_asm.h|1 + arch/arm/include/asm/kvm_host.h | 14 +++ arch/arm/include/asm/kvm_mmu.h| 20 arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/Kconfig |2 + arch/arm/kvm/Makefile |1 + arch/arm/kvm/arm.c|2 + arch/arm/kvm/interrupts.S | 11 ++ arch/arm/kvm/mmu.c| 209 +++-- arch/x86/include/asm/kvm_host.h |3 - arch/x86/kvm/Kconfig |1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c| 86 -- include/linux/kvm_host.h |4 + virt/kvm/Kconfig |6 + virt/kvm/dirtylog.c | 112 ++ virt/kvm/kvm_main.c |2 + 17 files changed, 380 insertions(+), 96 deletions(-) create mode 100644 virt/kvm/dirtylog.c Patches 1-3 seem to work fine on s390. The other patches are arm-only (well cant find 5 and 6) so I guess its ok for s390. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging
On Wed, 22 Oct 2014 15:34:07 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and makefile. No other architectures are affected, each uses it's own version. This changed from previous patch revision where non-generic architectures were modified. In subsequent patch armv7 does samething. All other architectures continue use architecture defined version. Hm. The x86 specific version of dirty page logging is generic enough to be used by other architectures, noteably ARMv7. So let's move the x86 code under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other architectures continue to use their own implementations. ? Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/x86/include/asm/kvm_host.h |3 -- arch/x86/kvm/Kconfig|1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c | 86 -- include/linux/kvm_host.h|4 ++ virt/kvm/Kconfig|3 ++ virt/kvm/dirtylog.c | 112 +++ 7 files changed, 121 insertions(+), 89 deletions(-) create mode 100644 virt/kvm/dirtylog.c diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c new file mode 100644 index 000..67a --- /dev/null +++ b/virt/kvm/dirtylog.c @@ -0,0 +1,112 @@ +/* + * kvm generic dirty logging support, used by architectures that share + * comman dirty page logging implementation. s/comman/common/ The approach looks sane to me, especially as it does not change other architectures needlessly. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 2/6] KVM: Add generic support for dirty page logging
On 10/30/2014 05:14 AM, Cornelia Huck wrote: On Wed, 22 Oct 2014 15:34:07 -0700 Mario Smarduch m.smard...@samsung.com wrote: This patch defines KVM_GENERIC_DIRTYLOG, and moves dirty log read function to it's own file virt/kvm/dirtylog.c. x86 is updated to use the generic dirty log interface, selecting KVM_GENERIC_DIRTYLOG in its Kconfig and makefile. No other architectures are affected, each uses it's own version. This changed from previous patch revision where non-generic architectures were modified. In subsequent patch armv7 does samething. All other architectures continue use architecture defined version. Hm. The x86 specific version of dirty page logging is generic enough to be used by other architectures, noteably ARMv7. So let's move the x86 code under virt/kvm/ and make it depend on KVM_GENERIC_DIRTYLOG. Other architectures continue to use their own implementations. ? I'll update descriptions for both patches, with the more concise descriptions. Thanks. Signed-off-by: Mario Smarduch m.smard...@samsung.com --- arch/x86/include/asm/kvm_host.h |3 -- arch/x86/kvm/Kconfig|1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c | 86 -- include/linux/kvm_host.h|4 ++ virt/kvm/Kconfig|3 ++ virt/kvm/dirtylog.c | 112 +++ 7 files changed, 121 insertions(+), 89 deletions(-) create mode 100644 virt/kvm/dirtylog.c diff --git a/virt/kvm/dirtylog.c b/virt/kvm/dirtylog.c new file mode 100644 index 000..67a --- /dev/null +++ b/virt/kvm/dirtylog.c @@ -0,0 +1,112 @@ +/* + * kvm generic dirty logging support, used by architectures that share + * comman dirty page logging implementation. s/comman/common/ The approach looks sane to me, especially as it does not change other architectures needlessly. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v12 0/6] arm/KVM: dirty page logging support for ARMv7 (3.17.0-rc1)
On 10/30/2014 05:11 AM, Christian Borntraeger wrote: Am 23.10.2014 00:34, schrieb Mario Smarduch: This patch series introduces dirty page logging for ARMv7 and adds some degree of generic dirty logging support for x86, armv7 and later armv8. I implemented Alex's suggestion after he took a look at the patches at kvm forum to simplify the generic/arch split - leaving mips, powerpc, s390, (ia64 although broken) unchanged. x86/armv7 now share some dirty logging code. armv8 dirty log patches have been posted and tested but for time being armv8 is non-generic as well. I briefly spoke to most of you at kvm forum, and this is the patch series I was referring to. Implementation changed from previous version (patches 1 2), those who acked previous revision, please review again. Last 4 patches (ARM) have been rebased for newer kernel, with no signifcant changes. Testing: - Generally live migration + checksumming of source/destination memory regions is used validate correctness. - qemu machvirt, VExpress - Exynos 5440, FastModels - lmbench + dirty guest memory cycling. - ARMv8 Foundation Model/kvmtool - Due to slight overlap in 2nd stage handlers did a basic bringup using qemu. - x86_64 qemu default machine model, tested migration on HP Z620, tested convergence for several dirty page rates See https://github.com/mjsmar/arm-dirtylog-tests - Dirtlogtest-setup.pdf for ARMv7 - https://github.com/mjsmar/arm-dirtylog-tests/tree/master/v7 - README The patch affects armv7,armv8, mips, ia64, powerpc, s390, x86_64. Patch series has been compiled for affected architectures: - x86_64 - defconfig - ia64 - ia64-linux-gcc4.6.3 - defconfig, ia64 Kconfig defines BROKEN worked around that to make sure new changes don't break build. Eventually build breaks due to other reasons. - mips - mips64-linux-gcc4.6.3 - malta_kvm_defconfig - ppc - powerpc64-linux-gcc4.6.3 - pseries_defconfig - s390 - s390x-linux-gcc4.6.3 - defconfig - armv8 - aarch64-linux-gnu-gcc4.8.1 - defconfig ARMv7 Dirty page logging implementation overivew- - initially write protects VM RAM memory region - 2nd stage page tables - add support to read dirty page log and again write protect the dirty pages - second stage page table for next pass. - second stage huge page are dissolved into small page tables to keep track of dirty pages at page granularity. Tracking at huge page granularity limits migration to an almost idle system. Small page size logging supports higher memory dirty rates. - In the event migration is canceled, normal behavior is resumed huge pages are rebuilt over time. Changes since v11: - Implemented Alex's comments to simplify generic layer. Changes since v10: - addressed wanghaibin comments - addressed Christoffers comments Changes since v9: - Split patches into generic and architecture specific variants for TLB Flushing and dirty log read (patches 1,2 3,4,5,6) - rebased to 3.16.0-rc1 - Applied Christoffers comments. Mario Smarduch (6): KVM: Add architecture-defined TLB flush support KVM: Add generic support for dirty page logging arm: KVM: Add ARMv7 API to flush TLBs arm: KVM: Add initial dirty page locking infrastructure arm: KVM: dirty log read write protect support arm: KVM: ARMv7 dirty page logging 2nd stage page fault arch/arm/include/asm/kvm_asm.h|1 + arch/arm/include/asm/kvm_host.h | 14 +++ arch/arm/include/asm/kvm_mmu.h| 20 arch/arm/include/asm/pgtable-3level.h |1 + arch/arm/kvm/Kconfig |2 + arch/arm/kvm/Makefile |1 + arch/arm/kvm/arm.c|2 + arch/arm/kvm/interrupts.S | 11 ++ arch/arm/kvm/mmu.c| 209 +++-- arch/x86/include/asm/kvm_host.h |3 - arch/x86/kvm/Kconfig |1 + arch/x86/kvm/Makefile |1 + arch/x86/kvm/x86.c| 86 -- include/linux/kvm_host.h |4 + virt/kvm/Kconfig |6 + virt/kvm/dirtylog.c | 112 ++ virt/kvm/kvm_main.c |2 + 17 files changed, 380 insertions(+), 96 deletions(-) create mode 100644 virt/kvm/dirtylog.c Patches 1-3 seem to work fine on s390. The other patches are arm-only (well cant find 5 and 6) so I guess its ok for s390. The patches are there but threading is broken, due to mail server message threshold rate. Just in case links below https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011730.html https://lists.cs.columbia.edu/pipermail/kvmarm/2014-October/011731.html Thanks. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html