Re: [RFC PATCH 00/17] virtual-bus
On Thursday 02 April 2009 02:40:29 Anthony Liguori wrote: > Rusty Russell wrote: > > As you point out, 350-450 is possible, which is still bad, and it's at least > > partially caused by the exit to userspace and two system calls. If > > virtio_net > > had a backend in the kernel, we'd be able to compare numbers properly. > > I doubt the userspace exit is the problem. On a modern system, it takes > about 1us to do a light-weight exit and about 2us to do a heavy-weight > exit. A transition to userspace is only about ~150ns, the bulk of the > additional heavy-weight exit cost is from vcpu_put() within KVM. Just to inject some facts, servicing a ping via tap (ie host->guest then guest->host response) takes 26 system calls from one qemu thread, 7 from another (see strace below). Judging by those futex calls, multiple context switches, too. > If you were to switch to another kernel thread, and I'm pretty sure you > have to, you're going to still see about a 2us exit cost. He switches to another thread, too, but with the right infrastructure (ie. skb data destructors) we could skip this as well. (It'd be interesting to see how virtual-bus performed on a single cpu host). Cheers, Rusty. Pid 10260: 12:37:40.245785 select(17, [4 6 8 14 16], [], [], {0, 996000}) = 1 (in [6], left {0, 992000}) <0.003995> 12:37:40.250226 read(6, "\0\0\0\0\0\0\0\0\0\0RT\0\0224V*\211\24\210`\304\10\0E\0"..., 69632) = 108 <0.51> 12:37:40.250462 write(1, "tap read: 108 bytes\n", 20) = 20 <0.000197> 12:37:40.250800 ioctl(7, 0x4008ae61, 0x7fff8cafb3a0) = 0 <0.000223> 12:37:40.251149 read(6, 0x115c6ac, 69632) = -1 EAGAIN (Resource temporarily unavailable) <0.19> 12:37:40.251292 write(1, "tap read: -1 bytes\n", 19) = 19 <0.85> 12:37:40.251488 clock_gettime(CLOCK_MONOTONIC, {1554, 633304282}) = 0 <0.20> 12:37:40.251604 clock_gettime(CLOCK_MONOTONIC, {1554, 633413793}) = 0 <0.19> 12:37:40.251717 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.001222> 12:37:40.253037 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {1, 0}) <0.26> 12:37:40.253196 read(16, "\16\0\0\0\0\0\0\0\376\377\377\377\0\0\0\0\0\0\0\0\0\0\0"..., 128) = 128 <0.22> 12:37:40.253324 rt_sigaction(SIGALRM, NULL, {0x406d50, ~[KILL STOP RTMIN RT_1], SA_RESTORER, 0x7f1a842430f0}, 8) = 0 <0.18> 12:37:40.253477 write(5, "\0", 1) = 1 <0.22> 12:37:40.253585 read(16, 0x7fff8cb09440, 128) = -1 EAGAIN (Resource temporarily unavailable) <0.20> 12:37:40.253687 clock_gettime(CLOCK_MONOTONIC, {1554, 635496181}) = 0 <0.19> 12:37:40.253798 writev(6, [{"\0\0\0\0\0\0\0\0\0\0", 10}, {"*\211\24\210`\304rt\0\0224v\10\0e\0\0t\255\262\...@\1g"..., 98}], 2) = 108 <0.62> 12:37:40.253993 ioctl(7, 0x4008ae61, 0x7fff8caff460) = 0 <0.000161> 12:37:40.254263 clock_gettime(CLOCK_MONOTONIC, {1554, 636077540}) = 0 <0.19> 12:37:40.254380 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 1 <0.000394> 12:37:40.254861 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [4], left {1, 0}) <0.22> 12:37:40.255001 read(4, "\0", 512) = 1 <0.21> 12:37:40.255109 read(4, 0x7fff8cb092d0, 512) = -1 EAGAIN (Resource temporarily unavailable) <0.18> 12:37:40.255211 clock_gettime(CLOCK_MONOTONIC, {1554, 637020677}) = 0 <0.19> 12:37:40.255314 clock_gettime(CLOCK_MONOTONIC, {1554, 637123483}) = 0 <0.19> 12:37:40.255416 timer_gettime(0, {it_interval={0, 0}, it_value={0, 0}}) = 0 <0.18> 12:37:40.255524 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 1400}}, NULL) = 0 <0.21> 12:37:40.255635 clock_gettime(CLOCK_MONOTONIC, {1554, 637443915}) = 0 <0.19> 12:37:40.255739 clock_gettime(CLOCK_MONOTONIC, {1554, 637547001}) = 0 <0.18> 12:37:40.255847 select(17, [4 6 8 14 16], [], [], {1, 0}) = 1 (in [16], left {0, 988000}) <0.014303> Pid 10262: 12:37:40.252531 clock_gettime(CLOCK_MONOTONIC, {1554, 634339051}) = 0 <0.18> 12:37:40.252631 timer_gettime(0, {it_interval={0, 0}, it_value={0, 17549811}}) = 0 <0.21> 12:37:40.252750 timer_settime(0, 0, {it_interval={0, 0}, it_value={0, 25}}, NULL) = 0 <0.24> 12:37:40.252868 ioctl(11, 0xae80, 0)= 0 <0.001171> 12:37:40.254128 futex(0xb81360, 0x80 /* FUTEX_??? */, 2) = 0 <0.000270> 12:37:40.254490 ioctl(7, 0x4008ae61, 0x4134bee0) = 0 <0.19> 12:37:40.254598 futex(0xb81360, 0x81 /* FUTEX_??? */, 1) = 0 <0.17> 12:37:40.254693 ioctl(11, 0xae80 fd: lrwx-- 1 root root 64 2009-04-05 12:31 0 -> /dev/pts/1 lrwx-- 1 root root 64 2009-04-05 12:31 1 -> /dev/pts/1 lrwx-- 1 root root 64 2009-04-05 12:35 10 -> /home/rusty/qemu-images/ubuntu-8.10 lrwx-- 1 root root 64 2009-04-05 12:35 11 -> anon_inode:kvm-vcpu lrwx-- 1 root root 64 2009-04-05 12:35 12 -> socket:[31414] lrwx-- 1 root root 64 2009-04-05 12:35 13 -> socket:[31416] lrwx-- 1 root root 64 2009-04-05 12:35 14 -> anon_inode:[eventfd]
Re: virtio_net: MAC address releated breakage if there is no MAC area in config
On Thursday 02 April 2009 22:03:24 Christian Borntraeger wrote: > I read this as the mac config field is optional (similar to all the optional > fields we added in virtio_blk later). > > I see two options: > 1. Change our sample userspace to always allocate the config (like lguest and > qemu) > 2. Change the kernel code to not write into the config unless a specific > feature > bit is set. (e.g. VIRTIO_NET_F_SETMAC) I had assumed it would always exist. Agreed it's unclear; I'm currently writing down the actual standard to avoid such mistakes in the future. Cheers, Rusty. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
AHCI?
Is there any plan to add AHCI support to kvm? It seems like it would be an ideal alternative to the LSI SCSI driver, since AHCI is supported by 64-bit Solaris as well as nearly every other modern OS. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio_net: MAC address releated breakage if there is no MAC area in config
From: Christian Borntraeger Date: Thu, 2 Apr 2009 19:23:48 +0200 > Am Thursday 02 April 2009 18:06:25 schrieb Alex Williamson: > > virtio_net: Set the mac config only when VIRITO_NET_F_MAC > > > > VIRTIO_NET_F_MAC indicates the presence of the mac field in config > > space, not the validity of the value it contains. Allow the mac to be > > changed at runtime, but only push the change into config space with the > > VIRTIO_NET_F_MAC feature present. > > > > Signed-off-by: Alex Williamson > > Acked-by: Christian Borntraeger Applied, thanks! -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest
On Fri, Apr 03, 2009 at 06:45:48PM -0300, Marcelo Tosatti wrote: > On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote: > >> index 2ea8262..48169d7 100644 > >> --- a/arch/x86/kvm/x86.c > >> +++ b/arch/x86/kvm/x86.c > >> @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, > >> struct kvm_run *kvm_run) > >>kvm_write_guest_time(vcpu); > >>if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests)) > >>kvm_mmu_sync_roots(vcpu); > >> + if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, > >> &vcpu->requests)) > >> + kvm_mmu_sync_global(vcpu); > >>if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests)) > >>kvm_x86_ops->tlb_flush(vcpu); > >>if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS > > > > Windows will (I think) write a PDE on every context switch, so this > > effectively disables global unsync for that guest. > > > > What about recursively syncing the newly linked page in FNAME(fetch)()? > > If the page isn't global, this becomes a no-op, so no new overhead. The > > only question is the expense when linking a populated top-level page, > > especially in long mode. > > How about this? > > KVM: MMU: sync global pages on fetch() > > If an unsync global page becomes unreachable via the shadow tree, which > can happen if one its parent pages is zapped, invlpg will fail to > invalidate translations for gvas contained in such unreachable pages. > > So sync global pages in fetch(). > > Signed-off-by: Marcelo Tosatti I have tried this patch, and unfortunately it does not solve the original problem, while the previous one did. -- Aurelien Jarno GPG: 1024D/F1BCDB73 aurel...@aurel32.net http://www.aurel32.net -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: KVM Port
ok, so these are a few steps to begin (a) add a QEMUMachine for my h/w in qemu (b) Add arch support in kvm I have a few questions (a) qemu starts in user space, how would I configure my linux. Should the linux run in Hypervisor state and the apps run in user state, and nothing runs in guest state [ there are 3 states in my processor] (b) qemu starts the VM and somehow ( i dont know yet, how?) , starts my code in processor guest state -thanks On Thu, Mar 26, 2009 at 4:16 PM, Avi Kivity wrote: > kvm port wrote: >> >> AFAIK KVM userspace app creates a VM using /dev/kvm. Now if IO has a >> MMU managed by KVM arch module, do i still need qemu? >> >> > > qemu is needed to allocate memory, and to emulate devices. For example the > IDE disk controller is implemented in qemu. > > > -- > error compiling committee.c: too many arguments to function > > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IOMMU setting
On Sat, Apr 04, 2009 at 12:16:50AM +, Eric Liu wrote: > > Is there a quick way to check if system has IOMMU enabled in Linux? > > I saw the following messages in /var/log/messages: > > Apr 3 21:03:16 kernel: PCI-DMA: Disabling AGP. > Apr 3 21:03:16 kernel: PCI-DMA: aperture base @ f400 size 65536 KB > Apr 3 21:03:16 kernel: init_memory_mapping: f400-f800 > Apr 3 21:03:16 kernel: last_map_addr: f800 end: f800 > Apr 3 21:03:16 kernel: PCI-DMA: using GART IOMMU. > Apr 3 21:03:16 kernel: PCI-DMA: Reserving 64MB of IOMMU area in the AGP > aperture > > Does this mean IOMMU is enabed? And i don't need anything like > iommu=force in boot option, right? It means that you are running on an AMD system, and that this system has a GART. You need an isolation-capable IOMMU such as Intel's VT-d for KVM in-tree device passthrough. Cheers, Muli -- Muli Ben-Yehuda | m...@il.ibm.com | +972-4-8281080 Manager, Virtualization and Systems Architecture Master Inventor, IBM Haifa Research Laboratory SYSTOR 2009---The Israeli Experimental Systems Conference http://www.haifa.il.ibm.com/conferences/systor2009/ -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Can't boot guest with more than 3585MB when using large pages
On Fri, 2009-04-03 at 20:28 -0300, Marcelo Tosatti wrote: > > Can you please try the following Thanks Marcelo, this seems to fix it. I tested up to a 30G guest with large pages. Alex > -- > > qemu: kvm: fixup 4GB+ memslot large page alignment > > Need to align the 4GB+ memslot after we know its address, not before. > > Signed-off-by: Marcelo Tosatti Tested-by: Alex Williamson > diff --git a/qemu/hw/pc.c b/qemu/hw/pc.c > index d4a4320..cc84772 100644 > --- a/qemu/hw/pc.c > +++ b/qemu/hw/pc.c > @@ -866,6 +866,7 @@ static void pc_init1(ram_addr_t ram_size, int > vga_ram_size, > > /* above 4giga memory allocation */ > if (above_4g_mem_size > 0) { > +ram_addr = qemu_ram_alloc(above_4g_mem_size); > if (hpagesize) { > if (ram_addr & (hpagesize-1)) { > unsigned long aligned_addr; > @@ -874,7 +875,6 @@ static void pc_init1(ram_addr_t ram_size, int > vga_ram_size, > ram_addr = aligned_addr; > } > } > -ram_addr = qemu_ram_alloc(above_4g_mem_size); > cpu_register_physical_memory(0x1ULL, > above_4g_mem_size, > ram_addr); > -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest
On Sat, Apr 04, 2009 at 01:37:39PM +0300, Avi Kivity wrote: >> diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h >> index 09782a9..728be72 100644 >> --- a/arch/x86/kvm/paging_tmpl.h >> +++ b/arch/x86/kvm/paging_tmpl.h >> @@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t >> addr, >> break; >> } >> - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) >> +if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { >> +if (level-1 == PT_PAGE_TABLE_LEVEL) { >> +shadow_page = page_header(__pa(sptep)); >> +if (shadow_page->unsync && shadow_page->global) >> +kvm_sync_page(vcpu, shadow_page); >> +} >> continue; >> +} >> if (is_large_pte(*sptep)) { >> rmap_remove(vcpu->kvm, sptep); >> > > But here the shadow page is already linked? Isn't the root cause that > an invlpg was called when the page wasn't linked, so it wasn't seen by > invlpg? > > So I thought the best place would be in fetch(), after > kvm_mmu_get_page(). If we're linking a page which contains global ptes, > they might be unsynced due to invlpgs that we've missed. > > Or am I missing something about the root cause? The problem is when the page is unreachable due to a higher level path being unlinked. Say: level 4 -> level 3 . level 2 -> level 1 (global unsync) The dot there means level 3 is not linked to level 2, so invlpg can't reach the global unsync at level 1. kvm_mmu_get_page does sync pages when it finds them, so the code is already safe for the "linking a page which contains global ptes" case you mention above. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 4/4] add ksm kernel shared memory driver.
Ksm is driver that allow merging identical pages between one or more applications in way unvisible to the application that use it. Pages that are merged are marked as readonly and are COWed when any application try to change them. Ksm is used for cases where using fork() is not suitable, one of this cases is where the pages of the application keep changing dynamicly and the application cannot know in advance what pages are going to be identical. Ksm works by walking over the memory pages of the applications it scan in order to find identical pages. It uses a two sorted data strctures called stable and unstable trees to find in effective way the identical pages. When ksm finds two identical pages, it marks them as readonly and merges them into single one page, after the pages are marked as readonly and merged into one page, linux will treat this pages as normal copy_on_write pages and will fork them when write access will happen to them. Ksm scan just memory areas that were registred to be scanned by it. Ksm api: KSM_GET_API_VERSION: Give the userspace the api version of the module. KSM_CREATE_SHARED_MEMORY_AREA: Create shared memory reagion fd, that latter allow the user to register the memory region to scan by using: KSM_REGISTER_MEMORY_REGION and KSM_REMOVE_MEMORY_REGION KSM_REGISTER_MEMORY_REGION: Register userspace virtual address range to be scanned by ksm. This ioctl is using the ksm_memory_region structure: ksm_memory_region: __u32 npages; number of pages to share inside this memory region. __u32 pad; __u64 addr: the begining of the virtual address of this region. __u64 reserved_bits; reserved bits for future usage. KSM_REMOVE_MEMORY_REGION: Remove memory region from ksm. Signed-off-by: Izik Eidus --- include/linux/ksm.h| 48 ++ include/linux/miscdevice.h |1 + mm/Kconfig |6 + mm/Makefile|1 + mm/ksm.c | 1668 5 files changed, 1724 insertions(+), 0 deletions(-) create mode 100644 include/linux/ksm.h create mode 100644 mm/ksm.c diff --git a/include/linux/ksm.h b/include/linux/ksm.h new file mode 100644 index 000..2c11e9a --- /dev/null +++ b/include/linux/ksm.h @@ -0,0 +1,48 @@ +#ifndef __LINUX_KSM_H +#define __LINUX_KSM_H + +/* + * Userspace interface for /dev/ksm - kvm shared memory + */ + +#include +#include + +#include + +#define KSM_API_VERSION 1 + +#define ksm_control_flags_run 1 + +/* for KSM_REGISTER_MEMORY_REGION */ +struct ksm_memory_region { + __u32 npages; /* number of pages to share */ + __u32 pad; + __u64 addr; /* the begining of the virtual address */ +__u64 reserved_bits; +}; + +#define KSMIO 0xAB + +/* ioctls for /dev/ksm */ + +#define KSM_GET_API_VERSION _IO(KSMIO, 0x00) +/* + * KSM_CREATE_SHARED_MEMORY_AREA - create the shared memory reagion fd + */ +#define KSM_CREATE_SHARED_MEMORY_AREA_IO(KSMIO, 0x01) /* return SMA fd */ + +/* ioctls for SMA fds */ + +/* + * KSM_REGISTER_MEMORY_REGION - register virtual address memory area to be + * scanned by kvm. + */ +#define KSM_REGISTER_MEMORY_REGION _IOW(KSMIO, 0x20,\ + struct ksm_memory_region) +/* + * KSM_REMOVE_MEMORY_REGION - remove virtual address memory area from ksm. + */ +#define KSM_REMOVE_MEMORY_REGION _IO(KSMIO, 0x21) + +#endif diff --git a/include/linux/miscdevice.h b/include/linux/miscdevice.h index beb6ec9..297c0bb 100644 --- a/include/linux/miscdevice.h +++ b/include/linux/miscdevice.h @@ -30,6 +30,7 @@ #define HPET_MINOR 228 #define FUSE_MINOR 229 #define KVM_MINOR 232 +#define KSM_MINOR 233 #define MISC_DYNAMIC_MINOR 255 struct device; diff --git a/mm/Kconfig b/mm/Kconfig index b53427a..3f3fd04 100644 --- a/mm/Kconfig +++ b/mm/Kconfig @@ -223,3 +223,9 @@ config HAVE_MLOCKED_PAGE_BIT config MMU_NOTIFIER bool + +config KSM + tristate "Enable KSM for page sharing" + help + Enable the KSM kernel module to allow page sharing of equal pages + among different tasks. diff --git a/mm/Makefile b/mm/Makefile index ec73c68..b885513 100644 --- a/mm/Makefile +++ b/mm/Makefile @@ -24,6 +24,7 @@ obj-$(CONFIG_SPARSEMEM_VMEMMAP) += sparse-vmemmap.o obj-$(CONFIG_TMPFS_POSIX_ACL) += shmem_acl.o obj-$(CONFIG_SLOB) += slob.o obj-$(CONFIG_MMU_NOTIFIER) += mmu_notifier.o +obj-$(CONFIG_KSM) += ksm.o obj-$(CONFIG_PAGE_POISONING) += debug-pagealloc.o obj-$(CONFIG_SLAB) += slab.o obj-$(CONFIG_SLUB) += slub.o diff --git a/mm/ksm.c b/mm/ksm.c new file mode 100644 index 000..fb59a08 --- /dev/null +++ b/mm/ksm.c @@ -0,0 +1,1668 @@ +/* + * Memory merging driver for Linux + * + * This module enables dynamic sharing of identical pages found in different + * memory areas, even if they are not shared by fork() + * + * Copyright (C) 2008 Red Hat, Inc. + * Authors: + * Izik Eidus + *
[PATCH 1/4] MMU_NOTIFIERS: add set_pte_at_notify()
this macro allow setting the pte in the shadow page tables directly instead of flushing the shadow page table entry and then get vmexit in order to set it. This function is optimzation for kvm/users of mmu_notifiers for COW pages, it is useful for kvm when ksm is used beacuse it allow kvm not to have to recive VMEXIT and only then map the shared page into the mmu shadow pages, but instead map it directly at the same time linux map the page into the host page table. this mmu notifer macro is working by calling to callback that will map directly the physical page into the shadow page tables. (users of mmu_notifiers that didnt implement the set_pte_at_notify() call back will just recive the mmu_notifier_invalidate_page callback) Signed-off-by: Izik Eidus --- include/linux/mmu_notifier.h | 34 ++ mm/memory.c | 10 -- mm/mmu_notifier.c| 20 3 files changed, 62 insertions(+), 2 deletions(-) diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h index b77486d..8bb245f 100644 --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -61,6 +61,15 @@ struct mmu_notifier_ops { struct mm_struct *mm, unsigned long address); + /* + * change_pte is called in cases that pte mapping into page is changed + * for example when ksm mapped pte to point into a new shared page. + */ + void (*change_pte)(struct mmu_notifier *mn, + struct mm_struct *mm, + unsigned long address, + pte_t pte); + /* * Before this is invoked any secondary MMU is still ok to * read/write to the page previously pointed to by the Linux @@ -154,6 +163,8 @@ extern void __mmu_notifier_mm_destroy(struct mm_struct *mm); extern void __mmu_notifier_release(struct mm_struct *mm); extern int __mmu_notifier_clear_flush_young(struct mm_struct *mm, unsigned long address); +extern void __mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte); extern void __mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address); extern void __mmu_notifier_invalidate_range_start(struct mm_struct *mm, @@ -175,6 +186,13 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ + if (mm_has_notifiers(mm)) + __mmu_notifier_change_pte(mm, address, pte); +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -236,6 +254,16 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) __young;\ }) +#define set_pte_at_notify(__mm, __address, __ptep, __pte) \ +({ \ + struct mm_struct *___mm = __mm; \ + unsigned long ___address = __address; \ + pte_t ___pte = __pte; \ + \ + set_pte_at(__mm, __address, __ptep, ___pte);\ + mmu_notifier_change_pte(___mm, ___address, ___pte); \ +}) + #else /* CONFIG_MMU_NOTIFIER */ static inline void mmu_notifier_release(struct mm_struct *mm) @@ -248,6 +276,11 @@ static inline int mmu_notifier_clear_flush_young(struct mm_struct *mm, return 0; } +static inline void mmu_notifier_change_pte(struct mm_struct *mm, + unsigned long address, pte_t pte) +{ +} + static inline void mmu_notifier_invalidate_page(struct mm_struct *mm, unsigned long address) { @@ -273,6 +306,7 @@ static inline void mmu_notifier_mm_destroy(struct mm_struct *mm) #define ptep_clear_flush_young_notify ptep_clear_flush_young #define ptep_clear_flush_notify ptep_clear_flush +#define set_pte_at_notify set_pte_at #endif /* CONFIG_MMU_NOTIFIER */ diff --git a/mm/memory.c b/mm/memory.c index cf6873e..1e1a14b 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -2051,9 +2051,15 @@ gotten: * seen in the presence of one thread doing SMC and another * thread doing COW. */ - ptep_clear_flush_notify(vma, address, page_table); + ptep_clear_flush(vma, address, page_table); page_add_new_anon_rmap(new_page, vma, address); - set_pte_at(mm, a
[PATCH 0/4] ksm - dynamic page sharing driver for linux v2
>From v1 to v2: 1)Fixed security issue found by Chris Wright: Ksm was checking if page is a shared page by running !PageAnon. Beacuse that Ksm scan only anonymous memory, all !PageAnons inside ksm data strctures are shared page, however there might be a case for do_wp_page() when the VM_SHARED is used where do_wp_page() would instead of copying the page into new anonymos page, would reuse the page, it was fixed by adding check for the dirty_bit of the virtual addresses pointing into the shared page. I was not finding any VM code tha would clear the dirty bit from this virtual address (due to the fact that we allocate the page using page_alloc() - kernel allocated pages), ~but i still want confirmation about this from the vm guys - thanks.~ 2)Moved to sysfs to control ksm: It was requested as a better way to control the ksm scanning thread than ioctls. the sysfs api: dir: /sys/kernel/mm/ksm/ kernel_pages_allocated - information about how many kernel pages ksm have allocated, this pages are not swappable, and each page like that is used by ksm to share pages with identical content pages_shared - how many pages were shared by ksm run - set to 1 when you want ksm to run, 0 when no max_kernel_pages - set the maximum amount of kernel pages to be allocated by ksm, set 0 for unlimited. pages_to_scan - how many pages to scan before ksm will sleep sleep - how much usecs ksm will sleep. 3)Add sysfs paramater to control the maximum kernel pages to be by ksm. 4)Add statistics about how much pages are really shared. One issue still to be discussed: There was a suggestion to use madvice(SHAREABLE) instead of using ioctls to register memory that need to be scanned by ksm. Such change is outside the area of ksm.c and would required adding new madvice api, and change some parts of the vm and the kernel code, so first thing to do, is realized if we really want this. I dont know any other open issues. Thanks. This is from the first post: (The kvm part, togather with the kvm-userspace part, was post with V1 before about a week, whoever want to test ksm may download the patch from lkml archive) KSM is a linux driver that allows dynamicly sharing identical memory pages between one or more processes. Unlike tradtional page sharing that is made at the allocation of the memory, ksm do it dynamicly after the memory was created. Memory is periodically scanned; identical pages are identified and merged. The sharing is unnoticeable by the process that use this memory. (the shared pages are marked as readonly, and in case of write do_wp_page() take care to create new copy of the page) To find identical pages ksm use algorithm that is split into three primery levels: 1) Ksm will start scan the memory and will calculate checksum for each page that is registred to be scanned. (In the first round of the scanning, ksm would only calculate this checksum for all the pages) 2) Ksm will go again on the whole memory and will recalculate the checmsum of the pages, pages that are found to have the same checksum value, would be considered "pages that are most likely wont changed" Ksm will insert this pages into sorted by page content RB-tree that is called "unstable tree", the reason that this tree is called unstable is due to the fact that the page contents might changed while they are still inside the tree, and therefore the tree would become corrupted. Due to this problem ksm take two more steps in addition to the checksum calculation: a) Ksm will throw and recreate the entire unstable tree each round of memory scanning - so if we have corruption, it will be fixed when we will rebuild the tree. b) Ksm is using RB-tree, that its balancing is made by the node color and not by the content, so even if the page get corrupted, it still would take the same amount of time to search on it. 3) In addition to the unstable tree, ksm hold another tree that is called "stable tree" - this tree is RB-tree that is sorted by the pages content and all its pages are write protected, and therefore it cant get corrupted. Each time ksm will find two identcial pages using the unstable tree, it will create new write-protected shared page, and this page will be inserted into the stable tree, and would be saved there, the stable tree, unlike the unstable tree, is never throwen away, so each page that we find would be saved inside it. Taking into account the three levels that described above, the algorithm work like that: search primary tree (sorted by entire page contents, pages write protected) - if match found, merge - if no match found... - search secondary tree (sorted by entire page contents, pages not write protected) - if match found, merge - remove from secondary tree and insert merged page into primary tree - if no match found...
[PATCH 2/4] add page_wrprotect(): write protecting page.
this patch add new function called page_wrprotect(), page_wrprotect() is used to take a page and mark all the pte that point into it as readonly. The function is working by walking the rmap of the page, and setting each pte realted to the page as readonly. The odirect_sync parameter is used to protect against possible races with odirect while we are marking the pte as readonly, as noted by Andrea Arcanglei: "While thinking at get_user_pages_fast I figured another worse way things can go wrong with ksm and o_direct: think a thread writing constantly to the last 512bytes of a page, while another thread read and writes to/from the first 512bytes of the page. We can lose O_DIRECT reads, the very moment we mark any pte wrprotected..." Signed-off-by: Izik Eidus --- include/linux/rmap.h | 11 mm/rmap.c| 139 ++ 2 files changed, 150 insertions(+), 0 deletions(-) diff --git a/include/linux/rmap.h b/include/linux/rmap.h index b35bc0e..469376d 100644 --- a/include/linux/rmap.h +++ b/include/linux/rmap.h @@ -118,6 +118,10 @@ static inline int try_to_munlock(struct page *page) } #endif +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +int page_wrprotect(struct page *page, int *odirect_sync, int count_offset); +#endif + #else /* !CONFIG_MMU */ #define anon_vma_init()do {} while (0) @@ -132,6 +136,13 @@ static inline int page_mkclean(struct page *page) return 0; } +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +static inline int page_wrprotect(struct page *page, int *odirect_sync, +int count_offset) +{ + return 0; +} +#endif #endif /* CONFIG_MMU */ diff --git a/mm/rmap.c b/mm/rmap.c index 1652166..95c55ea 100644 --- a/mm/rmap.c +++ b/mm/rmap.c @@ -585,6 +585,145 @@ int page_mkclean(struct page *page) } EXPORT_SYMBOL_GPL(page_mkclean); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) + +static int page_wrprotect_one(struct page *page, struct vm_area_struct *vma, + int *odirect_sync, int count_offset) +{ + struct mm_struct *mm = vma->vm_mm; + unsigned long address; + pte_t *pte; + spinlock_t *ptl; + int ret = 0; + + address = vma_address(page, vma); + if (address == -EFAULT) + goto out; + + pte = page_check_address(page, mm, address, &ptl, 0); + if (!pte) + goto out; + + if (pte_write(*pte)) { + pte_t entry; + + flush_cache_page(vma, address, pte_pfn(*pte)); + /* +* Ok this is tricky, when get_user_pages_fast() run it doesnt +* take any lock, therefore the check that we are going to make +* with the pagecount against the mapcount is racey and +* O_DIRECT can happen right after the check. +* So we clear the pte and flush the tlb before the check +* this assure us that no O_DIRECT can happen after the check +* or in the middle of the check. +*/ + entry = ptep_clear_flush(vma, address, pte); + /* +* Check that no O_DIRECT or similar I/O is in progress on the +* page +*/ + if ((page_mapcount(page) + count_offset) != page_count(page)) { + *odirect_sync = 0; + set_pte_at_notify(mm, address, pte, entry); + goto out_unlock; + } + entry = pte_wrprotect(entry); + set_pte_at_notify(mm, address, pte, entry); + } + ret = 1; + +out_unlock: + pte_unmap_unlock(pte, ptl); +out: + return ret; +} + +static int page_wrprotect_file(struct page *page, int *odirect_sync, + int count_offset) +{ + struct address_space *mapping; + struct prio_tree_iter iter; + struct vm_area_struct *vma; + pgoff_t pgoff = page->index << (PAGE_CACHE_SHIFT - PAGE_SHIFT); + int ret = 0; + + mapping = page_mapping(page); + if (!mapping) + return ret; + + spin_lock(&mapping->i_mmap_lock); + + vma_prio_tree_foreach(vma, &iter, &mapping->i_mmap, pgoff, pgoff) + ret += page_wrprotect_one(page, vma, odirect_sync, + count_offset); + + spin_unlock(&mapping->i_mmap_lock); + + return ret; +} + +static int page_wrprotect_anon(struct page *page, int *odirect_sync, + int count_offset) +{ + struct vm_area_struct *vma; + struct anon_vma *anon_vma; + int ret = 0; + + anon_vma = page_lock_anon_vma(page); + if (!anon_vma) + return ret; + + /* +* If the page is inside the swap cache, its _count number was +* increased by one, therefore we have to increase count
[PATCH 3/4] add replace_page(): change the page pte is pointing to.
replace_page() allow changing the mapping of pte from one physical page into diffrent physical page. this function is working by removing oldpage from the rmap and calling put_page on it, and by setting the pte to point into newpage and by inserting it to the rmap using page_add_file_rmap(). note: newpage must be non anonymous page, the reason for this is: replace_page() is built to allow mapping one page into more than one virtual addresses, the mapping of this page can happen in diffrent offsets inside each vma, and therefore we cannot trust the page->index anymore. the side effect of this issue is that newpage cannot be anything but kernel allocated page that is not swappable. Signed-off-by: Izik Eidus --- include/linux/mm.h |5 +++ mm/memory.c| 80 2 files changed, 85 insertions(+), 0 deletions(-) diff --git a/include/linux/mm.h b/include/linux/mm.h index bff1f0d..7a831ce 100644 --- a/include/linux/mm.h +++ b/include/linux/mm.h @@ -1240,6 +1240,11 @@ int vm_insert_pfn(struct vm_area_struct *vma, unsigned long addr, int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, unsigned long pfn); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot); +#endif + struct page *follow_page(struct vm_area_struct *, unsigned long address, unsigned int foll_flags); #define FOLL_WRITE 0x01/* check pte is writable */ diff --git a/mm/memory.c b/mm/memory.c index 1e1a14b..d6e53c2 100644 --- a/mm/memory.c +++ b/mm/memory.c @@ -1567,6 +1567,86 @@ int vm_insert_mixed(struct vm_area_struct *vma, unsigned long addr, } EXPORT_SYMBOL(vm_insert_mixed); +#if defined(CONFIG_KSM) || defined(CONFIG_KSM_MODULE) + +/** + * replace_page - replace page in vma with new page + * @vma: vma that hold the pte oldpage is pointed by. + * @oldpage: the page we are replacing with newpage + * @newpage: the page we replace oldpage with + * @orig_pte: the original value of the pte + * @prot: page protection bits + * + * Returns 0 on success, -EFAULT on failure. + * + * Note: @newpage must not be an anonymous page because replace_page() does + * not change the mapping of @newpage to have the same values as @oldpage. + * @newpage can be mapped in several vmas at different offsets (page->index). + */ +int replace_page(struct vm_area_struct *vma, struct page *oldpage, +struct page *newpage, pte_t orig_pte, pgprot_t prot) +{ + struct mm_struct *mm = vma->vm_mm; + pgd_t *pgd; + pud_t *pud; + pmd_t *pmd; + pte_t *ptep; + spinlock_t *ptl; + unsigned long addr; + int ret; + + BUG_ON(PageAnon(newpage)); + + ret = -EFAULT; + addr = page_address_in_vma(oldpage, vma); + if (addr == -EFAULT) + goto out; + + pgd = pgd_offset(mm, addr); + if (!pgd_present(*pgd)) + goto out; + + pud = pud_offset(pgd, addr); + if (!pud_present(*pud)) + goto out; + + pmd = pmd_offset(pud, addr); + if (!pmd_present(*pmd)) + goto out; + + ptep = pte_offset_map_lock(mm, pmd, addr, &ptl); + if (!ptep) + goto out; + + if (!pte_same(*ptep, orig_pte)) { + pte_unmap_unlock(ptep, ptl); + goto out; + } + + ret = 0; + get_page(newpage); + page_add_file_rmap(newpage); + + flush_cache_page(vma, addr, pte_pfn(*ptep)); + ptep_clear_flush(vma, addr, ptep); + set_pte_at_notify(mm, addr, ptep, mk_pte(newpage, prot)); + + page_remove_rmap(oldpage); + if (PageAnon(oldpage)) { + dec_mm_counter(mm, anon_rss); + inc_mm_counter(mm, file_rss); + } + put_page(oldpage); + + pte_unmap_unlock(ptep, ptl); + +out: + return ret; +} +EXPORT_SYMBOL_GPL(replace_page); + +#endif + /* * maps a range of physical memory into the requested pages. the old * mappings are removed. any references to nonexistent pages results -- 1.5.6.5 -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] KVM: Expand on "help" info to specify kvm intel and amd module names.
Signed-off-by: Robert P. J. Day --- diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 0a303c3..68c7e21 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -50,6 +50,9 @@ config KVM_INTEL Provides support for KVM on Intel processors equipped with the VT extensions. + To compile this as a module, choose M here: the module + will be called kvm-intel. + config KVM_AMD tristate "KVM for AMD processors support" depends on KVM @@ -57,6 +60,9 @@ config KVM_AMD Provides support for KVM on AMD processors equipped with the AMD-V (SVM) extensions. + To compile this as a module, choose M here: the module + will be called kvm-amd. + config KVM_TRACE bool "KVM trace support" depends on KVM && MARKERS && SYSFS Robert P. J. Day Linux Consulting, Training and Annoying Kernel Pedantry: Have classroom, will lecture. http://crashcourse.ca Waterloo, Ontario, CANADA -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IO on guest is 20 times slower than host
Joerg Roedel wrote: index 1fcbc17..d9774e9 100644 --- a/kernel/x86/kvm/svm.c +++ b/kernel/x86/kvm/svm.c @@ -575,7 +575,7 @@ static void init_vmcb(struct vcpu_svm *svm) INTERCEPT_CR3_MASK); control->intercept_cr_write &= ~(INTERCEPT_CR0_MASK| INTERCEPT_CR3_MASK); - save->g_pat = 0x0007040600070406ULL; + save->g_pat = 0x0606060606060606ULL; /* enable caching because the QEMU Bios doesn't enable it */ save->cr0 = X86_CR0_ET; save->cr3 = 0; Yeah, that patch makes sense. But I think we need some more work on this because the guest may change the pat msr afterwards. Best would be a simple shadow of the pat msr. Last question is how this will effect pci passthrough. I've noticed that Windows (and likely Linux, didn't test) maps the cirrus framebuffer with PWT=1, which should slow down the emulated framebuffer. So this patch should speed up things. If a device is assigned, we must respect the guest PAT, so cirrus performance will be low. On Intel there's an 'ignore PAT' bit which can be set on an ept pte for the framebuffer. Any trick we can do on AMD to achieve a similar result? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] Fix display breakage when resizing the screen
When the vga resolution changes, a new display surface is not allocated immediately; instead that is deferred until the next update. However, if we're running without a display client attached, that won't happen and the next bitblt is likely to cause a segfault by overflowing the display surface. Fix by reallocating the display immediately when the resolution changes. Tested with (Windows|Linux) x (cirrus|std) x (curses|sdl). Signed-off-by: Avi Kivity --- hw/cirrus_vga.c | 11 ++- hw/vga.c| 261 ++- hw/vga_int.h|4 + 3 files changed, 156 insertions(+), 120 deletions(-) diff --git a/hw/cirrus_vga.c b/hw/cirrus_vga.c index 08fd4c2..223008e 100644 --- a/hw/cirrus_vga.c +++ b/hw/cirrus_vga.c @@ -1392,6 +1392,8 @@ cirrus_hook_write_sr(CirrusVGAState * s, unsigned reg_index, int reg_value) break; } +vga_update_resolution((VGAState *)s); + return CIRRUS_HOOK_HANDLED; } @@ -1419,6 +1421,7 @@ static void cirrus_write_hidden_dac(CirrusVGAState * s, int reg_value) #endif } s->cirrus_hidden_dac_lockindex = 0; +vga_update_resolution((VGAState *)s); } /*** @@ -1705,6 +1708,8 @@ cirrus_hook_write_cr(CirrusVGAState * s, unsigned reg_index, int reg_value) break; } +vga_update_resolution((VGAState *)s); + return CIRRUS_HOOK_HANDLED; } @@ -2830,6 +2835,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) if (s->ar_flip_flop == 0) { val &= 0x3f; s->ar_index = val; +vga_update_resolution((VGAState *)s); } else { index = s->ar_index & 0x1f; switch (index) { @@ -2923,6 +2929,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) /* can always write bit 4 of CR7 */ if (s->cr_index == 7) s->cr[7] = (s->cr[7] & ~0x10) | (val & 0x10); +vga_update_resolution((VGAState *)s); return; } switch (s->cr_index) { @@ -2951,6 +2958,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) s->update_retrace_info((VGAState *) s); break; } +vga_update_resolution((VGAState *)s); break; case 0x3ba: case 0x3da: @@ -3157,7 +3165,8 @@ static int cirrus_vga_load(QEMUFile *f, void *opaque, int version_id) cirrus_update_memory_access(s); /* force refresh */ -s->graphic_mode = -1; +vga_update_resolution((VGAState *)s); +s->want_full_update = 1; cirrus_update_bank_ptr(s, 0); cirrus_update_bank_ptr(s, 1); return 0; diff --git a/hw/vga.c b/hw/vga.c index b1e4373..404450f 100644 --- a/hw/vga.c +++ b/hw/vga.c @@ -36,6 +36,10 @@ //#define DEBUG_BOCHS_VBE +#define GMODE_TEXT 0 +#define GMODE_GRAPH1 +#define GMODE_BLANK 2 + /* force some bits to zero */ const uint8_t sr_mask[8] = { 0x03, @@ -393,6 +397,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) if (s->ar_flip_flop == 0) { val &= 0x3f; s->ar_index = val; +vga_update_resolution(s); } else { index = s->ar_index & 0x1f; switch(index) { @@ -433,6 +438,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) #endif s->sr[s->sr_index] = val & sr_mask[s->sr_index]; if (s->sr_index == 1) s->update_retrace_info(s); +vga_update_resolution(s); break; case 0x3c7: s->dac_read_index = val; @@ -460,6 +466,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) printf("vga: write GR%x = 0x%02x\n", s->gr_index, val); #endif s->gr[s->gr_index] = val & gr_mask[s->gr_index]; +vga_update_resolution(s); break; case 0x3b4: case 0x3d4: @@ -475,6 +482,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) /* can always write bit 4 of CR7 */ if (s->cr_index == 7) s->cr[7] = (s->cr[7] & ~0x10) | (val & 0x10); +vga_update_resolution(s); return; } switch(s->cr_index) { @@ -502,6 +510,7 @@ static void vga_ioport_write(void *opaque, uint32_t addr, uint32_t val) s->update_retrace_info(s); break; } +vga_update_resolution(s); break; case 0x3ba: case 0x3da: @@ -581,11 +590,13 @@ static void vbe_ioport_write_data(void *opaque, uint32_t addr, uint32_t val) if ((val <= VBE_DISPI_MAX_XRES) && ((val & 7) == 0)) { s->vbe_regs[s->vbe_index] = val; } +vga_update_resolution(s); break; case VBE_DISPI_INDEX_YRES: if (val <= VBE_DISPI_MAX_YRES) { s->vbe_regs[s->vbe_index] = val; } +vga_u
Re: cr3 OOS optimisation breaks 32-bit GNU/kFreeBSD guest
Marcelo Tosatti wrote: On Tue, Mar 24, 2009 at 11:47:33AM +0200, Avi Kivity wrote: index 2ea8262..48169d7 100644 --- a/arch/x86/kvm/x86.c +++ b/arch/x86/kvm/x86.c @@ -3109,6 +3109,8 @@ static int vcpu_enter_guest(struct kvm_vcpu *vcpu, struct kvm_run *kvm_run) kvm_write_guest_time(vcpu); if (test_and_clear_bit(KVM_REQ_MMU_SYNC, &vcpu->requests)) kvm_mmu_sync_roots(vcpu); + if (test_and_clear_bit(KVM_REQ_MMU_GLOBAL_SYNC, &vcpu->requests)) + kvm_mmu_sync_global(vcpu); if (test_and_clear_bit(KVM_REQ_TLB_FLUSH, &vcpu->requests)) kvm_x86_ops->tlb_flush(vcpu); if (test_and_clear_bit(KVM_REQ_REPORT_TPR_ACCESS Windows will (I think) write a PDE on every context switch, so this effectively disables global unsync for that guest. What about recursively syncing the newly linked page in FNAME(fetch)()? If the page isn't global, this becomes a no-op, so no new overhead. The only question is the expense when linking a populated top-level page, especially in long mode. How about this? KVM: MMU: sync global pages on fetch() If an unsync global page becomes unreachable via the shadow tree, which can happen if one its parent pages is zapped, invlpg will fail to invalidate translations for gvas contained in such unreachable pages. So sync global pages in fetch(). Signed-off-by: Marcelo Tosatti diff --git a/arch/x86/kvm/paging_tmpl.h b/arch/x86/kvm/paging_tmpl.h index 09782a9..728be72 100644 --- a/arch/x86/kvm/paging_tmpl.h +++ b/arch/x86/kvm/paging_tmpl.h @@ -308,8 +308,14 @@ static u64 *FNAME(fetch)(struct kvm_vcpu *vcpu, gva_t addr, break; } - if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) + if (is_shadow_present_pte(*sptep) && !is_large_pte(*sptep)) { + if (level-1 == PT_PAGE_TABLE_LEVEL) { + shadow_page = page_header(__pa(sptep)); + if (shadow_page->unsync && shadow_page->global) + kvm_sync_page(vcpu, shadow_page); + } continue; + } if (is_large_pte(*sptep)) { rmap_remove(vcpu->kvm, sptep); But here the shadow page is already linked? Isn't the root cause that an invlpg was called when the page wasn't linked, so it wasn't seen by invlpg? So I thought the best place would be in fetch(), after kvm_mmu_get_page(). If we're linking a page which contains global ptes, they might be unsynced due to invlpgs that we've missed. Or am I missing something about the root cause? -- I have a truly marvellous patch that fixes the bug which this signature is too narrow to contain. -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
persistent tun & different virtual NICs & dead guest network
Hello. 2 days debugging an.. issue here, and finally got it. To make the long and painful (it was for me anyway) story short... kvm provides a way to control various offload settings on the "host side" of the tun network device (I mean the `-net tap' setup) from within guest. I.e., guest can set/clear various offload bits according to its capabilities/wishes. The problem is that different virtual NICs as used by kvm/qemu expects and sets different offload bits for the virtual NIC. And sets only those bits which - as they "think" - differs from the default (all-off). This means that when changing virtual NIC model AND using persistent tun device, it's very likely to get inconsistent flags. For example, here's how the offload settings on the host looks like after using e1000 driver in guest (freshly created persistent tun device): rx-checksumming: on tx-checksumming: on scatter-gather: on tcp segmentation offload: on udp fragmentation offload: off generic segmentation offload: off large receive offload: off Here's the same setting when using virtio_net instead: rx-checksumming: on tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: off large receive offload: off I.e., only rx-checksumming. When using virtio_net from 2.6.29, which supports LRO, it also turns on large receive offload. Now, say, I tried a host with e1000 driver, and it turned on tx, sg and tso bits. And now I'm trying to run a guest with new virtio-net NIC instead. It turns on lro bit, but the network does not work anyway: almost any packet that's being sent from host to the guest has incorrect checksum - because the NIC is marked as able to do tx-checksumming but it does not do it. The network is dead. Now, after trying that and this, not understanding what's going on etc, let's reboot back with e1000 NIC which worked a few minutes ago... just to discover that it does not work anymore too! Because previous attempt with virtio_net resulted in lro being on, but the driver does not support it! So now, we've non- working network again, and now, it does not matter which driver we'll try: neither of them will work because the offload settings are broken. It's more: one can't control this stuff from the host side using standard ethtool: it says that the operation is not supported (I wonder how kvm performs the settings changes). The solution here is to re-create the tun device before changing the virtual NIC model. But it isn't always possible, esp. when guests are being run from non-root user (where persistent tun devices are most useful). Can this be fixed somehow please? I think all the settings should be reset to 0 when opening the tun device. Thanks. /mjt, who lost 2 more days and had another sleepless night trying to understand what's going wrong... -- To unsubscribe from this list: send the line "unsubscribe kvm" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html