Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
Ryan Harper wrote: > There is a race between when the vcpu thread issues a create ioctl and when > apic_reset() gets called resulting in getting a badfd error. > > The problem is indeed there, but the fix is wrong: > main threadvcpu thread > diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c > index 78127de..3513e8c 100644 > --- a/qemu/qemu-kvm.c > +++ b/qemu/qemu-kvm.c > @@ -31,7 +31,9 @@ extern int smp_cpus; > static int qemu_kvm_reset_requested; > > pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER; > +pthread_mutex_t vcpu_mutex = PTHREAD_MUTEX_INITIALIZER; > pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER; > +pthread_cond_t qemu_vcpuup_cond = PTHREAD_COND_INITIALIZER; > __thread struct vcpu_info *vcpu; > > struct qemu_kvm_signal_table { > @@ -369,6 +371,11 @@ static void *ap_main_loop(void *_env) > sigfillset(&signals); > sigprocmask(SIG_BLOCK, &signals, NULL); > kvm_create_vcpu(kvm_context, env->cpu_index); > +/* block until cond_wait occurs */ > +pthread_mutex_lock(&vcpu_mutex); > +/* now we can signal */ > +pthread_cond_signal(&qemu_vcpuup_cond); > +pthread_mutex_unlock(&vcpu_mutex); > kvm_qemu_init_env(env); > kvm_main_loop_cpu(env); > return NULL; > @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table > *sigtab, int signum) > > void kvm_init_new_ap(int cpu, CPUState *env) > { > +pthread_mutex_lock(&vcpu_mutex); > pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); > -/* FIXME: wait for thread to spin up */ > -usleep(200); > +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); > pthread_cond_wait() is never correct outside a loop. The signal may arrive before wait is called. The usual idiom is while (condition is not fulfilled) pthread_cond_wait(); I see you have something there to ensure we block, but please use the right idiom. > +pthread_mutex_unlock(&vcpu_mutex); > } > Please reuse qemu_mutex for this, no need for a new one. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()
Marcelo Tosatti wrote: > On Wed, Apr 23, 2008 at 09:30:06AM +0300, Avi Kivity wrote: > >>> as I got no reply, I guess it is a bad setup on my part. If that might >>> help, this happenned while I was doing a "make -j" on webkit svn tree >>> (ie. heavy c++ compilation workload) . >>> >>> >>> >> No this is not bad setup. No amount of bad setup should give this warning. >> >> You didn't get a reply because no one knows what to make of it, and >> because it's much more fun to debate endianess or contemplete guests >> with eighty thousand disks than to fix those impossible bugs. If you >> can give clear instructions on how to reproduce this, we will try it >> out. Please be sure to state OS name and versions for the guest as well >> as the host. >> > > It is valid to have more than PAGES_PER_HPAGE in the largepage's > shadowed count. If the gpte read races with a pte-update-from-guest (and > the pte update results in a different sp->role), it might account twice > for a single gfn. > > Such "zombie" shadow pages should eventually be removed through > recycling, allowing for instantiation of a large page, unless references > can be leaked. Can't spot such leakage problem though. > > That strikes me as unlikely (though a valid scenario). An alternative explanation is that we're seeing a nonpae guest, so each page can be shadowed in two different roles (two quadrants for a pte page) or even four (for a pgd page). Thomas, are you running a 32-bit nonpae guest? -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project
On Sat, Apr 26, 2008 at 9:49 AM, Avi Kivity <[EMAIL PROTECTED]> wrote: > Anthony Liguori wrote: > > > The second stage is to use a loop of x86_emulate() to run all 16-bit code > (instead of using vm86 mode). This will allow us to support guests that use > big real mode. > > > > > > > > Why do that unconditionally, instead of only when in a big-real-mode state? Is big-real-mode the only state where we have problems? - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Kernel Oops with kvm 66 running WinXP
Hi, I've experienced a kernel Oops on 2.6.24 with kvm 66 on AMD in 64bit mode while starting up WinXP: kvm: emulating exchange as write Unable to handle kernel NULL pointer dereference at RIP: [] :kvm:x86_emulate_insn+0x3fa/0x4240 PGD 7658d067 PUD 242a6067 PMD 0 Oops: 0002 [1] SMP CPU 0 Modules linked in: bridge llc reiserfs tun kvm_amd kvm nfs nfsd lockd nfs_acl auth_rpcgss sunrpc exportfs w83627ehf hwmon_vid autofs4 snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device iptable_filter ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 fuse ext2 loop snd_hda_intel snd_pcm snd_timer k8temp i2c_nforce2 i2c_core hwmon sr_mod button cdrom snd soundcore snd_page_alloc forcedeth sg floppy linear sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot edd dm_mod fan sata_nv pata_amd libata scsi_mod thermal processor Pid: 3139, comm: qemu-system-x86 Not tainted 2.6.24-mludvig #1 RIP: 0010:[] [] :kvm:x86_emulate_insn+0x3fa/0x4240 RSP: 0018:8100609fdc18 EFLAGS: 00010246 RAX: 8001003b RBX: RCX: RDX: RSI: RDI: 8100609fe000 RBP: 8100609ff320 R08: 8100609ff3c0 R09: 0006 R10: 0002 R11: R12: 8100609ff368 R13: 8100609ff3c0 R14: 883be600 R15: 01971353 FS: 40804950(0063) GS:8053e000() knlGS: CS: 0010 DS: ES: CR0: 8005003b CR2: CR3: 6bb74000 CR4: 06e0 DR0: DR1: DR2: DR3: DR6: 0ff0 DR7: 0400 Process qemu-system-x86 (pid: 3139, threadinfo 8100609fc000, task 8100794d5680) Stack: 609fdc74 12187318 0ea5f068 8100609ff3c0 8100609fdc94 8839c9e0 8100609fe000 8100609ff320 Call Trace: [] :kvm:kvm_get_cs_db_l_bits+0x20/0x40 [] :kvm:emulate_instruction+0x1bf/0x340 [] :kvm_amd:emulate_on_interception+0x12/0x60 [] :kvm:kvm_arch_vcpu_ioctl_run+0x169/0x6d0 [] :kvm:kvm_vcpu_ioctl+0x41c/0x440 [] __wake_up+0x43/0x70 [] __up_read+0x21/0xb0 [] futex_wake+0xcc/0xf0 [] do_futex+0x129/0xbb0 [] __dequeue_entity+0x3d/0x50 [] :kvm:kvm_vm_ioctl+0x85/0x200 [] do_ioctl+0x2f/0xa0 [] vfs_ioctl+0x220/0x2d0 [] sys_ioctl+0x91/0xb0 [] system_call+0x7e/0x83 Code: 66 89 02 e9 ee fc ff ff 48 8b 95 88 00 00 00 48 8b 45 78 88 RIP [] :kvm:x86_emulate_insn+0x3fa/0x4240 RSP CR2: ---[ end trace d358bab3f035112e ]--- The host is still alive but the XP guest is locked up in a boot screen. Michal - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Reach your individual maximum
My doctor cannot help asking me how I grew so big http://www.islound.com/ - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] Problem to dedicate 1 public IP address directly to guest system
Hello It is some solution, how to set up public IP address directly to guest system? I my test case, now I use only situation, when host system have all public addresses and I set up NAT to guests. Some applications, like VOIP, is difficult to run behind NAT. Then, the problem is running this kind of applications in guests systems. If you have public IP subnet, for example /29, is not problem to set up 1 IP to host and make route to guest. But if you have few servers in server housing, nobody give you /29. Usually you have only few IP addresses each from different subnet. I don't know if this is question for KVM or qemu. Thanks for any hints Tomas Rusnak -- - ICQ: 147137905 Skype: linker83 - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote: > A full new update will some become visible here: > > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/ I grabbed these and built them. Only change needed was another include. After that, everything built fine and xpmem regression tests ran through the first four sets. The fifth is the oversubscription test which trips my xpmem bug. This is as good as the v12 runs from before. Since this include and the one for mm_types.h both are build breakages for ia64, I think you need to apply your ia64_cpumask and the following (possibly as a single patch) first or in your patch 1. Without that, ia64 doing a git-bisect could hit a build failure. Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h === --- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h2008-04-26 06:41:54.0 -0500 +++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 07:01:17.292071827 -0500 @@ -27,6 +27,8 @@ #ifndef _LINUX_SRCU_H #define _LINUX_SRCU_H +#include + struct srcu_struct_array { int c[2]; }; - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote: > Since this include and the one for mm_types.h both are build breakages > for ia64, I think you need to apply your ia64_cpumask and the following > (possibly as a single patch) first or in your patch 1. Without that, > ia64 doing a git-bisect could hit a build failure. Agreed, so it doesn't risk to break ia64 compilation, thanks for the great XPMEM feedback! Also note, I figured out that mmu_notifier_release can actually run concurrently against other mmu notifiers in case there's a vmtruncate (->release could already run concurrently if invoked by _unregister, the only guarantee is that ->release will be called one time and only one time and that no mmu notifier will ever run after _unregister returns). In short I can't keep the list_del_init in _release and I need a list_del_init_rcu instead to fix this minor issue. So this won't really make much difference after all. I'll release #v14 with all this after a bit of kvm testing with it... diff --git a/include/linux/list.h b/include/linux/list.h --- a/include/linux/list.h +++ b/include/linux/list.h @@ -755,6 +755,14 @@ static inline void hlist_del_init(struct } } +static inline void hlist_del_init_rcu(struct hlist_node *n) +{ + if (!hlist_unhashed(n)) { + __hlist_del(n); + n->pprev = NULL; + } +} + /** * hlist_replace_rcu - replace old entry by new one * @old : the element to be replaced diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h --- a/include/linux/mmu_notifier.h +++ b/include/linux/mmu_notifier.h @@ -22,7 +22,10 @@ struct mmu_notifier_ops { /* * Called either by mmu_notifier_unregister or when the mm is * being destroyed by exit_mmap, always before all pages are -* freed. It's mandatory to implement this method. +* freed. It's mandatory to implement this method. This can +* run concurrently to other mmu notifier methods and it +* should teardown all secondary mmu mappings and freeze the +* secondary mmu. */ void (*release)(struct mmu_notifier *mn, struct mm_struct *mm); diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c --- a/mm/mmu_notifier.c +++ b/mm/mmu_notifier.c @@ -19,12 +19,13 @@ /* * This function can't run concurrently against mmu_notifier_register - * or any other mmu notifier method. mmu_notifier_register can only - * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is - * zero). All other tasks of this mm already quit so they can't invoke - * mmu notifiers anymore. This can run concurrently only against - * mmu_notifier_unregister and it serializes against it with the - * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm + * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap + * runs with mm_users == 0. Other tasks may still invoke mmu notifiers + * in parallel despite there's no task using this mm anymore, through + * the vmas outside of the exit_mmap context, like with + * vmtruncate. This serializes against mmu_notifier_unregister with + * the mmu_notifier_mm->lock in addition to SRCU and it serializes + * against the other mmu notifiers with SRCU. struct mmu_notifier_mm * can't go away from under us as exit_mmap holds a mm_count pin * itself. */ @@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st * to wait ->release to finish and * mmu_notifier_unregister to return. */ - hlist_del_init(&mn->hlist); + hlist_del_init_rcu(&mn->hlist); /* * SRCU here will block mmu_notifier_unregister until * ->release returns. @@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not * side note: mmu_notifier_release can't run concurrently with * us because we hold the mm_users pin (either implicitly as * current->mm or explicitly with get_task_mm() or similar). +* We can't race against any other mmu notifiers either thanks +* to mm_lock(). */ spin_lock(&mm->mmu_notifier_mm->lock); hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list); - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] fix external module compile
Hello, after updating kvm-userland.git, kvm.git and linux-2.6-hg, and after make distclean and rebuild with slightly reduced .config, I can't compile the external module anymore. Looking into it with V=1, $(src) defines to "" and including /external-module-compat.h clearly fails. I fixed it like below, because it seems more consistent to enforce the ordering of the "special" includes in the same place, notably $(src)/include is already included as $LINUX at point 1 of the comment, so this looks a cleanup of superflous line in Kconfig besides fixing my compile by moving the external-module-compat in the same place with the other includes where `pwd` works instead of $(src) that doesn't work anymore for whatever reason. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/kernel/Kbuild b/kernel/Kbuild index cabfc75..d9245eb 100644 --- a/kernel/Kbuild +++ b/kernel/Kbuild @@ -1,4 +1,3 @@ -EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h obj-m := kvm.o kvm-intel.o kvm-amd.o kvm-objs := kvm_main.o x86.o mmu.o x86_emulate.o anon_inodes.o irq.o i8259.o \ lapic.o ioapic.o preempt.o i8254.o external-module-compat.o diff --git a/kernel/Makefile b/kernel/Makefile index 78ff923..e3fccbe 100644 --- a/kernel/Makefile +++ b/kernel/Makefile @@ -27,7 +27,8 @@ all:: # include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat $(MAKE) -C $(KERNELDIR) M=`pwd` \ LINUXINCLUDE="-I`pwd`/include -Iinclude -I`pwd`/include-compat \ - -include include/linux/autoconf.h" \ + -include include/linux/autoconf.h \ + -include `pwd`/external-module-compat.h" "$$@" sync: header-sync source-sync - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] paravirt clock stil causing hangs in kvm-65
Hi Glauber, sorry for late reply. well, I'm no longer able to reproduce the problem in the same way (with backtraces etc) as before, but anyways enabling paravirt_clock with or without Your patches on SMP guest still causes troubles: on my phenom machine, the kernel hangs after printing "PCI-GART - No AMD northbridge found." on intel machine normal system boot seems to be terribly slow taking tens of seconds between steps and later hangs, using init=/bin/sh seems to work though. I'm wondering how can I get the backtraces I was getting before, I have soft CPU lockup debugging enabled, what else could help? cheers n. On Thu, 24 Apr 2008, Glauber Costa wrote: > Just saw Gerd's patches flying around. Can anyone that is able to > reproduce this problem (a subgroup of human beings that does not > include me) test it with them applied? > > If it still fails, please let me know asap > > -- > Glauber Costa. > "Free as in Freedom" > http://glommer.net > > "The less confident you are, the more serious you have to act." > > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project
Avi Kivity wrote: > Anthony Liguori wrote: >> The second stage is to use a loop of x86_emulate() to run all 16-bit >> code (instead of using vm86 mode). This will allow us to support >> guests that use big real mode. >> >> > > Why do that unconditionally, instead of only when in a big-real-mode > state? It can be. It's probably easier from a development perspective to make it unconditional and then to flush out all the necessary instructions. Regards, Anthony Liguori - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
[kvm-devel] mmu notifier #v14
Hello everyone, here it is the mmu notifier #v14. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/ Please everyone involved review and (hopefully ;) ack that this is safe to go in 2.6.26, the most important is to verify that this is a noop when disarmed regardless of MMU_NOTIFIER=y or =n. http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core I'll be sending that patch to Andrew inbox. Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig index 8d45fab..ce3251c 100644 --- a/arch/x86/kvm/Kconfig +++ b/arch/x86/kvm/Kconfig @@ -21,6 +21,7 @@ config KVM tristate "Kernel-based Virtual Machine (KVM) support" depends on HAVE_KVM select PREEMPT_NOTIFIERS + select MMU_NOTIFIER select ANON_INODES ---help--- Support hosting fully virtualized guest machines using hardware diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c index 2ad6f54..853087a 100644 --- a/arch/x86/kvm/mmu.c +++ b/arch/x86/kvm/mmu.c @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) account_shadowed(kvm, gfn); } +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) +{ + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); + rmap_remove(kvm, spte); + set_shadow_pte(spte, shadow_trap_nonpresent_pte); + kvm_flush_remote_tlbs(kvm); + put_page(page); +} + +static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte, *curr_spte; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + BUG_ON(!(*spte & PT_PRESENT_MASK)); + rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte); + curr_spte = spte; + spte = rmap_next(kvm, rmapp, spte); + kvm_unmap_spte(kvm, curr_spte); + } +} + +void kvm_unmap_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + + /* +* If mmap_sem isn't taken, we can look the memslots with only +* the mmu_lock by skipping over the slots with userspace_addr == 0. +*/ + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } +} + +static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp) +{ + u64 *spte; + int young = 0; + + spte = rmap_next(kvm, rmapp, NULL); + while (spte) { + int _young; + u64 _spte = *spte; + BUG_ON(!(_spte & PT_PRESENT_MASK)); + _young = _spte & PT_ACCESSED_MASK; + if (_young) { + young = !!_young; + set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK); + } + spte = rmap_next(kvm, rmapp, spte); + } + return young; +} + +int kvm_age_hva(struct kvm *kvm, unsigned long hva) +{ + int i; + int young = 0; + + /* +* If mmap_sem isn't taken, we can look the memslots with only +* the mmu_lock by skipping over the slots with userspace_addr == 0. +*/ + spin_lock(&kvm->mmu_lock); + for (i = 0; i < kvm->nmemslots; i++) { + struct kvm_memory_slot *memslot = &kvm->memslots[i]; + unsigned long start = memslot->userspace_addr; + unsigned long end; + + /* mmu_lock protects userspace_addr */ + if (!start) + continue; + + end = start + (memslot->npages << PAGE_SHIFT); + if (hva >= start && hva < end) { + gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT; + young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]); + } + } + spin_unlock(&kvm->mmu_lock); + + if (young) + kvm_flush_remote_tlbs(kvm); + + return young; +} + #ifdef MMU_DEBUG static int is_empty_shadow_page(u64 *spt) { @@ -1200,6 +1302,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, int write, gfn_t gfn) int r; int largepage = 0; pfn_t pfn; + int mmu_seq; down_read(¤t->mm->mmap_sem); if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) { @@ -1207,6 +1310,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t
Re: [kvm-devel] Problem to dedicate 1 public IP address directly to guest system
I resend this, because It's could be helpful for another users. >From David Mair <[EMAIL PROTECTED]>: I saw your message on the kvm-devel list and I think I can advise you. I don't yet have a functioning subscription to the list so replying there is not simple for me. If this works out for you please post a message to the kvm-devel list with the solution you use (no need to mention me). To be honest, the problem isn't really a kvm or qemu one, it's an Ethernet issue I think. One solution is to create a bridge interface on the host, add your public facing NIC to the bridge, create a tap interface and attach it to the same bridge then use that tap interface for the guest you want to assign a public IP to. Assuming Linux: * You'll need a bridge-utils package or you'll need to build it * Assuming your host's public interface is eth0, then as root: # brctl addbr br0 # brctl addif br0 eth0 # tunctl -u -t qtap0 # brctl addif br0 qtap0 Replace eth0 with the real interface name for your host's public facing interface. Replace with the username that will be starting the guest. Replace qtap0 with a name you prefer to use for the tap interface. Now you can start the guest with the following -net option: qemu -net [vlan=n,]ifname=qtap0,script=no Replace qtap0 with whatever you used for the tunctl command above. If you are using a vlan other than zero for the guest NIC then include the vlan=n option replacing n with the vlan number. You can use script=yes and modify /etc/qemu-ifup to perform the bridge setup command above but I prefer to use my own and setup the bridge and tap at init time then just have the guest use the preset bridge. You can now just use the relevant static IP address as-is on the guest because the guest's NIC is on the public facing cabling. Good luck, I think you should be able to solve this fairly easily. -- - ICQ: 147137905 Skype: linker83 - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ryan Harper wrote: > +/* block until cond_wait occurs */ > +pthread_mutex_lock(&vcpu_mutex); > +/* now we can signal */ > +pthread_cond_signal(&qemu_vcpuup_cond); > +pthread_mutex_unlock(&vcpu_mutex); It is not necessary to take the mutex for a condvar if you're just waking a waiter. This just unnecessarily slows things down. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIE19D2ijCOnn/RHQRAvLqAJ9j+7ICPHpB/gCthCelDgYn5poGMgCfaZVy +tE5zOuxqlBMUR7Fgufw/wY= =5j4s -END PGP SIGNATURE- - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
-BEGIN PGP SIGNED MESSAGE- Hash: SHA1 Ryan Harper wrote: > @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table > *sigtab, int signum) > > void kvm_init_new_ap(int cpu, CPUState *env) > { > +pthread_mutex_lock(&vcpu_mutex); > pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); > -/* FIXME: wait for thread to spin up */ > -usleep(200); > +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); > +pthread_mutex_unlock(&vcpu_mutex); And something is very wrong here. The pattern for using a condvar is 1 take mutex 2 check condition 3 if condition is not fulfilled 3a call cond_wait 3b when returning, go back to step 2 4 unlock mutex Anything else is buggy. So, either your condvar use is wrong or you don't really want a condvar in the first place. I haven't checked the code. - -- ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ -BEGIN PGP SIGNATURE- Version: GnuPG v1.4.7 (GNU/Linux) iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis kw7ST4eJK9CXhNbjKphNsUo= =ISaC -END PGP SIGNATURE- - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset
Ulrich Drepper wrote: > -BEGIN PGP SIGNED MESSAGE- > Hash: SHA1 > > Ryan Harper wrote: > >> @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table >> *sigtab, int signum) >> >> void kvm_init_new_ap(int cpu, CPUState *env) >> { >> +pthread_mutex_lock(&vcpu_mutex); >> pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env); >> -/* FIXME: wait for thread to spin up */ >> -usleep(200); >> +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex); >> +pthread_mutex_unlock(&vcpu_mutex); >> > > And something is very wrong here. The pattern for using a condvar is > > 1 take mutex > > 2 check condition > > 3 if condition is not fulfilled > 3a call cond_wait > 3b when returning, go back to step 2 > > 4 unlock mutex > > > Anything else is buggy. > > So, either your condvar use is wrong or you don't really want a condvar > in the first place. I haven't checked the code. > A flag is needed in the vcpu structure that indicates whether the vcpu spun up or not. This is what the while () condition should be. This is needed as the thread may spin up before it gets to the pthread_cond_wait() in which case the signal happens when noone is waiting on it. The other reason a while () is usually needed is that cond_signal may not wake up the right thread so it's necessary to check whether you really have something to do. Not really a problem here but the former race is. A condvar is definitely the right thing to use here. Regards, Anthony Liguori > - -- > ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖ > -BEGIN PGP SIGNATURE- > Version: GnuPG v1.4.7 (GNU/Linux) > > iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis > kw7ST4eJK9CXhNbjKphNsUo= > =ISaC > -END PGP SIGNATURE- > > - > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone > ___ > kvm-devel mailing list > kvm-devel@lists.sourceforge.net > https://lists.sourceforge.net/lists/listinfo/kvm-devel > - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] mmu notifier #v14
Andrea Arcangeli wrote: > Hello everyone, > > here it is the mmu notifier #v14. > > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/ > > Please everyone involved review and (hopefully ;) ack that this is > safe to go in 2.6.26, the most important is to verify that this is a > noop when disarmed regardless of MMU_NOTIFIER=y or =n. > > > http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core > > I'll be sending that patch to Andrew inbox. > > Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]> > > diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig > index 8d45fab..ce3251c 100644 > --- a/arch/x86/kvm/Kconfig > +++ b/arch/x86/kvm/Kconfig > @@ -21,6 +21,7 @@ config KVM > tristate "Kernel-based Virtual Machine (KVM) support" > depends on HAVE_KVM > select PREEMPT_NOTIFIERS > + select MMU_NOTIFIER > select ANON_INODES > ---help--- > Support hosting fully virtualized guest machines using hardware > diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c > index 2ad6f54..853087a 100644 > --- a/arch/x86/kvm/mmu.c > +++ b/arch/x86/kvm/mmu.c > @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn) > account_shadowed(kvm, gfn); > } > > +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) > +{ > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> > PAGE_SHIFT); > + get_page(page); > You should not assume a struct page exists for any given spte. Instead, use kvm_get_pfn() and kvm_release_pfn_clean(). > static void nonpaging_free(struct kvm_vcpu *vcpu) > @@ -1643,11 +1771,11 @@ static void mmu_guess_page_from_pte_write(struct > kvm_vcpu *vcpu, gpa_t gpa, > int r; > u64 gpte = 0; > pfn_t pfn; > - > - vcpu->arch.update_pte.largepage = 0; > + int mmu_seq; > + int largepage; > > if (bytes != 4 && bytes != 8) > - return; > + goto out_lock; > > /* >* Assume that the pte write on a page table of the same type > @@ -1660,7 +1788,7 @@ static void mmu_guess_page_from_pte_write(struct > kvm_vcpu *vcpu, gpa_t gpa, > if ((bytes == 4) && (gpa % 4 == 0)) { > r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8); > if (r) > - return; > + goto out_lock; > memcpy((void *)&gpte + (gpa % 8), new, 4); > } else if ((bytes == 8) && (gpa % 8 == 0)) { > memcpy((void *)&gpte, new, 8); > @@ -1670,23 +1798,35 @@ static void mmu_guess_page_from_pte_write(struct > kvm_vcpu *vcpu, gpa_t gpa, > memcpy((void *)&gpte, new, 4); > } > if (!is_present_pte(gpte)) > - return; > + goto out_lock; > gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; > > + largepage = 0; > down_read(¤t->mm->mmap_sem); > if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) { > gfn &= ~(KVM_PAGES_PER_HPAGE-1); > - vcpu->arch.update_pte.largepage = 1; > + largepage = 1; > } > + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq); > + /* implicit mb(), we'll read before PT lock is unlocked */ > pfn = gfn_to_pfn(vcpu->kvm, gfn); > up_read(¤t->mm->mmap_sem); > > - if (is_error_pfn(pfn)) { > - kvm_release_pfn_clean(pfn); > - return; > - } > + if (is_error_pfn(pfn)) > + goto out_release_and_lock; > + > + spin_lock(&vcpu->kvm->mmu_lock); > + BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn)); > vcpu->arch.update_pte.gfn = gfn; > vcpu->arch.update_pte.pfn = pfn; > + vcpu->arch.update_pte.largepage = largepage; > + vcpu->arch.update_pte.mmu_seq = mmu_seq; > + return; > + > +out_release_and_lock: > + kvm_release_pfn_clean(pfn); > +out_lock: > + spin_lock(&vcpu->kvm->mmu_lock); > } > Perhaps I just have a weak stomach but I am uneasy having a function that takes a lock on exit. I walked through the logic and it doesn't appear to be wrong but it also is pretty clear that you could defer the acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the update_pte assignment into kvm_mmu_pte_write. > void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > @@ -1711,7 +1851,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa, > > pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes); > mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes); > Worst case, you pass 4 more pointer arguments here and, take the spin lock, and then depending on the result of mmu_guess_page_from_pte_write, update vcpu->arch.update_pte. > @@ -3899,13 +4037,12 @@ static void kvm_free_vcpus(struct kvm *kvm) > > void kvm_arch_destroy_vm(struct kvm *kvm) > { > -
Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project
Mohammed Gamal wrote: >> Why do that unconditionally, instead of only when in a big-real-mode state? >> > > Is big-real-mode the only state where we have problems? > In general, we need to emulate whenever we are in a VT-unfriendly state, where that is defined as guest state that fails the guest state checks defined by section 22.3.1 of volume 3B of the Intel software development manual, "checks on the guest state area", when that state is legal in a real processor. To date, we have encountered only two instances of such VT-unfriendly states: - "big real mode", where segment limits are not exactly 0x - protected mode transitions, where cs.rpl !=ss.rpl for a brief while There may well be more, as we remove the various hacks currently in place, and as we expand the envelope to support hybrid 16/32 bit guests like Windows 3.1 and Windows 95. -- Do not meddle in the internals of kernels, for they are subtle and quick to panic. - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] mmu notifier #v14
On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote: >> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) >> +{ >> +struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> >> PAGE_SHIFT); >> +get_page(page); >> > > You should not assume a struct page exists for any given spte. Instead, use > kvm_get_pfn() and kvm_release_pfn_clean(). Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build rmap on mmio regions, so the above is more efficient so put_page runs directly on the page without going back and forth between spte -> pfn -> page -> pfn -> page in a single function. Certainly if we start building rmap on mmio regions we'll have to change that. > Perhaps I just have a weak stomach but I am uneasy having a function that > takes a lock on exit. I walked through the logic and it doesn't appear to > be wrong but it also is pretty clear that you could defer the acquisition > of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the > update_pte assignment into kvm_mmu_pte_write. I agree out_lock is an uncommon exit path, the problem is that the code was buggy, and I tried to fix it with the smallest possible change and that resulting in an out_lock. That section likely need a refactoring, all those update_pte fields should be at least returned by the function guess_ but I tried to reduce the changes to make the issue more readable, I didn't want to rewrite certain functions just to take a spinlock a few instructions ahead. > Worst case, you pass 4 more pointer arguments here and, take the spin lock, > and then depending on the result of mmu_guess_page_from_pte_write, update > vcpu->arch.update_pte. Yes that was my same idea, but that's left for a later patch. Fixing this bug mixed with the mmu notifier patch was perhaps excessive already ;). > Why move the destruction of the vm to the MMU notifier unregister hook? > Does anything else ever call mmu_notifier_unregister that would implicitly > destroy the VM? mmu notifier ->release can run at anytime before the filehandle is closed. ->release has to zap all sptes and freeze the mmu (hence all vcpus) to prevent any further page fault. After ->release returns all pages are freed (we'll never relay on the page pin to avoid the rmap_remove put_page to be a relevant unpin event). So the idea is that I wanted to maintain the same ordering of the current code in the vm destroy event, I didn't want to leave a partially shutdown VM on the vmlist. If the ordering is entirely irrelevant and the kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then I can avoid changes to kvm_main.c but I doubt. I've done it in a way that archs not needing mmu notifiers like s390 can simply add the kvm_destroy_common_vm at the top of their kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke kvm_destroy_common_vm in the ->release of the mmu notifiers. This will ensure that everything will be ok regardless if exit_mmap is called before/after exit_files, and it won't make a whole lot of difference anymore, if the driver fd is pinned through vmas->vm_file released in exit_mmap or through the task filedescriptors relased in exit_files etc... Infact this allows to call mmu_notifier_unregister at anytime later after the task has already been killed, without any trouble (like if the mmu notifier owner isn't registering in current->mm but some other tasks mm). - This SF.net email is sponsored by the 2008 JavaOne(SM) Conference Don't miss this year's exciting event. There's still time to save $100. Use priority code J8TL2D2. http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone ___ kvm-devel mailing list kvm-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/kvm-devel
Re: [kvm-devel] mmu notifier #v14
Andrea Arcangeli wrote: > On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote: > >>> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte) >>> +{ >>> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> >>> PAGE_SHIFT); >>> + get_page(page); >>> >>> >> You should not assume a struct page exists for any given spte. Instead, use >> kvm_get_pfn() and kvm_release_pfn_clean(). >> > > Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build > rmap > on mmio regions, so the above is more efficient so put_page runs > directly on the page without going back and forth between spte -> pfn > -> page -> pfn -> page in a single function. > Avi can correct me if I'm wrong, but I don't think the consensus of that discussion was that we're going to avoid putting mmio pages in the rmap. Practically speaking, replacing: + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT); + get_page(page); With: unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; kvm_get_pfn(pfn); Results in exactly the same code except the later allows mmio pfns in the rmap. So ignoring the whole mmio thing, using accessors that are already there and used elsewhere seems like a good idea :-) > Certainly if we start building rmap on mmio regions we'll have to > change that. > > >> Perhaps I just have a weak stomach but I am uneasy having a function that >> takes a lock on exit. I walked through the logic and it doesn't appear to >> be wrong but it also is pretty clear that you could defer the acquisition >> of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the >> update_pte assignment into kvm_mmu_pte_write. >> > > I agree out_lock is an uncommon exit path, the problem is that the > code was buggy, and I tried to fix it with the smallest possible > change and that resulting in an out_lock. That section likely need a > refactoring, all those update_pte fields should be at least returned > by the function guess_ but I tried to reduce the changes to make > the issue more readable, I didn't want to rewrite certain functions > just to take a spinlock a few instructions ahead. > I appreciate the desire to minimize changes, but taking a lock on return seems to take that to a bit of an extreme. It seems like a simple thing to fix though, no? >> Why move the destruction of the vm to the MMU notifier unregister hook? >> Does anything else ever call mmu_notifier_unregister that would implicitly >> destroy the VM? >> > > mmu notifier ->release can run at anytime before the filehandle is > closed. ->release has to zap all sptes and freeze the mmu (hence all > vcpus) to prevent any further page fault. After ->release returns all > pages are freed (we'll never relay on the page pin to avoid the > rmap_remove put_page to be a relevant unpin event). So the idea is > that I wanted to maintain the same ordering of the current code in the > vm destroy event, I didn't want to leave a partially shutdown VM on > the vmlist. If the ordering is entirely irrelevant and the > kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then > I can avoid changes to kvm_main.c but I doubt. > > I've done it in a way that archs not needing mmu notifiers like s390 > can simply add the kvm_destroy_common_vm at the top of their > kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke > kvm_destroy_common_vm in the ->release of the mmu notifiers. > > This will ensure that everything will be ok regardless if exit_mmap is > called before/after exit_files, and it won't make a whole lot of > difference anymore, if the driver fd is pinned through vmas->vm_file > released in exit_mmap or through the task filedescriptors relased in > exit_files etc... Infact this allows to call mmu_notifier_unregister > at anytime later after the task has already been killed, without any > trouble (like if the mmu notifier owner isn't registering in > current->mm but some other tasks mm). > I see. It seems a little strange to me as a KVM guest isn't really tied to the current mm. It seems like the net effect of this is that we are now tying a KVM guest to an mm. For instance, if you create a guest, but didn't assign any memory to it, you could transfer the fd to another process and then close the fd (without destroying the guest). The other process then could assign memory to it and presumably run the guest. With your change, as soon as the first process exits, the guest will be destroyed. I'm not sure this behavioral difference really matters but it is a behavioral difference. Regards, Anthony Liguori > - > This SF.net email is sponsored by the 2008 JavaOne(SM) Conference > Don't miss this year's exciting event. There's still time to save $100. > Use priority code J8TL2D2. > http://ad.doubleclick.net/clk;198757673;1
Re: [kvm-devel] mmu notifier #v14
On Sat, Apr 26, 2008 at 08:54:23PM -0500, Anthony Liguori wrote: > Avi can correct me if I'm wrong, but I don't think the consensus of that > discussion was that we're going to avoid putting mmio pages in the rmap. My first impression on that discussion was that pci-passthrough mmio can't be swapped, can't require write throttling etc.. ;). From a linux VM pagetable point of view rmap on mmio looks weird. However thinking some more, it's not like in the linux kernel where write protect through rmap is needed only for write-throttling MAP_SHARED which clearly is strictly RAM, for sptes we need it for every cr3 touch too to trap pagetable updates (think ioremap done by guest kernel). So I think Avi's take that we need rmap for everything mapped by sptes is probably the only feasible way to go. > Practically speaking, replacing: > > + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> > PAGE_SHIFT); > + get_page(page); > > > With: > > unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT; > kvm_get_pfn(pfn); > > Results in exactly the same code except the later allows mmio pfns in the > rmap. So ignoring the whole mmio thing, using accessors that are already > there and used elsewhere seems like a good idea :-) Agreed especially at the light of the above. I didn't actually touch that function for a while (I clearly wrote it before we started moving the kvm mmu code from page to pfn), and it was still safe to use to test the locking of the mmu notifier methods. My current main focus in the last few days was to get the locking right against the last mmu notifier code #v14 ;). Now that I look into it more closely, the get_page/put_page are unnecessary by now (it was necessary with the older patches that didn't implement range_begin and that relied on page pinning). Not just in that function, but all reference counting inside kvm is now entirely useless and can be removed. NOTE: it is safe to flush the tlb outside the mmu_lock if done inside the mmu_notifier methods. But only mmu notifiers can defer the tlb flush after releasing mmu_lock because the page can't be freed by the VM until we return. All other kvm code must instead definitely flush the tlb inside the mmu_lock, otherwise when the mmu notifier code runs, it will see the spte nonpresent and so the mmu notifier code will do nothing (it will not wait kvm to drop the mmu_lock before allowing the main linux VM to free the page). The tlb flush must happen before the page is freed, and doing it inside mmu_lock everywhere (except in mmu-notifier contex where it can be done after releasing mmu_lock) guarantees it. The positive side of the tradeoff of having to do the tlb flush inside the mmu_lock, is that KVM can now safely zap and unmap as many sptes at it wants and do a single tlb flush at the end. The pages can't be freed as long as the mmu_lock is hold (this is why the tlb flush has to be done inside the mmu_lock). This model reduces heavily the tlb flush frequency for large spte-mangling, and tlb flushes here are quite expensive because of ipis. > I appreciate the desire to minimize changes, but taking a lock on return > seems to take that to a bit of an extreme. It seems like a simple thing to > fix though, no? I agree it needs to be rewritten as a cleaner fix but probably in a separate patch (which has to be incremental as that code will reject on the mmu notifier patch). I didn't see as a big issue however to apply my quick fix first and cleanup with an incremental update. > I see. It seems a little strange to me as a KVM guest isn't really tied to > the current mm. It seems like the net effect of this is that we are now > tying a KVM guest to an mm. > > For instance, if you create a guest, but didn't assign any memory to it, > you could transfer the fd to another process and then close the fd (without > destroying the guest). The other process then could assign memory to it > and presumably run the guest. Passing the anon kvm vm fd through unix sockets to another task is exactly why we need things like ->release not dependent on fd->release vma->vm_file->release ordering in the do_exit path to teardown the VM. The guest itself is definitely tied to a "mm", the guest runs using get_user_pages and get_user_pages is meaningless without an mm. But the fd where we run the ioctl isn't tied to the mm, it's just an fd that can be passed across tasks with unix sockets. > With your change, as soon as the first process exits, the guest will be > destroyed. I'm not sure this behavioral difference really matters but it > is a behavioral difference. The guest-mode of the cpu, can't run safely on any task but the one with the "mm" tracked by the mmu notifiers and where the memory is allocated from. The sptes points to the memory allocated in that "mm". It's definitely memory-corrupting to leave any spte established when the last thread of that "mm" exists as the memory supposedly pointed by the orpha