Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-26 Thread Avi Kivity
Ryan Harper wrote:
> There is a race between when the vcpu thread issues a create ioctl and when
> apic_reset() gets called resulting in getting a badfd error.
>
>   

The problem is indeed there, but the fix is wrong:

> main threadvcpu thread
> diff --git a/qemu/qemu-kvm.c b/qemu/qemu-kvm.c
> index 78127de..3513e8c 100644
> --- a/qemu/qemu-kvm.c
> +++ b/qemu/qemu-kvm.c
> @@ -31,7 +31,9 @@ extern int smp_cpus;
>  static int qemu_kvm_reset_requested;
>  
>  pthread_mutex_t qemu_mutex = PTHREAD_MUTEX_INITIALIZER;
> +pthread_mutex_t vcpu_mutex = PTHREAD_MUTEX_INITIALIZER;
>  pthread_cond_t qemu_aio_cond = PTHREAD_COND_INITIALIZER;
> +pthread_cond_t qemu_vcpuup_cond = PTHREAD_COND_INITIALIZER;
>  __thread struct vcpu_info *vcpu;
>  
>  struct qemu_kvm_signal_table {
> @@ -369,6 +371,11 @@ static void *ap_main_loop(void *_env)
>  sigfillset(&signals);
>  sigprocmask(SIG_BLOCK, &signals, NULL);
>  kvm_create_vcpu(kvm_context, env->cpu_index);
> +/* block until cond_wait occurs */
> +pthread_mutex_lock(&vcpu_mutex);
> +/* now we can signal */
> +pthread_cond_signal(&qemu_vcpuup_cond);
> +pthread_mutex_unlock(&vcpu_mutex);
>  kvm_qemu_init_env(env);
>  kvm_main_loop_cpu(env);
>  return NULL;
> @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table 
> *sigtab, int signum)
>  
>  void kvm_init_new_ap(int cpu, CPUState *env)
>  {
> +pthread_mutex_lock(&vcpu_mutex);
>  pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env);
> -/* FIXME: wait for thread to spin up */
> -usleep(200);
> +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex);
>   

pthread_cond_wait() is never correct outside a loop.  The signal may 
arrive before wait is called.

The usual idiom is

while (condition is not fulfilled)
 pthread_cond_wait();

I see you have something there to ensure we block, but please use the 
right idiom.

> +pthread_mutex_unlock(&vcpu_mutex);
>  }
>  

Please reuse qemu_mutex for this, no need for a new one.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] WARNING: at /usr/src/modules/kvm/mmu.c:390 account_shadowed()

2008-04-26 Thread Avi Kivity
Marcelo Tosatti wrote:
> On Wed, Apr 23, 2008 at 09:30:06AM +0300, Avi Kivity wrote:
>   
>>> as I got no reply, I guess it is a bad setup on my part. If that might
>>> help, this happenned while I was doing a "make -j" on webkit svn tree
>>> (ie. heavy c++ compilation workload) .
>>>
>>>   
>>>   
>> No this is not bad setup.  No amount of bad setup should give this warning.
>>
>> You didn't get a reply because no one knows what to make of it, and 
>> because it's much more fun to debate endianess or contemplete guests 
>> with eighty thousand disks than to fix those impossible bugs.  If you 
>> can give clear instructions on how to reproduce this, we will try it 
>> out.  Please be sure to state OS name and versions for the guest as well 
>> as the host.
>> 
>
> It is valid to have more than PAGES_PER_HPAGE in the largepage's
> shadowed count. If the gpte read races with a pte-update-from-guest (and
> the pte update results in a different sp->role), it might account twice
> for a single gfn.
>
> Such "zombie" shadow pages should eventually be removed through
> recycling, allowing for instantiation of a large page, unless references
> can be leaked. Can't spot such leakage problem though.
>
>   

That strikes me as unlikely (though a valid scenario).

An alternative explanation is that we're seeing a nonpae guest, so each 
page can be shadowed in two different roles (two quadrants for a pte 
page) or even four (for a pgd page).

Thomas, are you running a 32-bit nonpae guest?

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project

2008-04-26 Thread Mohammed Gamal
On Sat, Apr 26, 2008 at 9:49 AM, Avi Kivity <[EMAIL PROTECTED]> wrote:
> Anthony Liguori wrote:
>
> > The second stage is to use a loop of x86_emulate() to run all 16-bit code
> (instead of using vm86 mode).  This will allow us to support guests that use
> big real mode.
> >
> >
> >
>
>  Why do that unconditionally, instead of only when in a big-real-mode state?

Is big-real-mode the only state where we have problems?

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Kernel Oops with kvm 66 running WinXP

2008-04-26 Thread Michal Ludvig
Hi,

I've experienced a kernel Oops on 2.6.24 with kvm 66 on AMD in 64bit 
mode while starting up WinXP:

kvm: emulating exchange as write
Unable to handle kernel NULL pointer dereference at  RIP:
  [] :kvm:x86_emulate_insn+0x3fa/0x4240
PGD 7658d067 PUD 242a6067 PMD 0
Oops: 0002 [1] SMP
CPU 0
Modules linked in: bridge llc reiserfs tun kvm_amd kvm nfs nfsd lockd 
nfs_acl auth_rpcgss sunrpc exportfs w83627ehf hwmon_vid autofs4 
snd_pcm_oss snd_mixer_oss snd_seq snd_seq_device iptable_filter 
ip_tables ip6table_filter ip6_tables x_tables af_packet ipv6 fuse ext2 
loop snd_hda_intel snd_pcm snd_timer k8temp i2c_nforce2 i2c_core hwmon 
sr_mod button cdrom snd soundcore snd_page_alloc forcedeth sg floppy 
linear sd_mod ehci_hcd ohci_hcd usbcore dm_snapshot edd dm_mod fan 
sata_nv pata_amd libata scsi_mod thermal processor
Pid: 3139, comm: qemu-system-x86 Not tainted 2.6.24-mludvig #1
RIP: 0010:[]  [] 
:kvm:x86_emulate_insn+0x3fa/0x4240
RSP: 0018:8100609fdc18  EFLAGS: 00010246
RAX: 8001003b RBX:  RCX: 
RDX:  RSI:  RDI: 8100609fe000
RBP: 8100609ff320 R08: 8100609ff3c0 R09: 0006
R10: 0002 R11:  R12: 8100609ff368
R13: 8100609ff3c0 R14: 883be600 R15: 01971353
FS:  40804950(0063) GS:8053e000() knlGS:
CS:  0010 DS:  ES:  CR0: 8005003b
CR2:  CR3: 6bb74000 CR4: 06e0
DR0:  DR1:  DR2: 
DR3:  DR6: 0ff0 DR7: 0400
Process qemu-system-x86 (pid: 3139, threadinfo 8100609fc000, task 
8100794d5680)
Stack:  609fdc74 12187318 0ea5f068 8100609ff3c0
  8100609fdc94 8839c9e0  8100609fe000
  8100609ff320   
Call Trace:
  [] :kvm:kvm_get_cs_db_l_bits+0x20/0x40
  [] :kvm:emulate_instruction+0x1bf/0x340
  [] :kvm_amd:emulate_on_interception+0x12/0x60
  [] :kvm:kvm_arch_vcpu_ioctl_run+0x169/0x6d0
  [] :kvm:kvm_vcpu_ioctl+0x41c/0x440
  [] __wake_up+0x43/0x70
  [] __up_read+0x21/0xb0
  [] futex_wake+0xcc/0xf0
  [] do_futex+0x129/0xbb0
  [] __dequeue_entity+0x3d/0x50
  [] :kvm:kvm_vm_ioctl+0x85/0x200
  [] do_ioctl+0x2f/0xa0
  [] vfs_ioctl+0x220/0x2d0
  [] sys_ioctl+0x91/0xb0
  [] system_call+0x7e/0x83


Code: 66 89 02 e9 ee fc ff ff 48 8b 95 88 00 00 00 48 8b 45 78 88
RIP  [] :kvm:x86_emulate_insn+0x3fa/0x4240
  RSP 
CR2: 
---[ end trace d358bab3f035112e ]---

The host is still alive but the XP guest is locked up in a boot screen.

Michal

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Reach your individual maximum

2008-04-26 Thread Mitha
My doctor cannot help asking me how I grew so big http://www.islound.com/

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] Problem to dedicate 1 public IP address directly to guest system

2008-04-26 Thread LINkeR
Hello

It is some solution, how to set up public IP address directly to guest
system? I my test case, now I use only situation, when host system
have all public addresses and I set up NAT to guests. Some
applications, like VOIP, is difficult to run behind NAT. Then, the
problem is running this kind of applications in guests systems. If you
have public IP subnet, for example /29, is not problem to set up 1 IP
to host and make route to guest. But if you have few servers in server
housing, nobody give you /29. Usually you have only few IP addresses
each from different subnet.  I don't know if this is question for KVM
or qemu.

Thanks for any hints

Tomas Rusnak
-- 
-
ICQ: 147137905 Skype: linker83

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-26 Thread Robin Holt
On Thu, Apr 24, 2008 at 07:41:45PM +0200, Andrea Arcangeli wrote:
> A full new update will some become visible here:
> 
>   
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14-pre3/

I grabbed these and built them.  Only change needed was another include.
After that, everything built fine and xpmem regression tests ran through
the first four sets.  The fifth is the oversubscription test which trips
my xpmem bug.  This is as good as the v12 runs from before.

Since this include and the one for mm_types.h both are build breakages
for ia64, I think you need to apply your ia64_cpumask and the following
(possibly as a single patch) first or in your patch 1.  Without that,
ia64 doing a git-bisect could hit a build failure.


Index: mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h
===
--- mmu_v14_pre3_xpmem_v003_v1.orig/include/linux/srcu.h2008-04-26 
06:41:54.0 -0500
+++ mmu_v14_pre3_xpmem_v003_v1/include/linux/srcu.h 2008-04-26 
07:01:17.292071827 -0500
@@ -27,6 +27,8 @@
 #ifndef _LINUX_SRCU_H
 #define _LINUX_SRCU_H
 
+#include 
+
 struct srcu_struct_array {
int c[2];
 };

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 01 of 12] Core of mmu notifiers

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:17:34AM -0500, Robin Holt wrote:
> Since this include and the one for mm_types.h both are build breakages
> for ia64, I think you need to apply your ia64_cpumask and the following
> (possibly as a single patch) first or in your patch 1.  Without that,
> ia64 doing a git-bisect could hit a build failure.

Agreed, so it doesn't risk to break ia64 compilation, thanks for the
great XPMEM feedback!

Also note, I figured out that mmu_notifier_release can actually run
concurrently against other mmu notifiers in case there's a vmtruncate
(->release could already run concurrently if invoked by _unregister,
the only guarantee is that ->release will be called one time and only
one time and that no mmu notifier will ever run after _unregister
returns).

In short I can't keep the list_del_init in _release and I need a
list_del_init_rcu instead to fix this minor issue. So this won't
really make much difference after all.

I'll release #v14 with all this after a bit of kvm testing with it...

diff --git a/include/linux/list.h b/include/linux/list.h
--- a/include/linux/list.h
+++ b/include/linux/list.h
@@ -755,6 +755,14 @@ static inline void hlist_del_init(struct
}
 }
 
+static inline void hlist_del_init_rcu(struct hlist_node *n)
+{
+   if (!hlist_unhashed(n)) {
+   __hlist_del(n);
+   n->pprev = NULL;
+   }
+}
+
 /**
  * hlist_replace_rcu - replace old entry by new one
  * @old : the element to be replaced
diff --git a/include/linux/mmu_notifier.h b/include/linux/mmu_notifier.h
--- a/include/linux/mmu_notifier.h
+++ b/include/linux/mmu_notifier.h
@@ -22,7 +22,10 @@ struct mmu_notifier_ops {
/*
 * Called either by mmu_notifier_unregister or when the mm is
 * being destroyed by exit_mmap, always before all pages are
-* freed. It's mandatory to implement this method.
+* freed. It's mandatory to implement this method. This can
+* run concurrently to other mmu notifier methods and it
+* should teardown all secondary mmu mappings and freeze the
+* secondary mmu.
 */
void (*release)(struct mmu_notifier *mn,
struct mm_struct *mm);
diff --git a/mm/mmu_notifier.c b/mm/mmu_notifier.c
--- a/mm/mmu_notifier.c
+++ b/mm/mmu_notifier.c
@@ -19,12 +19,13 @@
 
 /*
  * This function can't run concurrently against mmu_notifier_register
- * or any other mmu notifier method. mmu_notifier_register can only
- * run with mm->mm_users > 0 (and exit_mmap runs only when mm_users is
- * zero). All other tasks of this mm already quit so they can't invoke
- * mmu notifiers anymore. This can run concurrently only against
- * mmu_notifier_unregister and it serializes against it with the
- * mmu_notifier_mm->lock in addition to RCU. struct mmu_notifier_mm
+ * because mm->mm_users > 0 during mmu_notifier_register and exit_mmap
+ * runs with mm_users == 0. Other tasks may still invoke mmu notifiers
+ * in parallel despite there's no task using this mm anymore, through
+ * the vmas outside of the exit_mmap context, like with
+ * vmtruncate. This serializes against mmu_notifier_unregister with
+ * the mmu_notifier_mm->lock in addition to SRCU and it serializes
+ * against the other mmu notifiers with SRCU. struct mmu_notifier_mm
  * can't go away from under us as exit_mmap holds a mm_count pin
  * itself.
  */
@@ -44,7 +45,7 @@ void __mmu_notifier_release(struct mm_st
 * to wait ->release to finish and
 * mmu_notifier_unregister to return.
 */
-   hlist_del_init(&mn->hlist);
+   hlist_del_init_rcu(&mn->hlist);
/*
 * SRCU here will block mmu_notifier_unregister until
 * ->release returns.
@@ -185,6 +186,8 @@ int mmu_notifier_register(struct mmu_not
 * side note: mmu_notifier_release can't run concurrently with
 * us because we hold the mm_users pin (either implicitly as
 * current->mm or explicitly with get_task_mm() or similar).
+* We can't race against any other mmu notifiers either thanks
+* to mm_lock().
 */
spin_lock(&mm->mmu_notifier_mm->lock);
hlist_add_head(&mn->hlist, &mm->mmu_notifier_mm->list);

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] fix external module compile

2008-04-26 Thread Andrea Arcangeli
Hello,

after updating kvm-userland.git, kvm.git and linux-2.6-hg, and after
make distclean and rebuild with slightly reduced .config, I can't
compile the external module anymore. Looking into it with V=1, $(src)
defines to "" and including /external-module-compat.h clearly fails. I
fixed it like below, because it seems more consistent to enforce the
ordering of the "special" includes in the same place, notably
$(src)/include is already included as $LINUX at point 1 of the
comment, so this looks a cleanup of superflous line in Kconfig besides
fixing my compile by moving the external-module-compat in the same
place with the other includes where `pwd` works instead of $(src) that
doesn't work anymore for whatever reason.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/kernel/Kbuild b/kernel/Kbuild
index cabfc75..d9245eb 100644
--- a/kernel/Kbuild
+++ b/kernel/Kbuild
@@ -1,4 +1,3 @@
-EXTRA_CFLAGS := -I$(src)/include -include $(src)/external-module-compat.h
 obj-m := kvm.o kvm-intel.o kvm-amd.o
 kvm-objs := kvm_main.o x86.o mmu.o x86_emulate.o anon_inodes.o irq.o i8259.o \
 lapic.o ioapic.o preempt.o i8254.o external-module-compat.o
diff --git a/kernel/Makefile b/kernel/Makefile
index 78ff923..e3fccbe 100644
--- a/kernel/Makefile
+++ b/kernel/Makefile
@@ -27,7 +27,8 @@ all::
 #  include header priority 1) $LINUX 2) $KERNELDIR 3) include-compat
$(MAKE) -C $(KERNELDIR) M=`pwd` \
LINUXINCLUDE="-I`pwd`/include -Iinclude -I`pwd`/include-compat \
-   -include include/linux/autoconf.h" \
+   -include include/linux/autoconf.h \
+   -include `pwd`/external-module-compat.h"
"$$@"
 
 sync: header-sync source-sync

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] paravirt clock stil causing hangs in kvm-65

2008-04-26 Thread extmaillist
Hi Glauber, sorry for late reply.
well, I'm no longer able to reproduce the problem in the same way (with 
backtraces etc) as before, but anyways enabling paravirt_clock with or 
without Your patches on SMP guest still causes troubles:
on my phenom machine, the kernel hangs after printing
"PCI-GART - No AMD northbridge found."

on intel machine normal system boot seems to be terribly slow taking tens 
of seconds between steps and later hangs, using init=/bin/sh seems to work 
though.

I'm wondering how can I get the backtraces I was getting before, I have 
soft CPU lockup debugging enabled, what else could help?

cheers

n.


On Thu, 24 Apr 2008, Glauber Costa wrote:

> Just saw Gerd's patches flying around. Can anyone that is able to
> reproduce this problem (a subgroup of human beings that does not
> include me) test it with them applied?
>
> If it still fails, please let me know asap
>
> -- 
> Glauber Costa.
> "Free as in Freedom"
> http://glommer.net
>
> "The less confident you are, the more serious you have to act."
>
>

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project

2008-04-26 Thread Anthony Liguori
Avi Kivity wrote:
> Anthony Liguori wrote:
>> The second stage is to use a loop of x86_emulate() to run all 16-bit 
>> code (instead of using vm86 mode).  This will allow us to support 
>> guests that use big real mode.
>>
>>   
>
> Why do that unconditionally, instead of only when in a big-real-mode 
> state?

It can be.  It's probably easier from a development perspective to make 
it unconditional and then to flush out all the necessary instructions.

Regards,

Anthony Liguori


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


[kvm-devel] mmu notifier #v14

2008-04-26 Thread Andrea Arcangeli
Hello everyone,

here it is the mmu notifier #v14.


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/

Please everyone involved review and (hopefully ;) ack that this is
safe to go in 2.6.26, the most important is to verify that this is a
noop when disarmed regardless of MMU_NOTIFIER=y or =n.


http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core

I'll be sending that patch to Andrew inbox.

Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>

diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
index 8d45fab..ce3251c 100644
--- a/arch/x86/kvm/Kconfig
+++ b/arch/x86/kvm/Kconfig
@@ -21,6 +21,7 @@ config KVM
tristate "Kernel-based Virtual Machine (KVM) support"
depends on HAVE_KVM
select PREEMPT_NOTIFIERS
+   select MMU_NOTIFIER
select ANON_INODES
---help---
  Support hosting fully virtualized guest machines using hardware
diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
index 2ad6f54..853087a 100644
--- a/arch/x86/kvm/mmu.c
+++ b/arch/x86/kvm/mmu.c
@@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
account_shadowed(kvm, gfn);
 }
 
+static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
+{
+   struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
PAGE_SHIFT);
+   get_page(page);
+   rmap_remove(kvm, spte);
+   set_shadow_pte(spte, shadow_trap_nonpresent_pte);
+   kvm_flush_remote_tlbs(kvm);
+   put_page(page);
+}
+
+static void kvm_unmap_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte, *curr_spte;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   BUG_ON(!(*spte & PT_PRESENT_MASK));
+   rmap_printk("kvm_rmap_unmap_hva: spte %p %llx\n", spte, *spte);
+   curr_spte = spte;
+   spte = rmap_next(kvm, rmapp, spte);
+   kvm_unmap_spte(kvm, curr_spte);
+   }
+}
+
+void kvm_unmap_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   for (i = 0; i < kvm->nmemslots; i++) {
+   struct kvm_memory_slot *memslot = &kvm->memslots[i];
+   unsigned long start = memslot->userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+   if (hva >= start && hva < end) {
+   gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+   kvm_unmap_rmapp(kvm, &memslot->rmap[gfn_offset]);
+   }
+   }
+}
+
+static int kvm_age_rmapp(struct kvm *kvm, unsigned long *rmapp)
+{
+   u64 *spte;
+   int young = 0;
+
+   spte = rmap_next(kvm, rmapp, NULL);
+   while (spte) {
+   int _young;
+   u64 _spte = *spte;
+   BUG_ON(!(_spte & PT_PRESENT_MASK));
+   _young = _spte & PT_ACCESSED_MASK;
+   if (_young) {
+   young = !!_young;
+   set_shadow_pte(spte, _spte & ~PT_ACCESSED_MASK);
+   }
+   spte = rmap_next(kvm, rmapp, spte);
+   }
+   return young;
+}
+
+int kvm_age_hva(struct kvm *kvm, unsigned long hva)
+{
+   int i;
+   int young = 0;
+
+   /*
+* If mmap_sem isn't taken, we can look the memslots with only
+* the mmu_lock by skipping over the slots with userspace_addr == 0.
+*/
+   spin_lock(&kvm->mmu_lock);
+   for (i = 0; i < kvm->nmemslots; i++) {
+   struct kvm_memory_slot *memslot = &kvm->memslots[i];
+   unsigned long start = memslot->userspace_addr;
+   unsigned long end;
+
+   /* mmu_lock protects userspace_addr */
+   if (!start)
+   continue;
+
+   end = start + (memslot->npages << PAGE_SHIFT);
+   if (hva >= start && hva < end) {
+   gfn_t gfn_offset = (hva - start) >> PAGE_SHIFT;
+   young |= kvm_age_rmapp(kvm, &memslot->rmap[gfn_offset]);
+   }
+   }
+   spin_unlock(&kvm->mmu_lock);
+
+   if (young)
+   kvm_flush_remote_tlbs(kvm);
+
+   return young;
+}
+
 #ifdef MMU_DEBUG
 static int is_empty_shadow_page(u64 *spt)
 {
@@ -1200,6 +1302,7 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t v, 
int write, gfn_t gfn)
int r;
int largepage = 0;
pfn_t pfn;
+   int mmu_seq;
 
down_read(¤t->mm->mmap_sem);
if (is_largepage_backed(vcpu, gfn & ~(KVM_PAGES_PER_HPAGE-1))) {
@@ -1207,6 +1310,8 @@ static int nonpaging_map(struct kvm_vcpu *vcpu, gva_t 

Re: [kvm-devel] Problem to dedicate 1 public IP address directly to guest system

2008-04-26 Thread LINkeR
I resend this, because It's could be helpful for another users.

>From David Mair <[EMAIL PROTECTED]>:

I saw your message on the kvm-devel list and I think I can advise you.
I don't yet have a functioning subscription to the list so replying
there is not simple for me. If this works out for you please post a
message to the kvm-devel list with the solution you use (no need to
mention me).

To be honest, the problem isn't really a kvm or qemu one, it's an
Ethernet issue I think. One solution is to create a bridge interface
on the host, add your public facing NIC to the bridge, create a tap
interface and attach it to the same bridge then use that tap interface
for the guest you want to assign a public IP to. Assuming Linux:

* You'll need a bridge-utils package or you'll need to build it
* Assuming your host's public interface is eth0, then as root:
 # brctl addbr br0
 # brctl addif br0 eth0
 # tunctl -u  -t qtap0
 # brctl addif br0 qtap0

Replace eth0 with the real interface name for your host's public
facing interface. Replace  with the username that will be
starting the guest. Replace qtap0 with a name you prefer to use for
the tap interface.

Now you can start the guest with the following -net option:

qemu  -net [vlan=n,]ifname=qtap0,script=no

Replace qtap0 with whatever you used for the tunctl command above. If
you are using a vlan other than zero for the guest NIC then include
the vlan=n option replacing n with the vlan number. You can use
script=yes and modify /etc/qemu-ifup to perform the bridge setup
command above but I prefer to use my own and setup the bridge and tap
at init time then just have the guest use the preset bridge.

You can now just use the relevant static IP address as-is on the guest
because the guest's NIC is on the public facing cabling.

Good luck, I think you should be able to solve this fairly easily.


-- 
-
ICQ: 147137905 Skype: linker83

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-26 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Harper wrote:
> +/* block until cond_wait occurs */
> +pthread_mutex_lock(&vcpu_mutex);
> +/* now we can signal */
> +pthread_cond_signal(&qemu_vcpuup_cond);
> +pthread_mutex_unlock(&vcpu_mutex);

It is not necessary to take the mutex for a condvar if you're just
waking a waiter.  This just unnecessarily slows things down.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIE19D2ijCOnn/RHQRAvLqAJ9j+7ICPHpB/gCthCelDgYn5poGMgCfaZVy
+tE5zOuxqlBMUR7Fgufw/wY=
=5j4s
-END PGP SIGNATURE-

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-26 Thread Ulrich Drepper
-BEGIN PGP SIGNED MESSAGE-
Hash: SHA1

Ryan Harper wrote:
> @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table 
> *sigtab, int signum)
>  
>  void kvm_init_new_ap(int cpu, CPUState *env)
>  {
> +pthread_mutex_lock(&vcpu_mutex);
>  pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env);
> -/* FIXME: wait for thread to spin up */
> -usleep(200);
> +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex);
> +pthread_mutex_unlock(&vcpu_mutex);

And something is very wrong here.  The pattern for using a condvar is

1 take mutex

2 check condition

3 if condition is not fulfilled
3a   call cond_wait
3b   when returning, go back to step 2

4 unlock mutex


Anything else is buggy.

So, either your condvar use is wrong or you don't really want a condvar
in the first place.  I haven't checked the code.

- --
➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
-BEGIN PGP SIGNATURE-
Version: GnuPG v1.4.7 (GNU/Linux)

iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis
kw7ST4eJK9CXhNbjKphNsUo=
=ISaC
-END PGP SIGNATURE-

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Fix QEMU vcpu thread race with apic_reset

2008-04-26 Thread Anthony Liguori
Ulrich Drepper wrote:
> -BEGIN PGP SIGNED MESSAGE-
> Hash: SHA1
>
> Ryan Harper wrote:
>   
>> @@ -388,9 +395,10 @@ static void kvm_add_signal(struct qemu_kvm_signal_table 
>> *sigtab, int signum)
>>  
>>  void kvm_init_new_ap(int cpu, CPUState *env)
>>  {
>> +pthread_mutex_lock(&vcpu_mutex);
>>  pthread_create(&vcpu_info[cpu].thread, NULL, ap_main_loop, env);
>> -/* FIXME: wait for thread to spin up */
>> -usleep(200);
>> +pthread_cond_wait(&qemu_vcpuup_cond, &vcpu_mutex);
>> +pthread_mutex_unlock(&vcpu_mutex);
>> 
>
> And something is very wrong here.  The pattern for using a condvar is
>
> 1 take mutex
>
> 2 check condition
>
> 3 if condition is not fulfilled
> 3a   call cond_wait
> 3b   when returning, go back to step 2
>
> 4 unlock mutex
>
>
> Anything else is buggy.
>
> So, either your condvar use is wrong or you don't really want a condvar
> in the first place.  I haven't checked the code.
>   

A flag is needed in the vcpu structure that indicates whether the vcpu 
spun up or not.  This is what the while () condition should be.  This is 
needed as the thread may spin up before it gets to the 
pthread_cond_wait() in which case the signal happens when noone is 
waiting on it.  The other reason a while () is usually needed is that 
cond_signal may not wake up the right thread so it's necessary to check 
whether you really have something to do.  Not really a problem here but 
the former race is.

A condvar is definitely the right thing to use here.

Regards,

Anthony Liguori

> - --
> ➧ Ulrich Drepper ➧ Red Hat, Inc. ➧ 444 Castro St ➧ Mountain View, CA ❖
> -BEGIN PGP SIGNATURE-
> Version: GnuPG v1.4.7 (GNU/Linux)
>
> iD8DBQFIE2CD2ijCOnn/RHQRAs4kAJ40kbWjNJAzj2gGdbo/sSxZTx5b0ACglbis
> kw7ST4eJK9CXhNbjKphNsUo=
> =ISaC
> -END PGP SIGNATURE-
>
> -
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
> ___
> kvm-devel mailing list
> kvm-devel@lists.sourceforge.net
> https://lists.sourceforge.net/lists/listinfo/kvm-devel
>   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Anthony Liguori
Andrea Arcangeli wrote:
> Hello everyone,
>
> here it is the mmu notifier #v14.
>
>   
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/
>
> Please everyone involved review and (hopefully ;) ack that this is
> safe to go in 2.6.26, the most important is to verify that this is a
> noop when disarmed regardless of MMU_NOTIFIER=y or =n.
>
>   
> http://www.kernel.org/pub/linux/kernel/people/andrea/patches/v2.6/2.6.25/mmu-notifier-v14/mmu-notifier-core
>
> I'll be sending that patch to Andrew inbox.
>
> Signed-off-by: Andrea Arcangeli <[EMAIL PROTECTED]>
>
> diff --git a/arch/x86/kvm/Kconfig b/arch/x86/kvm/Kconfig
> index 8d45fab..ce3251c 100644
> --- a/arch/x86/kvm/Kconfig
> +++ b/arch/x86/kvm/Kconfig
> @@ -21,6 +21,7 @@ config KVM
>   tristate "Kernel-based Virtual Machine (KVM) support"
>   depends on HAVE_KVM
>   select PREEMPT_NOTIFIERS
> + select MMU_NOTIFIER
>   select ANON_INODES
>   ---help---
> Support hosting fully virtualized guest machines using hardware
> diff --git a/arch/x86/kvm/mmu.c b/arch/x86/kvm/mmu.c
> index 2ad6f54..853087a 100644
> --- a/arch/x86/kvm/mmu.c
> +++ b/arch/x86/kvm/mmu.c
> @@ -663,6 +663,108 @@ static void rmap_write_protect(struct kvm *kvm, u64 gfn)
>   account_shadowed(kvm, gfn);
>  }
>
> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
> +{
> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
> PAGE_SHIFT);
> + get_page(page);
>   

You should not assume a struct page exists for any given spte. Instead, 
use kvm_get_pfn() and kvm_release_pfn_clean().

>  static void nonpaging_free(struct kvm_vcpu *vcpu)
> @@ -1643,11 +1771,11 @@ static void mmu_guess_page_from_pte_write(struct 
> kvm_vcpu *vcpu, gpa_t gpa,
>   int r;
>   u64 gpte = 0;
>   pfn_t pfn;
> -
> - vcpu->arch.update_pte.largepage = 0;
> + int mmu_seq;
> + int largepage;
>
>   if (bytes != 4 && bytes != 8)
> - return;
> + goto out_lock;
>
>   /*
>* Assume that the pte write on a page table of the same type
> @@ -1660,7 +1788,7 @@ static void mmu_guess_page_from_pte_write(struct 
> kvm_vcpu *vcpu, gpa_t gpa,
>   if ((bytes == 4) && (gpa % 4 == 0)) {
>   r = kvm_read_guest(vcpu->kvm, gpa & ~(u64)7, &gpte, 8);
>   if (r)
> - return;
> + goto out_lock;
>   memcpy((void *)&gpte + (gpa % 8), new, 4);
>   } else if ((bytes == 8) && (gpa % 8 == 0)) {
>   memcpy((void *)&gpte, new, 8);
> @@ -1670,23 +1798,35 @@ static void mmu_guess_page_from_pte_write(struct 
> kvm_vcpu *vcpu, gpa_t gpa,
>   memcpy((void *)&gpte, new, 4);
>   }
>   if (!is_present_pte(gpte))
> - return;
> + goto out_lock;
>   gfn = (gpte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
>
> + largepage = 0;
>   down_read(¤t->mm->mmap_sem);
>   if (is_large_pte(gpte) && is_largepage_backed(vcpu, gfn)) {
>   gfn &= ~(KVM_PAGES_PER_HPAGE-1);
> - vcpu->arch.update_pte.largepage = 1;
> + largepage = 1;
>   }
> + mmu_seq = atomic_read(&vcpu->kvm->arch.mmu_notifier_seq);
> + /* implicit mb(), we'll read before PT lock is unlocked */
>   pfn = gfn_to_pfn(vcpu->kvm, gfn);
>   up_read(¤t->mm->mmap_sem);
>
> - if (is_error_pfn(pfn)) {
> - kvm_release_pfn_clean(pfn);
> - return;
> - }
> + if (is_error_pfn(pfn))
> + goto out_release_and_lock;
> +
> + spin_lock(&vcpu->kvm->mmu_lock);
> + BUG_ON(!is_error_pfn(vcpu->arch.update_pte.pfn));
>   vcpu->arch.update_pte.gfn = gfn;
>   vcpu->arch.update_pte.pfn = pfn;
> + vcpu->arch.update_pte.largepage = largepage;
> + vcpu->arch.update_pte.mmu_seq = mmu_seq;
> + return;
> +
> +out_release_and_lock:
> + kvm_release_pfn_clean(pfn);
> +out_lock:
> + spin_lock(&vcpu->kvm->mmu_lock);
>  }
>   

Perhaps I just have a weak stomach but I am uneasy having a function 
that takes a lock on exit. I walked through the logic and it doesn't 
appear to be wrong but it also is pretty clear that you could defer the 
acquisition of the lock to the caller (in this case, kvm_mmu_pte_write) 
by moving the update_pte assignment into kvm_mmu_pte_write.

>  void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
> @@ -1711,7 +1851,6 @@ void kvm_mmu_pte_write(struct kvm_vcpu *vcpu, gpa_t gpa,
>
>   pgprintk("%s: gpa %llx bytes %d\n", __func__, gpa, bytes);
>   mmu_guess_page_from_pte_write(vcpu, gpa, new, bytes);
>   

Worst case, you pass 4 more pointer arguments here and, take the spin 
lock, and then depending on the result of mmu_guess_page_from_pte_write, 
update vcpu->arch.update_pte.

> @@ -3899,13 +4037,12 @@ static void kvm_free_vcpus(struct kvm *kvm)
>
>  void kvm_arch_destroy_vm(struct kvm *kvm)
>  {
> -

Re: [kvm-devel] Real Mode Improvement on Intel Hosts - GSoC Project

2008-04-26 Thread Avi Kivity
Mohammed Gamal wrote:
>>  Why do that unconditionally, instead of only when in a big-real-mode state?
>> 
>
> Is big-real-mode the only state where we have problems?
>   

In general, we need to emulate whenever we are in a VT-unfriendly state, 
where that is defined as guest state that fails the guest state checks 
defined by section 22.3.1 of volume 3B of the Intel software development 
manual, "checks on the guest state area", when that state is legal in a 
real processor.  To date, we have encountered only two instances of such 
VT-unfriendly states:

- "big real mode", where segment limits are not exactly 0x
- protected mode transitions, where cs.rpl !=ss.rpl for a brief while

There may well be more, as we remove the various hacks currently in 
place, and as we expand the envelope to support hybrid 16/32 bit guests 
like Windows 3.1 and Windows 95.

-- 
Do not meddle in the internals of kernels, for they are subtle and quick to 
panic.


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote:
>> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
>> +{
>> +struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
>> PAGE_SHIFT);
>> +get_page(page);
>>   
>
> You should not assume a struct page exists for any given spte. Instead, use 
> kvm_get_pfn() and kvm_release_pfn_clean().

Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build rmap
on mmio regions, so the above is more efficient so put_page runs
directly on the page without going back and forth between spte -> pfn
-> page -> pfn -> page in a single function.

Certainly if we start building rmap on mmio regions we'll have to
change that.

> Perhaps I just have a weak stomach but I am uneasy having a function that 
> takes a lock on exit. I walked through the logic and it doesn't appear to 
> be wrong but it also is pretty clear that you could defer the acquisition 
> of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the 
> update_pte assignment into kvm_mmu_pte_write.

I agree out_lock is an uncommon exit path, the problem is that the
code was buggy, and I tried to fix it with the smallest possible
change and that resulting in an out_lock. That section likely need a
refactoring, all those update_pte fields should be at least returned
by the function guess_  but I tried to reduce the changes to make
the issue more readable, I didn't want to rewrite certain functions
just to take a spinlock a few instructions ahead.

> Worst case, you pass 4 more pointer arguments here and, take the spin lock, 
> and then depending on the result of mmu_guess_page_from_pte_write, update 
> vcpu->arch.update_pte.

Yes that was my same idea, but that's left for a later patch. Fixing
this bug mixed with the mmu notifier patch was perhaps excessive
already ;).

> Why move the destruction of the vm to the MMU notifier unregister hook? 
> Does anything else ever call mmu_notifier_unregister that would implicitly 
> destroy the VM?

mmu notifier ->release can run at anytime before the filehandle is
closed. ->release has to zap all sptes and freeze the mmu (hence all
vcpus) to prevent any further page fault. After ->release returns all
pages are freed (we'll never relay on the page pin to avoid the
rmap_remove put_page to be a relevant unpin event). So the idea is
that I wanted to maintain the same ordering of the current code in the
vm destroy event, I didn't want to leave a partially shutdown VM on
the vmlist. If the ordering is entirely irrelevant and the
kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then
I can avoid changes to kvm_main.c but I doubt.

I've done it in a way that archs not needing mmu notifiers like s390
can simply add the kvm_destroy_common_vm at the top of their
kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke
kvm_destroy_common_vm in the ->release of the mmu notifiers.

This will ensure that everything will be ok regardless if exit_mmap is
called before/after exit_files, and it won't make a whole lot of
difference anymore, if the driver fd is pinned through vmas->vm_file
released in exit_mmap or through the task filedescriptors relased in
exit_files etc... Infact this allows to call mmu_notifier_unregister
at anytime later after the task has already been killed, without any
trouble (like if the mmu notifier owner isn't registering in
current->mm but some other tasks mm).

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Anthony Liguori
Andrea Arcangeli wrote:
> On Sat, Apr 26, 2008 at 01:59:23PM -0500, Anthony Liguori wrote:
>   
>>> +static void kvm_unmap_spte(struct kvm *kvm, u64 *spte)
>>> +{
>>> +   struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
>>> PAGE_SHIFT);
>>> +   get_page(page);
>>>   
>>>   
>> You should not assume a struct page exists for any given spte. Instead, use 
>> kvm_get_pfn() and kvm_release_pfn_clean().
>> 
>
> Last email from [EMAIL PROTECTED] in my inbox argues it's useless to build 
> rmap
> on mmio regions, so the above is more efficient so put_page runs
> directly on the page without going back and forth between spte -> pfn
> -> page -> pfn -> page in a single function.
>   

Avi can correct me if I'm wrong, but I don't think the consensus of that 
discussion was that we're going to avoid putting mmio pages in the 
rmap.  Practically speaking, replacing:

+   struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
PAGE_SHIFT);
+   get_page(page);


With:

unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
kvm_get_pfn(pfn);

Results in exactly the same code except the later allows mmio pfns in 
the rmap.  So ignoring the whole mmio thing, using accessors that are 
already there and used elsewhere seems like a good idea :-)

> Certainly if we start building rmap on mmio regions we'll have to
> change that.
>
>   
>> Perhaps I just have a weak stomach but I am uneasy having a function that 
>> takes a lock on exit. I walked through the logic and it doesn't appear to 
>> be wrong but it also is pretty clear that you could defer the acquisition 
>> of the lock to the caller (in this case, kvm_mmu_pte_write) by moving the 
>> update_pte assignment into kvm_mmu_pte_write.
>> 
>
> I agree out_lock is an uncommon exit path, the problem is that the
> code was buggy, and I tried to fix it with the smallest possible
> change and that resulting in an out_lock. That section likely need a
> refactoring, all those update_pte fields should be at least returned
> by the function guess_  but I tried to reduce the changes to make
> the issue more readable, I didn't want to rewrite certain functions
> just to take a spinlock a few instructions ahead.
>   

I appreciate the desire to minimize changes, but taking a lock on return 
seems to take that to a bit of an extreme.  It seems like a simple thing 
to fix though, no?

>> Why move the destruction of the vm to the MMU notifier unregister hook? 
>> Does anything else ever call mmu_notifier_unregister that would implicitly 
>> destroy the VM?
>> 
>
> mmu notifier ->release can run at anytime before the filehandle is
> closed. ->release has to zap all sptes and freeze the mmu (hence all
> vcpus) to prevent any further page fault. After ->release returns all
> pages are freed (we'll never relay on the page pin to avoid the
> rmap_remove put_page to be a relevant unpin event). So the idea is
> that I wanted to maintain the same ordering of the current code in the
> vm destroy event, I didn't want to leave a partially shutdown VM on
> the vmlist. If the ordering is entirely irrelevant and the
> kvm_arch_destroy_vm can run well before kvm_destroy_vm is called, then
> I can avoid changes to kvm_main.c but I doubt.
>
> I've done it in a way that archs not needing mmu notifiers like s390
> can simply add the kvm_destroy_common_vm at the top of their
> kvm_arch_destroy_vm. All others using mmu_notifiers have to invoke
> kvm_destroy_common_vm in the ->release of the mmu notifiers.
>
> This will ensure that everything will be ok regardless if exit_mmap is
> called before/after exit_files, and it won't make a whole lot of
> difference anymore, if the driver fd is pinned through vmas->vm_file
> released in exit_mmap or through the task filedescriptors relased in
> exit_files etc... Infact this allows to call mmu_notifier_unregister
> at anytime later after the task has already been killed, without any
> trouble (like if the mmu notifier owner isn't registering in
> current->mm but some other tasks mm).
>   

I see.  It seems a little strange to me as a KVM guest isn't really tied 
to the current mm.  It seems like the net effect of this is that we are 
now tying a KVM guest to an mm.

For instance, if you create a guest, but didn't assign any memory to it, 
you could transfer the fd to another process and then close the fd 
(without destroying the guest).  The other process then could assign 
memory to it and presumably run the guest.

With your change, as soon as the first process exits, the guest will be 
destroyed.  I'm not sure this behavioral difference really matters but 
it is a behavioral difference.

Regards,

Anthony Liguori

> -
> This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
> Don't miss this year's exciting event. There's still time to save $100. 
> Use priority code J8TL2D2. 
> http://ad.doubleclick.net/clk;198757673;1

Re: [kvm-devel] mmu notifier #v14

2008-04-26 Thread Andrea Arcangeli
On Sat, Apr 26, 2008 at 08:54:23PM -0500, Anthony Liguori wrote:
> Avi can correct me if I'm wrong, but I don't think the consensus of that 
> discussion was that we're going to avoid putting mmio pages in the rmap.  

My first impression on that discussion was that pci-passthrough mmio
can't be swapped, can't require write throttling etc.. ;). From a
linux VM pagetable point of view rmap on mmio looks weird. However
thinking some more, it's not like in the linux kernel where write
protect through rmap is needed only for write-throttling MAP_SHARED
which clearly is strictly RAM, for sptes we need it for every cr3
touch too to trap pagetable updates (think ioremap done by guest
kernel). So I think Avi's take that we need rmap for everything mapped
by sptes is probably the only feasible way to go.

> Practically speaking, replacing:
>
> + struct page *page = pfn_to_page((*spte & PT64_BASE_ADDR_MASK) >> 
> PAGE_SHIFT);
> + get_page(page);
>
>
> With:
>
> unsigned long pfn = (*spte & PT64_BASE_ADDR_MASK) >> PAGE_SHIFT;
> kvm_get_pfn(pfn);
>
> Results in exactly the same code except the later allows mmio pfns in the 
> rmap.  So ignoring the whole mmio thing, using accessors that are already 
> there and used elsewhere seems like a good idea :-)

Agreed especially at the light of the above. I didn't actually touch
that function for a while (I clearly wrote it before we started moving
the kvm mmu code from page to pfn), and it was still safe to use to
test the locking of the mmu notifier methods. My current main focus in
the last few days was to get the locking right against the last mmu
notifier code #v14 ;).

Now that I look into it more closely, the get_page/put_page are
unnecessary by now (it was necessary with the older patches that
didn't implement range_begin and that relied on page pinning).

Not just in that function, but all reference counting inside kvm is
now entirely useless and can be removed.

NOTE: it is safe to flush the tlb outside the mmu_lock if done inside
the mmu_notifier methods. But only mmu notifiers can defer the tlb
flush after releasing mmu_lock because the page can't be freed by the
VM until we return.

All other kvm code must instead definitely flush the tlb inside the
mmu_lock, otherwise when the mmu notifier code runs, it will see the
spte nonpresent and so the mmu notifier code will do nothing (it will
not wait kvm to drop the mmu_lock before allowing the main linux VM to
free the page).

The tlb flush must happen before the page is freed, and doing it
inside mmu_lock everywhere (except in mmu-notifier contex where it can
be done after releasing mmu_lock) guarantees it.

The positive side of the tradeoff of having to do the tlb flush inside
the mmu_lock, is that KVM can now safely zap and unmap as many sptes
at it wants and do a single tlb flush at the end. The pages can't be
freed as long as the mmu_lock is hold (this is why the tlb flush has
to be done inside the mmu_lock). This model reduces heavily the tlb
flush frequency for large spte-mangling, and tlb flushes here are
quite expensive because of ipis.

> I appreciate the desire to minimize changes, but taking a lock on return 
> seems to take that to a bit of an extreme.  It seems like a simple thing to 
> fix though, no?

I agree it needs to be rewritten as a cleaner fix but probably in a
separate patch (which has to be incremental as that code will reject
on the mmu notifier patch). I didn't see as a big issue however to
apply my quick fix first and cleanup with an incremental update.

> I see.  It seems a little strange to me as a KVM guest isn't really tied to 
> the current mm.  It seems like the net effect of this is that we are now 
> tying a KVM guest to an mm.
>
> For instance, if you create a guest, but didn't assign any memory to it, 
> you could transfer the fd to another process and then close the fd (without 
> destroying the guest).  The other process then could assign memory to it 
> and presumably run the guest.

Passing the anon kvm vm fd through unix sockets to another task is
exactly why we need things like ->release not dependent on fd->release
vma->vm_file->release ordering in the do_exit path to teardown the VM.

The guest itself is definitely tied to a "mm", the guest runs using
get_user_pages and get_user_pages is meaningless without an mm. But
the fd where we run the ioctl isn't tied to the mm, it's just an fd
that can be passed across tasks with unix sockets.

> With your change, as soon as the first process exits, the guest will be 
> destroyed.  I'm not sure this behavioral difference really matters but it 
> is a behavioral difference.

The guest-mode of the cpu, can't run safely on any task but the one
with the "mm" tracked by the mmu notifiers and where the memory is
allocated from. The sptes points to the memory allocated in that
"mm". It's definitely memory-corrupting to leave any spte established
when the last thread of that "mm" exists as the memory supposedly
pointed by the orpha