Re: [kvm-devel] [PATCH/RFC] stop_machine: make stop_machine_run more virtualization friendly

2008-05-08 Thread Jeremy Fitzhardinge
Christian Borntraeger wrote:
 On kvm I have seen some rare hangs in stop_machine when I used more guest
 cpus than hosts cpus. e.g. 32 guest cpus on 1 host cpu triggered the
 hang quite often. I could also reproduce the problem on a 4 way z/VM host 
 with 
 a 64 way guest.
   

I think that's one of those don't do that then cases ;)

 It turned out that the guest was consuming all available cpus mostly for
 spinning on scheduler locks like rq-lock. This is expected as the threads 
 are 
 calling yield all the time. 
 The problem is now, that the host scheduling decisings together with the 
 guest 
 scheduling decisions and spinlocks not being fair managed to create an 
 interesting scenario similar to a live lock. (Sometimes the hang resolved 
 itself after some minutes)
   

I think x86 (at least) is now using ticket locks, which is fair.  Which 
kernel are you seeing this problem on?

 Changing stop_machine to yield the cpu to the hypervisor when yielding inside 
 the guest fixed the problem for me. While I am not completely happy with this 
 patch, I think it causes no harm and it really improves the situation for me.

 I used cpu_relax for yielding to the hypervisor, does that work on all 
 architectures?
   

On x86, cpu_relax is just a pause instruction (rep;nop).  We don't 
hook it in paravirt_ops, and while VT/SVM can be used to fault into the 
hypervisor on this instruction, I don't know if kvm actually does so.  
Either way, it wouldn't work for VMI, Xen or lguest.

J

 p.s.: If you want to reproduce the problem, cpu hotplug and kprobes use 
 stop_machine_run and both triggered the problem after some retries. 


 Signed-off-by: Christian Borntraeger [EMAIL PROTECTED]
 CC: Ingo Molnar [EMAIL PROTECTED]
 CC: Rusty Russell [EMAIL PROTECTED]

 ---
  kernel/stop_machine.c |7 ---
  1 file changed, 4 insertions(+), 3 deletions(-)

 Index: kvm/kernel/stop_machine.c
 ===
 --- kvm.orig/kernel/stop_machine.c
 +++ kvm/kernel/stop_machine.c
 @@ -62,8 +62,7 @@ static int stopmachine(void *cpu)
* help our sisters onto their CPUs. */
   if (!prepared  !irqs_disabled)
   yield();
 - else
 - cpu_relax();
 + cpu_relax();
   }
  
   /* Ack: we are exiting. */
 @@ -106,8 +105,10 @@ static int stop_machine(void)
   }
  
   /* Wait for them all to come to life. */
 - while (atomic_read(stopmachine_thread_ack) != stopmachine_num_threads)
 + while (atomic_read(stopmachine_thread_ack) != stopmachine_num_threads) 
 {
   yield();
 + cpu_relax();
 + }
  
   /* If some failed, kill them all. */
   if (ret  0) {

 ___
 Virtualization mailing list
 [EMAIL PROTECTED]
 https://lists.linux-foundation.org/mailman/listinfo/virtualization
   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH/RFC] stop_machine: make stop_machine_run more virtualization friendly

2008-05-08 Thread Jeremy Fitzhardinge
Christian Borntraeger wrote:
 I really like 64 guest cpus as a good testcase for all kind of things. 
   

Sure, I do the same kind of thing.

 I think x86 (at least) is now using ticket locks, which is fair.  Which 
 kernel are you seeing this problem on?
 

 Sorry, forgot to mention. Its kvm.git from 2 days ago on s390.
   

And on s390 cpu_relax yields the vcpu?  That's not common behaviour 
across architectures.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ RfC / patch ] kvmclock fixes

2008-04-21 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
   * Host: make kvm pv clock really compatible with xen pv clock.
   * Guest/xen: factor out some xen clock code into a separate
source file (pvclock.[ch]), so kvm can reuse it.
   * Guest/kvm: make kvm clock compatible with xen clock by using
the common code bits.
   

I guess saving on code duplication is good...

 +cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src)
 +{
 + struct pvclock_shadow_time *shadow = get_cpu_var(shadow_time);
 + cycle_t ret;
 +
 + pvclock_get_time_values(shadow, src);
 + ret = shadow-system_timestamp + pvclock_get_nsec_offset(shadow);
   

You need to put this in a loop in case the system clock parameters 
change between the pvclock_get_time_values() and pvclock_get_nsec_offset().

How does kvm deal with suspend/resume with respect to time?  Is the 
system timestamp guaranteed to remain monotonic?  For Xen, I think 
we'll need to maintain an offset between the initial system timestamp 
and whatever it is after resuming.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-21 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 Hmm, I somehow fail to see a case where it could be non-atomic ...

 get_time_values() copies a consistent snapshot, thus
 xen_clocksource_read() doesn't race against xen updating the fields.
 The snapshot is in a per-cpu variable, thus it doesn't race against
 other guest vcpus running get_time_values() at the same time.
   

Xen could change the parameters in the instant after get_time_values().  
That change could be as a result of suspend-resume, so the parameters 
and the tsc could be wildly different.  It's definitely an edge-case, 
but it's easy enough to deal with.

 There could be a loopless
 __get_time_values() for use in this case, but given that it almost never
 loops, I don't think its worthwhile.
 

 in this case ???  I'm confused.  There is only a single user of
 get_nsec_offset(), which is xen_clocksource_read() ...
   

Sure, but get_time_values() has several other callers.  If 
xen_clocksource_read() had its own loop to make sure the read_tsc is 
atomic with respect to get_time_values, then get_time_values itself 
needn't loop.  But, given that it only loops in the very rare case that 
it races with Xen updating those parameters, it doesn't seem to make 
much difference either way.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-21 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 Jeremy Fitzhardinge wrote:
   
 Xen could change the parameters in the instant after get_time_values(). 
 That change could be as a result of suspend-resume, so the parameters
 and the tsc could be wildly different.
 

 Ah, ok, forgot the rdtsc in the picture.  With that in mind I fully
 agree that the loop is needed.  I think kvm guests can even hit that one
 with the vcpu migrating to a different physical cpu, so we better handle
 it correctly ;)
   

Yes, same with Xen.

 Sure, but get_time_values() has several other callers.
 

 Not really.  There are only two calls, one in clocksource_read() and one
 in the init path.  The later is superfluous I think because
 clocksource_read() is the only user of the shadowed time info.
   

Hm.  It doesn't look like shadow_time needs to be a static percpu at 
all.  It could just be a local to clocksource_read, I think.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [ RfC / patch ] kvmclock fixes

2008-04-21 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 +cycle_t pvclock_clocksource_read(struct kvm_vcpu_time_info *src)
 +{
 + struct pvclock_shadow_time *shadow;
 + cycle_t ret;
 + unsigned version;
 +
 + shadow = get_cpu_var(shadow_time);
 + do {
 + version = pvclock_get_time_values(shadow, src);
 + barrier();
 + ret = shadow-system_timestamp + 
 pvclock_get_nsec_offset(shadow);
 + barrier();
   

Is barrier() strong enough?  Does kvm guarantee that the per-cpu time 
parameters are only ever updated by that cpu?  I'm pretty sure Xen does, 
so that's OK.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-18 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 I'm looking at the guest side of the issue right now, trying to identify
 common code, and while doing so noticed that xen does the
 version-check-loop in both get_time_values_from_xen(void) and
 xen_clocksource_read(void), and I can't see any obvious reason for that.
  The loop in xen_clocksource_read(void) is not needed IMHO.  Can I drop it?
   

No.  The get_nsec_offset() needs to be atomic with respect to the 
get_time_values() parameters.  There could be a loopless 
__get_time_values() for use in this case, but given that it almost never 
loops, I don't think its worthwhile.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] pv clock: kvm is incompatible with xen :-(

2008-04-11 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 Wall clock is off a few hours though.  Oops.

 I think the way wall clock and system clock work together in xen (Jeremy
 correct me if I'm wrong) is that the wall clock specifies the point in
 time where the system clock started going.  As kvm fills in host system
 time into the guest system time fields the guest wall clock fields
 should be filled with the host boot time timestamp I'd say.
   

Yes.  The wallclock field in the shared info structure is the wallclock 
time at boot; you compute the current time by adding the system 
timestamp to it.   System time changes are effected by retroactively 
changing the boot time of the machine, though that can also change 
because of suspend/resume/migrate.

In general the kernel only reads the wallclock time at boot, and then 
maintains it for itself from then on.  I think.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Don't miss this year's exciting event. There's still time to save $100. 
Use priority code J8TL2D2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 01/10] emm: mm_lock: Lock a process against reclaim

2008-04-07 Thread Jeremy Fitzhardinge
Andrea Arcangeli wrote:
 On Fri, Apr 04, 2008 at 04:12:42PM -0700, Jeremy Fitzhardinge wrote:
   
 I think you can break this if() down a bit:

  if (!(vma-vm_file  vma-vm_file-f_mapping))
  continue;
 

 It makes no difference at runtime, coding style preferences are quite
 subjective.
   

Well, overall the formatting of that if statement is very hard to read.  
Separating out the logically distinct pieces in to different ifs at 
least shows the reader that they are distinct.
Aside from that, doing some manual CSE to remove all the casts and 
expose the actual thing you're testing for would help a lot (are the 
casts even necessary?).

 So this is an O(n^2) algorithm to take the i_mmap_locks from low to high 
 order?  A comment would be nice.  And O(n^2)?  Ouch.  How often is it 
 called?
 

 It's called a single time when the mmu notifier is registered. It's a
 very slow path of course. Any other approach to reduce the complexity
 would require memory allocations and it would require
 mmu_notifier_register to return -ENOMEM failure. It didn't seem worth
 it.
   

It's per-mm though.  How many processes would need to have notifiers?


 And is it necessary to mush lock and unlock together?  Unlock ordering 
 doesn't matter, so you should just be able to have a much simpler loop, no?
 

 That avoids duplicating .text. Originally they were separated. unlock
 can't be a simpler loop because I didn't reserve vm_flags bitflags to
 do a single O(N) loop for unlock. If you do malloc+fork+munmap two
 vmas will point to the same anon-vma lock, that's why the unlock isn't
 simpler unless I mark what I locked with a vm_flags bitflag.

Well, its definitely going to need more comments then.  I assumed it 
would end up locking everything, so unlocking everything would be 
sufficient.

J

-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer ends at 11:59 p.m., 
Monday, April 7! Use priority code J8TLD2. 
http://ad.doubleclick.net/clk;198757673;13503038;p?http://java.sun.com/javaone
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [patch 01/10] emm: mm_lock: Lock a process against reclaim

2008-04-04 Thread Jeremy Fitzhardinge
Christoph Lameter wrote:
 Provide a way to lock an mm_struct against reclaim (try_to_unmap
 etc). This is necessary for the invalidate notifier approaches so
 that they can reliably add and remove a notifier.

 Signed-off-by: Andrea Arcangeli [EMAIL PROTECTED]
 Signed-off-by: Christoph Lameter [EMAIL PROTECTED]

 ---
  include/linux/mm.h |   10 
  mm/mmap.c  |   66 
 +
  2 files changed, 76 insertions(+)

 Index: linux-2.6/include/linux/mm.h
 ===
 --- linux-2.6.orig/include/linux/mm.h 2008-04-02 11:41:47.741678873 -0700
 +++ linux-2.6/include/linux/mm.h  2008-04-04 15:02:17.660504756 -0700
 @@ -1050,6 +1050,16 @@ extern int install_special_mapping(struc
  unsigned long addr, unsigned long len,
  unsigned long flags, struct page **pages);
  
 +/*
 + * Locking and unlocking am mm against reclaim.
 + *
 + * mm_lock will take mmap_sem writably (to prevent additional vmas from being
 + * added) and then take all mapping locks of the existing vmas. With that
 + * reclaim is effectively stopped.
 + */
 +extern void mm_lock(struct mm_struct *mm);
 +extern void mm_unlock(struct mm_struct *mm);
 +
  extern unsigned long get_unmapped_area(struct file *, unsigned long, 
 unsigned long, unsigned long, unsigned long);
  
  extern unsigned long do_mmap_pgoff(struct file *file, unsigned long addr,
 Index: linux-2.6/mm/mmap.c
 ===
 --- linux-2.6.orig/mm/mmap.c  2008-04-04 14:55:03.477593980 -0700
 +++ linux-2.6/mm/mmap.c   2008-04-04 14:59:05.505395402 -0700
 @@ -2242,3 +2242,69 @@ int install_special_mapping(struct mm_st
  
   return 0;
  }
 +
 +static void mm_lock_unlock(struct mm_struct *mm, int lock)
 +{
 + struct vm_area_struct *vma;
 + spinlock_t *i_mmap_lock_last, *anon_vma_lock_last;
 +
 + i_mmap_lock_last = NULL;
 + for (;;) {
 + spinlock_t *i_mmap_lock = (spinlock_t *) -1UL;
 + for (vma = mm-mmap; vma; vma = vma-vm_next)
 + if (vma-vm_file  vma-vm_file-f_mapping 
   
I think you can break this if() down a bit:

if (!(vma-vm_file  vma-vm_file-f_mapping))
continue;


 + (unsigned long) i_mmap_lock 
 + (unsigned long)
 + vma-vm_file-f_mapping-i_mmap_lock 
 + (unsigned long)
 + vma-vm_file-f_mapping-i_mmap_lock 
 + (unsigned long) i_mmap_lock_last)
 + i_mmap_lock =
 + vma-vm_file-f_mapping-i_mmap_lock;
   

So this is an O(n^2) algorithm to take the i_mmap_locks from low to high 
order?  A comment would be nice.  And O(n^2)?  Ouch.  How often is it 
called?

And is it necessary to mush lock and unlock together?  Unlock ordering 
doesn't matter, so you should just be able to have a much simpler loop, no?


 + if (i_mmap_lock == (spinlock_t *) -1UL)
 + break;
 + i_mmap_lock_last = i_mmap_lock;
 + if (lock)
 + spin_lock(i_mmap_lock);
 + else
 + spin_unlock(i_mmap_lock);
 + }
 +
 + anon_vma_lock_last = NULL;
 + for (;;) {
 + spinlock_t *anon_vma_lock = (spinlock_t *) -1UL;
 + for (vma = mm-mmap; vma; vma = vma-vm_next)
 + if (vma-anon_vma 
 + (unsigned long) anon_vma_lock 
 + (unsigned long) vma-anon_vma-lock 
 + (unsigned long) vma-anon_vma-lock 
 + (unsigned long) anon_vma_lock_last)
 + anon_vma_lock = vma-anon_vma-lock;
 + if (anon_vma_lock == (spinlock_t *) -1UL)
 + break;
 + anon_vma_lock_last = anon_vma_lock;
 + if (lock)
 + spin_lock(anon_vma_lock);
 + else
 + spin_unlock(anon_vma_lock);
 + }
 +}
   


 +
 +/*
 + * This operation locks against the VM for all pte/vma/mm related
 + * operations that could ever happen on a certain mm. This includes
 + * vmtruncate, try_to_unmap, and all page faults. The holder
 + * must not hold any mm related lock. A single task can't take more
 + * than one mm lock in a row or it would deadlock.
 + */
 +void mm_lock(struct mm_struct * mm)
 +{
 + down_write(mm-mmap_sem);
 + mm_lock_unlock(mm, 1);
 +}
 +
 +void mm_unlock(struct mm_struct *mm)
 +{
 + mm_lock_unlock(mm, 0);
 + up_write(mm-mmap_sem);
 +}

   


-
This SF.net email is sponsored by the 2008 JavaOne(SM) Conference 
Register now and save $200. Hurry, offer 

Re: [kvm-devel] [02/17][PATCH] Implement smp_call_function_mask for ia64 - V8

2008-04-01 Thread Jeremy Fitzhardinge
Jes Sorensen wrote:
 Jeremy Fitzhardinge wrote:
   
 Jes Sorensen wrote:
 This change has been on the x86 side for ages, and not even Ingo made a 
 peep about it ;)
 

 Mmmm, last time I looked, x86 didn't scale to any interesting number
 of CPUs :-)
   

Well, I guess you need all those CPUs if scanning a 64-word bitvector 
takes anything like the time it takes to do an IPI...

 I wasn't suggesting we shouldn't have both interfaces, merely
 questioning why adding what to me seems like an unnecessary performance
 hit for the classic case of the call.

I don't mind how many interfaces there are, so long as there only needs 
to be one place to hook to plug in the Xen version of 
smp_call_function_whatever.  Perhaps the answer is to just hook the IPI 
mechanism itself rather than the whole of smp_call_function_mask...

Have you looked at Jens Axboe's patches to make all this stuff a lot 
more arch-common?

J

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [02/17][PATCH] Implement smp_call_function_mask for ia64 - V8

2008-03-31 Thread Jeremy Fitzhardinge
Jes Sorensen wrote:
 I'm a little wary of the performance impact of this change. Doing a
 cpumask compare on all smp_call_function calls seems a little expensive.
 Maybe it's just noise in the big picture compared to the actual cost of
 the IPIs, but I thought I'd bring it up.

 Keep in mind that a cpumask can be fairly big these days, max NR_CPUS
 is currently 4096. For those booting a kernel with NR_CPUS at 4096 on
 a dual CPU machine, it would be a bit expensive.
   

Unless your hardware has remarkably fast IPIs, I think really the cost 
of scanning 512 bytes is going to be in the noise...

This change has been on the x86 side for ages, and not even Ingo made a 
peep about it ;)

 Why not keep smp_call_function() the way it was before, rather than
 implementing it via the call to smp_call_function_mask()?
   

Because Xen needs a different core implementation (because of its 
different IPI implementation), and it would be better to just have to do 
one of them rather than N.

J

-
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC/PATCH 01/15] preparation: provide hook to enable pgstes in user pagetable

2008-03-20 Thread Jeremy Fitzhardinge
Carsten Otte wrote:
 +struct mm_struct *dup_mm(struct task_struct *tsk);
   

No prototypes in .c files.  Put this in an appropriate header.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [EMAIL PROTECTED]: [patch 3/4] paravirt: set_access_flags/set_wrprotect should use paravirt interface]

2008-01-30 Thread Jeremy Fitzhardinge
Marcelo Tosatti wrote:
 Forgot to copy you... Ideally all pte updates should be done via the
 paravirt interface.
   

Hm, are you sure?

 +static inline void pte_clear_bit(unsigned int bit, pte_t *ptep)
 +{
 + pte_t pte = *ptep;
 + clear_bit(bit, (unsigned long *)pte.pte);
 + set_pte(ptep, pte);
 +}
   

This is not a safe transformation.  This will lose hardware A/D bit 
changes if the pte is part of an active pagetable.  Aside from that, 
clear_bit is atomic and so is the wrong thing to use on a local 
variable; a plain bitmask would do the job.

clear_bit on a PTE is fine, since everyone has to support some level of 
trap'n'emulate on pte-level entries (ptep_get_and_clear is very hard to 
implement otherwise).  Shadow pagetable implementations will do this 
naturally, and Xen must do it to allow atomic updates on ptes (also 
because it makes late-pinning/early-unpinning a performance win).

 +
  static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
 pte_t *ptep, pte_t pte)
  {
 Index: linux-2.6-x86-kvm/include/asm-x86/pgtable.h
 ===
 --- linux-2.6-x86-kvm.orig/include/asm-x86/pgtable.h
 +++ linux-2.6-x86-kvm/include/asm-x86/pgtable.h
 @@ -227,6 +227,8 @@ void native_pagetable_setup_done(pgd_t *
  #define pte_update(mm, addr, ptep)  do { } while (0)
  #define pte_update_defer(mm, addr, ptep)do { } while (0)
  
 +#define pte_clear_bit(bit, ptep) clear_bit(bit, (unsigned long 
 *)ptep-pte)
 +
  static inline void paravirt_pagetable_setup_start(pgd_t *base)
  {
   native_pagetable_setup_start(base);
 @@ -302,7 +304,7 @@ static inline void native_set_pte_at(str
  ({   \
   int __changed = !pte_same(*(ptep), entry);  \
   if (__changed  dirty) {   \
 - *ptep = entry;  \
 + set_pte(ptep, entry);   \
   

This change is OK, but doesn't really do anything.  All hypervisors 
allow direct access to ptes.

   pte_update_defer((vma)-vm_mm, (address), (ptep));  \
   flush_tlb_page(vma, address);   \
   }   \
 @@ -357,7 +359,7 @@ static inline pte_t ptep_get_and_clear_f
  #define __HAVE_ARCH_PTEP_SET_WRPROTECT
  static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long 
 addr, pte_t *ptep)
  {
 - clear_bit(_PAGE_BIT_RW, (unsigned long *)ptep-pte);
 + pte_clear_bit(_PAGE_BIT_RW, ptep);
   

This is only valid if ptep_set_wrprotect is never used on active ptes, 
which I don't think is the case.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [EMAIL PROTECTED]: [patch 3/4] paravirt: set_access_flags/set_wrprotect should use paravirt interface]

2008-01-30 Thread Jeremy Fitzhardinge
Marcelo Tosatti wrote:
 On Wed, Jan 30, 2008 at 03:00:49PM -0800, Jeremy Fitzhardinge wrote:
   
 Marcelo Tosatti wrote:
 
 Forgot to copy you... Ideally all pte updates should be done via the
 paravirt interface.
  
   
 Hm, are you sure?

 
 +static inline void pte_clear_bit(unsigned int bit, pte_t *ptep)
 +{
 +   pte_t pte = *ptep;
 +   clear_bit(bit, (unsigned long *)pte.pte);
 +   set_pte(ptep, pte);
 +}
  
   
 This is not a safe transformation.  This will lose hardware A/D bit 
 changes if the pte is part of an active pagetable.  Aside from that, 
 clear_bit is atomic and so is the wrong thing to use on a local 
 variable; a plain bitmask would do the job.
 

 True this is racy on baremetal.
   

And under Xen (since the pagetable is the real hardware pagetable).

 clear_bit on a PTE is fine, since everyone has to support some level of 
 trap'n'emulate on pte-level entries (ptep_get_and_clear is very hard to 
 implement otherwise).  Shadow pagetable implementations will do this 
 naturally, and Xen must do it to allow atomic updates on ptes (also 
 because it makes late-pinning/early-unpinning a performance win).
 

 Sure clear_bit on a PTE is fine, but we wan't to virtualize for batching
 pte updates via hypercall.
   

Yes, mprotect needs batching, but that probably needs special handling.

   
 +
 static inline void set_pte_at(struct mm_struct *mm, unsigned long addr,
   pte_t *ptep, pte_t pte)
 {
 Index: linux-2.6-x86-kvm/include/asm-x86/pgtable.h
 ===
 --- linux-2.6-x86-kvm.orig/include/asm-x86/pgtable.h
 +++ linux-2.6-x86-kvm/include/asm-x86/pgtable.h
 @@ -227,6 +227,8 @@ void native_pagetable_setup_done(pgd_t *
 #define pte_update(mm, addr, ptep)  do { } while (0)
 #define pte_update_defer(mm, addr, ptep)do { } while (0)

 +#define pte_clear_bit(bit, ptep)   clear_bit(bit, (unsigned long 
 *)ptep-pte)
 +
 static inline void paravirt_pagetable_setup_start(pgd_t *base)
 {
 native_pagetable_setup_start(base);
 @@ -302,7 +304,7 @@ static inline void native_set_pte_at(str
 ({  \
 int __changed = !pte_same(*(ptep), entry);  \
 if (__changed  dirty) {   \
 -   *ptep = entry;  \
 +   set_pte(ptep, entry);   \
  
   
 This change is OK, but doesn't really do anything.  All hypervisors 
 allow direct access to ptes.
 

 We want the pte update to go through a hypercall instead of trap'n'emulate.
   

ptep_set_access_flags is only used in fault handling, so there's no 
scope for batching.

   
 pte_update_defer((vma)-vm_mm, (address), (ptep));  \
 flush_tlb_page(vma, address);   \
 }   \
 @@ -357,7 +359,7 @@ static inline pte_t ptep_get_and_clear_f
 #define __HAVE_ARCH_PTEP_SET_WRPROTECT
 static inline void ptep_set_wrprotect(struct mm_struct *mm, unsigned long 
 addr, pte_t *ptep)
 {
 -   clear_bit(_PAGE_BIT_RW, (unsigned long *)ptep-pte);
 +   pte_clear_bit(_PAGE_BIT_RW, ptep);
  
   
 This is only valid if ptep_set_wrprotect is never used on active ptes, 
 which I don't think is the case.
 

 Ok, pte_clear_bit() like above is broken. But the point is that we
 want to batch the write protection of the source pte along with the
 creation of the write protected destination pte on COW using the
 paravirt_enter_lazy/paravirt_leave_lazy.
   

Which we are we talking about here?  In Xen, the destination pte is 
unpinned during a fork(), so a plain memory write will work without any 
hypervisor interaction.  That's what I mean about the late pin/early 
unpin optimisation.  Almost all fork/exit pagetable manipulation 
happens on unpinned pagetables which are plain RW pages.  It's only 
updates on pinned pagetables which need hypervisor interaction.  The 
only significant batchable update is mprotect.

I don't know what the issues for KVM/VMI are.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC] SVM: Add Support for Nested Paging in AMD Fam16 CPUs

2008-01-27 Thread Jeremy Fitzhardinge
Avi Kivity wrote:
  I find it non-descriptive, and it reminds me of another hypervisor.  
 I suggest 'tlp' for two-level paging.

That has its own ambiguity; without other context it reads like 
two-level pagetable.  Anyway, using the same term for the same thing 
is not a bad idea.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] export notifier #1

2008-01-23 Thread Jeremy Fitzhardinge
Gerd Hoffmann wrote:
 Another maybe workable approach for Xen is to go through pv_ops
 (although pte_clear doesn't go through pv_ops right now, so this would
 be an additional hook too ...).
   

It does for 32-bit PAE.  Making pte_clear uniform across all pagetable 
modes would be a nice cleanup.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 24/24] make vsmp a paravirt client

2007-11-13 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 the ifdef only exists because, as I said, the code itself will be always
 compiled in, to avoid an ifdef in setup_64.c. So it's just a taking it
 from here, putting it there issue. Kiran seem to prefer this way, but I
 don't really have a preference.

It would be better to have the ifdef in setup_64.c and just make the
compilation of vsmp_64.c depend on CONFIG_X86_VSMP.  If the ifdef really
rankles you could use a weak stub function somewhere, or define an
inline stub in vsmp.h.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/24] paravirt_ops for unified x86 - that's me again!

2007-11-13 Thread Jeremy Fitzhardinge
Amit Shah wrote:
 Glauber, are you planning on consolidating the dma_ops structure for 32- and 
 64-bit? 32-bit doesn't currently have a dma_mapping_ops structure, which 
 makes paravirtualizing DMA access difficult on 32-bit.


I think it's a good idea.  While I haven't worked out the details yet,
it seems like something I'll need for Xen dom0 support.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 1/24] mm/sparse-vmemmap.c: make sure init_mm is included

2007-11-09 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 mm/sparse-vmemmap.c uses init_mm in some places.  However, it is not
 present in any of the headers currently included in the file.

 init_mm is defined as extern in sched.h, so we add it to the headers list

 Up to now, this problem was masked by the fact that functions like
 set_pte_at() and pmd_populate_kernel() are usually macros that expand to
 simpler variants that does not use the first parameter at all.

 Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
 Signed-off-by: Andrew Morton [EMAIL PROTECTED]
 Signed-off-by: Linus Torvalds [EMAIL PROTECTED]
 ---
  mm/sparse-vmemmap.c |1 +
  1 files changed, 1 insertions(+), 0 deletions(-)

 diff --git a/mm/sparse-vmemmap.c b/mm/sparse-vmemmap.c
 index d3b718b..22620f6 100644
 --- a/mm/sparse-vmemmap.c
 +++ b/mm/sparse-vmemmap.c
 @@ -24,6 +24,7 @@
  #include linux/module.h
  #include linux/spinlock.h
  #include linux/vmalloc.h
 +#include linux/sched.h
   

This is already in git.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/24] smp x86 consolidation

2007-11-09 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 This patch consolidates part of the pieces of smp for both architectures.
 (i386 and x86_64). It makes part the calls go through smp_ops, and shares
 code for those functions in smpcommon.c

 There's more room for code sharing here, but it is left as an exercise to
 the reader ;-)
   

I'm getting link errors in 32-bit:

arch/x86/kernel/built-in.o: In function `native_smp_send_reschedule':
/home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/smpcommon.c:262: undefined 
reference to `genapic'
arch/x86/kernel/built-in.o: In function `native_smp_call_function_mask':
/home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/smpcommon.c:113: undefined 
reference to `genapic'


J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/24] smp x86 consolidation

2007-11-09 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
  arch/x86/kernel/built-in.o: In function `native_smp_send_reschedule':
  /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/smpcommon.c:262:
 undefined reference to `genapic'
  arch/x86/kernel/built-in.o: In function `native_smp_call_function_mask':
  /home/jeremy/hg/xen/paravirt/linux/arch/x86/kernel/smpcommon.c:113:
 undefined reference to `genapic'

 Ok, it compiled just fine here. I bet it's due to  one of that i386 lots
 of variants.
 Which subarchitecture are you compiling for, jeremy ?

Default (CONFIG_X86_PC).  Config attached.

J
#
# Automatically generated make config: don't edit
# Linux kernel version: 2.6.24-rc2
# Fri Nov  9 15:19:56 2007
#
CONFIG_X86_32=y
CONFIG_GENERIC_TIME=y
CONFIG_GENERIC_CMOS_UPDATE=y
CONFIG_CLOCKSOURCE_WATCHDOG=y
CONFIG_GENERIC_CLOCKEVENTS=y
CONFIG_GENERIC_CLOCKEVENTS_BROADCAST=y
CONFIG_LOCKDEP_SUPPORT=y
CONFIG_STACKTRACE_SUPPORT=y
CONFIG_SEMAPHORE_SLEEPERS=y
CONFIG_X86=y
CONFIG_MMU=y
CONFIG_ZONE_DMA=y
CONFIG_QUICKLIST=y
CONFIG_GENERIC_ISA_DMA=y
CONFIG_GENERIC_IOMAP=y
CONFIG_GENERIC_BUG=y
CONFIG_GENERIC_HWEIGHT=y
CONFIG_ARCH_MAY_HAVE_PC_FDC=y
CONFIG_DMI=y
CONFIG_DEFCONFIG_LIST=/lib/modules/$UNAME_RELEASE/.config

#
# General setup
#
CONFIG_EXPERIMENTAL=y
CONFIG_LOCK_KERNEL=y
CONFIG_INIT_ENV_ARG_LIMIT=32
CONFIG_LOCALVERSION=
# CONFIG_LOCALVERSION_AUTO is not set
CONFIG_SWAP=y
CONFIG_SYSVIPC=y
CONFIG_SYSVIPC_SYSCTL=y
CONFIG_POSIX_MQUEUE=y
CONFIG_BSD_PROCESS_ACCT=y
# CONFIG_BSD_PROCESS_ACCT_V3 is not set
CONFIG_TASKSTATS=y
CONFIG_TASK_DELAY_ACCT=y
# CONFIG_TASK_XACCT is not set
# CONFIG_USER_NS is not set
# CONFIG_AUDIT is not set
# CONFIG_IKCONFIG is not set
CONFIG_LOG_BUF_SHIFT=17
# CONFIG_CGROUPS is not set
CONFIG_FAIR_GROUP_SCHED=y
CONFIG_FAIR_USER_SCHED=y
# CONFIG_FAIR_CGROUP_SCHED is not set
CONFIG_SYSFS_DEPRECATED=y
CONFIG_RELAY=y
CONFIG_BLK_DEV_INITRD=y
CONFIG_INITRAMFS_SOURCE=
CONFIG_CC_OPTIMIZE_FOR_SIZE=y
CONFIG_SYSCTL=y
# CONFIG_EMBEDDED is not set
CONFIG_UID16=y
CONFIG_SYSCTL_SYSCALL=y
CONFIG_KALLSYMS=y
# CONFIG_KALLSYMS_ALL is not set
CONFIG_KALLSYMS_EXTRA_PASS=y
CONFIG_HOTPLUG=y
CONFIG_PRINTK=y
CONFIG_BUG=y
CONFIG_ELF_CORE=y
CONFIG_BASE_FULL=y
CONFIG_FUTEX=y
CONFIG_ANON_INODES=y
CONFIG_EPOLL=y
CONFIG_SIGNALFD=y
CONFIG_EVENTFD=y
CONFIG_SHMEM=y
CONFIG_VM_EVENT_COUNTERS=y
CONFIG_SLUB_DEBUG=y
# CONFIG_SLAB is not set
CONFIG_SLUB=y
# CONFIG_SLOB is not set
CONFIG_RT_MUTEXES=y
# CONFIG_TINY_SHMEM is not set
CONFIG_BASE_SMALL=0
CONFIG_MODULES=y
CONFIG_MODULE_UNLOAD=y
# CONFIG_MODULE_FORCE_UNLOAD is not set
CONFIG_MODVERSIONS=y
CONFIG_MODULE_SRCVERSION_ALL=y
CONFIG_KMOD=y
CONFIG_STOP_MACHINE=y
CONFIG_BLOCK=y
# CONFIG_LBD is not set
# CONFIG_BLK_DEV_IO_TRACE is not set
# CONFIG_LSF is not set
CONFIG_BLK_DEV_BSG=y

#
# IO Schedulers
#
CONFIG_IOSCHED_NOOP=y
CONFIG_IOSCHED_AS=y
CONFIG_IOSCHED_DEADLINE=y
CONFIG_IOSCHED_CFQ=y
# CONFIG_DEFAULT_AS is not set
# CONFIG_DEFAULT_DEADLINE is not set
CONFIG_DEFAULT_CFQ=y
# CONFIG_DEFAULT_NOOP is not set
CONFIG_DEFAULT_IOSCHED=cfq
CONFIG_PREEMPT_NOTIFIERS=y

#
# Processor type and features
#
CONFIG_TICK_ONESHOT=y
CONFIG_NO_HZ=y
CONFIG_HIGH_RES_TIMERS=y
CONFIG_GENERIC_CLOCKEVENTS_BUILD=y
CONFIG_SMP=y
CONFIG_X86_PC=y
# CONFIG_X86_ELAN is not set
# CONFIG_X86_VOYAGER is not set
# CONFIG_X86_NUMAQ is not set
# CONFIG_X86_SUMMIT is not set
# CONFIG_X86_BIGSMP is not set
# CONFIG_X86_VISWS is not set
# CONFIG_X86_GENERICARCH is not set
# CONFIG_X86_ES7000 is not set
CONFIG_SCHED_NO_NO_OMIT_FRAME_POINTER=y
CONFIG_PARAVIRT=y
CONFIG_PARAVIRT_GUEST=y
CONFIG_XEN=y
CONFIG_VMI=y
# CONFIG_M386 is not set
# CONFIG_M486 is not set
# CONFIG_M586 is not set
# CONFIG_M586TSC is not set
# CONFIG_M586MMX is not set
# CONFIG_M686 is not set
# CONFIG_MPENTIUMII is not set
# CONFIG_MPENTIUMIII is not set
CONFIG_MPENTIUMM=y
# CONFIG_MCORE2 is not set
# CONFIG_MPENTIUM4 is not set
# CONFIG_MK6 is not set
# CONFIG_MK7 is not set
# CONFIG_MK8 is not set
# CONFIG_MCRUSOE is not set
# CONFIG_MEFFICEON is not set
# CONFIG_MWINCHIPC6 is not set
# CONFIG_MWINCHIP2 is not set
# CONFIG_MWINCHIP3D is not set
# CONFIG_MGEODEGX1 is not set
# CONFIG_MGEODE_LX is not set
# CONFIG_MCYRIXIII is not set
# CONFIG_MVIAC3_2 is not set
# CONFIG_MVIAC7 is not set
# CONFIG_X86_GENERIC is not set
CONFIG_X86_CMPXCHG=y
CONFIG_X86_L1_CACHE_SHIFT=6
CONFIG_X86_XADD=y
CONFIG_RWSEM_XCHGADD_ALGORITHM=y
# CONFIG_ARCH_HAS_ILOG2_U32 is not set
# CONFIG_ARCH_HAS_ILOG2_U64 is not set
CONFIG_GENERIC_CALIBRATE_DELAY=y
CONFIG_X86_WP_WORKS_OK=y
CONFIG_X86_INVLPG=y
CONFIG_X86_BSWAP=y
CONFIG_X86_POPAD_OK=y
CONFIG_X86_GOOD_APIC=y
CONFIG_X86_INTEL_USERCOPY=y
CONFIG_X86_USE_PPRO_CHECKSUM=y
CONFIG_X86_TSC=y
CONFIG_X86_CMOV=y
CONFIG_X86_MINIMUM_CPU_FAMILY=4
CONFIG_HPET_TIMER=y
CONFIG_NR_CPUS=4
# CONFIG_SCHED_SMT is not set
CONFIG_SCHED_MC=y
# CONFIG_PREEMPT_NONE is not set
# CONFIG_PREEMPT_VOLUNTARY is not set
CONFIG_PREEMPT=y
CONFIG_PREEMPT_BKL=y
CONFIG_X86_LOCAL_APIC=y
CONFIG_X86_IO_APIC=y
CONFIG_X86_MCE=y
CONFIG_X86_MCE_NONFATAL=y
CONFIG_X86_MCE_P4THERMAL=y

Re: [kvm-devel] include files for kvmclock

2007-11-06 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 This patch introduces the include files for kvm clock.
 They'll be needed for both guest and host part.

 Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
 ---
  include/asm-x86/kvm_para.h |   23 +++
  include/linux/kvm.h|1 +
  include/linux/kvm_para.h   |   20 
  3 files changed, 44 insertions(+), 0 deletions(-)

 diff --git a/include/asm-x86/kvm_para.h b/include/asm-x86/kvm_para.h
 index c6f3fd8..af9fb75 100644
 --- a/include/asm-x86/kvm_para.h
 +++ b/include/asm-x86/kvm_para.h
 @@ -10,15 +10,38 @@
   * paravirtualization, the appropriate feature bit should be checked.
   */
  #define KVM_CPUID_FEATURES   0x4001
 +#define KVM_FEATURE_CLOCKEVENTS 0
 +#define KVM_FEATURE_CLOCKSOURCE 1
 +
  
  #ifdef __KERNEL__
  #include asm/processor.h
 +extern void kvmclock_init(void);
 +
 +union kvm_hv_clock {
   

Why two copies of this structure?

 + struct {
 + u64 tsc_mult;
 + u64 now_ns;
 + /* That's the wall clock, not the water closet */
 + u64 wc_sec;
 + u64 wc_nsec;
 + u64 last_tsc;
 + /* At first, we could use the tsc value as a marker, but Jeremy
 +  * well noted that it will cause us locking problems in 32-bit
 +  * sys, so we have a special version field */
 + u32 version;
 + };
 + char page_align[PAGE_SIZE];
 +};
 +
   

[...]

 +
 +union kvm_hv_clock {
 + struct {
 + u64 tsc_mult;
 + u64 now_ns;
 + /* That's the wall clock, not the water closet */
 + u64 wc_sec;
 + u64 wc_nsec;
 + u64 last_tsc;
 + /* At first, we could use the tsc value as a marker, but Jeremy
 +  * well noted that it will cause us locking problems in 32-bit
 +  * sys, so we have a special version field */
 + u32 version;
 + };
 + char page_align[PAGE_SIZE];
 +};
 +
  /*
   * hypercalls use architecture specific
   */
   


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] include files for kvmclock

2007-11-06 Thread Jeremy Fitzhardinge
Avi Kivity wrote:
 Glauber de Oliveira Costa wrote:
   
 +union kvm_hv_clock {
 +   struct {
 +   u64 tsc_mult;
 +   u64 now_ns;
 +   /* That's the wall clock, not the water closet */
 +   u64 wc_sec;
 +   u64 wc_nsec;
 
   

 Do we really need 128-bit time?  you must be planning to live forever.
   

Well, he's planning on having lots of very small nanoseconds.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 I in fact have seen bugs with mixed reads and writes to the same cr,
 (cr4), but adding the volatile
 flag to the read function seemed to fix it.

Well, volatile will make a read be repeated rather than caching the
previous value, but it has no effect on ordering.

 Yet, I agree with you that
 the theorectical problem exists for the reorder, and your proposed fix
 seems fine (although if we're really desperate about memory usage, we
 can use a char instead a int and save 3 bytes!)

Sure.  Ideally the compiler would never even generate a reference to it,
and it could just be extern, but in practice the compiler will generate
references sometimes.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-11-01 Thread Jeremy Fitzhardinge
Keir Fraser wrote:
 volatile prevents the asm from being 'moved significantly', according to the
 gcc manual. I take that to mean that reordering is not allowed.
   

That phrase doesn't appear in the gcc manual; in fact, it specifically
says that reordering can happen:

The `volatile' keyword indicates that the instruction has important
side-effects.  GCC will not delete a volatile `asm' if it is reachable.
(The instruction can still be deleted if GCC can prove that
control-flow will never reach the location of the instruction.)  Note
that even a volatile `asm' instruction can be moved relative to other
code, including across jump instructions.  For example, on many targets
there is a system register which can be set to control the rounding
mode of floating point operations.  You might try setting it with a
volatile `asm', like this PowerPC example:

asm volatile(mtfsf 255,%0 : : f (fpenv));
sum = x + y;

This will not work reliably, as the compiler may move the addition back
before the volatile `asm'.  To make it work you need to add an
artificial dependency to the `asm' referencing a variable in the code
you don't want moved, for example:

 asm volatile (mtfsf 255,%1 : =X(sum): f(fpenv));
 sum = x + y;

I take from this that it is not a good idea to assume asm volatile has
any ordering effects at all.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/16] read/write_crX, clts and wbinvd for 64-bit paravirt

2007-10-31 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 This patch introduces, and patch callers when needed, native
 versions for read/write_crX functions, clts and wbinvd.

 Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
 Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
 Acked-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
 ---
  arch/x86/mm/pageattr_64.c   |3 +-
  include/asm-x86/system_64.h |   60 ++
  2 files changed, 45 insertions(+), 18 deletions(-)

 diff --git a/arch/x86/mm/pageattr_64.c b/arch/x86/mm/pageattr_64.c
 index c40afba..59a52b0 100644
 --- a/arch/x86/mm/pageattr_64.c
 +++ b/arch/x86/mm/pageattr_64.c
 @@ -12,6 +12,7 @@
  #include asm/processor.h
  #include asm/tlbflush.h
  #include asm/io.h
 +#include asm/paravirt.h
  
  pte_t *lookup_address(unsigned long address)
  { 
 @@ -77,7 +78,7 @@ static void flush_kernel_map(void *arg)
  much cheaper than WBINVD. */
   /* clflush is still broken. Disable for now. */
   if (1 || !cpu_has_clflush)
 - asm volatile(wbinvd ::: memory);
 + wbinvd();
   else list_for_each_entry(pg, l, lru) {
   void *adr = page_address(pg);
   clflush_cache_range(adr, PAGE_SIZE);
 diff --git a/include/asm-x86/system_64.h b/include/asm-x86/system_64.h
 index 4cb2384..b558cb2 100644
 --- a/include/asm-x86/system_64.h
 +++ b/include/asm-x86/system_64.h
 @@ -65,53 +65,62 @@ extern void load_gs_index(unsigned);
  /*
   * Clear and set 'TS' bit respectively
   */
 -#define clts() __asm__ __volatile__ (clts)
 +static inline void native_clts(void)
 +{
 + asm volatile (clts);
 +}
  
 -static inline unsigned long read_cr0(void)
 -{ 
 +static inline unsigned long native_read_cr0(void)
 +{
   unsigned long cr0;
   asm volatile(movq %%cr0,%0 : =r (cr0));
   return cr0;
  }
   

This is a pre-existing bug, but it seems to me that these read/write crX
asms should have a constraint to stop the compiler from reordering them
with respect to each other.  The brute-force approach would be to add
memory clobbers, but the subtle fix would be to add a variable which
is only used to sequence:

static int __cr_seq;

static inline unsigned long native_read_cr0(void)
{
unsigned long cr0;
asm volatile(mov %%cr0, %0 : =r (cr0), =m (__cr_seq));
}

static inline void native_write_cr0(unsigned long val)
{
asm volatile(mov %1, %%cr0 : +m (__cr_seq) : r (val));
}


J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 11/16] turn priviled operation into a macro in head_64.S

2007-10-31 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 under paravirt, read cr2 cannot be issued directly anymore.
 So wrap it in a macro, defined to the operation itself in case
 paravirt is off, but to something else if we have paravirt
 in the game
   

Is this actually needed?  It's only used in the early fault handler in
head_64.S.  Will we be taking that path in the paravirt case?  If so,
should we disable the fault handler altogether, since the hypervisor can
probably provide better diagnositcs.

J
 Signed-off-by: Glauber de Oliveira Costa [EMAIL PROTECTED]
 Signed-off-by: Steven Rostedt [EMAIL PROTECTED]
 Acked-by: Jeremy Fitzhardinge [EMAIL PROTECTED]
 ---
  arch/x86/kernel/head_64.S |9 -
  1 files changed, 8 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kernel/head_64.S b/arch/x86/kernel/head_64.S
 index b6167fe..c31b1c9 100644
 --- a/arch/x86/kernel/head_64.S
 +++ b/arch/x86/kernel/head_64.S
 @@ -19,6 +19,13 @@
  #include asm/msr.h
  #include asm/cache.h
  
 +#ifdef CONFIG_PARAVIRT
 +#include asm/asm-offsets.h
 +#include asm/paravirt.h
 +#else
 +#define GET_CR2_INTO_RCX movq %cr2, %rcx
 +#endif
 +
  /* we are not able to switch in one step to the final KERNEL ADRESS SPACE
   * because we need identity-mapped pages.
   *
 @@ -267,7 +274,7 @@ ENTRY(early_idt_handler)
   xorl %eax,%eax
   movq 8(%rsp),%rsi   # get rip
   movq (%rsp),%rdx
 - movq %cr2,%rcx
 + GET_CR2_INTO_RCX
   leaq early_idt_msg(%rip),%rdi
   call early_printk
   cmpl $2,early_recursion_flag(%rip)
   


-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] raise tsc clocksource rating

2007-10-29 Thread Jeremy Fitzhardinge
Ingo Molnar wrote:
 * Zachary Amsden [EMAIL PROTECTED] wrote:

   
 On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote:
 
 From: Glauber de Oliveira Costa [EMAIL PROTECTED]

 tsc is very good time source (when it does not have drifts, does not
 change it's frequency, i.e. when it works), so it should have its rating
 raised to a value greater than, or equal 400.

 Since it's being a tendency among paravirt clocksources to use values
 around 400, we should declare tsc as even better: So we use 500.
   
 Why is the TSC better than a paravirt clocksource?  In our case this 
 is definitely inaccurate.  Paravirt clocksources should be preferred 
 to TSC, and both must be made available in hardware for platforms 
 which do not support paravirt.
 

 if it's inaccurate why are you exposing it to the guest then? Native 
 only uses the TSC if it's safe and accurate to do so.
   

It is used as part of the Xen clocksource as a short term extrapolator,
with correction parameters supplied by the hypervisor.  It should never
be used directly.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] raise tsc clocksource rating

2007-10-29 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 On Mon, 2007-10-29 at 20:10 -0300, Glauber de Oliveira Costa wrote:
   
 From: Glauber de Oliveira Costa [EMAIL PROTECTED]

 tsc is very good time source (when it does not have drifts, does not
 change it's frequency, i.e. when it works), so it should have its rating
 raised to a value greater than, or equal 400.

 Since it's being a tendency among paravirt clocksources to use values
 around 400, we should declare tsc as even better: So we use 500.
 

 Why is the TSC better than a paravirt clocksource?  In our case this is
 definitely inaccurate.  Paravirt clocksources should be preferred to
 TSC, and both must be made available in hardware for platforms which do
 not support paravirt.

 Also, please cc all the paravirt developers on things related to
 paravirt, especially things with such broad effect.  I think 400 is a
 good value for a perfect native clocksource.  400 should be reserved
 for super-real (i.e. paravirt) sources that should always be chosen over
 a hardware realistic implementation in a virtual environment.
   

Yes, agreed.  The tsc is never the right thing to use if there's a
paravirt clocksource available.

What's wrong with rating it 300?  What inferior clocksource does it lose
out to?  Shouldn't that clocksource be lowered?  (Why don't we just use
1 to 10?)

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] raise tsc clocksource rating

2007-10-29 Thread Jeremy Fitzhardinge
Ingo Molnar wrote:
 that's totally broken then. You cannot create an SMP-safe monotonic 
 clocksource via interpolation - native does not do it either. Good thing 
 this problem got exposed, it needs to be fixed.
   

Sigh, I don't really want to have this fight again.

I don't really see what point there is in raising the tsc's rating (why
is 300 insufficient again?), but regardless of the value, the Xen
clocksource rating needs to be higher.

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] raise tsc clocksource rating

2007-10-29 Thread Jeremy Fitzhardinge
Ingo Molnar wrote:
 i dont remember us having discussed this before, ever. If there's any 
 fight about monotonicity and SMP then it would be a pretty onesided 
 affair, with you being beaten up seriously ;-)
   

This is part of Xen's ABI, so it isn't easily changed.  You're right
that getting monotonicity is a pretty ugly affair of doing something
like if (now  previous_return) return ++previous_return;, but its
better than using the tsc.

(Last time around, it was Xen can't use the tsc because it can't ever
be used as a stable timebase - though I don't think that was you
specifically.)

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] Paravirt timer for KVM

2007-10-15 Thread Jeremy Fitzhardinge
Carsten Otte wrote:
 On s390, we've had an instruction...

Yes, quite ;)

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [RFC] Paravirt timer for KVM

2007-10-12 Thread Jeremy Fitzhardinge
Glauber de Oliveira Costa wrote:
 My next TODOs with it are:
 * Get SMP working
 * Try something for stolen time, as jeremy's last suggestion for anthony's 
 patch
 * Measure the time it takes for a hypercall, and subtract this time
 for calculating the expiry time for the timer event.
   

I don't think there's much point in trying to do stuff like this.  The
guest can be preempted at any time, so there's an arbitrary amount of
time between deciding to set a timeout, and the time the timeout
actually happens.

In theory you can mitigate this by using an absolute rather than
relative timeout value, but in practice I don't think it makes much
difference.

 +
 +/*
 + * This is our read_clock function. The host puts an tsc timestamp each time
 + * it updates a new time, and then we can use it to derive a slightly more
 + * precise notion of elapsed time, converted to nanoseconds.
 + *
 + * If the platform provides a stable tsc, we just use it, and there is no 
 need
 + * for the host to update anything.


How would you deal with suspend/resume/migrate?  Also, do you assume
that stable_tsc also means synchronized tsc on an SMP host?

 + */
 +static cycle_t kvm_clock_read(void) {
 +
 + u64 delta, last_tsc;
 + struct timespec *now;
 +
 + if (hv_clock.stable_tsc) {
 + rdtscll(last_tsc);
 + return last_tsc;
   

So this returns a tsc here? 

 + }
 +
 + do {
 + last_tsc = hv_clock.last_tsc;
 + rmb();
 + now = hv_clock.now;
   

Shouldn't this be taking a copy of now, rather than a pointer to it? 
Otherwise what's the point of this loop?

 + rmb();
 + } while (hv_clock.last_tsc != last_tsc);
   

This won't be an atomic compare on 32-bit; it could get confused by
seeing a half-updated tsc value.

 +
 + delta = native_read_tsc() - last_tsc;
 + delta = (delta * hv_clock.tsc_mult)  KVM_SCALE;
 +
 + return (cycle_t)now-tv_sec * NSEC_PER_SEC + now-tv_nsec + delta;
   

--- But returns ns here?

 +}
 +
 +static void kvm_timer_set_mode(enum clock_event_mode mode,
 + struct clock_event_device *evt)
 +{
 + WARN_ON(!irqs_disabled());
 +
 + switch (mode) {
 + case CLOCK_EVT_MODE_ONESHOT:
 + /* this is what we want */
 + break;
 + case CLOCK_EVT_MODE_RESUME:
 + break;
 + case CLOCK_EVT_MODE_PERIODIC:
 + WARN_ON(1);
 + break;
 + case CLOCK_EVT_MODE_UNUSED:
 + case CLOCK_EVT_MODE_SHUTDOWN:
 + kvm_hypercall0(KVM_HCALL_STOP_ONESHOT);
 + break;
 + default:
 + break;
 + }
 +}
 +
 +/*
 + * Programming the next event is just a matter of asking the host
 + * to generate us an interrupt when the time expires. We pass the
 + * delta on, and hypervisor will do all remaining tricks. For a more
 + * precise timing, we can just subtract the time spent by the hypercall
   

Not worthwhile.  It would be better to make the hypercall take an
absolute time, and pass it now+delta.  At least then if you get
preempted past the timeout period you can return -ETIME, and the clock
subsystem will know what to do.

 + */
 +static int kvm_timer_next_event(unsigned long delta,
 + struct clock_event_device *evt)
 +{
 + WARN_ON(evt-mode != CLOCK_EVT_MODE_ONESHOT);
 + kvm_hypercall1(KVM_HCALL_SET_ALARM, delta);
 + return 0;
 +}
 +
 diff --git a/arch/i386/kernel/setup.c b/arch/i386/kernel/setup.c
 index d474cd6..fd758f9 100644
 --- a/arch/i386/kernel/setup.c
 +++ b/arch/i386/kernel/setup.c
 @@ -46,6 +46,7 @@
  #include linux/crash_dump.h
  #include linux/dmi.h
  #include linux/pfn.h
 +#include linux/kvm_para.h
  
  #include video/edid.h
  
 @@ -579,6 +580,9 @@ void __init setup_arch(char **cmdline_p)
   vmi_init();
  #endif
  
 +#ifdef CONFIG_KVM_CLOCK
 + kvmclock_init();
 +#endif
   

Why is this necessary?  Can't you hook one of the existing pvops?


J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now  http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-17 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Nakajima, Jun wrote:
 I don't understand the purpose of returning the max leaf.  Who is that
 information useful for?
 

 Well, this is the key info to the user of CPUID. It tells which leaves
 are valid to use. Otherwise, the user cannot tell whether the results of
 CPUID.0x400N are valid or not (i.e. junk). BTW, this is what we are
 doing on the native (for the leaf 0, 0x8000, for example). The fact
 that Xen returns 0x4002 means it only uses 3 leaves today.   

 Then it's just a version ID.  You pretty much have to treat it as a
 version id because if it returns 0x4000 0003 and you only know what
 0002 is, then you can't actually use it.

Yeah.  It's the way all the other cpuid leaf/level stuff works, so it's
reasonable to do the same thing here.  The question it helps answer is
I understand leaf 33, does the [v]CPU?.

 I much prefer the current use of CPUID in KVM.  If 1000 returns the
 KVM signature, then 1001 *must* be valid and contain a set of feature
 bits.  If we wish to use additional CPUID leaves in the future, then
 we can just use a feature bit.  The real benefit to us is that we can
 use a discontiguous set of leaves whereas the Xen approach is forced
 to use a linear set (at least for the result to be meaningful).

Well, its also what the CPU itself does.  The feature bits tend to
relate to specific CPU features rather than CPUID instruction leaves. 
The features themselves may also have corresponding leaves, but that's
secondary.  IOW, if feature bit X is set, it may use leaf 0x4000101f,
but that doesn't mean leaves 0x40001001-1f are necessarily defined.

 I'm starting to lean toward just using .  If for no other reason
 than the hypercall space is unsharable.

Well, it could be, but it would take affirmative action on the guest's
part.  If there's feature bits for each supported hypercall interface,
then you could have a magic MSR to select which interface you want to
use now.  That would allow a generic-interface-using guest to probe for
the generic interface at cpuid leaf 0x40001000, use 40001001 to
determine whether the hypercall interface is available, 4000100x to find
the base of the magic msrs, and write appropriate msr to set the desired
hypercall style (and all this can be done without using vmcall, so it
doesn't matter that hypercall interface is initially established).

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 This patch refactors the current hypercall infrastructure to better support 
 live
 migration and SMP.  It eliminates the hypercall page by trapping the UD
 exception that would occur if you used the wrong hypercall instruction for the
 underlying architecture and replacing it with the right one lazily.
   

I guess it would be pretty rude/unlikely for these opcodes to get reused
in other implementations...  But couldn't you make the page trap
instead, rather than relying on an instruction fault?

 It also introduces the infrastructure to probe for hypercall available via
 CPUID leaves 0x4002.  CPUID leaf 0x4003 should be filled out by
 userspace.
   

Is this compatible with Xen's (and other's) use of cpuid?  That is,
0x4000 returns a hypervisor-specific signature in e[bcd]x, and eax
has the max hypervisor leaf.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 The whole point of using the instruction is to allow hypercalls to be
 used in many locations.  This has the nice side effect of not
 requiring a central hypercall initialization routine in the guest to
 fetch the hypercall page.  A PV driver can be completely independent
 of any other code provided that it restricts itself to it's hypercall
 namespace.

I see.  So you take the fault, disassemble the instruction, see that its
another CPU's vmcall instruction, and then replace it with the current
CPU's vmcall?

 Xen is currently using 0/1/2.  I had thought it was only using 0/1. 
 The intention was not to squash Xen's current CPUID usage so that it
 would still be possible for Xen to make use of the guest code.  Can we
 agree that Xen won't squash leaves 3/4 or is it not worth trying to be
 compatible at this point?

No, the point is that you're supposed to work out which hypervisor it is
from the signature in leaf 0, and then the hypervisor can put anything
it wants in the other leaves.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Yeah, see, the initial goal was to make it possible to use the KVM
 paravirtualizations on other hypervisors.  However, I don't think this
 is really going to be possible in general so maybe it's better to just
 use leaf 0.  I'll let others chime in before sending a new patch.

Hm.  Obviously you can just define a signature for kvm-compatible
hypercall interface and make it common that way, but it gets tricky if
the hypervisor supports multiple hypercall interfaces, including the kvm
one.  Start the kvm leaves at 0x40001000 or something?

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Jeremy Fitzhardinge
Nakajima, Jun wrote:
 Today, 3 CPUID leaves starting from 0x4000_ are defined in a generic
 fashion (hypervisor detection, version, and hypercall page), and those
 are the ones used by Xen today. We should extend those leaves (e.g.
 starting from 0x4000_0003) for the vmm-independent features as well.

 If Xen needs additional Xen-specific features, we need to allocate some
 leaves for those (e.g. 0x4000_1000)

But the signature is XenVMMXenVMM, which isn't very generic.  If we're
presenting a generic interface, it needs to have a generic signature,
otherwise guests will need to have a list of all hypervisor signatures
supporting their interface.  Since 0x4000 has already been
established as the base leaf of the hypervisor-specific interfaces, the
generic interface will have to be elsewhere.

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] Refactor hypercall infrastructure

2007-09-14 Thread Jeremy Fitzhardinge
Nakajima, Jun wrote:
 The hypervisor detection machanism is generic, and the signature
 returned is implentation specific. Having a list of all hypervisor
 signatures sounds fine to me as we are detecting vendor-specific
 processor(s) in the native. And I don't expect the list is large. 

   

I'm confused about what you're proposing.  I was thinking that a kernel
looking for the generic hypervisor interface would check for a specific
signature at some cpuid leaf, and then go about using it from there.  If
not, how does is it supposed to detect the generic hypervisor interface?

J

-
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse012070mrt/direct/01/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH][RFC]: pte notifiers -- support for external page tables

2007-09-06 Thread Jeremy Fitzhardinge
Avi Kivity wrote:
 It is, but the hooks are in much the same places.  It could be argued
 that you'd embed pte notifiers in paravirt_ops for a host kernel, but
 that's not doable because pte notifiers use higher-level data
 strutures (like vmas).

Also, I wouldn't like to preclude the possibility of having a kernel
that's both a guest and a host (ie, nested vmms).

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/4] Virtual Machine Time Accounting

2007-08-20 Thread Jeremy Fitzhardinge
Laurent Vivier wrote:
 functionnalities:

 - allow to measure time spent by a CPU in a virtual CPU.
 - allow to display in /proc/state this value by CPU
 - allow to display in /proc/pid/state this value by process
 - allow KVM to use these 3 previous functionnalities
   

So, currently time spent in a kvm guest is accumulated as qemu-kvm
usertime, right?  Given that qemu knows when its running in qemu vs
guest context, couldn't it provide the breakdown between user and guest
time (ditto lguest)?

J

-
This SF.net email is sponsored by: Splunk Inc.
Still grepping through log files to find problems?  Stop.
Now Search log events and configuration files using AJAX and a browser.
Download your FREE copy of Splunk now   http://get.splunk.com/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 I don't agree that having paravirt_ops within a normal module is all
 that useful.  By the time modules can be loaded, the kernel has
 completely booted.  There should only be a handful of paravirt_ops
 implementations and they aren't large so I don't think there's a big
 size savings either.

It doesn't seem terribly valuable to me either.  But Zach is talking
about something very similar to the kvm case, where you have a fully
virtualized environment (with hardware support), but then you load a
module containing paravirtualized helpers at some late stage which makes
things more efficient but isn't required for functional correctness.

 More importantly, now device drivers for virtual devices would have a
 way to inquire into which set of paravirt-ops was loaded by having an
 official registered interface rather than an ad-hoc (if xxx_running
 == 1) mess, and now the paravirt driver modules are nicely decoupled
 from the boot-strap code and can be loaded dynamically.

 I'm not familiar with the particular problem here, but I don't think
 that driver modules should be checking to see what paravirt_ops is
 active.  Each VMM has it's own discovery mechanism (KVM and Xen are
 both based on CPUID) so that seems like a much better method to use.

I think he's referring to the xen/kvm/vmi paravirt implementation as a
driver here.  I think.

I don't know of any if (xxx_running) tests at present.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 For a VMM which supports both full emulation and para-virtualization, 
 testing CPUID leaves is not sufficient to determine applicability of a 
 paravirt device driver.  This only indicates the presence of the 
 functionality, not the fact that the functionality has been 
 activated.  For 32-bit Xen, this might be an already assumed fact - 
 but for VMI, KVM, and HV assisted Xen, which do support guests running 
 without paravirt, you need a way to test whether the particular family 
 of paravirt has been activated - for device drivers which assume 
 paravirt semantics might well require this activation to work, or need 
 to behave differently in an unactivated environment (emulate 
 hypercalls with port I/O, for example).

paravirt_ops-style paravirtualization and paravirt device drivers are 
more or less completely orthogonal.  An unmodified OS running under hvm 
Xen can still have paravirt drivers which can detect the presence of 
Xenbus and do all the appropriate things.  It would presumably be a 
matter of loading xenbus.ko, which would probe for the presence of the 
Xen device infrastructure, and that in turn would start pulling in the 
appropriate paravirt drivers.  The state (or existence) of struct 
paravirt_ops is immaterial.

 Thus, all the paravirt drivers as modules would need to test if 
 (xen_running) or (vmi_enabled) or (kvm_active), and then all these 
 symbols need to be exported, and now you have an ad-hoc activation 
 detection system for each brand of paravirt.

No, I think not.  Normal bus/device probing should be able to deal with it.

 Better to have a standard interface, IMHO, where a paravirt-ops 
 parent module gets registered and activated, and then well defined 
 symbols to query that activation.  You also have module dependencies 
 between the parent and child which are nicely modeled with the module 
 system (xenbus dependes on xen, vmitimer depends on vmi, etc..).

To a large extent that already exists in the device model.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 Basically, it just makes it easier on distributors and allows any old 
 kernel with paravirt-ops module support to run on any modern, new 
 hypervisor - that might not have even existed at the time the distro 
 was created.

Hey, isn't that what VMI's for? ;)

I'd been thinking about the possibility of allowing the domain builder 
to provide a new paravirt_ops implementation to the booting kernel.  It 
would be akin to a kernel module, in that its built for a specific 
kernel, but obviously run a lot earlier.  But at this point I think the 
idea is too crack-ridden to be taken seriously.

 In the case of KVM, the paravirt_ops implementation is orthogonal to 
 paravirt device drivers.  A PV device driver can happily exist even 
 if the paravirt_ops backend isn't activated.  This is assuming that 
 hypercalls aren't used btw.  If hypercalls are desirable to use, then 
 the paravirt_ops backend would have to EXPORT_GPL the hypercall 
 interface.  I imagine returning a specific errno would suffice.

 I'm mostly in agreement on that - although making dual hypercall / I/O 
 emulated drivers is a bit more work. 

Semi-paravirtualized real-hardware drivers seems like a difficult 
mishmash.  I would hope we could deal with it with a virtio-like thing.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 I've been thinking about this wrt the hypercall page in KVM.  The 
 problem is that in a model like KVM (or presumably VMI), migration 
 gets really difficult if you have anything but a trivial hypercall 
 page since the hypercall page will change after migration.

 If you cannot guarantee the guest isn't executing code within the 
 hypercall page (or in your case, doing something with paravirt_ops), 
 then you cannot safely migrate. 

Hm, you need to quiesce the kernel in some way when you do a migrate, so 
making sure it isn't in a hypercall would be just part of that.  In 
general you'd make sure all but one CPU is parked somewhere, and the 
remaining CPU is doing the suspend, right?

The tricky part for Xen in all this is how to make sure all mfn 
references are visible to the hypervisor/toolstack so they can be 
remapped; hypercall page contents are not a concern by comparison.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 Unless you also migrate the hypercall page itself and impose migration 
 restrictions on compatible hypercall pages.

Seems unreasonable, especially if you support migration between VT and 
SVM machines.  The whole point of a hypercall page is to give you a 
point of indirection in order to hide these kinds of hardware 
differences; migrating it would defeat the purpose.

 Although I favor the guarantee that execution within the hypercall 
 page is finished - it is important for protecting against 
 non-reentrancy as well.  Think about interrupts during batching / 
 queueing operations.

Not quite sure that's specifically relevant to migration, but yes, its 
important to disable interrupts while doing the setup for a batch of 
stuff unless you want to see some surprises in your queue (and not Oh, 
yay, a puppy! kind of surprises).

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 You only need to quiesce if you have guest-visible data-structures 
 that have details about the underlying hardware.  So Xen needs to 
 quiesce, but I don't know of any other VMM that would.

 VMI, KVM and lhype should be capable of transparent migration without 
 guest involvement. 

Sure; Xen makes the explicit design decision that suspend/resume/migrate 
are things that the guest is likely to want to have some involvement in 
if its already doing all the paravirt games.  A 
semi-kinda-paravirtualized hvm Xen guest doesn't have to worry too much 
about those kinds of things.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 Yes, but if we want to stay with that forward compatibility story, we 
 need a way to allow paravirt device probing to be completely 
 orthogonal to paravirt-ops probing.  Either the VMware hypervisor 
 needs to NOT implement a CPUID leaf, keeping the same ROM based 
 detection, or other VMI client drivers (say, as a wild example, a KVM 
 driver running on a VMI to KVM paravirt-ops backend) need not to check 
 CPUID leaf as a condition of execution.

Yes, this is something that keeps coming up.  hpa originally floated the 
idea of reserving some PCI bus namespace as a gateway for probing for 
virtual/paravirtual devices, and Jun Nakajima proposed it again in the 
context of smart hardware which is virtualization friendly (ie, how to 
represent PCI-IOV to guests).

I'm not wildly happy about the idea of using PCI for probing for 
otherwise completely non-PCI devices, but some kind of probing mechanism 
might be nice in the general case.  Xen deals with it with Xenbus, but I 
figure I'm unlikely to convince everyone to adopt that.

 We at least would like to use a CPUID leaf for the core paravirt-ops 
 on 64-bit and get rid of the need for ROM probing in that case, which 
 would mean we either need a CPUID sub-leaf for the device model, a 
 completely identical device model, or completely orthogonal device 
 probing.

Well, cpuid leaf 0x4000 seems to be gaining currency as a 
(semi-?)formal way for hypervisors to advertise themselves, so that 
seems completely doable.

   Since there hasn't been a formal specification for how the device 
 probing should work, or, at least, I don't know all the details of how 
 device probing works for all the various hypervisors, I worry that 
 weird ad-hoc tests could trample the compatibility effort.

Yes.  That's the thinking behind using PCI as a somewhat common 
mechanism for device discovery.  s390 folks hate it, of course.

 The completely identical device model is of course ideal, but the 
 implementation and consolidation of that is a long term prospect to 
 move towards, not something that will happen immediately.  We at least 
 emulate physical hardware devices already, and will continue to need 
 drivers compatible with those models for some time.

Well, physical devices and completely emulated physical devices are 
fairly straightforward - do it like real hardware.  Its the semi-virtual 
devices which pose problems.  Either device emulations with a bit of 
performance paravirtualization sprinkled over them, or virtualization 
friendly devices which allow safe direct guest access, but need some 
paravirtual management interfaces as well.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Zachary Amsden wrote:
 If I had a gentoo install,

Yes, but then you'd be a gentoo user. ;)

 I would probably go so far as to want to recompile everything after 
 migration across CPU vendors; things like NMIs, MSRs, thermal controls 
 and sleep states are also vendor dependent and either need to be 
 emulated both ways, re-invented in a new way entirely, or just dropped.

Many of those things are meaningless for a guest to see or control.

 I don't think cross-CPU vendor hot migration is particularly 
 compelling, although it certainly is possible, the payoff doesn't seem 
 worth the implementation cost and you will find a maze of brambly 
 thorns blocking your path.

We see people trying to do it, and for good reasons.  The selling point 
of VMs is that they can install their thingy once, and it becomes 
independent from its hardware environment, to the extent that they can 
update their hardware without having to worry about its effects on their 
guests.  With a stable OS and live migration, its reasonable to consider 
a guest VM undergoing multiple hardware upgrade transitions with no 
downtime.

I think the general approach is to have a compatible-performance 
slider which disables/enables non-portable features.  Migrating between 
Intel/AMD is just a slightly more extreme point on the continuum of 
allowing people to migrate between different models within one 
manufacturer's line.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-20 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 The real trick is doing it without the guest being involved at all.  
 Right now, it won't be a problem in KVM since the hypercall page only 
 differs by a single instruction across platforms.  In the future, 
 we'll have to be smarter and wait for all VCPUs to leave the hypercall 
 page.

Well, you could just fake the whole acpi suspend/resume ;)

 The tricky part for Xen in all this is how to make sure all mfn 
 references are visible to the hypervisor/toolstack so they can be 
 remapped; hypercall page contents are not a concern by comparison.

 I don't know HVM save/resume all that well but I think it's a similar 
 model where the guest isn't involved.

Yes, the guest has little to nothing to do.

 They may have a similar issue when using PV drivers.

Hm, not sure.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/5] KVM: paravirt time source

2007-06-19 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 I've updated this patch and switched to using a scale/shift like Xen
 is doing, but I must admit, I don't understand how it helps adjtime. 
 I poked around a bit and it wasn't obvious.

 Why is having {mult=122, shift=22} better for adjtime than {mult=1,
 shift=0}?

I don't fully understand it myself, but I think its because adjtime
plays with the mult factor to scale the rate at which time passes.  If
the scale is 1, then it can only scale time by integer amounts. By
setting it to 2^22, then it can adjust time down to 1 part in 4 million.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 3/5] KVM: Add paravirt MMU write support

2007-06-19 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Perhaps my grep'ing skills are weak, but I don't seem to see any. 
 Were you thinking of something in particular? 

__pte(), of course.  Sheesh.   ;)

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-18 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Perhaps we can just print the banner before batching occurs?  Then
 it's being printed at the last possible moment.

s/batching/patching/?  Yes, that would work.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/5] KVM: paravirt time source

2007-06-18 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 +static cycle_t read_hyper(void)
 +{
 + struct timespec now;
 + int ret;
 +
 + ret = kvm_hypercall(KVM_HYPERCALL_GET_KTIME, (u32)now, 0, 0, 0);
 + WARN_ON(ret);
 +
 + return now.tv_nsec + now.tv_sec * (cycles_t)1e9;
   

Hm, use of FP looks pretty odd.  I guess its OK to assume the compiler
will completely remove all the FP stuff at compile time.  Or you could
use NSEC_PER_SEC.

 +}
 +
 +static struct clocksource clocksource_hyper = {
 + .name   = hyper,
 + .rating = 200,
   

We should probably standardize on this.  I guess that if you're in a
paravirt environment, and there's a paravirt clocksource, that would
always be the best clocksource to use.

 + .read   = read_hyper,
 + .mask   = CLOCKSOURCE_MASK(64),
 + .mult   = 1,
 + .shift  = 0,
   

It would be better to use a scale and shift here, so that adjtime has
something to work with when warping time.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 5/5] KVM: paravirt time source

2007-06-18 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Jeremy Fitzhardinge wrote:
 Anthony Liguori wrote:
  
 Okay.  I may remove this patch from the patch series and attempt to
 sit down next week and work out something more complete that also
 implements stolen time accounting.
 

 Well, that's a separate problem.  clocksource.read should always return
 real time passed, so stolen time doesn't come into it.

 Right.

  
 paravirt_ops.sched_clock should take stolen time into account, but
 that's almost completely orthogonal.
   

 Except that I wanted to change the hypercall to allow querying of real
 time or available time as VMI puts it.

I see.  You could have an interface like Xen's runstate interface, which
gives you a breakdown of how long each vcpu spends in each state
(running, runnable, paused or blocked).  The sum of all of them gives
you real time, or you can use a subset to work out stolen time
(runnable+paused), busyness (blocked / (running+runnable+blocked)), etc.

 Right now, I'm relying on the PIT but it would be nice to eliminate
 that.  I'd like to move to something PV so that I can make use of
 tickless guest kernels.  I'm very open to suggestion and even more
 open to reusing other people's code :-)

A simple hypercall interface which says raise irq X after N ns is
probably the easiest way to go.  Try to avoid the pitfalls of VMI's
interface in this area, which tries to recycle existing architectural
interrupt sources for hypervisor timers, and gets into a bit of mess. 
Xen is relatively clean, but I'm not sure how it would fit into the
mostly-emulated kvm model.  If you keep things simple, most of the Xen
clockevent code could be reused with little change.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-17 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 1) Not really sure what is needed for CONFIG_PREEMPT support.  I'm not
 sure which paravirt_ops calls are actually re-entrant.

I'm not sure that has specifically come up.  The main issue is whether a
particular call can be preempted and whether that matters.  I guess the
calls which affect a particular CPU's state will generally be called in
a non-preemptable context, but I guess we can't assume that; the best
approach is to assume that each call be atomic with respect to preemption.

Things like batching must be completed with preemption disabled over the
whole batch.  I check that with BUG_ON in the Xen code.

 2) The paravirt_ops implementation is registered with
 core_initcall().  However, the paravirt_ops banner is also printed
 with core_initcall() so that fact that this works now is just the luck
 of build order.  Need a better way to initialize the KVM paravirt_ops
 backend.

Hm.  We could make the banner printing later; obviously its purely
cosmetic.  I put it as a core_initcall on the assumption that pv-ops
would be set up very early as it is with Xen and lguest, and so the
banner should print relatively early.  But if your model is that you
boot fully virtualized for a while, and then become paravirtualized
later, then it would make sense to defer banner printing until then.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 4/5] KVM: Add hypercall queue for paravirt_ops implementation

2007-06-17 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Regards,

 Anthony Liguori
 

 Subject: [PATCH] KVM: Add hypercall queue for paravirt_ops implementation
 Author: Anthony Liguori [EMAIL PROTECTED]

 Implemented a hypercall queue that can be used when paravirt_ops lazy mode
 is enabled.  This patch enables queueing of MMU write operations and CR
 updates.  This results in about a 50% bump in kernbench performance.

 Signed-off-by: Anthony Liguori [EMAIL PROTECTED]

 diff --git a/arch/i386/kernel/kvm.c b/arch/i386/kernel/kvm.c
 index 07ce38e..4b323f1 100644
 --- a/arch/i386/kernel/kvm.c
 +++ b/arch/i386/kernel/kvm.c
 @@ -33,8 +33,10 @@ struct kvm_paravirt_state
   unsigned long cached_cr[5];
   int cr_valid[5];
  
 - struct kvm_vmca *vmca;
 + enum paravirt_lazy_mode mode;
   struct kvm_hypercall_entry *queue;
 +
 + struct kvm_vmca *vmca;
   void (*hypercall)(void);
  
   u64 vmca_gpa;
 @@ -42,17 +44,17 @@ struct kvm_paravirt_state
  
  static DEFINE_PER_CPU(struct kvm_paravirt_state *, paravirt_state);
  
 +static int do_hypercall_batching;
  static int do_mmu_write;
  static int do_cr_read_caching;
  static int do_nop_io_delay;
  static u64 msr_set_vmca;
  
 -static long kvm_hypercall(unsigned int nr, unsigned long p1,
 -   unsigned long p2, unsigned long p3,
 -   unsigned long p4)
 +static long _kvm_hypercall(struct kvm_paravirt_state *state,
 +unsigned int nr, unsigned long p1,
 +unsigned long p2, unsigned long p3,
 +unsigned long p4)
  {
 - struct kvm_paravirt_state *state
 - = per_cpu(paravirt_state, smp_processor_id());
   long ret;
  
   asm volatile(call *(%6) \n\t
 @@ -69,6 +71,55 @@ static long kvm_hypercall(unsigned int nr, unsigned long 
 p1,
   return ret;
  }
  
 +static int can_defer_hypercall(struct kvm_paravirt_state *state,
 +unsigned int nr)
 +{
 + if (state-mode == PARAVIRT_LAZY_MMU) {
 + if (nr == KVM_HYPERCALL_MMU_WRITE)
 + return 1;
 + } else if (state-mode == PARAVIRT_LAZY_CPU) {
 + if (nr == KVM_HYPERCALL_SET_CR)
 + return 1;
 + }
 +
 + return 0;
 +}
 +
 +static void _kvm_hypercall_defer(struct kvm_paravirt_state *state,
 +  unsigned int nr,
 +  unsigned long p1, unsigned long p2,
 +  unsigned long p3, unsigned long p4)
 +{
 + struct kvm_hypercall_entry *entry;
 +
 + if (state-vmca-queue_index == state-vmca-max_queue_index)
 + _kvm_hypercall(state, KVM_HYPERCALL_FLUSH, 0, 0, 0, 0);
 +
 + /* FIXME: are we preempt safe here? */
   

BUG_ON(preemptible()) would be a reasonable thing to put here to be sure.

 + entry = state-queue[state-vmca-queue_index++];
 + entry-nr = nr;
 + entry-p1 = p1;
 + entry-p2 = p2;
 + entry-p3 = p3;
 + entry-p4 = p4;
 +}
 +
 +static long kvm_hypercall(unsigned int nr, unsigned long p1,
 +   unsigned long p2, unsigned long p3,
 +   unsigned long p4)
 +{
 + struct kvm_paravirt_state *state
 + = per_cpu(paravirt_state, smp_processor_id());
   

Rather than using this here and passing state around, you could use
either x86_read/write_percpu, or get/put_cpu_var (or __get_vpu_var if
you don't need the preempt-disable).

 + long ret = 0;
 +
 + if (can_defer_hypercall(state, nr))
 + _kvm_hypercall_defer(state, nr, p1, p2, p3, p4);
 + else
 + ret = _kvm_hypercall(state, nr, p1, p2, p3, p4);
 +
 + return ret;
 +}
 +
  /*
   * No need for any IO delay on KVM
   */
 @@ -107,7 +158,9 @@ static void kvm_write_cr(int reg, unsigned long value)
   state-cr_valid[reg] = 1;
   state-cached_cr[reg] = value;
  
 - if (reg == 0)
 + if (state-mode == PARAVIRT_LAZY_CPU)
 + kvm_hypercall(KVM_HYPERCALL_SET_CR, reg, value, 0, 0);
 + else if (reg == 0)
   native_write_cr0(value);
   else if (reg == 3)
   native_write_cr3(value);
 @@ -218,6 +271,18 @@ static void kvm_pmd_clear(pmd_t *pmdp)
   kvm_mmu_write(pmdp, pmd, sizeof(pmd));
  }
  
 +static void kvm_set_lazy_mode(enum paravirt_lazy_mode mode)
 +{
 + struct kvm_paravirt_state *state
 + = per_cpu(paravirt_state, smp_processor_id());
 +
 + if (mode == PARAVIRT_LAZY_FLUSH || mode == PARAVIRT_LAZY_NONE) {
 + if (state-vmca-queue_index)
 + _kvm_hypercall(state, KVM_HYPERCALL_FLUSH, 0, 0, 0, 0);
 + }
 + state-mode = mode;
   

No, you don't want to set state-mode to LAZY_FLUSH (its not a mode,
just a action which overloads the interface).

 +}
 +
  static void paravirt_ops_setup(void)
  {
   paravirt_ops.name = KVM;
 @@ -249,6 +314,9 @@ static void 

Re: [kvm-devel] [PATCH 4/5] KVM: Add hypercall queue for paravirt_ops implementation

2007-06-17 Thread Jeremy Fitzhardinge
Jeremy Fitzhardinge wrote:
 +static int kvm_hypercall_flush(struct kvm_vcpu *vcpu)
 +{
 +struct kvm_hypercall_entry *queue;
 +struct kvm_vmca *vmca;
 +int ret = 0;
 +int i;
 +
 +queue = kmap(vcpu-queue_page);
 +vmca = kmap(vcpu-para_state_page);
   
 

 kmap_atomic?  Or why not keep them mapped all the time?
   

Oh, right, this is on the kvm side.  Still, kmap_atomic?

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH 0/5] KVM paravirt_ops implementation

2007-06-17 Thread Jeremy Fitzhardinge
Anthony Liguori wrote:
 Hi Jeremy,

 Jeremy Fitzhardinge wrote:
 Anthony Liguori wrote:
  
 1) Not really sure what is needed for CONFIG_PREEMPT support.  I'm not
 sure which paravirt_ops calls are actually re-entrant.
 

 I'm not sure that has specifically come up.  The main issue is whether a
 particular call can be preempted and whether that matters.  I guess the
 calls which affect a particular CPU's state will generally be called in
 a non-preemptable context, but I guess we can't assume that; the best
 approach is to assume that each call be atomic with respect to
 preemption.
   

 So each call would need to disable preemption?  I'm not sure that
 makes a whole lot of sense for something like CR reads/writes.  In
 fact, without passing in a cpu parameter, I'm pretty sure that those
 operations *have* to require preemption to be disabled.

Yeah, its a little unclear to me.  If you're poking at a control
register, then one presumes you've got a specific CPU's CRx in mind. 
But in the Xen code I don't care about the preemption state for control
register updates - except for write_cr3, which never makes any sense
with preemption enabled.

 For something like MMU operations, preemption really doesn't have to
 be disabled.

Unless you're batching, since the lazy_mode is inherently per-cpu state.

 Things like batching must be completed with preemption disabled over the
 whole batch.  I check that with BUG_ON in the Xen code.
   

 Right now the KVM batching requires preemption to be disabled for
 batching.

I think that's probably overkill.  I had to put a few explicit
preempt_disable/enables in the Xen code, but mostly the preempt state is
reasonable for a given operation (ie, disabled for per-cpu state
updates, enabled for memory/global state updates).

   I don't think that's a hard requirement though since we could pass
 the current batch PA as part of the flush hypercalls.

PA?

 Things are a lot easier though if we can just assume preemption is
 disabled :-)

 Are you aware of any paravirt_ops calls that are probably being called
 in the kernel with preemption enabled?

Erm, I haven't made a breakdown, but many are.  The descriptor updates
generally are, for example.  Pagetable updates could be, but are
generally done under a pagetable lock, and so are not preemptible anyway.

 I don't see a compelling reason to paravirtualize earlier although I
 also don't see a compelling reason not too.  I noticed that VMI hooks
 setup.c.  It wasn't immediately obvious why it was hooking there but
 perhaps it worthwhile to have a common hook?  I suspect VMI and KVM
 will have a similar model for startup.

Well, I was suggesting we could print the banner later rather than
forcing an earlier init.

The important part is that you set your pv_ops before patching occurs,
since that will bake the function calls into the rest of the kernel, and
it will ignore any further changes to the paravirt_ops structure.

I think Zach was originally thinking of initializing VMI much later
(even as a module load), but the subtleties of inveigling its way into
the kernel at that late stage got too complex.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [Xen-devel] [PATCH RFC 1/3] virtio infrastructure

2007-06-04 Thread Jeremy Fitzhardinge
Santos, Jose Renato G wrote:
   It seems that we will still need specific devices drivers
   for each different virtualization flavor. For example,
   we will still need to have a specific Xen netfront 
   device that talks to a backend device in dom0, using
   page grants, and other Xen specific mechanisms.
   It looks like  will still need to maintain all the virtual device 
   drivers and in addition we will now have to maintain 
   another virtualization layer. 
   

The hope is that the Xen-specific elements of the driver will be
restricted to Xen-specific things like grant tables, and the bulk of the
driver and its logic can be common.  Whether that can be achieved and
still retain the full performance/features of the entirely Xen-specific
netfront driver remains to be seen.  I haven't had a chance to look at
doing any Xen-virtio glue yet, so I'm not really sure how it will work out.

J

-
This SF.net email is sponsored by DB2 Express
Download DB2 Express C - the FREE version of DB2 express and take
control of your XML. No limits. Just data. Click to get it now.
http://sourceforge.net/powerbar/db2/
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] make KVM conform to sucky rdmsr interface

2007-03-22 Thread Jeremy Fitzhardinge
Rusty Russell wrote:
 It was actually Jeremy's paravirt cleanup patch which changed the
 calling convention of rdmsr_safe() to match rdmsr().
   

Oops, my little mind hobgoblin is getting out of control...

J

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel


Re: [kvm-devel] [PATCH] make KVM conform to sucky rdmsr interface

2007-03-22 Thread Jeremy Fitzhardinge
Andrew Morton wrote:
 Which tree are you patching??
 -

It looks like its against the previously posted Cleanup: rationalize
paravirt wrappers patch.

J

-
Take Surveys. Earn Cash. Influence the Future of IT
Join SourceForge.net's Techsay panel and you'll get the chance to share your
opinions on IT  business topics through brief surveys-and earn cash
http://www.techsay.com/default.php?page=join.phpp=sourceforgeCID=DEVDEV
___
kvm-devel mailing list
kvm-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/kvm-devel