Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Avi Kivity
On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
 Also it should not be necessary for these flushes to be inside mmu_lock
 on EPT/NPT case (since there is no write protection there). 

We do write protect with TDP, if nested virt is active.  The question is
whether we have indirect pages or not, not whether TDP is active or not
(even without TDP, if you don't enable paging in the guest, you don't
have to write protect).

 But it would
 be awkward to differentiate the unlock position based on EPT/NPT.


I would really like to move the IPI back out of the lock.

How about something like a sequence lock:


spin_lock(mmu_lock)
need_flush = write_protect_stuff();
atomic_add(kvm-want_flush_counter, need_flush);
spin_unlock(mmu_lock);

while ((done = atomic_read(kvm-done_flush_counter))  (want =
atomic_read(kvm-want_flush_counter)) {
  kvm_make_request(flush)
  atomic_cmpxchg(kvm-done_flush_counter, done, want)
}

This (or maybe a corrected and optimized version) ensures that any
need_flush cannot pass the while () barrier, no matter which thread
encounters it first.  However it violates the do not invent new locking
techniques commandment.  Can we map it to some existing method?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: qemu-kvm-1.0 regression with usb tablet after live migration

2012-02-15 Thread Peter Lieven
Anyone?

Peter Lieven wrote:
 Hi,

 i recently started updating our VMs to qemu-kvm 1.0. Since that I see
 that the usb tablet device (used for as pointer device for accurate
 mouse positioning) becomes unavailable after live migrating.
 If I migrate a few times a Windows 7 VM reliable stops using
 the USB tablet and fails back to PS/2 mouse.
 If I do the same with qemu-kvm-0.12.5 with the very same VM its working
 fine.

 Can anyone imagine what introduced this flaw?

 Thanks,
 Peter






--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42755] KVM is being extremely slow on AMD Athlon64 4000+ Dual Core 2.1GHz Brisbane

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42755





--- Comment #30 from Avi Kivity a...@redhat.com  2012-02-15 09:28:12 ---
Disable ksm, and build with debug information so we get useful information
instead of hex addresses.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AESNI and guest hosts

2012-02-15 Thread Avi Kivity
On 02/14/2012 08:18 PM, Brian Jackson wrote:
 On Tuesday, February 14, 2012 03:31:10 AM Ryan Brown wrote:
  Sorry for being a noob here, Any clues with this?, anyone ...
  
  On Mon, Feb 13, 2012 at 2:05 AM, Ryan Brown mp3g...@gmail.com wrote:
   Host/KVM server is running linux 3.2.4 (Debian wheezy), and guest
   kernel is running 3.2.5. The cpu is an E3-1230, but for some reason
   its not able to supply the guest with aesni. Is there a config option
   or is there something we're missing?



 I don't think it's supported to pass that functionality to the guest.


Why not?  Perhaps a new libvirt or qemu is needed.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Avi Kivity
On 02/15/2012 11:18 AM, Avi Kivity wrote:
 On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
  Also it should not be necessary for these flushes to be inside mmu_lock
  on EPT/NPT case (since there is no write protection there). 

 We do write protect with TDP, if nested virt is active.  The question is
 whether we have indirect pages or not, not whether TDP is active or not
 (even without TDP, if you don't enable paging in the guest, you don't
 have to write protect).

  But it would
  be awkward to differentiate the unlock position based on EPT/NPT.
 

 I would really like to move the IPI back out of the lock.

 How about something like a sequence lock:


 spin_lock(mmu_lock)
 need_flush = write_protect_stuff();
 atomic_add(kvm-want_flush_counter, need_flush);
 spin_unlock(mmu_lock);

 while ((done = atomic_read(kvm-done_flush_counter))  (want =
 atomic_read(kvm-want_flush_counter)) {
   kvm_make_request(flush)
   atomic_cmpxchg(kvm-done_flush_counter, done, want)
 }

 This (or maybe a corrected and optimized version) ensures that any
 need_flush cannot pass the while () barrier, no matter which thread
 encounters it first.  However it violates the do not invent new locking
 techniques commandment.  Can we map it to some existing method?

There is no need to advance 'want' in the loop.  So we could do

/* must call with mmu_lock held */
void kvm_mmu_defer_remote_flush(kvm, need_flush)
{
  if (need_flush)
++kvm-flush_counter.want;
}

/* may call without mmu_lock */
void kvm_mmu_commit_remote_flush(kvm)
{
  want = ACCESS_ONCE(kvm-flush_counter.want)
  while ((done = atomic_read(kvm-flush_counter.done)  want) {
kvm_make_request(flush)
atomic_cmpxchg(kvm-flush_counter.done, done, want)
  }
}




-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: The way of mapping BIOS into the guest's address space

2012-02-15 Thread Cyrill Gorcunov
On Tue, Feb 14, 2012 at 11:07:08PM -0500, Kevin O'Connor wrote:
...
  hardware. Maybe we could poke someone from KVM camp for a hint?
 
 SeaBIOS has two ways to be deployed - first is to copy the image to
 the top of the first 1MB (eg, 0xe-0xf) and jump to
 0xf000:0xfff0 in 16bit mode.  The second way is to use the SeaBIOS elf
 and deploy into memory (according to the elf memory map) and jump to
 SeaBIOS in 32bit mode (according to the elf entry point).
 
 SeaBIOS doesn't really need to be in the top 4G of ram.  SeaBIOS does
 expect to have normal PC hardware devices (eg, a PIC), though many
 hardware devices can be compiled out via its kconfig interface.  The
 more interesting challenge will likely be in communicating critical
 pieces of information (eg, total memory size) into SeaBIOS.
 
 The SeaBIOS mailing list (seab...@seabios.org) is probably a better
 location for technical seabios questions.
 

Hi Kevin, thanks for pointing. Yes, providing info back to seabios
to setup mttr and such (so seabios would recognize them) is
most challeging I think.

Cyrill
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: x86: kvmclock: abstract save/restore sched_clock_state (v2)

2012-02-15 Thread Avi Kivity
On 02/13/2012 05:52 PM, Marcelo Tosatti wrote:
{
  +  x86_platform.restore_sched_clock_state();
  Isn't it too early? It is scarry to say hypervisor to write to some
  memory location and than completely replace page-tables and half of
  cpu state in __restore_processor_state. Wouldn't that have a potential
  of writing into a place that is not restored hv_clock and restored
  hv_clock might still be stale?

 No, memory is copied in swsusp_arch_resume(), which happens
 before restore_processor_state. restore_processor_state() is only
 setting up registers and MTRR.


In addition, kvmclock uses physical addresses, so page table changes
don't matter.

Note we could have done this in
__save_processor_state()/__restore_processor_state() (it's just reading
and writing an MSR, like we do for MSR_IA32_MISC_ENABLE), but I think
your patch is the right way.  I'd like an ack from the x86 maintainers
though.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BUG in pv_clock when overflow condition is detected

2012-02-15 Thread Avi Kivity
On 02/13/2012 08:20 PM, Igor Mammedov wrote:
 BUG when overflow occurs at pvclock.c:pvclock_get_nsec_offset

 u64 delta = native_read_tsc() - shadow-tsc_timestamp;

 this might happen at an attempt to read an uninitialized yet clock.
 It won't prevent stalls and hangs but at least it won't do it silently.

 Signed-off-by: Igor Mammedov imamm...@redhat.com
 ---
  arch/x86/kernel/pvclock.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)

 diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
 index 42eb330..35a6190 100644
 --- a/arch/x86/kernel/pvclock.c
 +++ b/arch/x86/kernel/pvclock.c
 @@ -43,7 +43,10 @@ void pvclock_set_flags(u8 flags)
  
  static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
  {
 - u64 delta = native_read_tsc() - shadow-tsc_timestamp;
 + u64 delta;
 + u64 tsc = native_read_tsc();
 + BUG_ON(tsc  shadow-tsc_timestamp);
 + delta = tsc - shadow-tsc_timestamp;
   return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
  shadow-tsc_shift);

Maybe a WARN_ON_ONCE()?  Otherwise a relatively minor hypervisor bug can
kill the guest.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Q: Does linux kvm native tool support loading BIOS as the default loader now?

2012-02-15 Thread Avi Kivity
On 02/13/2012 03:35 PM, Asias He wrote:
 On 02/13/2012 12:38 PM, Pekka Enberg wrote:
  On Mon, Feb 13, 2012 at 08:14:22PM +0800, Yang Bai wrote:
  As I know, native tool does not support loading BIOS so it does not
  support Windows. Is this supporting now?
  If not, I may try to implement it.

 You're welcome to do so ;-). This would open the door for non-linux OS
 support in kvm tool.

Also, to loading the kernel from /boot, and so allowing for the normal
distro kernel update mechanism to work.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: vsyscall=emulate regression

2012-02-15 Thread Amit Shah
On (Tue) 14 Feb 2012 [08:26:22], Andy Lutomirski wrote:
 On Tue, Feb 14, 2012 at 4:22 AM, Amit Shah amit.s...@redhat.com wrote:
  On (Fri) 03 Feb 2012 [13:57:48], Amit Shah wrote:
  Hello,
 
  I'm booting some latest kernels on a Fedora 11 (released June 2009)
  guest.  After the recent change of default to vsyscall=emulate, the
  guest fails to boot (init segfaults).
 
  I also tried vsyscall=none, as suggested by hpa, and that fails as
  well.  Only vsyscall=native works fine.
 
  The commit that introduced the kernel parameter,
 
  3ae36655b97a03fa1decf72f04078ef945647c1a
 
  is bad too.
 
  I suggest we revert 2e57ae0515124af45dd889bfbd4840fd40fcc07d till we
  track down and fix the vsyscal=emulate case.
 
 Hi-
 
 Sorry, I lost track of this one.  I can't reproduce it, although I
 doubt I've set up the right test environment.  But this is fishy:
 
 init[1]: segfault at ff600400 ip ff600400 sp
 7fff9c8ba098 error 5
 
 Error 5, if I'm decoding it correctly, is a userspace read (i.e. not
 execute) fault.  The vsyscall emulation changes shouldn't have had any
 effect on reads there.
 
 Can you try booting the initramfs here:
 http://web.mit.edu/luto/www/linux/vsyscall_initramfs.img
 with your kernel image (i.e. qemu-kvm -kernel whatever -initrd
 vsyscall_initramfs.img -whatever_else) and seeing what happens?  It
 works for me.

This too results in a similar error.

 I'm also curious what happens if you run without kvm (i.e. straight
 qemu)

Interesting; without kvm, this does work fine.

 and what your .config on the guest kernel is.  It sounds like
 something's wrong with your fixmap, which makes me wonder if your
 qemu/kernel combo is capable of booting even a modern distro
 (up-to-date F16, say) -- the vvar page uses identical fixmap flags as
 the vsyscall page in vsyscall=emulate and vsyscall=none mode.

I didn't try a modern distro, but looks like this is enough evidence
for now to check the kvm emulator code.  I tried the same guests on a
newer kernel (Fedora 16's 3.2), and things worked fine except for
vsyscall=none, panic message below.

 What host cpu are you on and what qemu flags do you use?

$ cat /proc/cpuinfo
processor   : 0
vendor_id   : GenuineIntel
cpu family  : 6
model   : 15
model name  : Intel(R) Core(TM)2 Duo CPU E6550  @ 2.33GHz
stepping: 11
cpu MHz : 2000.000
cache size  : 4096 KB
physical id : 0
siblings: 2
core id : 0
cpu cores   : 2
apicid  : 0
initial apicid  : 0
fpu : yes
fpu_exception   : yes
cpuid level : 10
wp  : yes
flags   : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat 
pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor 
ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi 
flexpriority
bogomips: 4654.73
clflush size: 64
cache_alignment : 64
address sizes   : 36 bits physical, 48 bits virtual
power management:

  Maybe
 something is wrong with your emulator.

Yes, looks like it.  Thanks!

This is what I get with vsyscall=none, where emulate and native work
fine on the 3.2 kernel on different host hardware, the guest stays the
same:


[2.874661] debug: unmapping init memory 8167f000..818dc000
[2.876778] Write protecting the kernel read-only data: 6144k
[2.879111] debug: unmapping init memory 880001318000..88000140
[2.881242] debug: unmapping init memory 8800015a..88000160
[2.884637] init[1] vsyscall attempted with vsyscall=none 
ip:ff600400 cs:33 sp:7fff2f48fe18 ax:7fff2f48fe50 si:7fff2f48ff08 di:0
[2.888078] init[1]: segfault at ff600400 ip ff600400 sp 
7fff2f48fe18 error 15
[2.888193] Refined TSC clocksource calibration: 2691.293 MHz.
[2.892748] 
[2.895219] Kernel panic - not syncing: Attempted to kill init!


Amit
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 04:39 PM, Alexander Graf wrote:
  
  Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
  tighten the vcpu/thread and vm/process relationship.

 How about keeping the ioctl interface but moving vcpu_run to a syscall then?

I dislike half-and-half interfaces even more.  And it's not like the
fget_light() is really painful - it's just that I see it occasionally in
perf top so it annoys me.

  That should really be the only thing that belongs into the fast path, right? 
 Every time we do a register sync in user space, we do something wrong. 
 Instead, user space should either

   a) have wrappers around register accesses, so it can directly ask for 
 specific registers that it needs
 or
   b) keep everything that would be requested by the register synchronization 
 in shared memory

Always-synced shared memory is a liability, since newer hardware might
introduce on-chip caches for that state, making synchronization
expensive.  Or we may choose to keep some of the registers loaded, if we
have a way to trap on their use from userspace - for example we can
return to userspace with the guest fpu loaded, and trap if userspace
tries to use it.

Is an extra syscall for copying TLB entries to user space prohibitively
expensive?

  
  , keep the rest in user space.
  
  
When a device is fully in the kernel, we have a good specification of 
   the ABI: it just implements the spec, and the ABI provides the interface 
   from the device to the rest of the world.  Partially accelerated devices 
   means a much greater effort in specifying exactly what it does.  It's 
   also vulnerable to changes in how the guest uses the device.
  
  Why? For the HPET timer register for example, we could have a simple MMIO 
  hook that says
  
on_read:
  return read_current_time() - shared_page.offset;
on_write:
  handle_in_user_space();
  
  It works for the really simple cases, yes, but if the guest wants to set up 
  one-shot timers, it fails.  

 I don't understand. Why would anything fail here? 

It fails to provide a benefit, I didn't mean it causes guest failures.

You also have to make sure the kernel part and the user part use exactly
the same time bases.

 Once the logic that's implemented by the kernel accelerator doesn't fit 
 anymore, unregister it.

Yeah.


  Also look at the PIT which latches on read.
  
  
  For IDE, it would be as simple as
  
register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]);
for (i = 1; i  7; i++) {
  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
}
  
  and we should have reduced overhead of IDE by quite a bit already. All the 
  other 2k LOC in hw/ide/core.c don't matter for us really.
  
  
  Just use virtio.

 Just use xenbus. Seriously, this is not an answer.

Why not?  We invested effort in making it as fast as possible, and in
writing the drivers.  IDE will never, ever, get anything close to virtio
performance, even if we put all of it in the kernel.

However, after these examples, I'm more open to partial acceleration
now.  I won't ever like it though.

  
  - VGA
  - IDE
  
Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
   virtio-scsi).
  
  Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
  AHCI needs 3rd party drivers on w2k3 and wxp. 

3rd party drivers are a way of life for Windows users; and the
incremental benefits of IDE acceleration are still far behind virtio.

 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 

Cirrus or vesa should be okay for them, I don't see what we could do for
them in the kernel, or why.

 Same for virtio.
  
  Please don't do the Xen mistake again of claiming that all we care about 
  is Linux as a guest.
  
  Rest easy, there's no chance of that.  But if a guest is important enough, 
  virtio drivers will get written.  IDE has no chance in hell of approaching 
  virtio-blk performance, no matter how much effort we put into it.

 Ever used VMware? They basically get virtio-blk performance out of ordinary 
 IDE for linear workloads.

For linear loads, so should we, perhaps with greater cpu utliization.

If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
shouldn't matter.

  
  KVM's strength has always been its close resemblance to hardware.
  
  This will remain.  But we can't optimize everything.

 That's my point. Let's optimize the hot paths and be good. As long as we 
 default to IDE for disk, we should have that be fast, no?

We should make sure that we don't default to IDE.  Qemu has no knowledge
of the guest, so it can't default to virtio, but higher level tools can
and should.

  
  Well, we don't always have shadow page tables. Having hints for unmapped 
  guest memory like this is pretty tricky.
  We're currently running 

Re: [PATCH] BUG in pv_clock when overflow condition is detected

2012-02-15 Thread Igor Mammedov

On 02/15/2012 11:49 AM, Avi Kivity wrote:

On 02/13/2012 08:20 PM, Igor Mammedov wrote:

BUG when overflow occurs at pvclock.c:pvclock_get_nsec_offset

 u64 delta = native_read_tsc() - shadow-tsc_timestamp;

this might happen at an attempt to read an uninitialized yet clock.
It won't prevent stalls and hangs but at least it won't do it silently.

Signed-off-by: Igor Mammedovimamm...@redhat.com
---
  arch/x86/kernel/pvclock.c |5 -
  1 files changed, 4 insertions(+), 1 deletions(-)

diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c
index 42eb330..35a6190 100644
--- a/arch/x86/kernel/pvclock.c
+++ b/arch/x86/kernel/pvclock.c
@@ -43,7 +43,10 @@ void pvclock_set_flags(u8 flags)

  static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow)
  {
-   u64 delta = native_read_tsc() - shadow-tsc_timestamp;
+   u64 delta;
+   u64 tsc = native_read_tsc();
+   BUG_ON(tsc  shadow-tsc_timestamp);
+   delta = tsc - shadow-tsc_timestamp;
return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
   shadow-tsc_shift);


Maybe a WARN_ON_ONCE()?  Otherwise a relatively minor hypervisor bug can
kill the guest.


An attempt to print from this place is not perfect since it often leads
to recursive calling to this very function and it hang there anyway.
But if you insist I'll re-post it with WARN_ON_ONCE,
It won't make much difference because guest will hang/stall due overflow
anyway.

If there is an intention to keep guest functional after the event then
maybe this patch is a way to go
  http://www.spinics.net/lists/kvm/msg68463.html
this way clock will be re-silent to this kind of errors, like bare-metal
one is.

--
Thanks,
 Igor
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Xiao Guangrong
On 02/15/2012 05:47 PM, Avi Kivity wrote:

 On 02/15/2012 11:18 AM, Avi Kivity wrote:
 On 02/14/2012 09:43 PM, Marcelo Tosatti wrote:
 Also it should not be necessary for these flushes to be inside mmu_lock
 on EPT/NPT case (since there is no write protection there). 

 We do write protect with TDP, if nested virt is active.  The question is
 whether we have indirect pages or not, not whether TDP is active or not
 (even without TDP, if you don't enable paging in the guest, you don't
 have to write protect).

 But it would
 be awkward to differentiate the unlock position based on EPT/NPT.


 I would really like to move the IPI back out of the lock.

 How about something like a sequence lock:


 spin_lock(mmu_lock)
 need_flush = write_protect_stuff();
 atomic_add(kvm-want_flush_counter, need_flush);
 spin_unlock(mmu_lock);

 while ((done = atomic_read(kvm-done_flush_counter))  (want =
 atomic_read(kvm-want_flush_counter)) {
   kvm_make_request(flush)
   atomic_cmpxchg(kvm-done_flush_counter, done, want)
 }

 This (or maybe a corrected and optimized version) ensures that any
 need_flush cannot pass the while () barrier, no matter which thread
 encounters it first.  However it violates the do not invent new locking
 techniques commandment.  Can we map it to some existing method?
 
 There is no need to advance 'want' in the loop.  So we could do
 
 /* must call with mmu_lock held */
 void kvm_mmu_defer_remote_flush(kvm, need_flush)
 {
   if (need_flush)
 ++kvm-flush_counter.want;
 }
 
 /* may call without mmu_lock */
 void kvm_mmu_commit_remote_flush(kvm)
 {
   want = ACCESS_ONCE(kvm-flush_counter.want)
   while ((done = atomic_read(kvm-flush_counter.done)  want) {
 kvm_make_request(flush)
 atomic_cmpxchg(kvm-flush_counter.done, done, want)
   }
 }
 


Hmm, we already have kvm-tlbs_dirty, so, we can do it like this:

#define SPTE_INVALID_UNCLEAN (1  63 )

in invalid page path:
lock mmu_lock
if (spte is invalid)
kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN;
need_tlb_flush = kvm-tlbs_dirty;
unlock mmu_lock
if (need_tlb_flush)
kvm_flush_remote_tlbs()

And in page write-protected path:
lock mmu_lock
if (it has spte change to readonly |
  kvm-tlbs_dirty  SPTE_INVALID_UNCLEAN)
kvm_flush_remote_tlbs()
unlock mmu_lock

How about this?

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 12:18, Avi Kivity wrote:

 On 02/07/2012 04:39 PM, Alexander Graf wrote:
 
 Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
 tighten the vcpu/thread and vm/process relationship.
 
 How about keeping the ioctl interface but moving vcpu_run to a syscall then?
 
 I dislike half-and-half interfaces even more.  And it's not like the
 fget_light() is really painful - it's just that I see it occasionally in
 perf top so it annoys me.
 
 That should really be the only thing that belongs into the fast path, right? 
 Every time we do a register sync in user space, we do something wrong. 
 Instead, user space should either
 
  a) have wrappers around register accesses, so it can directly ask for 
 specific registers that it needs
 or
  b) keep everything that would be requested by the register synchronization 
 in shared memory
 
 Always-synced shared memory is a liability, since newer hardware might
 introduce on-chip caches for that state, making synchronization
 expensive.  Or we may choose to keep some of the registers loaded, if we
 have a way to trap on their use from userspace - for example we can
 return to userspace with the guest fpu loaded, and trap if userspace
 tries to use it.
 
 Is an extra syscall for copying TLB entries to user space prohibitively
 expensive?

The copying can be very expensive, yes. We want to have the possibility of 
exposing a very large TLB to the guest, in the order of multiple kentries. 
Every entry is a struct of 24 bytes.

 
 
 , keep the rest in user space.
 
 
 When a device is fully in the kernel, we have a good specification of the 
 ABI: it just implements the spec, and the ABI provides the interface from 
 the device to the rest of the world.  Partially accelerated devices means 
 a much greater effort in specifying exactly what it does.  It's also 
 vulnerable to changes in how the guest uses the device.
 
 Why? For the HPET timer register for example, we could have a simple MMIO 
 hook that says
 
  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();
 
 It works for the really simple cases, yes, but if the guest wants to set up 
 one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
 
 It fails to provide a benefit, I didn't mean it causes guest failures.
 
 You also have to make sure the kernel part and the user part use exactly
 the same time bases.

Right. It's an optional performance accelerator. If anything doesn't align, 
don't use it. But if you happen to have a system where everything's cool, 
you're faster. Sounds like a good deal to me ;).

 
 Once the logic that's implemented by the kernel accelerator doesn't fit 
 anymore, unregister it.
 
 Yeah.
 
 
 Also look at the PIT which latches on read.
 
 
 For IDE, it would be as simple as
 
  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]);
  for (i = 1; i  7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
  }
 
 and we should have reduced overhead of IDE by quite a bit already. All the 
 other 2k LOC in hw/ide/core.c don't matter for us really.
 
 
 Just use virtio.
 
 Just use xenbus. Seriously, this is not an answer.
 
 Why not?  We invested effort in making it as fast as possible, and in
 writing the drivers.  IDE will never, ever, get anything close to virtio
 performance, even if we put all of it in the kernel.
 
 However, after these examples, I'm more open to partial acceleration
 now.  I won't ever like it though.
 
 
   - VGA
   - IDE
 
 Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
 virtio-scsi).
 
 Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
 AHCI needs 3rd party drivers on w2k3 and wxp. 
 
 3rd party drivers are a way of life for Windows users; and the
 incremental benefits of IDE acceleration are still far behind virtio.

The typical way of life for Windows users are all-included drivers. Which is 
the case for AHCI, where we're getting awesome performance for Vista and above 
guests. The iDE thing was just an idea for legacy ones.

It'd be great to simply try and see how fast we could get by handling a few 
special registers in kernel space vs heavyweight exiting to QEMU. If it's only 
10%, I wouldn't even bother with creating an interface for it. I'd bet the 
benefits are a lot bigger though.

And the main point was that specific partial device emulation buys us more than 
pseudo-generic accelerators like coalesced mmio, which are also only used by 1 
or 2 devices.

 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
 
 Cirrus or vesa should be okay for them, I don't see what we could do for
 them in the kernel, or why.

That's my point. You need fast emulation of standard devices to get a good 
baseline. Do PV on top, but keep the baseline as fast as is reasonable.

 
 Same for virtio.
 
 Please don't do 

Re: AESNI and guest hosts

2012-02-15 Thread Ryan Brown

 I don't think it's supported to pass that functionality to the guest.


 Why not?  Perhaps a new libvirt or qemu is needed.


Should it be the case to add one of the following?

feature name='aes'/
or..
feature name='aesni'/

something like that?

Host is using linux kernel 3.2.4 (Debian Wheezy) libvirt (0.9.8-2),
qemu (1.0+dfsg-2), Guest is on linux kernel Ubuntu/3.2.5
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:57 PM, Alexander Graf wrote:
  
  Is an extra syscall for copying TLB entries to user space prohibitively
  expensive?

 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.

You don't need to copy the entire TLB, just the way that maps the
address you're interested in.

btw, why are you interested in virtual addresses in userspace at all?

  
  It works for the really simple cases, yes, but if the guest wants to set 
  up one-shot timers, it fails.  
  
  I don't understand. Why would anything fail here? 
  
  It fails to provide a benefit, I didn't mean it causes guest failures.
  
  You also have to make sure the kernel part and the user part use exactly
  the same time bases.

 Right. It's an optional performance accelerator. If anything doesn't align, 
 don't use it. But if you happen to have a system where everything's cool, 
 you're faster. Sounds like a good deal to me ;).

Depends on how much the alignment relies on guest knowledge.  I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.

  
  Because not every guest supports them. Virtio-blk needs 3rd party 
  drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
  
  3rd party drivers are a way of life for Windows users; and the
  incremental benefits of IDE acceleration are still far behind virtio.

 The typical way of life for Windows users are all-included drivers. Which is 
 the case for AHCI, where we're getting awesome performance for Vista and 
 above guests. The iDE thing was just an idea for legacy ones.

 It'd be great to simply try and see how fast we could get by handling a few 
 special registers in kernel space vs heavyweight exiting to QEMU. If it's 
 only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
 the benefits are a lot bigger though.

 And the main point was that specific partial device emulation buys us more 
 than pseudo-generic accelerators like coalesced mmio, which are also only 
 used by 1 or 2 devices.

Ok.

  
  I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
  
  Cirrus or vesa should be okay for them, I don't see what we could do for
  them in the kernel, or why.

 That's my point. You need fast emulation of standard devices to get a good 
 baseline. Do PV on top, but keep the baseline as fast as is reasonable.

  
  Same for virtio.
  
  Please don't do the Xen mistake again of claiming that all we care about 
  is Linux as a guest.
  
  Rest easy, there's no chance of that.  But if a guest is important 
  enough, virtio drivers will get written.  IDE has no chance in hell of 
  approaching virtio-blk performance, no matter how much effort we put into 
  it.
  
  Ever used VMware? They basically get virtio-blk performance out of 
  ordinary IDE for linear workloads.
  
  For linear loads, so should we, perhaps with greater cpu utliization.
  
  If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
  means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
  shouldn't matter.

 *shrug* last time I checked we were a lot slower. But maybe there's more 
 stuff making things slow than the exit path ;).

One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.

  
  
  KVM's strength has always been its close resemblance to hardware.
  
  This will remain.  But we can't optimize everything.
  
  That's my point. Let's optimize the hot paths and be good. As long as we 
  default to IDE for disk, we should have that be fast, no?
  
  We should make sure that we don't default to IDE.  Qemu has no knowledge
  of the guest, so it can't default to virtio, but higher level tools can
  and should.

 You can only default to virtio on recent Linux. Windows, BSD, etc don't 
 include drivers, so you can't assume it working. You can default to AHCI for 
 basically any recent guest, but that still won't work for XP and the likes :(.

The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.


  
  Ah, because you're on NPT and you can have MMIO hints in the nested page 
  table. Nifty. Yeah, we don't have that luxury :).
  
  Well the real reason is we have an extra bit reported by page faults
  that we can control.  Can't you set up a hashed pte that is configured
  in a way that it will fault, no matter what type of access the guest
  does, and see it in your page fault handler?

 I might be able to synthesize a PTE that is !readable and might throw a 
 permission exception instead of a miss exception. I might be able to 
 synthesize something similar for booke. I don't however get any indication on 
 why things failed.

 So for MMIO reads, 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote:
 Avi Kivity a...@redhat.com wrote:

 Slot searching is quite fast since there's a small number of slots, 
and we sort the larger ones to be in the front, so positive lookups are 
fast.  We cache negative lookups in the shadow page tables (an spte can 
be either not mapped, mapped to RAM, or not mapped and known to be 
mmio) so we rarely need to walk the entire list.
  
   Well, we don't always have shadow page tables. Having hints for unmapped 
   guest memory like this is pretty tricky.
   We're currently running into issues with device assignment though, where 
   we get a lot of small slots mapped to real hardware. I'm sure that will 
   hit us on x86 sooner or later too.
  
  For x86 that's not a problem, since once you map a page, it stays mapped 
  (on modern hardware).
  

 I was once thinking about how to search a slot reasonably fast for every case,
 even when we do not have mmio-spte cache.

 One possible way I thought up was to sort slots according to their base_gfn.
 Then the problem would become:  find the first slot whose base_gfn + npages
 is greater than this gfn.

 Since we can do binary search, the search cost is O(log(# of slots)).

 But I guess that most of the time was wasted on reading many memslots just to
 know their base_gfn and npages.

 So the most practically effective thing is to make a separate array which 
 holds
 just their base_gfn.  This will make the task a simple, and cache friendly,
 search on an integer array:  probably faster than using *-tree data structure.

This assumes that there is equal probability for matching any slot.  But
that's not true, even if you have hundreds of slots, the probability is
much greater for the two main memory slots, or if you're playing with
the framebuffer, the framebuffer slot.  Everything else is loaded
quickly into shadow and forgotten.

 If needed, we should make cmp_memslot() architecture specific in the end?

We could, but why is it needed?  This logic holds for all architectures.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 05:23 PM, Anthony Liguori wrote:
 On 02/07/2012 07:40 AM, Alexander Graf wrote:

 Why? For the HPET timer register for example, we could have a simple
 MMIO hook that says

on_read:
  return read_current_time() - shared_page.offset;
on_write:
  handle_in_user_space();

 For IDE, it would be as simple as

register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]);
for (i = 1; i  7; i++) {
  register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
  register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
}

 You can't easily serialize updates to that address with the kernel
 since two threads are likely going to be accessing it at the same
 time.  That either means an expensive sync operation or a reliance on
 atomic instructions.

 But not all architectures offer non-word sized atomic instructions so
 it gets fairly nasty in practice.


I doubt that any guest accesses IDE registers from two threads in
parallel.  The guest will have some lock, so we could have a lock as
well and be assured that there will never be contention.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:29, Avi Kivity wrote:

 On 02/15/2012 01:57 PM, Alexander Graf wrote:
 
 Is an extra syscall for copying TLB entries to user space prohibitively
 expensive?
 
 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.
 
 You don't need to copy the entire TLB, just the way that maps the
 address you're interested in.

Yeah, unless we do migration in which case we need to introduce another special 
case to fetch the whole thing :(.

 btw, why are you interested in virtual addresses in userspace at all?

We need them for gdb and monitor introspection.

 
 
 It works for the really simple cases, yes, but if the guest wants to set 
 up one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
 
 It fails to provide a benefit, I didn't mean it causes guest failures.
 
 You also have to make sure the kernel part and the user part use exactly
 the same time bases.
 
 Right. It's an optional performance accelerator. If anything doesn't align, 
 don't use it. But if you happen to have a system where everything's cool, 
 you're faster. Sounds like a good deal to me ;).
 
 Depends on how much the alignment relies on guest knowledge.  I guess
 with a simple device like HPET, it's simple, but with a complex device,
 different guests (or different versions of the same guest) could drive
 it very differently.

Right. But accelerating simple devices  not accelerating any devices. No? :)

 
 
 Because not every guest supports them. Virtio-blk needs 3rd party 
 drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
 
 3rd party drivers are a way of life for Windows users; and the
 incremental benefits of IDE acceleration are still far behind virtio.
 
 The typical way of life for Windows users are all-included drivers. Which is 
 the case for AHCI, where we're getting awesome performance for Vista and 
 above guests. The iDE thing was just an idea for legacy ones.
 
 It'd be great to simply try and see how fast we could get by handling a few 
 special registers in kernel space vs heavyweight exiting to QEMU. If it's 
 only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
 the benefits are a lot bigger though.
 
 And the main point was that specific partial device emulation buys us more 
 than pseudo-generic accelerators like coalesced mmio, which are also only 
 used by 1 or 2 devices.
 
 Ok.
 
 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
 
 Cirrus or vesa should be okay for them, I don't see what we could do for
 them in the kernel, or why.
 
 That's my point. You need fast emulation of standard devices to get a good 
 baseline. Do PV on top, but keep the baseline as fast as is reasonable.
 
 
 Same for virtio.
 
 Please don't do the Xen mistake again of claiming that all we care about 
 is Linux as a guest.
 
 Rest easy, there's no chance of that.  But if a guest is important 
 enough, virtio drivers will get written.  IDE has no chance in hell of 
 approaching virtio-blk performance, no matter how much effort we put into 
 it.
 
 Ever used VMware? They basically get virtio-blk performance out of 
 ordinary IDE for linear workloads.
 
 For linear loads, so should we, perhaps with greater cpu utliization.
 
 If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
 means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
 shouldn't matter.
 
 *shrug* last time I checked we were a lot slower. But maybe there's more 
 stuff making things slow than the exit path ;).
 
 One thing that's different is that virtio offloads itself to a thread
 very quickly, while IDE does a lot of work in vcpu thread context.

So it's all about latencies again, which could be reduced at least a fair bit 
with the scheme I described above. But really, this needs to be prototyped and 
benchmarked to actually give us data on how fast it would get us.

 
 
 
 KVM's strength has always been its close resemblance to hardware.
 
 This will remain.  But we can't optimize everything.
 
 That's my point. Let's optimize the hot paths and be good. As long as we 
 default to IDE for disk, we should have that be fast, no?
 
 We should make sure that we don't default to IDE.  Qemu has no knowledge
 of the guest, so it can't default to virtio, but higher level tools can
 and should.
 
 You can only default to virtio on recent Linux. Windows, BSD, etc don't 
 include drivers, so you can't assume it working. You can default to AHCI for 
 basically any recent guest, but that still won't work for XP and the likes 
 :(.
 
 The all-knowing management tool can provide a virtio driver disk, or
 even slip-stream the driver into the installation CD.

One management tool might do that, another one might now. We can't assume that 
all management tools are all-knowing. Some times you also want to run guest OSs 
that 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 08:12 PM, Rusty Russell wrote:
  I would really love to have this, but the problem is that we'd need a
  general purpose bytecode VM with binding to some kernel APIs.  The
  bytecode VM, if made general enough to host more complicated devices,
  would likely be much larger than the actual code we have in the kernel now.

 We have the ability to upload bytecode into the kernel already.  It's in
 a great bytecode interpreted by the CPU itself.

Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.

 If every user were emulating different machines, LPF this would make
 sense.  Are they?  

They aren't.

 Or should we write those helpers once, in C, and
 provide that for them.

There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated.  However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 06:29 PM, Jan Kiszka wrote:
 
 
  Isn't there another level in between just scheduling and full syscall
  return if the user return notifier has some real work to do?
  
  Depends on whether you're scheduling a kthread or a userspace process, no?  
  If 

 Kthreads can't return, of course. User space threads /may/ do so. And
 then there needs to be a differences between host and guest in the
 tracked MSRs. 

Right.  Until we randomize kernel virtual addresses (what happened to
that?) and then there will always be a difference, even if you run the
same kernel in the host and guest.

 I think to recall it's a question of another few hundred
 cycles.

Right.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/07/2012 06:19 PM, Anthony Liguori wrote:
 Ah. But then ioeventfd has that as well, unless the other end is in
 the kernel too.


 Yes, that was my point exactly :-)

 ioeventfd/mmio-over-socketpair to adifferent thread is not faster than
 a synchronous KVM_RUN + writing to an eventfd in userspace modulo a
 couple of cheap syscalls.

 The exception is when the other end is in the kernel and there is
 magic optimizations (like there is today with ioeventfd).

vhost seems to schedule a workqueue item unconditionally.

irqfd does have magic optimizations to avoid an extra schedule.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 03:37 PM, Alexander Graf wrote:
 On 15.02.2012, at 14:29, Avi Kivity wrote:

  On 02/15/2012 01:57 PM, Alexander Graf wrote:
  
  Is an extra syscall for copying TLB entries to user space prohibitively
  expensive?
  
  The copying can be very expensive, yes. We want to have the possibility of 
  exposing a very large TLB to the guest, in the order of multiple kentries. 
  Every entry is a struct of 24 bytes.
  
  You don't need to copy the entire TLB, just the way that maps the
  address you're interested in.

 Yeah, unless we do migration in which case we need to introduce another 
 special case to fetch the whole thing :(.

Well, the scatter/gather registers I proposed will give you just one
register or all of them.

  btw, why are you interested in virtual addresses in userspace at all?

 We need them for gdb and monitor introspection.

Hardly fast paths that justify shared memory.  I should be much harder
on you.

  
  Right. It's an optional performance accelerator. If anything doesn't 
  align, don't use it. But if you happen to have a system where everything's 
  cool, you're faster. Sounds like a good deal to me ;).
  
  Depends on how much the alignment relies on guest knowledge.  I guess
  with a simple device like HPET, it's simple, but with a complex device,
  different guests (or different versions of the same guest) could drive
  it very differently.

 Right. But accelerating simple devices  not accelerating any devices. No? :)

Yes.  But introducing bugs and vulns  not introducing them.  It's a
tradeoff.  Even an unexploited vulnerability can be a lot more pain,
just because you need to update your entire cluster, than a simple
device that is accelerated for a guest which has maybe 3% utilization. 
Performance is just one parameter we optimize for.  It's easy to overdo
it because it's an easily measurable and sexy parameter, but it's a mistake.

  
  One thing that's different is that virtio offloads itself to a thread
  very quickly, while IDE does a lot of work in vcpu thread context.

 So it's all about latencies again, which could be reduced at least a fair bit 
 with the scheme I described above. But really, this needs to be prototyped 
 and benchmarked to actually give us data on how fast it would get us.

Simply making qemu issue the request from a thread would be way better. 
Something like socketpair mmio, configured for not waiting for the
writes to be seen (posted writes) will also help by buffering writes in
the socket buffer.

  
  The all-knowing management tool can provide a virtio driver disk, or
  even slip-stream the driver into the installation CD.

 One management tool might do that, another one might now. We can't assume 
 that all management tools are all-knowing. Some times you also want to run 
 guest OSs that the management tool doesn't know (yet).

That is true, but we have to leave some work for the management guys.

  
  So for MMIO reads, I can assume that this is an MMIO because I would never 
  write a non-readable entry. For writes, I'm overloading the bit that also 
  means guest entry is not readable so there I'd have to walk the guest 
  PTEs/TLBs and check if I find a read-only entry. Right now I can just 
  forward write faults to the guest. Since COW is probably a hotter path for 
  the guest than MMIO, this might end up being ineffective.
  
  COWs usually happen from guest userspace, while mmio is usually from the
  guest kernel, so you can switch on that, maybe.

 Hrm, nice idea. That might fall apart with user space drivers that we might 
 eventually have once vfio turns out to work well, but for the time being it's 
 a nice hack :).

Or nested virt...



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BUG in pv_clock when overflow condition is detected

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:23 PM, Igor Mammedov wrote:
   static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time
 *shadow)
   {
 -u64 delta = native_read_tsc() - shadow-tsc_timestamp;
 +u64 delta;
 +u64 tsc = native_read_tsc();
 +BUG_ON(tsc  shadow-tsc_timestamp);
 +delta = tsc - shadow-tsc_timestamp;
   return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
  shadow-tsc_shift);

 Maybe a WARN_ON_ONCE()?  Otherwise a relatively minor hypervisor bug can
 kill the guest.


 An attempt to print from this place is not perfect since it often leads
 to recursive calling to this very function and it hang there anyway.
 But if you insist I'll re-post it with WARN_ON_ONCE,
 It won't make much difference because guest will hang/stall due overflow
 anyway.

Won't a BUG_ON() also result in a printk?


 If there is an intention to keep guest functional after the event then
 maybe this patch is a way to go
   http://www.spinics.net/lists/kvm/msg68463.html
 this way clock will be re-silent to this kind of errors, like bare-metal
 one is.

It's the same patch... do you mean something that detects the overflow
and uses the last value?

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: AESNI and guest hosts

2012-02-15 Thread Avi Kivity
On 02/15/2012 02:02 PM, Ryan Brown wrote:
 
  I don't think it's supported to pass that functionality to the guest.
 
 
  Why not?  Perhaps a new libvirt or qemu is needed.
 

 Should it be the case to add one of the following?

 feature name='aes'/
 or..
 feature name='aesni'/

 something like that?

The qemu name is aes.  Don't know about libvirt, suggest you start with
bare qemu first.



-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:37 PM, Xiao Guangrong wrote:
 
  I would really like to move the IPI back out of the lock.
 
  How about something like a sequence lock:
 
 
  spin_lock(mmu_lock)
  need_flush = write_protect_stuff();
  atomic_add(kvm-want_flush_counter, need_flush);
  spin_unlock(mmu_lock);
 
  while ((done = atomic_read(kvm-done_flush_counter))  (want =
  atomic_read(kvm-want_flush_counter)) {
kvm_make_request(flush)
atomic_cmpxchg(kvm-done_flush_counter, done, want)
  }
 
  This (or maybe a corrected and optimized version) ensures that any
  need_flush cannot pass the while () barrier, no matter which thread
  encounters it first.  However it violates the do not invent new locking
  techniques commandment.  Can we map it to some existing method?
  
  There is no need to advance 'want' in the loop.  So we could do
  
  /* must call with mmu_lock held */
  void kvm_mmu_defer_remote_flush(kvm, need_flush)
  {
if (need_flush)
  ++kvm-flush_counter.want;
  }
  
  /* may call without mmu_lock */
  void kvm_mmu_commit_remote_flush(kvm)
  {
want = ACCESS_ONCE(kvm-flush_counter.want)
while ((done = atomic_read(kvm-flush_counter.done)  want) {
  kvm_make_request(flush)
  atomic_cmpxchg(kvm-flush_counter.done, done, want)
}
  }
  


 Hmm, we already have kvm-tlbs_dirty, so, we can do it like this:

 #define SPTE_INVALID_UNCLEAN (1  63 )

 in invalid page path:
 lock mmu_lock
 if (spte is invalid)
   kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN;
 need_tlb_flush = kvm-tlbs_dirty;
 unlock mmu_lock
 if (need_tlb_flush)
   kvm_flush_remote_tlbs()

 And in page write-protected path:
 lock mmu_lock
   if (it has spte change to readonly |
 kvm-tlbs_dirty  SPTE_INVALID_UNCLEAN)
   kvm_flush_remote_tlbs()
 unlock mmu_lock

 How about this?

Well, it still has flushes inside the lock.  And it seems to be more
complicated, but maybe that's because I thought of my idea and didn't
fully grok yours yet.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:57, Avi Kivity wrote:

 On 02/15/2012 03:37 PM, Alexander Graf wrote:
 On 15.02.2012, at 14:29, Avi Kivity wrote:
 
 On 02/15/2012 01:57 PM, Alexander Graf wrote:
 
 Is an extra syscall for copying TLB entries to user space prohibitively
 expensive?
 
 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.
 
 You don't need to copy the entire TLB, just the way that maps the
 address you're interested in.
 
 Yeah, unless we do migration in which case we need to introduce another 
 special case to fetch the whole thing :(.
 
 Well, the scatter/gather registers I proposed will give you just one
 register or all of them.

One register is hardly any use. We either need all ways of a respective address 
to do a full fledged lookup or all of them. By sharing the same data structures 
between qemu and kvm, we actually managed to reuse all of the tcg code for 
lookups, just like you do for x86. On x86 you also have shared memory for page 
tables, it's just guest visible, hence in guest memory. The concept is the same.

 
 btw, why are you interested in virtual addresses in userspace at all?
 
 We need them for gdb and monitor introspection.
 
 Hardly fast paths that justify shared memory.  I should be much harder
 on you.

It was a tradeoff on speed and complexity. This way we have the least amount of 
complexity IMHO. All KVM code paths just magically fit in with the TCG code. 
There are essentially no if(kvm_enabled)'s in our MMU walking code, because the 
tables are just there. Makes everything a lot easier (without dragging down 
performance).

 
 
 Right. It's an optional performance accelerator. If anything doesn't 
 align, don't use it. But if you happen to have a system where everything's 
 cool, you're faster. Sounds like a good deal to me ;).
 
 Depends on how much the alignment relies on guest knowledge.  I guess
 with a simple device like HPET, it's simple, but with a complex device,
 different guests (or different versions of the same guest) could drive
 it very differently.
 
 Right. But accelerating simple devices  not accelerating any devices. No? :)
 
 Yes.  But introducing bugs and vulns  not introducing them.  It's a
 tradeoff.  Even an unexploited vulnerability can be a lot more pain,
 just because you need to update your entire cluster, than a simple
 device that is accelerated for a guest which has maybe 3% utilization. 
 Performance is just one parameter we optimize for.  It's easy to overdo
 it because it's an easily measurable and sexy parameter, but it's a mistake.

Yeah, I agree. That's why I was trying to get AHCI to the default storage 
adapter for a while, because I think the same. However, Anthony believes that 
XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do 
that :(.

I'm mostly trying to think of ways to accelerate the obvious low hanging 
fruits, without overengineering any interfaces.

 
 
 One thing that's different is that virtio offloads itself to a thread
 very quickly, while IDE does a lot of work in vcpu thread context.
 
 So it's all about latencies again, which could be reduced at least a fair 
 bit with the scheme I described above. But really, this needs to be 
 prototyped and benchmarked to actually give us data on how fast it would get 
 us.
 
 Simply making qemu issue the request from a thread would be way better. 
 Something like socketpair mmio, configured for not waiting for the
 writes to be seen (posted writes) will also help by buffering writes in
 the socket buffer.

Yup, nice idea. That only works when all parts of a device are actually 
implemented through the same socket though. Otherwise you could run out of 
order. So if you have a PCI device with a PIO and an MMIO BAR region, they 
would both have to be handled through the same socket.

 
 
 The all-knowing management tool can provide a virtio driver disk, or
 even slip-stream the driver into the installation CD.
 
 One management tool might do that, another one might now. We can't assume 
 that all management tools are all-knowing. Some times you also want to run 
 guest OSs that the management tool doesn't know (yet).
 
 That is true, but we have to leave some work for the management guys.

The easier the management stack is, the happier I am ;).

 
 
 So for MMIO reads, I can assume that this is an MMIO because I would never 
 write a non-readable entry. For writes, I'm overloading the bit that also 
 means guest entry is not readable so there I'd have to walk the guest 
 PTEs/TLBs and check if I find a read-only entry. Right now I can just 
 forward write faults to the guest. Since COW is probably a hotter path for 
 the guest than MMIO, this might end up being ineffective.
 
 COWs usually happen from guest userspace, while mmio is usually from the
 guest kernel, so you can switch on that, maybe.
 
 Hrm, nice idea. 

Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread Jamal Hadi Salim
On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:

 Roopa was likely on the right track here,
 
 http://patchwork.ozlabs.org/patch/123064/

Doesnt seem related to the bridging stuff - the modeling looks
reasonable however.

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.
 
 An ideal user space interaction IMHO would look like,
 
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local

Looks nice, where is the targeted bridge(eg br0) in that syntax?

 Using Stephen's br tool. First command adds FDB entry to SW bridge and
 if the same tool could be used to add entries to embedded bridge I think
 that would be the best case. 

That would be nice (although adds dependency on the presence of the
s/ware bridge). Would be nicer to have either a knob in the kernel to
say synchronize with h/w bridge foo which can be turned off.  

 So no RTNETLINK error on the second cmd. Then
 embedded FDB entries could be dumped this way also so I get a complete view
 of my FDB setup across multiple sw bridges and embedded bridges.

So if you had multiple h/ware bridges - which one is tied to br0? 


 Yes. The hardware has a bit to support this which is currently not exposed
 to user space. That's a case where we have 'yet another knob' that needs
 a clean solution. This causes real bugs today when users try to use the
 macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
 all part of the 802.1Qbg spec which people actually want to use with Linux
 so a good clean solution is probably needed.


I think the knobs to flood and learn are important. The hardware
seems to have the flood but not the learn/discover. I think the
s/ware bridge needs to have both. At the moment - as pointed out in that
*NEIGH* notification, s/w bridge assumes a policy that could be
considered a security flaw in some circles - just because you are my
neighbor does not mean i trust you to come into my house; i may trust
you partially and allow you only to come through the front door. Even in
Canada with a default policy of not locking your door we sometimes lock
our doors ;-


 I have no problem with drawing the line here and trying to implement something
 over PF_BRIDGE:RTM_xxx nlmsgs. 


My comment/concern was in regard to the bridge built-in policy of
reading from the neighbor updates (refer to above comments)

cheers,
jamal


--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread madengineer10
I'm not sure if this bug is located in userspace or in the kernel.
Could you let me know where to file it?

Bug:
Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside
KVM causes it to hang and spin (using 1 full CPU core) after loading
the initrd, as determined by serial console output. The only error
message is KVM internal error. Suberror: 1/emulation failure.
Booting a regular Debian kernel succeeds, as does running the Xenomai
kernel with software emulation (-no-kvm).

Info:
CPU: Intel Core i7-2670QM
Emulator: qemu-kvm 0.14.1
Host kernel: 3.0.0-15 (Ubuntu build), x86_64
Guest OS: Debian Squeeze, kernel.org 2.6.37 kernel with Xenomai 2.6.0
(config attached)
Qemu command: kvm -M pc-0.14 -enable-kvm -m 1024 -drive
file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0
-netdev tap,fd=21,id=hostnet0 -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3
-chardev stdio,id=charserial0 -device
isa-serial,chardev=charserial0,id=serial0 -usb -device
usb-tablet,id=input0 -vga cirrus
Effects of flags: Adding one or both of --no-kvm-irqchip or
--no-kvm-pit has no apparent effect. Adding --no-kvm appears to
correct the problem.

Trace will be attached to the final bug submission.

Thanks,
    --Doug Brunner
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread Avi Kivity
On 02/15/2012 06:40 PM, madengineer10 wrote:
 I'm not sure if this bug is located in userspace or in the kernel.
 Could you let me know where to file it?

 Bug:
 Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside
 KVM causes it to hang and spin (using 1 full CPU core) after loading
 the initrd, as determined by serial console output. The only error
 message is KVM internal error. Suberror: 1/emulation failure.
 Booting a regular Debian kernel succeeds, as does running the Xenomai
 kernel with software emulation (-no-kvm).



Please issue the following commands on the qemu monitor:

  (qemu) info registers
  (qemu) x/30i $eip

and report.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread Avi Kivity
On 02/15/2012 06:46 PM, Avi Kivity wrote:
 On 02/15/2012 06:40 PM, madengineer10 wrote:
  I'm not sure if this bug is located in userspace or in the kernel.
  Could you let me know where to file it?
 
  Bug:
  Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside
  KVM causes it to hang and spin (using 1 full CPU core) after loading
  the initrd, as determined by serial console output. The only error
  message is KVM internal error. Suberror: 1/emulation failure.
  Booting a regular Debian kernel succeeds, as does running the Xenomai
  kernel with software emulation (-no-kvm).
 
 

 Please issue the following commands on the qemu monitor:

   (qemu) info registers
   (qemu) x/30i $eip

 and report.


Oh, and wrt your original question, it's likely a kvm bug, please report
in bugzilla.kernel.org.

-- 
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] BUG in pv_clock when overflow condition is detected

2012-02-15 Thread Igor Mammedov


- Original Message -
 From: Avi Kivity a...@redhat.com
 To: Igor Mammedov imamm...@redhat.com
 Cc: linux-ker...@vger.kernel.org, kvm@vger.kernel.org, t...@linutronix.de, 
 mi...@redhat.com, h...@zytor.com,
 r...@redhat.com, amit shah amit.s...@redhat.com, mtosa...@redhat.com
 Sent: Wednesday, February 15, 2012 3:02:04 PM
 Subject: Re: [PATCH] BUG in pv_clock when overflow condition is detected
 
 On 02/15/2012 01:23 PM, Igor Mammedov wrote:
static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time
  *shadow)
{
  -u64 delta = native_read_tsc() - shadow-tsc_timestamp;
  +u64 delta;
  +u64 tsc = native_read_tsc();
  +BUG_ON(tsc  shadow-tsc_timestamp);
  +delta = tsc - shadow-tsc_timestamp;
return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul,
   shadow-tsc_shift);
 
  Maybe a WARN_ON_ONCE()?  Otherwise a relatively minor hypervisor
  bug can
  kill the guest.
 
 
  An attempt to print from this place is not perfect since it often
  leads
  to recursive calling to this very function and it hang there
  anyway.
  But if you insist I'll re-post it with WARN_ON_ONCE,
  It won't make much difference because guest will hang/stall due
  overflow
  anyway.
 
 Won't a BUG_ON() also result in a printk?
Yes, it will. But stack will still keep failure point and poking
with crash/gdb at core will always show where it's BUGged.

In case it manages to print dump somehow (saw it couple times from ~
30 test cycles), logs from console or from kernel message buffer
(again poking with gdb) will show where it was called from.

If WARN* is used, it will still totaly screwup clock and 
last value and system will become unusable, requiring looking with
gdb/crash at the core any way.

So I've just used more stable failure point that will leave trace
everywhere it manages (maybe in console log, but for sure in stack)
in case of WARN it might leave trace on console or not and probably
won't reflect failure point in stack either leaving only kernel
message buffer for clue.

 
 
  If there is an intention to keep guest functional after the event
  then
  maybe this patch is a way to go
http://www.spinics.net/lists/kvm/msg68463.html
  this way clock will be re-silent to this kind of errors, like
  bare-metal
  one is.
 
 It's the same patch... do you mean something that detects the
 overflow
 and uses the last value?
I'm sorry, pasted wrong link
here it goes: 
  pvclock: Make pv_clock more robust and fixup it if overflow happens
 http://www.spinics.net/lists/kvm/msg68440.html

 
 --
 error compiling committee.c: too many arguments to function
 
 --
 To unsubscribe from this list: send the line unsubscribe kvm in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Andrea Arcangeli
On Wed, Feb 15, 2012 at 04:07:49PM +0200, Avi Kivity wrote:
 Well, it still has flushes inside the lock.  And it seems to be more
 complicated, but maybe that's because I thought of my idea and didn't
 fully grok yours yet.

If we go more complicated I prefer Avi's suggestion to move them all
outside the lock.

Yesterday I was also thinking at the regular pagetables and how we do
not have similar issues there. On the regular pagetables we just do
unconditional flush in fork when we make it readonly and KSM (the
other place that ptes stuff readonly that later can cow) uses
ptep_clear_flush which does an unconditional flush and furthermore it
does it inside the PT lock, so generally we don't optimize for those
things on the regular pagetables. But then these events don't happen
as frequently as they can on KVM without EPT/NPT.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Scott Wood
On 02/15/2012 05:57 AM, Alexander Graf wrote:
 
 On 15.02.2012, at 12:18, Avi Kivity wrote:
 
 Well the real reason is we have an extra bit reported by page faults
 that we can control.  Can't you set up a hashed pte that is configured
 in a way that it will fault, no matter what type of access the guest
 does, and see it in your page fault handler?
 
 I might be able to synthesize a PTE that is !readable and might throw
 a permission exception instead of a miss exception. I might be able
 to synthesize something similar for booke. I don't however get any
 indication on why things failed.

On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will
trigger a DSI that gets sent to the hypervisor even if normal DSIs go
directly to the guest.  You'll still need to zero out the execute
permission bits.

For other booke, you could use one of the user bits in MAS3 (along with
zeroing out all the permission bits), which you could get to by doing a
tlbsx.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Alexander Graf

On 10.01.2012, at 01:51, Scott Wood wrote:

 On 01/09/2012 11:46 AM, Alexander Graf wrote:
 
 On 21.12.2011, at 02:34, Scott Wood wrote:
 

[...]

 Current issues include:
 - Machine checks from guest state are not routed to the host handler.
 - The guest can cause a host oops by executing an emulated instruction
  in a page that lacks read permission.  Existing e500/4xx support has
  the same problem.
 
 We solve that in book3s pr by doing
 
  LAST_INST = known bad value;
  PACA-kvm_mode = recover at next inst;
  lwz(guest pc);
  do_more_stuff();
 
 That way when an exception occurs at lwz() the DO_KVM handler checks that 
 we're in kvm mode recover which does basically srr0+=4; rfi;.
 
 I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
 treat it as a kernel fault (search exception table) -- but this works
 too and is a bit cleaner (could be other uses of external pid), at the
 expense of a couple extra instructions in the emulation path (but
 probably a slightly faster host TLB handler).
 
 The check wouldn't go in DO_KVM, though, since on bookehv that only
 deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
 case here.

Thinking about it a bit more, how is this different from a failed get_user()? 
We can just use the same fixup mechanism as there, right?

Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[KVM paravirt issue?] Re: vsyscall=emulate regression

2012-02-15 Thread Andy Lutomirski
Hi, kvm people-

Here's a strange failure.  It could be a bug in something
RHEL6-specific, but it could be a generic issue that only triggers
with a paravirt guest with old userspace on a non-ept host.  There was
a bug like this on Xen, and I'm wondering something's wrong on kvm as
well.

For background, a change in 3.1 (IIRC) means that, when
vsyscall=emulate or vsyscall=none, the vsyscall page in the fixmap is
NX.  It seems like Amit's machine is marking the physical PTE present
but unreadable.  So I could have messed up, or there could be a subtle
bug somewhere.  Any ideas?

I'll try to reproduce on a non-ept host later on, but that will
involve finding one.

On Wed, Feb 15, 2012 at 3:01 AM, Amit Shah amit.s...@redhat.com wrote:
 On (Tue) 14 Feb 2012 [08:26:22], Andy Lutomirski wrote:
 On Tue, Feb 14, 2012 at 4:22 AM, Amit Shah amit.s...@redhat.com wrote:
 Can you try booting the initramfs here:
 http://web.mit.edu/luto/www/linux/vsyscall_initramfs.img
 with your kernel image (i.e. qemu-kvm -kernel whatever -initrd
 vsyscall_initramfs.img -whatever_else) and seeing what happens?  It
 works for me.

 This too results in a similar error.

Can you post the exact error?  I'm interested in how far it gets
before it fails.

 I didn't try a modern distro, but looks like this is enough evidence
 for now to check the kvm emulator code.  I tried the same guests on a
 newer kernel (Fedora 16's 3.2), and things worked fine except for
 vsyscall=none, panic message below.

vsyscall=none isn't supposed to work unless you're running a very
modern distro *and* you have no legacy static binaries *and* you
aren't using anything written in Go (sigh).  It will probably either
never become the default or will take 5-10 years.


 model name      : Intel(R) Core(TM)2 Duo CPU     E6550  @ 2.33GHz
 flags           : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov 
 pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm 
 constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor 
 ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi 
 flexpriority

Hmm.  You don't have ept.  If your guest kernel supports paravirt,
then you might use the hypercall interface instead of programming the
fixmap directly.


 This is what I get with vsyscall=none, where emulate and native work
 fine on the 3.2 kernel on different host hardware, the guest stays the
 same:


 [    2.874661] debug: unmapping init memory 8167f000..818dc000
 [    2.876778] Write protecting the kernel read-only data: 6144k
 [    2.879111] debug: unmapping init memory 880001318000..88000140
 [    2.881242] debug: unmapping init memory 8800015a..88000160
 [    2.884637] init[1] vsyscall attempted with vsyscall=none 
 ip:ff600400 cs:33 sp:7fff2f48fe18 ax:7fff2f48fe50 si:7fff2f48ff08 di:0

This like (vsyscall attempted) means that the emulation worked
correctly.  Your other traces didn't have it or anything like it,
which mostly rules out do_emulate_vsyscall issues.

--Andy
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Scott Wood
On 02/15/2012 01:36 PM, Alexander Graf wrote:
 
 On 10.01.2012, at 01:51, Scott Wood wrote:
 I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
 treat it as a kernel fault (search exception table) -- but this works
 too and is a bit cleaner (could be other uses of external pid), at the
 expense of a couple extra instructions in the emulation path (but
 probably a slightly faster host TLB handler).

 The check wouldn't go in DO_KVM, though, since on bookehv that only
 deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
 case here.
 
 Thinking about it a bit more, how is this different from a failed get_user()? 
 We can just use the same fixup mechanism as there, right?

The fixup mechanism can be the same (we'd like to know whether it failed
due to TLB miss or DSI, so we know which to reflect -- but if necessary
I think we can figure that out with a tlbsx).  What's different is that
the page fault handler needs to know that any external pid (or AS1)
fault is bad, same as if the address were in the kernel area, and it
should go directly to searching the exception tables instead of trying
to page something in.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Anthony Liguori

On 02/15/2012 07:39 AM, Avi Kivity wrote:

On 02/07/2012 08:12 PM, Rusty Russell wrote:

I would really love to have this, but the problem is that we'd need a
general purpose bytecode VM with binding to some kernel APIs.  The
bytecode VM, if made general enough to host more complicated devices,
would likely be much larger than the actual code we have in the kernel now.


We have the ability to upload bytecode into the kernel already.  It's in
a great bytecode interpreted by the CPU itself.


Unfortunately it's inflexible (has to come with the kernel) and open to
security vulnerabilities.


I wonder if there's any reasonable way to run device emulation within the 
context of the guest.  Could we effectively do something like SMM?


For a given set of traps, reflect back into the guest quickly changing the 
visibility of the VGA region. It may require installing a new CR3 but maybe that 
wouldn't be so bad with VPIDs.


Then you could implement the PIT as guest firmware using kvmclock as the time 
base.

Once you're back in the guest, you could install the old CR3.  Perhaps just hide 
a portion of the physical address space with the e820.


Regards,

Anthony Liguori


If every user were emulating different machines, LPF this would make
sense.  Are they?


They aren't.


Or should we write those helpers once, in C, and
provide that for them.


There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
them are quite complicated.  However implementing them in bytecode
amounts to exposing a stable kernel ABI, since they use such a vast
range of kernel services.



--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
 On 07.02.2012, at 07:58, Michael Ellerman wrote:
 
  On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
  You're exposing a large, complex kernel subsystem that does very
  low-level things with the hardware.  It's a potential source of exploits
  (from bugs in KVM or in hardware).  I can see people wanting to be
  selective with access because of that.
  
  Exactly.
  
  In a perfect world I'd agree with Anthony, but in reality I think
  sysadmins are quite happy that they can prevent some users from using
  KVM.
  
  You could presumably achieve something similar with capabilities or
  whatever, but a node in /dev is much simpler.
 
 Well, you could still keep the /dev/kvm node and then have syscalls operate 
 on the fd.
 
 But again, I don't see the problem with the ioctl interface. It's nice, 
 extensible and works great for us.
 

ioctl is good for hardware devices and stuff that you want to enumerate
and/or control permissions on. For something like KVM that is really a
core kernel service, a syscall makes much more sense.

I would certainly never mix the two concepts: If you use a chardev to get
a file descriptor, use ioctl to do operations on it, and if you use a 
syscall to get the file descriptor then use other syscalls to do operations
on it.

I don't really have a good recommendation whether or not to change from an
ioctl based interface to syscall for KVM now. On the one hand I believe it
would be significantly cleaner, on the other hand we cannot remove the
chardev interface any more since there are many existing users.

Arnd
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
  
  Not sure we'll ever get there. For PPC, it will probably take another 1-2 
  years until we get the 32-bit targets stabilized. By then we will have new 
  64-bit support though. And then the next gen will come out giving us even 
  more new constraints.
  
  I would expect that newer archs have less constraints, not more.
 
 Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
 today on 32-bit, but extends a
 bunch of registers to 64-bit. So what if we laid out stuff wrong before?
 
 I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
 completely new architecture.
 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

Arnd

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 20:40, Scott Wood wrote:

 On 02/15/2012 01:36 PM, Alexander Graf wrote:
 
 On 10.01.2012, at 01:51, Scott Wood wrote:
 I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
 treat it as a kernel fault (search exception table) -- but this works
 too and is a bit cleaner (could be other uses of external pid), at the
 expense of a couple extra instructions in the emulation path (but
 probably a slightly faster host TLB handler).
 
 The check wouldn't go in DO_KVM, though, since on bookehv that only
 deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
 case here.
 
 Thinking about it a bit more, how is this different from a failed 
 get_user()? We can just use the same fixup mechanism as there, right?
 
 The fixup mechanism can be the same (we'd like to know whether it failed
 due to TLB miss or DSI, so we know which to reflect

No, we only want to know fast path failed. The reason is a different pair of 
shoes and should be evaluated in the slow path. We shouldn't ever fault here 
during normal operation btw. We already executed a guest instruction, so 
there's almost no reason it can't be read.

 -- but if necessary
 I think we can figure that out with a tlbsx).  What's different is that
 the page fault handler needs to know that any external pid (or AS1)
 fault is bad, same as if the address were in the kernel area, and it
 should go directly to searching the exception tables instead of trying
 to page something in.

Yes and no. We need to force it to search the exception tables. We don't care 
if the page fault handlers knows anything about external pids.

Either way, we discussed the further stuff on IRC and came to a working 
solution :). Stay tuned.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Michael Ellerman
On Wed, 2012-02-15 at 22:21 +, Arnd Bergmann wrote:
 On Tuesday 07 February 2012, Alexander Graf wrote:
  On 07.02.2012, at 07:58, Michael Ellerman wrote:
  
   On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote:
   You're exposing a large, complex kernel subsystem that does very
   low-level things with the hardware.  It's a potential source of exploits
   (from bugs in KVM or in hardware).  I can see people wanting to be
   selective with access because of that.
   
   Exactly.
   
   In a perfect world I'd agree with Anthony, but in reality I think
   sysadmins are quite happy that they can prevent some users from using
   KVM.
   
   You could presumably achieve something similar with capabilities or
   whatever, but a node in /dev is much simpler.
  
  Well, you could still keep the /dev/kvm node and then have syscalls operate 
  on the fd.
  
  But again, I don't see the problem with the ioctl interface. It's nice, 
  extensible and works great for us.
  
 
 ioctl is good for hardware devices and stuff that you want to enumerate
 and/or control permissions on. For something like KVM that is really a
 core kernel service, a syscall makes much more sense.

Yeah maybe. That distinction is at least in part just historical.

The first problem I see with using a syscall is that you don't need one
syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a
multiplexed syscall like epoll_ctl() - or probably several
(vm/vcpu/etc).

Secondly you still need a handle/context for those syscalls, and I think
the most sane thing to use for that is an fd.

At that point you've basically reinvented ioctl :)

I also think it is an advantage that you have a node in /dev for
permissions. I know other core kernel interfaces don't use a /dev
node, but arguably that is their loss.

 I would certainly never mix the two concepts: If you use a chardev to get
 a file descriptor, use ioctl to do operations on it, and if you use a 
 syscall to get the file descriptor then use other syscalls to do operations
 on it.

Sure, we use a syscall to get the fd (open) and then other syscalls to
do operations on it, ioctl and kvm_vcpu_run. ;)

But seriously, I guess that makes sense. Though it's a bit of a pity
because if you want a syscall for any of it, eg. vcpu_run(), then you
have to basically reinvent ioctl for all the other little operations.

cheers


signature.asc
Description: This is a digitally signed message part


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread John Fastabend
On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote:
 On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote:
 
 Roopa was likely on the right track here,

 http://patchwork.ozlabs.org/patch/123064/
 
 Doesnt seem related to the bridging stuff - the modeling looks
 reasonable however.
 

The operations are really the same ADD/DEL/GET additional MAC
addresses to a port, in this case a macvlan type port. The
difference is the  macvlan port type drops any packet with an
address not in the FDB where the bridge type floods these.

 But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX
 netlink messages. And if possible drive this without extending ndo_ops.

 An ideal user space interaction IMHO would look like,

 [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10
 [root@jf-dev1-dcblab iproute2]# ./br/br fdb
 portmac addrflags
 veth2   36:a6:35:9b:96:c4   local
 veth4   aa:54:b0:7b:42:ef   local
 veth0   2a:e8:5c:95:6c:1b   local
 veth6   6e:26:d5:43:a3:36   local
 veth0   f2:c1:39:76:6a:fb
 veth8   4e:35:16:af:87:13   local
 veth10  52:e5:62:7b:57:88   static
 veth10  aa:a9:35:21:15:c4   local
 
 Looks nice, where is the targeted bridge(eg br0) in that syntax?

[root@jf-dev1-dcblab src]# br fdb help
Usage: br fdb { add | del | replace } ADDR dev DEV
   br fdb {show} [ dev DEV ]

In my example I just dumped all bridge devices,

#br fdb show dev bridge0

 
 Using Stephen's br tool. First command adds FDB entry to SW bridge and
 if the same tool could be used to add entries to embedded bridge I think
 that would be the best case. 
 
 That would be nice (although adds dependency on the presence of the
 s/ware bridge). Would be nicer to have either a knob in the kernel to
 say synchronize with h/w bridge foo which can be turned off.  
 

Seems we need both a synchronize and a { add | del | replace } option.

 So no RTNETLINK error on the second cmd. Then
 embedded FDB entries could be dumped this way also so I get a complete view
 of my FDB setup across multiple sw bridges and embedded bridges.
 
 So if you had multiple h/ware bridges - which one is tied to br0? 
 

Not sure I follow but does the additional dev parameter above answer this?

 
 Yes. The hardware has a bit to support this which is currently not exposed
 to user space. That's a case where we have 'yet another knob' that needs
 a clean solution. This causes real bugs today when users try to use the
 macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are
 all part of the 802.1Qbg spec which people actually want to use with Linux
 so a good clean solution is probably needed.
 
 
 I think the knobs to flood and learn are important. The hardware
 seems to have the flood but not the learn/discover. I think the
 s/ware bridge needs to have both. At the moment - as pointed out in that
 *NEIGH* notification, s/w bridge assumes a policy that could be
 considered a security flaw in some circles - just because you are my
 neighbor does not mean i trust you to come into my house; i may trust
 you partially and allow you only to come through the front door. Even in
 Canada with a default policy of not locking your door we sometimes lock
 our doors ;-
 
 
 I have no problem with drawing the line here and trying to implement 
 something
 over PF_BRIDGE:RTM_xxx nlmsgs. 
 
 
 My comment/concern was in regard to the bridge built-in policy of
 reading from the neighbor updates (refer to above comments)
 

So I think what your saying is a per port bit to disable learning...

hmm but if you start tweaking it too much it looks less and less like a
802.1D bridge and more like something you would want to build with tc or
openvswitch or tc+bridge or tc+macvlan.

.John

 cheers,
 jamal
 
 

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [PATCH 0/3] [PULL] qemu-kvm.git uq/master queue

2012-02-15 Thread Anthony Liguori

On 02/08/2012 02:01 PM, Marcelo Tosatti wrote:

The following changes since commit cf4dc461a4cfc3e056ee24edb26154f4d34a6278:

   Restore consistent formatting (2012-02-07 22:11:04 +0400)

are available in the git repository at:
   git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master


Pulled.  Thanks.

Regards,

Anthony Liguori



Jan Kiszka (3):
   kvm: Allow to set shadow MMU size
   kvm: Implement kvm_irqchip_in_kernel like kvm_enabled
   apic: Fix legacy vmstate loading for KVM

  hw/apic_common.c  |7 ++-
  hw/pc.c   |4 ++--
  hw/pc_piix.c  |6 +++---
  kvm-all.c |   13 -
  kvm-stub.c|5 -
  kvm.h |8 +---
  qemu-config.c |4 
  qemu-options.hx   |5 -
  target-i386/kvm.c |   17 +++--
  9 files changed, 43 insertions(+), 26 deletions(-)




--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Rusty Russell
On Wed, 15 Feb 2012 15:39:41 +0200, Avi Kivity a...@redhat.com wrote:
 On 02/07/2012 08:12 PM, Rusty Russell wrote:
   I would really love to have this, but the problem is that we'd need a
   general purpose bytecode VM with binding to some kernel APIs.  The
   bytecode VM, if made general enough to host more complicated devices,
   would likely be much larger than the actual code we have in the kernel 
   now.
 
  We have the ability to upload bytecode into the kernel already.  It's in
  a great bytecode interpreted by the CPU itself.
 
 Unfortunately it's inflexible (has to come with the kernel) and open to
 security vulnerabilities.

It doesn't have to come with the kernel, but it does require privs.  And
the bytecode itself might be invulnerable, the services it will call
will be, so it's not clear it'll be a win, given the reduced
auditability.

The grass is not really greener, and getting there involves many fences.

  If every user were emulating different machines, LPF this would make
  sense.  Are they?  
 
 They aren't.
 
  Or should we write those helpers once, in C, and
  provide that for them.
 
 There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V
 stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of
 them are quite complicated.  However implementing them in bytecode
 amounts to exposing a stable kernel ABI, since they use such a vast
 range of kernel services.

We could think about regularizing and enumerating the various in-kernel
helpers, and give userspace a generic mechanism for wiring them up.
That would surely be the first step towards bytecode anyway.

But the current device assignment ioctls make me think that this
wouldn't be simple or neat.

Cheers,
Rusty.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware

2012-02-15 Thread Ben Hutchings
[I'm just catching up with this after getting my own driver changes into
shape.]

On Fri, 2012-02-10 at 10:18 -0500, jamal wrote:
 Hi John,
 
 I went backwards to summarize at the top after going through your email.
 
 TL;DR version 0.1: 
 you provide a good use case where it makes sense to do things in the
 kernel. IMO, you could make the same arguement if your embedded switch
 could do ACLs, IPv4 forwarding etc. And the kernel bloats.
 I am always bigoted to move all policy control to user space instead of
 bloating in the kernel.
[...]
  Now here is the potential issue,
  
  (G) The frame transmitted from ethx.y with the destination address of
  veth0 but the embedded switch is not a learning switch. If the FDB
  update is done in user space its possible (likely?) that the FDB
  entry for veth0 has not been added to the embedded switch yet. 
 
 Ok, got it - so the catch here is the switch is not capable of learning.
 I think this depends on where learning is done. Your intent is to
 use the S/W bridge as something that does the learning for you i.e in
 the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run.
 And that maybe the case for your use case.
[...]

Well, in addition, there are SR-IOV network adapters that don't have any
bridge.  For these, the software bridge is necessary to handle
multicast, broadcast and forwarding between local ports, not only to do
learning.

Solarflare's implementation of accelerated guest networking (which
Shradha and I are gradually sending upstream) builds on libvirt's
existing support for software bridges and assigns VFs to guests as a
means to offload some of the forwarding.

If and when we implement a hardware bridge, we would probably still want
to keep the software bridge as a fallback.  If a guest is dependent on a
VF that's connected to a hardware bridge, it becomes impossible or at
least very disruptive to migrate it to another host that doesn't have a
compatible VF available.

Ben.

-- 
Ben Hutchings, Staff Engineer, Solarflare
Not speaking for my employer; that's the marketing department's job.
They asked us to note that Solarflare product names are trademarked.

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock

2012-02-15 Thread Xiao Guangrong
On 02/15/2012 10:07 PM, Avi Kivity wrote:

 On 02/15/2012 01:37 PM, Xiao Guangrong wrote:

 I would really like to move the IPI back out of the lock.

 How about something like a sequence lock:


 spin_lock(mmu_lock)
 need_flush = write_protect_stuff();
 atomic_add(kvm-want_flush_counter, need_flush);
 spin_unlock(mmu_lock);

 while ((done = atomic_read(kvm-done_flush_counter))  (want =
 atomic_read(kvm-want_flush_counter)) {
   kvm_make_request(flush)
   atomic_cmpxchg(kvm-done_flush_counter, done, want)
 }

 This (or maybe a corrected and optimized version) ensures that any
 need_flush cannot pass the while () barrier, no matter which thread
 encounters it first.  However it violates the do not invent new locking
 techniques commandment.  Can we map it to some existing method?

 There is no need to advance 'want' in the loop.  So we could do

 /* must call with mmu_lock held */
 void kvm_mmu_defer_remote_flush(kvm, need_flush)
 {
   if (need_flush)
 ++kvm-flush_counter.want;
 }

 /* may call without mmu_lock */
 void kvm_mmu_commit_remote_flush(kvm)
 {
   want = ACCESS_ONCE(kvm-flush_counter.want)
   while ((done = atomic_read(kvm-flush_counter.done)  want) {
 kvm_make_request(flush)
 atomic_cmpxchg(kvm-flush_counter.done, done, want)
   }
 }



 Hmm, we already have kvm-tlbs_dirty, so, we can do it like this:

 #define SPTE_INVALID_UNCLEAN (1  63 )

 in invalid page path:
 lock mmu_lock
 if (spte is invalid)
  kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN;
 need_tlb_flush = kvm-tlbs_dirty;
 unlock mmu_lock
 if (need_tlb_flush)
  kvm_flush_remote_tlbs()

 And in page write-protected path:
 lock mmu_lock
  if (it has spte change to readonly |
kvm-tlbs_dirty  SPTE_INVALID_UNCLEAN)
  kvm_flush_remote_tlbs()
 unlock mmu_lock

 How about this?
 
 Well, it still has flushes inside the lock.  And it seems to be more
 complicated, but maybe that's because I thought of my idea and didn't
 fully grok yours yet.
 


Oh, i was not think that flush all tlbs out of mmu-lock, just invalid page
path.

But, there still have some paths need flush tlbs inside of mmu-lock(like
sync_children, get_page).

In your code:

 /* must call with mmu_lock held */
 void kvm_mmu_defer_remote_flush(kvm, need_flush)
 {
   if (need_flush)
 ++kvm-flush_counter.want;
 }

 /* may call without mmu_lock */
 void kvm_mmu_commit_remote_flush(kvm)
 {
   want = ACCESS_ONCE(kvm-flush_counter.want)
   while ((done = atomic_read(kvm-flush_counter.done)  want) {
 kvm_make_request(flush)
 atomic_cmpxchg(kvm-flush_counter.done, done, want)
   }
 }

I think we do not need handle all tlb-flushed request here since all of these
request can be delayed to the point where mmu-lock is released , we can simply
do it:

void kvm_mmu_defer_remote_flush(kvm, need_flush)
{
if (need_flush)
++kvm-tlbs_dirty;
}

void kvm_mmu_commit_remote_flush(struct kvm *kvm)
{
int dirty_count = kvm-tlbs_dirty;

smp_mb();

if (!dirty_count)
return;

if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH))
++kvm-stat.remote_tlb_flush;
cmpxchg(kvm-tlbs_dirty, dirty_count, 0);
}

if this is ok, we only need do small change in the current code, since
kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs().

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-02-15 Thread Xiao Guangrong
On 02/13/2012 11:52 PM, David Ahern wrote:


 The first patch is only needed for code compilation, after kvm-events is
 compiled, you can analyse any kernels. :)
 
 understood.
 
 Now that I recall perf's way of handling out of tree builds, a couple of
 comments:
 
 1. you need to add the following to tools/perf/MANIFEST
 arch/x86/include/asm/svm.h
 arch/x86/include/asm/vmx.h
 arch/x86/include/asm/kvm_host.h
 


Right.

 2.scripts/checkpatch.pl is an unhappy camper.
 


It seems checkpath always complains about TRACE_EVENT and many more
than-80-characters lines in perf tools.

 I'll take a look at the code and try out the command when I get some time.
 


Okay, i will post the next version after collecting your new comments!

Thanks for your time, David! :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-02-15 Thread David Ahern

On 2/15/12 9:59 PM, Xiao Guangrong wrote:



Okay, i will post the next version after collecting your new comments!

Thanks for your time, David! :)



I had more comments, but got sidetracked and forgot to come back to 
this. I still haven't looked at the code yet, but some comments from 
testing:


1. The error message:
  Warning: Error: expected type 5 but read 4
  Warning: Error: expected type 5 but read 0
  Warning: unknown op '}'

is fixed by this patch which has not yet made its way into perf:
https://lkml.org/lkml/2011/9/4/41

The most recent request:
https://lkml.org/lkml/2012/2/8/479

Arnaldo: the patch still applies cleanly (but with an offset of -2 lines).


2. negatve testing:

perf kvm-events record -e kvm:* -p 2603 -- sleep 10

  Warning: Error: expected type 4 but read 7
  Warning: Error: expected type 5 but read 0
  Warning: failed to read event print fmt for kvm_apic
  Warning: Error: expected type 4 but read 7
  Warning: Error: expected type 5 but read 0
  Warning: failed to read event print fmt for kvm_inj_exception
  Fatal: bad op token {

If other kvm events are specified in the record line they appear to be 
silently ignored in the report in which case why allow the -e option to 
record?



3. What is happening for multiple VMs?

a. perf kvm-events report
data is collected for all VMs. What is displayed in the report? An
average for all VMs?

b. perf kvm-events report --vcpu 1
Does this given an average of all vcpu 1's?

Perhaps a -p option for the report to pull out events related to a 
single VM. Really this could be a generic option (to perf-report and 
perf-script as well) to only show/analyze events for the specified pid. 
ie., data is recorded for all VMs (or system wide for the regular 
perf-record) and you want to only consider events for a specific pid. 
e.g., in process_sample_event() skip event if event-ip.pid != 
report_pid (works for perf code because PERF_SAMPLE_TID attribute is 
always set).


David
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] KVM: perf: kvm events analysis tool

2012-02-15 Thread Xiao Guangrong
On 02/16/2012 01:05 PM, David Ahern wrote:

 On 2/15/12 9:59 PM, Xiao Guangrong wrote:


 Okay, i will post the next version after collecting your new comments!

 Thanks for your time, David! :)

 
 I had more comments, but got sidetracked and forgot to come back to this. I 
 still haven't looked at the code yet, but some comments from testing:
 
 1. The error message:
   Warning: Error: expected type 5 but read 4
   Warning: Error: expected type 5 but read 0
   Warning: unknown op '}'
 
 is fixed by this patch which has not yet made its way into perf:
 https://lkml.org/lkml/2011/9/4/41
 
 The most recent request:
 https://lkml.org/lkml/2012/2/8/479
 
 Arnaldo: the patch still applies cleanly (but with an offset of -2 lines).
 


Great, it is a good fix.

But, it does not hurt the development of kvm-events.

 
 2. negatve testing:
 
 perf kvm-events record -e kvm:* -p 2603 -- sleep 10
 
   Warning: Error: expected type 4 but read 7
   Warning: Error: expected type 5 but read 0
   Warning: failed to read event print fmt for kvm_apic
   Warning: Error: expected type 4 but read 7
   Warning: Error: expected type 5 but read 0
   Warning: failed to read event print fmt for kvm_inj_exception
   Fatal: bad op token {
 
 If other kvm events are specified in the record line they appear to be 
 silently ignored in the report in which case why allow the -e option to 
 record?
 


Yes, kvm-events doese not analyse these events specified by -e option since
these events are not needed by vmexit/ioport/mmio analysis.

And after kvm-evnets record, you can see these events by perf script

 
 3. What is happening for multiple VMs?
 
 a. perf kvm-events report
 data is collected for all VMs. What is displayed in the report? An
 average for all VMs?
 


Yes

 b. perf kvm-events report --vcpu 1
 Does this given an average of all vcpu 1's?
 


Yes

 Perhaps a -p option for the report to pull out events related to a single VM. 
 Really this could be a generic option (to perf-report and perf-script as 
 well) to only show/analyze events for the specified pid. ie., data is 
 recorded for all VMs (or system wide for the regular perf-record) and you 
 want to only consider events for a specific pid. e.g., in 
 process_sample_event() skip event if event-ip.pid != report_pid (works for 
 perf code because PERF_SAMPLE_TID attribute is always set).

Analysis for per VMs is good idea, but please allow me put it into my TODO 
list. :)

--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] New: KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779

   Summary: KVM domain hangs after loading initrd with Xenomai
kernel
   Product: Virtualization
   Version: unspecified
Kernel Version: 3.0.0-15
  Platform: All
OS/Version: Linux
  Tree: Mainline
Status: NEW
  Severity: normal
  Priority: P1
 Component: kvm
AssignedTo: virtualization_...@kernel-bugs.osdl.org
ReportedBy: madenginee...@gmail.com
Regression: No


Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside KVM
causes it to hang and spin (using 1 full CPU core) after loading the initrd, as
determined by serial console output. The only error message is KVM internal
error. Suberror: 1/emulation failure. Booting a regular Debian kernel
succeeds, as does running the Xenomai kernel with software emulation (-no-kvm).

Info:
CPU: Intel Core i7-2670QM
Emulator: qemu-kvm 0.14.1
Host kernel: 3.0.0-15 (Ubuntu build), x86_64
Guest OS: Debian Squeeze, kernel.org 2.6.37 kernel with Xenomai 2.6.0 (config
attached)
Qemu command: kvm -M pc-0.14 -enable-kvm -m 1024 -drive
file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev
tap,fd=21,id=hostnet0 -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3 -chardev
stdio,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb
-device usb-tablet,id=input0 -vga cirrus
Effects of flags: Adding one or both of --no-kvm-irqchip or --no-kvm-pit has no
apparent effect. Adding --no-kvm appears to correct the problem, at the cost of
performance due to using the software emulator.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #1 from madenginee...@gmail.com  2012-02-16 05:46:07 ---
Created an attachment (id=72393)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=72393)
Configuration of the guest kernel

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #2 from madenginee...@gmail.com  2012-02-16 05:47:18 ---
Created an attachment (id=72394)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=72394)
Result of 'registers info' and 'x/30i $eip' after fault

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779


madenginee...@gmail.com changed:

   What|Removed |Added

  Attachment #72393|application/octet-stream|text/plain
  mime type||




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779


madenginee...@gmail.com changed:

   What|Removed |Added

  Attachment #72394|application/octet-stream|text/plain
  mime type||




-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #3 from madenginee...@gmail.com  2012-02-16 05:57:25 ---
Couldn't attach the trace I recorded of the fault occurring since it's 3 MB
compressed with xz, bigger still with other formats. I can email it if it will
be useful.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #4 from madenginee...@gmail.com  2012-02-16 06:32:48 ---
Same problem occurs with qemu-kvm 1.0 from
https://launchpad.net/~bderzhavets/+archive/lib-usbredir39:

$ sudo kvm -M pc-1.0 -enable-kvm -m 1024 -drive
file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw
-device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev
tap,fd=21,id=hostnet0 -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3 -chardev
vc,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb
-device usb-tablet,id=input0 -vga cirrus
kvm: -netdev tap,fd=21,id=hostnet0: TUNGETIFF ioctl() failed: Bad file
descriptor
TUNSETOFFLOAD ioctl() failed: Bad file descriptor
kvm: -device
e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3:
pci_add_option_rom: failed to find romfile pxe-e1000.rom
KVM internal error. Suberror: 1
emulation failure
EAX=f681 EBX=003e ECX=003e EDX=c00b8000
ESI=c00b8000 EDI=c15b EBP=c15b1f74 ESP=c15b1f58
EIP=c1228905 EFL=00010206 [-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0
ES =007b   00c0f300 DPL=3 DS   [-WA]
CS =0060   00c09b00 DPL=0 CS32 [-RA]
SS =0068   00c09300 DPL=0 DS   [-WA]
DS =007b   00c0f300 DPL=3 DS   [-WA]
FS =   
GS =   
LDT=   
TR =0080 c15b6300 206b 8b00 DPL=0 TSS32-busy
GDT= c15b3000 00ff
IDT= c15b2000 07ff
CR0=80050033 CR2=ffee4000 CR3=01663000 CR4=0690
DR0= DR1= DR2=
DR3= 
DR6=0ff0 DR7=0400
EFER=
Code=8e 2b 01 00 00 8b 4d f0 89 f2 8b 45 ec 0f 0d 82 40 01 00 00 0f 6f 02 0f
6f 4a 08 0f 6f 52 10 0f 6f 5a 18 0f 7f 00 0f 7f 48 08 0f 7f 50 10 0f 7f 58 18

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779


Gleb g...@redhat.com changed:

   What|Removed |Added

 CC||g...@redhat.com




--- Comment #5 from Gleb g...@redhat.com  2012-02-16 07:15:49 ---
(In reply to comment #3)
 Couldn't attach the trace I recorded of the fault occurring since it's 3 MB
 compressed with xz, bigger still with other formats. I can email it if it will
 be useful.

Can you do tail -1 on it and attach it here?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #6 from madenginee...@gmail.com  2012-02-16 07:22:41 ---
Created an attachment (id=72395)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=72395)
Last 10k lines of a trace showing the fault

Per Gleb's request

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #7 from Gleb g...@redhat.com  2012-02-16 07:43:11 ---
Have you installed trace-cmd before capturing the trace? It failed to parse kvm
events. qemu haven't paused the guest after emulation error (looks like a bug),
so 'x/30i $eip' output is not useful either. Can you do 'x/30i 0xXXX' where XXX
is the address in EIP from register dump you see after instruction emulation
failure message (c1228905 in your output from comment #4)?

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #8 from madenginee...@gmail.com  2012-02-16 07:56:03 ---
Not sure what you mean by installing trace-cmd before capturing the trace--I
did do that, otherwise I wouldn't have had a trace-cmd to run. The package
version is trace-cmd 1.0.3-0ubuntu1 if that helps. I tried running it again
against qemu 1.0 (the last one was for qemu 0.14), still contained a bunch of
[FAILED TO PARSE] messages.

Attaching the output you requested separately.

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel

2012-02-15 Thread bugzilla-daemon
https://bugzilla.kernel.org/show_bug.cgi?id=42779





--- Comment #9 from madenginee...@gmail.com  2012-02-16 07:57:35 ---
Created an attachment (id=72397)
 -- (https://bugzilla.kernel.org/attachment.cgi?id=72397)
Register state and code disassembly at failure point with qemu 1.0

-- 
Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email
--- You are receiving this mail because: ---
You are watching the assignee of the bug.
--
To unsubscribe from this list: send the line unsubscribe kvm in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 12:18, Avi Kivity wrote:

 On 02/07/2012 04:39 PM, Alexander Graf wrote:
 
 Syscalls are orthogonal to that - they're to avoid the fget_light() and to 
 tighten the vcpu/thread and vm/process relationship.
 
 How about keeping the ioctl interface but moving vcpu_run to a syscall then?
 
 I dislike half-and-half interfaces even more.  And it's not like the
 fget_light() is really painful - it's just that I see it occasionally in
 perf top so it annoys me.
 
 That should really be the only thing that belongs into the fast path, right? 
 Every time we do a register sync in user space, we do something wrong. 
 Instead, user space should either
 
  a) have wrappers around register accesses, so it can directly ask for 
 specific registers that it needs
 or
  b) keep everything that would be requested by the register synchronization 
 in shared memory
 
 Always-synced shared memory is a liability, since newer hardware might
 introduce on-chip caches for that state, making synchronization
 expensive.  Or we may choose to keep some of the registers loaded, if we
 have a way to trap on their use from userspace - for example we can
 return to userspace with the guest fpu loaded, and trap if userspace
 tries to use it.
 
 Is an extra syscall for copying TLB entries to user space prohibitively
 expensive?

The copying can be very expensive, yes. We want to have the possibility of 
exposing a very large TLB to the guest, in the order of multiple kentries. 
Every entry is a struct of 24 bytes.

 
 
 , keep the rest in user space.
 
 
 When a device is fully in the kernel, we have a good specification of the 
 ABI: it just implements the spec, and the ABI provides the interface from 
 the device to the rest of the world.  Partially accelerated devices means 
 a much greater effort in specifying exactly what it does.  It's also 
 vulnerable to changes in how the guest uses the device.
 
 Why? For the HPET timer register for example, we could have a simple MMIO 
 hook that says
 
  on_read:
return read_current_time() - shared_page.offset;
  on_write:
handle_in_user_space();
 
 It works for the really simple cases, yes, but if the guest wants to set up 
 one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
 
 It fails to provide a benefit, I didn't mean it causes guest failures.
 
 You also have to make sure the kernel part and the user part use exactly
 the same time bases.

Right. It's an optional performance accelerator. If anything doesn't align, 
don't use it. But if you happen to have a system where everything's cool, 
you're faster. Sounds like a good deal to me ;).

 
 Once the logic that's implemented by the kernel accelerator doesn't fit 
 anymore, unregister it.
 
 Yeah.
 
 
 Also look at the PIT which latches on read.
 
 
 For IDE, it would be as simple as
 
  register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]);
  for (i = 1; i  7; i++) {
register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]);
  }
 
 and we should have reduced overhead of IDE by quite a bit already. All the 
 other 2k LOC in hw/ide/core.c don't matter for us really.
 
 
 Just use virtio.
 
 Just use xenbus. Seriously, this is not an answer.
 
 Why not?  We invested effort in making it as fast as possible, and in
 writing the drivers.  IDE will never, ever, get anything close to virtio
 performance, even if we put all of it in the kernel.
 
 However, after these examples, I'm more open to partial acceleration
 now.  I won't ever like it though.
 
 
   - VGA
   - IDE
 
 Why?  There are perfectly good replacements for these (qxl, virtio-blk, 
 virtio-scsi).
 
 Because not every guest supports them. Virtio-blk needs 3rd party drivers. 
 AHCI needs 3rd party drivers on w2k3 and wxp. 
 
 3rd party drivers are a way of life for Windows users; and the
 incremental benefits of IDE acceleration are still far behind virtio.

The typical way of life for Windows users are all-included drivers. Which is 
the case for AHCI, where we're getting awesome performance for Vista and above 
guests. The iDE thing was just an idea for legacy ones.

It'd be great to simply try and see how fast we could get by handling a few 
special registers in kernel space vs heavyweight exiting to QEMU. If it's only 
10%, I wouldn't even bother with creating an interface for it. I'd bet the 
benefits are a lot bigger though.

And the main point was that specific partial device emulation buys us more than 
pseudo-generic accelerators like coalesced mmio, which are also only used by 1 
or 2 devices.

 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
 
 Cirrus or vesa should be okay for them, I don't see what we could do for
 them in the kernel, or why.

That's my point. You need fast emulation of standard devices to get a good 
baseline. Do PV on top, but keep the baseline as fast as is reasonable.

 
 Same for virtio.
 
 Please don't do 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Avi Kivity
On 02/15/2012 01:57 PM, Alexander Graf wrote:
  
  Is an extra syscall for copying TLB entries to user space prohibitively
  expensive?

 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.

You don't need to copy the entire TLB, just the way that maps the
address you're interested in.

btw, why are you interested in virtual addresses in userspace at all?

  
  It works for the really simple cases, yes, but if the guest wants to set 
  up one-shot timers, it fails.  
  
  I don't understand. Why would anything fail here? 
  
  It fails to provide a benefit, I didn't mean it causes guest failures.
  
  You also have to make sure the kernel part and the user part use exactly
  the same time bases.

 Right. It's an optional performance accelerator. If anything doesn't align, 
 don't use it. But if you happen to have a system where everything's cool, 
 you're faster. Sounds like a good deal to me ;).

Depends on how much the alignment relies on guest knowledge.  I guess
with a simple device like HPET, it's simple, but with a complex device,
different guests (or different versions of the same guest) could drive
it very differently.

  
  Because not every guest supports them. Virtio-blk needs 3rd party 
  drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
  
  3rd party drivers are a way of life for Windows users; and the
  incremental benefits of IDE acceleration are still far behind virtio.

 The typical way of life for Windows users are all-included drivers. Which is 
 the case for AHCI, where we're getting awesome performance for Vista and 
 above guests. The iDE thing was just an idea for legacy ones.

 It'd be great to simply try and see how fast we could get by handling a few 
 special registers in kernel space vs heavyweight exiting to QEMU. If it's 
 only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
 the benefits are a lot bigger though.

 And the main point was that specific partial device emulation buys us more 
 than pseudo-generic accelerators like coalesced mmio, which are also only 
 used by 1 or 2 devices.

Ok.

  
  I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
  
  Cirrus or vesa should be okay for them, I don't see what we could do for
  them in the kernel, or why.

 That's my point. You need fast emulation of standard devices to get a good 
 baseline. Do PV on top, but keep the baseline as fast as is reasonable.

  
  Same for virtio.
  
  Please don't do the Xen mistake again of claiming that all we care about 
  is Linux as a guest.
  
  Rest easy, there's no chance of that.  But if a guest is important 
  enough, virtio drivers will get written.  IDE has no chance in hell of 
  approaching virtio-blk performance, no matter how much effort we put into 
  it.
  
  Ever used VMware? They basically get virtio-blk performance out of 
  ordinary IDE for linear workloads.
  
  For linear loads, so should we, perhaps with greater cpu utliization.
  
  If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
  means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
  shouldn't matter.

 *shrug* last time I checked we were a lot slower. But maybe there's more 
 stuff making things slow than the exit path ;).

One thing that's different is that virtio offloads itself to a thread
very quickly, while IDE does a lot of work in vcpu thread context.

  
  
  KVM's strength has always been its close resemblance to hardware.
  
  This will remain.  But we can't optimize everything.
  
  That's my point. Let's optimize the hot paths and be good. As long as we 
  default to IDE for disk, we should have that be fast, no?
  
  We should make sure that we don't default to IDE.  Qemu has no knowledge
  of the guest, so it can't default to virtio, but higher level tools can
  and should.

 You can only default to virtio on recent Linux. Windows, BSD, etc don't 
 include drivers, so you can't assume it working. You can default to AHCI for 
 basically any recent guest, but that still won't work for XP and the likes :(.

The all-knowing management tool can provide a virtio driver disk, or
even slip-stream the driver into the installation CD.


  
  Ah, because you're on NPT and you can have MMIO hints in the nested page 
  table. Nifty. Yeah, we don't have that luxury :).
  
  Well the real reason is we have an extra bit reported by page faults
  that we can control.  Can't you set up a hashed pte that is configured
  in a way that it will fault, no matter what type of access the guest
  does, and see it in your page fault handler?

 I might be able to synthesize a PTE that is !readable and might throw a 
 permission exception instead of a miss exception. I might be able to 
 synthesize something similar for booke. I don't however get any indication on 
 why things failed.

 So for MMIO reads, 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 14:29, Avi Kivity wrote:

 On 02/15/2012 01:57 PM, Alexander Graf wrote:
 
 Is an extra syscall for copying TLB entries to user space prohibitively
 expensive?
 
 The copying can be very expensive, yes. We want to have the possibility of 
 exposing a very large TLB to the guest, in the order of multiple kentries. 
 Every entry is a struct of 24 bytes.
 
 You don't need to copy the entire TLB, just the way that maps the
 address you're interested in.

Yeah, unless we do migration in which case we need to introduce another special 
case to fetch the whole thing :(.

 btw, why are you interested in virtual addresses in userspace at all?

We need them for gdb and monitor introspection.

 
 
 It works for the really simple cases, yes, but if the guest wants to set 
 up one-shot timers, it fails.  
 
 I don't understand. Why would anything fail here? 
 
 It fails to provide a benefit, I didn't mean it causes guest failures.
 
 You also have to make sure the kernel part and the user part use exactly
 the same time bases.
 
 Right. It's an optional performance accelerator. If anything doesn't align, 
 don't use it. But if you happen to have a system where everything's cool, 
 you're faster. Sounds like a good deal to me ;).
 
 Depends on how much the alignment relies on guest knowledge.  I guess
 with a simple device like HPET, it's simple, but with a complex device,
 different guests (or different versions of the same guest) could drive
 it very differently.

Right. But accelerating simple devices  not accelerating any devices. No? :)

 
 
 Because not every guest supports them. Virtio-blk needs 3rd party 
 drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 
 
 3rd party drivers are a way of life for Windows users; and the
 incremental benefits of IDE acceleration are still far behind virtio.
 
 The typical way of life for Windows users are all-included drivers. Which is 
 the case for AHCI, where we're getting awesome performance for Vista and 
 above guests. The iDE thing was just an idea for legacy ones.
 
 It'd be great to simply try and see how fast we could get by handling a few 
 special registers in kernel space vs heavyweight exiting to QEMU. If it's 
 only 10%, I wouldn't even bother with creating an interface for it. I'd bet 
 the benefits are a lot bigger though.
 
 And the main point was that specific partial device emulation buys us more 
 than pseudo-generic accelerators like coalesced mmio, which are also only 
 used by 1 or 2 devices.
 
 Ok.
 
 
 I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. 
 
 Cirrus or vesa should be okay for them, I don't see what we could do for
 them in the kernel, or why.
 
 That's my point. You need fast emulation of standard devices to get a good 
 baseline. Do PV on top, but keep the baseline as fast as is reasonable.
 
 
 Same for virtio.
 
 Please don't do the Xen mistake again of claiming that all we care about 
 is Linux as a guest.
 
 Rest easy, there's no chance of that.  But if a guest is important 
 enough, virtio drivers will get written.  IDE has no chance in hell of 
 approaching virtio-blk performance, no matter how much effort we put into 
 it.
 
 Ever used VMware? They basically get virtio-blk performance out of 
 ordinary IDE for linear workloads.
 
 For linear loads, so should we, perhaps with greater cpu utliization.
 
 If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple)
 means 0.5 msec/transaction.  Spending 30 usec on some heavyweight exits
 shouldn't matter.
 
 *shrug* last time I checked we were a lot slower. But maybe there's more 
 stuff making things slow than the exit path ;).
 
 One thing that's different is that virtio offloads itself to a thread
 very quickly, while IDE does a lot of work in vcpu thread context.

So it's all about latencies again, which could be reduced at least a fair bit 
with the scheme I described above. But really, this needs to be prototyped and 
benchmarked to actually give us data on how fast it would get us.

 
 
 
 KVM's strength has always been its close resemblance to hardware.
 
 This will remain.  But we can't optimize everything.
 
 That's my point. Let's optimize the hot paths and be good. As long as we 
 default to IDE for disk, we should have that be fast, no?
 
 We should make sure that we don't default to IDE.  Qemu has no knowledge
 of the guest, so it can't default to virtio, but higher level tools can
 and should.
 
 You can only default to virtio on recent Linux. Windows, BSD, etc don't 
 include drivers, so you can't assume it working. You can default to AHCI for 
 basically any recent guest, but that still won't work for XP and the likes 
 :(.
 
 The all-knowing management tool can provide a virtio driver disk, or
 even slip-stream the driver into the installation CD.

One management tool might do that, another one might now. We can't assume that 
all management tools are all-knowing. Some times you also want to run guest OSs 
that 

Re: [Qemu-devel] [RFC] Next gen kvm api

2012-02-15 Thread Arnd Bergmann
On Tuesday 07 February 2012, Alexander Graf wrote:
  
  Not sure we'll ever get there. For PPC, it will probably take another 1-2 
  years until we get the 32-bit targets stabilized. By then we will have new 
  64-bit support though. And then the next gen will come out giving us even 
  more new constraints.
  
  I would expect that newer archs have less constraints, not more.
 
 Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have 
 today on 32-bit, but extends a
 bunch of registers to 64-bit. So what if we laid out stuff wrong before?
 
 I don't even want to imagine what v7 arm vs v8 arm looks like. It's a 
 completely new architecture.
 

I have not seen the source but I'm pretty sure that v7 and v8 they look very
similar regarding virtualization support because they were designed together,
including the concept that on v8 you can run either a v7 compatible 32 bit
hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of
32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical
to the v8 one. The main difference is the instruction set, but then ARMv7
already has four of these (ARM, Thumb, Thumb2, ThumbEE).

Arnd

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support

2012-02-15 Thread Alexander Graf

On 15.02.2012, at 20:40, Scott Wood wrote:

 On 02/15/2012 01:36 PM, Alexander Graf wrote:
 
 On 10.01.2012, at 01:51, Scott Wood wrote:
 I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and
 treat it as a kernel fault (search exception table) -- but this works
 too and is a bit cleaner (could be other uses of external pid), at the
 expense of a couple extra instructions in the emulation path (but
 probably a slightly faster host TLB handler).
 
 The check wouldn't go in DO_KVM, though, since on bookehv that only
 deals with diverting flow when xSRR1[GS] is set, which wouldn't be the
 case here.
 
 Thinking about it a bit more, how is this different from a failed 
 get_user()? We can just use the same fixup mechanism as there, right?
 
 The fixup mechanism can be the same (we'd like to know whether it failed
 due to TLB miss or DSI, so we know which to reflect

No, we only want to know fast path failed. The reason is a different pair of 
shoes and should be evaluated in the slow path. We shouldn't ever fault here 
during normal operation btw. We already executed a guest instruction, so 
there's almost no reason it can't be read.

 -- but if necessary
 I think we can figure that out with a tlbsx).  What's different is that
 the page fault handler needs to know that any external pid (or AS1)
 fault is bad, same as if the address were in the kernel area, and it
 should go directly to searching the exception tables instead of trying
 to page something in.

Yes and no. We need to force it to search the exception tables. We don't care 
if the page fault handlers knows anything about external pids.

Either way, we discussed the further stuff on IRC and came to a working 
solution :). Stay tuned.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html