Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: qemu-kvm-1.0 regression with usb tablet after live migration
Anyone? Peter Lieven wrote: Hi, i recently started updating our VMs to qemu-kvm 1.0. Since that I see that the usb tablet device (used for as pointer device for accurate mouse positioning) becomes unavailable after live migrating. If I migrate a few times a Windows 7 VM reliable stops using the USB tablet and fails back to PS/2 mouse. If I do the same with qemu-kvm-0.12.5 with the very same VM its working fine. Can anyone imagine what introduced this flaw? Thanks, Peter -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42755] KVM is being extremely slow on AMD Athlon64 4000+ Dual Core 2.1GHz Brisbane
https://bugzilla.kernel.org/show_bug.cgi?id=42755 --- Comment #30 from Avi Kivity a...@redhat.com 2012-02-15 09:28:12 --- Disable ksm, and build with debug information so we get useful information instead of hex addresses. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AESNI and guest hosts
On 02/14/2012 08:18 PM, Brian Jackson wrote: On Tuesday, February 14, 2012 03:31:10 AM Ryan Brown wrote: Sorry for being a noob here, Any clues with this?, anyone ... On Mon, Feb 13, 2012 at 2:05 AM, Ryan Brown mp3g...@gmail.com wrote: Host/KVM server is running linux 3.2.4 (Debian wheezy), and guest kernel is running 3.2.5. The cpu is an E3-1230, but for some reason its not able to supply the guest with aesni. Is there a config option or is there something we're missing? I don't think it's supported to pass that functionality to the guest. Why not? Perhaps a new libvirt or qemu is needed. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 11:18 AM, Avi Kivity wrote: On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: The way of mapping BIOS into the guest's address space
On Tue, Feb 14, 2012 at 11:07:08PM -0500, Kevin O'Connor wrote: ... hardware. Maybe we could poke someone from KVM camp for a hint? SeaBIOS has two ways to be deployed - first is to copy the image to the top of the first 1MB (eg, 0xe-0xf) and jump to 0xf000:0xfff0 in 16bit mode. The second way is to use the SeaBIOS elf and deploy into memory (according to the elf memory map) and jump to SeaBIOS in 32bit mode (according to the elf entry point). SeaBIOS doesn't really need to be in the top 4G of ram. SeaBIOS does expect to have normal PC hardware devices (eg, a PIC), though many hardware devices can be compiled out via its kconfig interface. The more interesting challenge will likely be in communicating critical pieces of information (eg, total memory size) into SeaBIOS. The SeaBIOS mailing list (seab...@seabios.org) is probably a better location for technical seabios questions. Hi Kevin, thanks for pointing. Yes, providing info back to seabios to setup mttr and such (so seabios would recognize them) is most challeging I think. Cyrill -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: x86: kvmclock: abstract save/restore sched_clock_state (v2)
On 02/13/2012 05:52 PM, Marcelo Tosatti wrote: { + x86_platform.restore_sched_clock_state(); Isn't it too early? It is scarry to say hypervisor to write to some memory location and than completely replace page-tables and half of cpu state in __restore_processor_state. Wouldn't that have a potential of writing into a place that is not restored hv_clock and restored hv_clock might still be stale? No, memory is copied in swsusp_arch_resume(), which happens before restore_processor_state. restore_processor_state() is only setting up registers and MTRR. In addition, kvmclock uses physical addresses, so page table changes don't matter. Note we could have done this in __save_processor_state()/__restore_processor_state() (it's just reading and writing an MSR, like we do for MSR_IA32_MISC_ENABLE), but I think your patch is the right way. I'd like an ack from the x86 maintainers though. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] BUG in pv_clock when overflow condition is detected
On 02/13/2012 08:20 PM, Igor Mammedov wrote: BUG when overflow occurs at pvclock.c:pvclock_get_nsec_offset u64 delta = native_read_tsc() - shadow-tsc_timestamp; this might happen at an attempt to read an uninitialized yet clock. It won't prevent stalls and hangs but at least it won't do it silently. Signed-off-by: Igor Mammedov imamm...@redhat.com --- arch/x86/kernel/pvclock.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..35a6190 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -43,7 +43,10 @@ void pvclock_set_flags(u8 flags) static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { - u64 delta = native_read_tsc() - shadow-tsc_timestamp; + u64 delta; + u64 tsc = native_read_tsc(); + BUG_ON(tsc shadow-tsc_timestamp); + delta = tsc - shadow-tsc_timestamp; return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); Maybe a WARN_ON_ONCE()? Otherwise a relatively minor hypervisor bug can kill the guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Q: Does linux kvm native tool support loading BIOS as the default loader now?
On 02/13/2012 03:35 PM, Asias He wrote: On 02/13/2012 12:38 PM, Pekka Enberg wrote: On Mon, Feb 13, 2012 at 08:14:22PM +0800, Yang Bai wrote: As I know, native tool does not support loading BIOS so it does not support Windows. Is this supporting now? If not, I may try to implement it. You're welcome to do so ;-). This would open the door for non-linux OS support in kvm tool. Also, to loading the kernel from /boot, and so allowing for the normal distro kernel update mechanism to work. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: vsyscall=emulate regression
On (Tue) 14 Feb 2012 [08:26:22], Andy Lutomirski wrote: On Tue, Feb 14, 2012 at 4:22 AM, Amit Shah amit.s...@redhat.com wrote: On (Fri) 03 Feb 2012 [13:57:48], Amit Shah wrote: Hello, I'm booting some latest kernels on a Fedora 11 (released June 2009) guest. After the recent change of default to vsyscall=emulate, the guest fails to boot (init segfaults). I also tried vsyscall=none, as suggested by hpa, and that fails as well. Only vsyscall=native works fine. The commit that introduced the kernel parameter, 3ae36655b97a03fa1decf72f04078ef945647c1a is bad too. I suggest we revert 2e57ae0515124af45dd889bfbd4840fd40fcc07d till we track down and fix the vsyscal=emulate case. Hi- Sorry, I lost track of this one. I can't reproduce it, although I doubt I've set up the right test environment. But this is fishy: init[1]: segfault at ff600400 ip ff600400 sp 7fff9c8ba098 error 5 Error 5, if I'm decoding it correctly, is a userspace read (i.e. not execute) fault. The vsyscall emulation changes shouldn't have had any effect on reads there. Can you try booting the initramfs here: http://web.mit.edu/luto/www/linux/vsyscall_initramfs.img with your kernel image (i.e. qemu-kvm -kernel whatever -initrd vsyscall_initramfs.img -whatever_else) and seeing what happens? It works for me. This too results in a similar error. I'm also curious what happens if you run without kvm (i.e. straight qemu) Interesting; without kvm, this does work fine. and what your .config on the guest kernel is. It sounds like something's wrong with your fixmap, which makes me wonder if your qemu/kernel combo is capable of booting even a modern distro (up-to-date F16, say) -- the vvar page uses identical fixmap flags as the vsyscall page in vsyscall=emulate and vsyscall=none mode. I didn't try a modern distro, but looks like this is enough evidence for now to check the kvm emulator code. I tried the same guests on a newer kernel (Fedora 16's 3.2), and things worked fine except for vsyscall=none, panic message below. What host cpu are you on and what qemu flags do you use? $ cat /proc/cpuinfo processor : 0 vendor_id : GenuineIntel cpu family : 6 model : 15 model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz stepping: 11 cpu MHz : 2000.000 cache size : 4096 KB physical id : 0 siblings: 2 core id : 0 cpu cores : 2 apicid : 0 initial apicid : 0 fpu : yes fpu_exception : yes cpuid level : 10 wp : yes flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi flexpriority bogomips: 4654.73 clflush size: 64 cache_alignment : 64 address sizes : 36 bits physical, 48 bits virtual power management: Maybe something is wrong with your emulator. Yes, looks like it. Thanks! This is what I get with vsyscall=none, where emulate and native work fine on the 3.2 kernel on different host hardware, the guest stays the same: [2.874661] debug: unmapping init memory 8167f000..818dc000 [2.876778] Write protecting the kernel read-only data: 6144k [2.879111] debug: unmapping init memory 880001318000..88000140 [2.881242] debug: unmapping init memory 8800015a..88000160 [2.884637] init[1] vsyscall attempted with vsyscall=none ip:ff600400 cs:33 sp:7fff2f48fe18 ax:7fff2f48fe50 si:7fff2f48ff08 di:0 [2.888078] init[1]: segfault at ff600400 ip ff600400 sp 7fff2f48fe18 error 15 [2.888193] Refined TSC clocksource calibration: 2691.293 MHz. [2.892748] [2.895219] Kernel panic - not syncing: Attempted to kill init! Amit -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/07/2012 04:39 PM, Alexander Graf wrote: Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship. How about keeping the ioctl interface but moving vcpu_run to a syscall then? I dislike half-and-half interfaces even more. And it's not like the fget_light() is really painful - it's just that I see it occasionally in perf top so it annoys me. That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either a) have wrappers around register accesses, so it can directly ask for specific registers that it needs or b) keep everything that would be requested by the register synchronization in shared memory Always-synced shared memory is a liability, since newer hardware might introduce on-chip caches for that state, making synchronization expensive. Or we may choose to keep some of the registers loaded, if we have a way to trap on their use from userspace - for example we can return to userspace with the guest fpu loaded, and trap if userspace tries to use it. Is an extra syscall for copying TLB entries to user space prohibitively expensive? , keep the rest in user space. When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device. Why? For the HPET timer register for example, we could have a simple MMIO hook that says on_read: return read_current_time() - shared_page.offset; on_write: handle_in_user_space(); It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it. Yeah. Also look at the PIT which latches on read. For IDE, it would be as simple as register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]); for (i = 1; i 7; i++) { register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); } and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really. Just use virtio. Just use xenbus. Seriously, this is not an answer. Why not? We invested effort in making it as fast as possible, and in writing the drivers. IDE will never, ever, get anything close to virtio performance, even if we put all of it in the kernel. However, after these examples, I'm more open to partial acceleration now. I won't ever like it though. - VGA - IDE Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi). Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. Same for virtio. Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it. Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads. For linear loads, so should we, perhaps with greater cpu utliization. If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple) means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits shouldn't matter. KVM's strength has always been its close resemblance to hardware. This will remain. But we can't optimize everything. That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no? We should make sure that we don't default to IDE. Qemu has no knowledge of the guest, so it can't default to virtio, but higher level tools can and should. Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky. We're currently running
Re: [PATCH] BUG in pv_clock when overflow condition is detected
On 02/15/2012 11:49 AM, Avi Kivity wrote: On 02/13/2012 08:20 PM, Igor Mammedov wrote: BUG when overflow occurs at pvclock.c:pvclock_get_nsec_offset u64 delta = native_read_tsc() - shadow-tsc_timestamp; this might happen at an attempt to read an uninitialized yet clock. It won't prevent stalls and hangs but at least it won't do it silently. Signed-off-by: Igor Mammedovimamm...@redhat.com --- arch/x86/kernel/pvclock.c |5 - 1 files changed, 4 insertions(+), 1 deletions(-) diff --git a/arch/x86/kernel/pvclock.c b/arch/x86/kernel/pvclock.c index 42eb330..35a6190 100644 --- a/arch/x86/kernel/pvclock.c +++ b/arch/x86/kernel/pvclock.c @@ -43,7 +43,10 @@ void pvclock_set_flags(u8 flags) static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { - u64 delta = native_read_tsc() - shadow-tsc_timestamp; + u64 delta; + u64 tsc = native_read_tsc(); + BUG_ON(tsc shadow-tsc_timestamp); + delta = tsc - shadow-tsc_timestamp; return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); Maybe a WARN_ON_ONCE()? Otherwise a relatively minor hypervisor bug can kill the guest. An attempt to print from this place is not perfect since it often leads to recursive calling to this very function and it hang there anyway. But if you insist I'll re-post it with WARN_ON_ONCE, It won't make much difference because guest will hang/stall due overflow anyway. If there is an intention to keep guest functional after the event then maybe this patch is a way to go http://www.spinics.net/lists/kvm/msg68463.html this way clock will be re-silent to this kind of errors, like bare-metal one is. -- Thanks, Igor -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 05:47 PM, Avi Kivity wrote: On 02/15/2012 11:18 AM, Avi Kivity wrote: On 02/14/2012 09:43 PM, Marcelo Tosatti wrote: Also it should not be necessary for these flushes to be inside mmu_lock on EPT/NPT case (since there is no write protection there). We do write protect with TDP, if nested virt is active. The question is whether we have indirect pages or not, not whether TDP is active or not (even without TDP, if you don't enable paging in the guest, you don't have to write protect). But it would be awkward to differentiate the unlock position based on EPT/NPT. I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 15.02.2012, at 12:18, Avi Kivity wrote: On 02/07/2012 04:39 PM, Alexander Graf wrote: Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship. How about keeping the ioctl interface but moving vcpu_run to a syscall then? I dislike half-and-half interfaces even more. And it's not like the fget_light() is really painful - it's just that I see it occasionally in perf top so it annoys me. That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either a) have wrappers around register accesses, so it can directly ask for specific registers that it needs or b) keep everything that would be requested by the register synchronization in shared memory Always-synced shared memory is a liability, since newer hardware might introduce on-chip caches for that state, making synchronization expensive. Or we may choose to keep some of the registers loaded, if we have a way to trap on their use from userspace - for example we can return to userspace with the guest fpu loaded, and trap if userspace tries to use it. Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. , keep the rest in user space. When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device. Why? For the HPET timer register for example, we could have a simple MMIO hook that says on_read: return read_current_time() - shared_page.offset; on_write: handle_in_user_space(); It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it. Yeah. Also look at the PIT which latches on read. For IDE, it would be as simple as register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]); for (i = 1; i 7; i++) { register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); } and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really. Just use virtio. Just use xenbus. Seriously, this is not an answer. Why not? We invested effort in making it as fast as possible, and in writing the drivers. IDE will never, ever, get anything close to virtio performance, even if we put all of it in the kernel. However, after these examples, I'm more open to partial acceleration now. I won't ever like it though. - VGA - IDE Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi). Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do
Re: AESNI and guest hosts
I don't think it's supported to pass that functionality to the guest. Why not? Perhaps a new libvirt or qemu is needed. Should it be the case to add one of the following? feature name='aes'/ or.. feature name='aesni'/ something like that? Host is using linux kernel 3.2.4 (Debian Wheezy) libvirt (0.9.8-2), qemu (1.0+dfsg-2), Guest is on linux kernel Ubuntu/3.2.5 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. btw, why are you interested in virtual addresses in userspace at all? It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. Ok. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it. Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads. For linear loads, so should we, perhaps with greater cpu utliization. If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple) means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits shouldn't matter. *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;). One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. KVM's strength has always been its close resemblance to hardware. This will remain. But we can't optimize everything. That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no? We should make sure that we don't default to IDE. Qemu has no knowledge of the guest, so it can't default to virtio, but higher level tools can and should. You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :). Well the real reason is we have an extra bit reported by page faults that we can control. Can't you set up a hashed pte that is configured in a way that it will fault, no matter what type of access the guest does, and see it in your page fault handler? I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed. So for MMIO reads,
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/12/2012 09:10 AM, Takuya Yoshikawa wrote: Avi Kivity a...@redhat.com wrote: Slot searching is quite fast since there's a small number of slots, and we sort the larger ones to be in the front, so positive lookups are fast. We cache negative lookups in the shadow page tables (an spte can be either not mapped, mapped to RAM, or not mapped and known to be mmio) so we rarely need to walk the entire list. Well, we don't always have shadow page tables. Having hints for unmapped guest memory like this is pretty tricky. We're currently running into issues with device assignment though, where we get a lot of small slots mapped to real hardware. I'm sure that will hit us on x86 sooner or later too. For x86 that's not a problem, since once you map a page, it stays mapped (on modern hardware). I was once thinking about how to search a slot reasonably fast for every case, even when we do not have mmio-spte cache. One possible way I thought up was to sort slots according to their base_gfn. Then the problem would become: find the first slot whose base_gfn + npages is greater than this gfn. Since we can do binary search, the search cost is O(log(# of slots)). But I guess that most of the time was wasted on reading many memslots just to know their base_gfn and npages. So the most practically effective thing is to make a separate array which holds just their base_gfn. This will make the task a simple, and cache friendly, search on an integer array: probably faster than using *-tree data structure. This assumes that there is equal probability for matching any slot. But that's not true, even if you have hundreds of slots, the probability is much greater for the two main memory slots, or if you're playing with the framebuffer, the framebuffer slot. Everything else is loaded quickly into shadow and forgotten. If needed, we should make cmp_memslot() architecture specific in the end? We could, but why is it needed? This logic holds for all architectures. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/07/2012 05:23 PM, Anthony Liguori wrote: On 02/07/2012 07:40 AM, Alexander Graf wrote: Why? For the HPET timer register for example, we could have a simple MMIO hook that says on_read: return read_current_time() - shared_page.offset; on_write: handle_in_user_space(); For IDE, it would be as simple as register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]); for (i = 1; i 7; i++) { register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); } You can't easily serialize updates to that address with the kernel since two threads are likely going to be accessing it at the same time. That either means an expensive sync operation or a reliance on atomic instructions. But not all architectures offer non-word sized atomic instructions so it gets fairly nasty in practice. I doubt that any guest accesses IDE registers from two threads in parallel. The guest will have some lock, so we could have a lock as well and be assured that there will never be contention. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 15.02.2012, at 14:29, Avi Kivity wrote: On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(. btw, why are you interested in virtual addresses in userspace at all? We need them for gdb and monitor introspection. It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Right. But accelerating simple devices not accelerating any devices. No? :) Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. Ok. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it. Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads. For linear loads, so should we, perhaps with greater cpu utliization. If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple) means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits shouldn't matter. *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;). One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us. KVM's strength has always been its close resemblance to hardware. This will remain. But we can't optimize everything. That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no? We should make sure that we don't default to IDE. Qemu has no knowledge of the guest, so it can't default to virtio, but higher level tools can and should. You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/07/2012 08:12 PM, Rusty Russell wrote: I would really love to have this, but the problem is that we'd need a general purpose bytecode VM with binding to some kernel APIs. The bytecode VM, if made general enough to host more complicated devices, would likely be much larger than the actual code we have in the kernel now. We have the ability to upload bytecode into the kernel already. It's in a great bytecode interpreted by the CPU itself. Unfortunately it's inflexible (has to come with the kernel) and open to security vulnerabilities. If every user were emulating different machines, LPF this would make sense. Are they? They aren't. Or should we write those helpers once, in C, and provide that for them. There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of them are quite complicated. However implementing them in bytecode amounts to exposing a stable kernel ABI, since they use such a vast range of kernel services. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/07/2012 06:29 PM, Jan Kiszka wrote: Isn't there another level in between just scheduling and full syscall return if the user return notifier has some real work to do? Depends on whether you're scheduling a kthread or a userspace process, no? If Kthreads can't return, of course. User space threads /may/ do so. And then there needs to be a differences between host and guest in the tracked MSRs. Right. Until we randomize kernel virtual addresses (what happened to that?) and then there will always be a difference, even if you run the same kernel in the host and guest. I think to recall it's a question of another few hundred cycles. Right. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/07/2012 06:19 PM, Anthony Liguori wrote: Ah. But then ioeventfd has that as well, unless the other end is in the kernel too. Yes, that was my point exactly :-) ioeventfd/mmio-over-socketpair to adifferent thread is not faster than a synchronous KVM_RUN + writing to an eventfd in userspace modulo a couple of cheap syscalls. The exception is when the other end is in the kernel and there is magic optimizations (like there is today with ioeventfd). vhost seems to schedule a workqueue item unconditionally. irqfd does have magic optimizations to avoid an extra schedule. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/15/2012 03:37 PM, Alexander Graf wrote: On 15.02.2012, at 14:29, Avi Kivity wrote: On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(. Well, the scatter/gather registers I proposed will give you just one register or all of them. btw, why are you interested in virtual addresses in userspace at all? We need them for gdb and monitor introspection. Hardly fast paths that justify shared memory. I should be much harder on you. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Right. But accelerating simple devices not accelerating any devices. No? :) Yes. But introducing bugs and vulns not introducing them. It's a tradeoff. Even an unexploited vulnerability can be a lot more pain, just because you need to update your entire cluster, than a simple device that is accelerated for a guest which has maybe 3% utilization. Performance is just one parameter we optimize for. It's easy to overdo it because it's an easily measurable and sexy parameter, but it's a mistake. One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us. Simply making qemu issue the request from a thread would be way better. Something like socketpair mmio, configured for not waiting for the writes to be seen (posted writes) will also help by buffering writes in the socket buffer. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet). That is true, but we have to leave some work for the management guys. So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means guest entry is not readable so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective. COWs usually happen from guest userspace, while mmio is usually from the guest kernel, so you can switch on that, maybe. Hrm, nice idea. That might fall apart with user space drivers that we might eventually have once vfio turns out to work well, but for the time being it's a nice hack :). Or nested virt... -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] BUG in pv_clock when overflow condition is detected
On 02/15/2012 01:23 PM, Igor Mammedov wrote: static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { -u64 delta = native_read_tsc() - shadow-tsc_timestamp; +u64 delta; +u64 tsc = native_read_tsc(); +BUG_ON(tsc shadow-tsc_timestamp); +delta = tsc - shadow-tsc_timestamp; return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); Maybe a WARN_ON_ONCE()? Otherwise a relatively minor hypervisor bug can kill the guest. An attempt to print from this place is not perfect since it often leads to recursive calling to this very function and it hang there anyway. But if you insist I'll re-post it with WARN_ON_ONCE, It won't make much difference because guest will hang/stall due overflow anyway. Won't a BUG_ON() also result in a printk? If there is an intention to keep guest functional after the event then maybe this patch is a way to go http://www.spinics.net/lists/kvm/msg68463.html this way clock will be re-silent to this kind of errors, like bare-metal one is. It's the same patch... do you mean something that detects the overflow and uses the last value? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: AESNI and guest hosts
On 02/15/2012 02:02 PM, Ryan Brown wrote: I don't think it's supported to pass that functionality to the guest. Why not? Perhaps a new libvirt or qemu is needed. Should it be the case to add one of the following? feature name='aes'/ or.. feature name='aesni'/ something like that? The qemu name is aes. Don't know about libvirt, suggest you start with bare qemu first. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 01:37 PM, Xiao Guangrong wrote: I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 15.02.2012, at 14:57, Avi Kivity wrote: On 02/15/2012 03:37 PM, Alexander Graf wrote: On 15.02.2012, at 14:29, Avi Kivity wrote: On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(. Well, the scatter/gather registers I proposed will give you just one register or all of them. One register is hardly any use. We either need all ways of a respective address to do a full fledged lookup or all of them. By sharing the same data structures between qemu and kvm, we actually managed to reuse all of the tcg code for lookups, just like you do for x86. On x86 you also have shared memory for page tables, it's just guest visible, hence in guest memory. The concept is the same. btw, why are you interested in virtual addresses in userspace at all? We need them for gdb and monitor introspection. Hardly fast paths that justify shared memory. I should be much harder on you. It was a tradeoff on speed and complexity. This way we have the least amount of complexity IMHO. All KVM code paths just magically fit in with the TCG code. There are essentially no if(kvm_enabled)'s in our MMU walking code, because the tables are just there. Makes everything a lot easier (without dragging down performance). Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Right. But accelerating simple devices not accelerating any devices. No? :) Yes. But introducing bugs and vulns not introducing them. It's a tradeoff. Even an unexploited vulnerability can be a lot more pain, just because you need to update your entire cluster, than a simple device that is accelerated for a guest which has maybe 3% utilization. Performance is just one parameter we optimize for. It's easy to overdo it because it's an easily measurable and sexy parameter, but it's a mistake. Yeah, I agree. That's why I was trying to get AHCI to the default storage adapter for a while, because I think the same. However, Anthony believes that XP/w2k3 is still a major chunk of the guests running on QEMU, so we can't do that :(. I'm mostly trying to think of ways to accelerate the obvious low hanging fruits, without overengineering any interfaces. One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us. Simply making qemu issue the request from a thread would be way better. Something like socketpair mmio, configured for not waiting for the writes to be seen (posted writes) will also help by buffering writes in the socket buffer. Yup, nice idea. That only works when all parts of a device are actually implemented through the same socket though. Otherwise you could run out of order. So if you have a PCI device with a PIO and an MMIO BAR region, they would both have to be handled through the same socket. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that the management tool doesn't know (yet). That is true, but we have to leave some work for the management guys. The easier the management stack is, the happier I am ;). So for MMIO reads, I can assume that this is an MMIO because I would never write a non-readable entry. For writes, I'm overloading the bit that also means guest entry is not readable so there I'd have to walk the guest PTEs/TLBs and check if I find a read-only entry. Right now I can just forward write faults to the guest. Since COW is probably a hotter path for the guest than MMIO, this might end up being ineffective. COWs usually happen from guest userspace, while mmio is usually from the guest kernel, so you can switch on that, maybe. Hrm, nice idea.
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote: Roopa was likely on the right track here, http://patchwork.ozlabs.org/patch/123064/ Doesnt seem related to the bridging stuff - the modeling looks reasonable however. But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX netlink messages. And if possible drive this without extending ndo_ops. An ideal user space interaction IMHO would look like, [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10 [root@jf-dev1-dcblab iproute2]# ./br/br fdb portmac addrflags veth2 36:a6:35:9b:96:c4 local veth4 aa:54:b0:7b:42:ef local veth0 2a:e8:5c:95:6c:1b local veth6 6e:26:d5:43:a3:36 local veth0 f2:c1:39:76:6a:fb veth8 4e:35:16:af:87:13 local veth10 52:e5:62:7b:57:88 static veth10 aa:a9:35:21:15:c4 local Looks nice, where is the targeted bridge(eg br0) in that syntax? Using Stephen's br tool. First command adds FDB entry to SW bridge and if the same tool could be used to add entries to embedded bridge I think that would be the best case. That would be nice (although adds dependency on the presence of the s/ware bridge). Would be nicer to have either a knob in the kernel to say synchronize with h/w bridge foo which can be turned off. So no RTNETLINK error on the second cmd. Then embedded FDB entries could be dumped this way also so I get a complete view of my FDB setup across multiple sw bridges and embedded bridges. So if you had multiple h/ware bridges - which one is tied to br0? Yes. The hardware has a bit to support this which is currently not exposed to user space. That's a case where we have 'yet another knob' that needs a clean solution. This causes real bugs today when users try to use the macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are all part of the 802.1Qbg spec which people actually want to use with Linux so a good clean solution is probably needed. I think the knobs to flood and learn are important. The hardware seems to have the flood but not the learn/discover. I think the s/ware bridge needs to have both. At the moment - as pointed out in that *NEIGH* notification, s/w bridge assumes a policy that could be considered a security flaw in some circles - just because you are my neighbor does not mean i trust you to come into my house; i may trust you partially and allow you only to come through the front door. Even in Canada with a default policy of not locking your door we sometimes lock our doors ;- I have no problem with drawing the line here and trying to implement something over PF_BRIDGE:RTM_xxx nlmsgs. My comment/concern was in regard to the bridge built-in policy of reading from the neighbor updates (refer to above comments) cheers, jamal -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel
I'm not sure if this bug is located in userspace or in the kernel. Could you let me know where to file it? Bug: Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside KVM causes it to hang and spin (using 1 full CPU core) after loading the initrd, as determined by serial console output. The only error message is KVM internal error. Suberror: 1/emulation failure. Booting a regular Debian kernel succeeds, as does running the Xenomai kernel with software emulation (-no-kvm). Info: CPU: Intel Core i7-2670QM Emulator: qemu-kvm 0.14.1 Host kernel: 3.0.0-15 (Ubuntu build), x86_64 Guest OS: Debian Squeeze, kernel.org 2.6.37 kernel with Xenomai 2.6.0 (config attached) Qemu command: kvm -M pc-0.14 -enable-kvm -m 1024 -drive file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=21,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3 -chardev stdio,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -vga cirrus Effects of flags: Adding one or both of --no-kvm-irqchip or --no-kvm-pit has no apparent effect. Adding --no-kvm appears to correct the problem. Trace will be attached to the final bug submission. Thanks, --Doug Brunner -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel
On 02/15/2012 06:40 PM, madengineer10 wrote: I'm not sure if this bug is located in userspace or in the kernel. Could you let me know where to file it? Bug: Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside KVM causes it to hang and spin (using 1 full CPU core) after loading the initrd, as determined by serial console output. The only error message is KVM internal error. Suberror: 1/emulation failure. Booting a regular Debian kernel succeeds, as does running the Xenomai kernel with software emulation (-no-kvm). Please issue the following commands on the qemu monitor: (qemu) info registers (qemu) x/30i $eip and report. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Correct location for bug report: KVM domain hangs after loading initrd with Xenomai kernel
On 02/15/2012 06:46 PM, Avi Kivity wrote: On 02/15/2012 06:40 PM, madengineer10 wrote: I'm not sure if this bug is located in userspace or in the kernel. Could you let me know where to file it? Bug: Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside KVM causes it to hang and spin (using 1 full CPU core) after loading the initrd, as determined by serial console output. The only error message is KVM internal error. Suberror: 1/emulation failure. Booting a regular Debian kernel succeeds, as does running the Xenomai kernel with software emulation (-no-kvm). Please issue the following commands on the qemu monitor: (qemu) info registers (qemu) x/30i $eip and report. Oh, and wrt your original question, it's likely a kvm bug, please report in bugzilla.kernel.org. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] BUG in pv_clock when overflow condition is detected
- Original Message - From: Avi Kivity a...@redhat.com To: Igor Mammedov imamm...@redhat.com Cc: linux-ker...@vger.kernel.org, kvm@vger.kernel.org, t...@linutronix.de, mi...@redhat.com, h...@zytor.com, r...@redhat.com, amit shah amit.s...@redhat.com, mtosa...@redhat.com Sent: Wednesday, February 15, 2012 3:02:04 PM Subject: Re: [PATCH] BUG in pv_clock when overflow condition is detected On 02/15/2012 01:23 PM, Igor Mammedov wrote: static u64 pvclock_get_nsec_offset(struct pvclock_shadow_time *shadow) { -u64 delta = native_read_tsc() - shadow-tsc_timestamp; +u64 delta; +u64 tsc = native_read_tsc(); +BUG_ON(tsc shadow-tsc_timestamp); +delta = tsc - shadow-tsc_timestamp; return pvclock_scale_delta(delta, shadow-tsc_to_nsec_mul, shadow-tsc_shift); Maybe a WARN_ON_ONCE()? Otherwise a relatively minor hypervisor bug can kill the guest. An attempt to print from this place is not perfect since it often leads to recursive calling to this very function and it hang there anyway. But if you insist I'll re-post it with WARN_ON_ONCE, It won't make much difference because guest will hang/stall due overflow anyway. Won't a BUG_ON() also result in a printk? Yes, it will. But stack will still keep failure point and poking with crash/gdb at core will always show where it's BUGged. In case it manages to print dump somehow (saw it couple times from ~ 30 test cycles), logs from console or from kernel message buffer (again poking with gdb) will show where it was called from. If WARN* is used, it will still totaly screwup clock and last value and system will become unusable, requiring looking with gdb/crash at the core any way. So I've just used more stable failure point that will leave trace everywhere it manages (maybe in console log, but for sure in stack) in case of WARN it might leave trace on console or not and probably won't reflect failure point in stack either leaving only kernel message buffer for clue. If there is an intention to keep guest functional after the event then maybe this patch is a way to go http://www.spinics.net/lists/kvm/msg68463.html this way clock will be re-silent to this kind of errors, like bare-metal one is. It's the same patch... do you mean something that detects the overflow and uses the last value? I'm sorry, pasted wrong link here it goes: pvclock: Make pv_clock more robust and fixup it if overflow happens http://www.spinics.net/lists/kvm/msg68440.html -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On Wed, Feb 15, 2012 at 04:07:49PM +0200, Avi Kivity wrote: Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. If we go more complicated I prefer Avi's suggestion to move them all outside the lock. Yesterday I was also thinking at the regular pagetables and how we do not have similar issues there. On the regular pagetables we just do unconditional flush in fork when we make it readonly and KSM (the other place that ptes stuff readonly that later can cow) uses ptep_clear_flush which does an unconditional flush and furthermore it does it inside the PT lock, so generally we don't optimize for those things on the regular pagetables. But then these events don't happen as frequently as they can on KVM without EPT/NPT. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/15/2012 05:57 AM, Alexander Graf wrote: On 15.02.2012, at 12:18, Avi Kivity wrote: Well the real reason is we have an extra bit reported by page faults that we can control. Can't you set up a hashed pte that is configured in a way that it will fault, no matter what type of access the guest does, and see it in your page fault handler? I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed. On booke with ISA 2.06 hypervisor extensions, there's MAS8[VF] that will trigger a DSI that gets sent to the hypervisor even if normal DSIs go directly to the guest. You'll still need to zero out the execute permission bits. For other booke, you could use one of the user bits in MAS3 (along with zeroing out all the permission bits), which you could get to by doing a tlbsx. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support
On 10.01.2012, at 01:51, Scott Wood wrote: On 01/09/2012 11:46 AM, Alexander Graf wrote: On 21.12.2011, at 02:34, Scott Wood wrote: [...] Current issues include: - Machine checks from guest state are not routed to the host handler. - The guest can cause a host oops by executing an emulated instruction in a page that lacks read permission. Existing e500/4xx support has the same problem. We solve that in book3s pr by doing LAST_INST = known bad value; PACA-kvm_mode = recover at next inst; lwz(guest pc); do_more_stuff(); That way when an exception occurs at lwz() the DO_KVM handler checks that we're in kvm mode recover which does basically srr0+=4; rfi;. I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and treat it as a kernel fault (search exception table) -- but this works too and is a bit cleaner (could be other uses of external pid), at the expense of a couple extra instructions in the emulation path (but probably a slightly faster host TLB handler). The check wouldn't go in DO_KVM, though, since on bookehv that only deals with diverting flow when xSRR1[GS] is set, which wouldn't be the case here. Thinking about it a bit more, how is this different from a failed get_user()? We can just use the same fixup mechanism as there, right? Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[KVM paravirt issue?] Re: vsyscall=emulate regression
Hi, kvm people- Here's a strange failure. It could be a bug in something RHEL6-specific, but it could be a generic issue that only triggers with a paravirt guest with old userspace on a non-ept host. There was a bug like this on Xen, and I'm wondering something's wrong on kvm as well. For background, a change in 3.1 (IIRC) means that, when vsyscall=emulate or vsyscall=none, the vsyscall page in the fixmap is NX. It seems like Amit's machine is marking the physical PTE present but unreadable. So I could have messed up, or there could be a subtle bug somewhere. Any ideas? I'll try to reproduce on a non-ept host later on, but that will involve finding one. On Wed, Feb 15, 2012 at 3:01 AM, Amit Shah amit.s...@redhat.com wrote: On (Tue) 14 Feb 2012 [08:26:22], Andy Lutomirski wrote: On Tue, Feb 14, 2012 at 4:22 AM, Amit Shah amit.s...@redhat.com wrote: Can you try booting the initramfs here: http://web.mit.edu/luto/www/linux/vsyscall_initramfs.img with your kernel image (i.e. qemu-kvm -kernel whatever -initrd vsyscall_initramfs.img -whatever_else) and seeing what happens? It works for me. This too results in a similar error. Can you post the exact error? I'm interested in how far it gets before it fails. I didn't try a modern distro, but looks like this is enough evidence for now to check the kvm emulator code. I tried the same guests on a newer kernel (Fedora 16's 3.2), and things worked fine except for vsyscall=none, panic message below. vsyscall=none isn't supposed to work unless you're running a very modern distro *and* you have no legacy static binaries *and* you aren't using anything written in Go (sigh). It will probably either never become the default or will take 5-10 years. model name : Intel(R) Core(TM)2 Duo CPU E6550 @ 2.33GHz flags : fpu vme de pse tsc msr pae mce cx8 apic mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx lm constant_tsc arch_perfmon pebs bts rep_good aperfmperf pni dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm lahf_lm dts tpr_shadow vnmi flexpriority Hmm. You don't have ept. If your guest kernel supports paravirt, then you might use the hypercall interface instead of programming the fixmap directly. This is what I get with vsyscall=none, where emulate and native work fine on the 3.2 kernel on different host hardware, the guest stays the same: [ 2.874661] debug: unmapping init memory 8167f000..818dc000 [ 2.876778] Write protecting the kernel read-only data: 6144k [ 2.879111] debug: unmapping init memory 880001318000..88000140 [ 2.881242] debug: unmapping init memory 8800015a..88000160 [ 2.884637] init[1] vsyscall attempted with vsyscall=none ip:ff600400 cs:33 sp:7fff2f48fe18 ax:7fff2f48fe50 si:7fff2f48ff08 di:0 This like (vsyscall attempted) means that the emulation worked correctly. Your other traces didn't have it or anything like it, which mostly rules out do_emulate_vsyscall issues. --Andy -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support
On 02/15/2012 01:36 PM, Alexander Graf wrote: On 10.01.2012, at 01:51, Scott Wood wrote: I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and treat it as a kernel fault (search exception table) -- but this works too and is a bit cleaner (could be other uses of external pid), at the expense of a couple extra instructions in the emulation path (but probably a slightly faster host TLB handler). The check wouldn't go in DO_KVM, though, since on bookehv that only deals with diverting flow when xSRR1[GS] is set, which wouldn't be the case here. Thinking about it a bit more, how is this different from a failed get_user()? We can just use the same fixup mechanism as there, right? The fixup mechanism can be the same (we'd like to know whether it failed due to TLB miss or DSI, so we know which to reflect -- but if necessary I think we can figure that out with a tlbsx). What's different is that the page fault handler needs to know that any external pid (or AS1) fault is bad, same as if the address were in the kernel area, and it should go directly to searching the exception tables instead of trying to page something in. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/15/2012 07:39 AM, Avi Kivity wrote: On 02/07/2012 08:12 PM, Rusty Russell wrote: I would really love to have this, but the problem is that we'd need a general purpose bytecode VM with binding to some kernel APIs. The bytecode VM, if made general enough to host more complicated devices, would likely be much larger than the actual code we have in the kernel now. We have the ability to upload bytecode into the kernel already. It's in a great bytecode interpreted by the CPU itself. Unfortunately it's inflexible (has to come with the kernel) and open to security vulnerabilities. I wonder if there's any reasonable way to run device emulation within the context of the guest. Could we effectively do something like SMM? For a given set of traps, reflect back into the guest quickly changing the visibility of the VGA region. It may require installing a new CR3 but maybe that wouldn't be so bad with VPIDs. Then you could implement the PIT as guest firmware using kvmclock as the time base. Once you're back in the guest, you could install the old CR3. Perhaps just hide a portion of the physical address space with the e820. Regards, Anthony Liguori If every user were emulating different machines, LPF this would make sense. Are they? They aren't. Or should we write those helpers once, in C, and provide that for them. There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of them are quite complicated. However implementing them in bytecode amounts to exposing a stable kernel ABI, since they use such a vast range of kernel services. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On Tuesday 07 February 2012, Alexander Graf wrote: On 07.02.2012, at 07:58, Michael Ellerman wrote: On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote: You're exposing a large, complex kernel subsystem that does very low-level things with the hardware. It's a potential source of exploits (from bugs in KVM or in hardware). I can see people wanting to be selective with access because of that. Exactly. In a perfect world I'd agree with Anthony, but in reality I think sysadmins are quite happy that they can prevent some users from using KVM. You could presumably achieve something similar with capabilities or whatever, but a node in /dev is much simpler. Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd. But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us. ioctl is good for hardware devices and stuff that you want to enumerate and/or control permissions on. For something like KVM that is really a core kernel service, a syscall makes much more sense. I would certainly never mix the two concepts: If you use a chardev to get a file descriptor, use ioctl to do operations on it, and if you use a syscall to get the file descriptor then use other syscalls to do operations on it. I don't really have a good recommendation whether or not to change from an ioctl based interface to syscall for KVM now. On the one hand I believe it would be significantly cleaner, on the other hand we cannot remove the chardev interface any more since there are many existing users. Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On Tuesday 07 February 2012, Alexander Graf wrote: Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints. I would expect that newer archs have less constraints, not more. Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before? I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture. I have not seen the source but I'm pretty sure that v7 and v8 they look very similar regarding virtualization support because they were designed together, including the concept that on v8 you can run either a v7 compatible 32 bit hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of 32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical to the v8 one. The main difference is the instruction set, but then ARMv7 already has four of these (ARM, Thumb, Thumb2, ThumbEE). Arnd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support
On 15.02.2012, at 20:40, Scott Wood wrote: On 02/15/2012 01:36 PM, Alexander Graf wrote: On 10.01.2012, at 01:51, Scott Wood wrote: I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and treat it as a kernel fault (search exception table) -- but this works too and is a bit cleaner (could be other uses of external pid), at the expense of a couple extra instructions in the emulation path (but probably a slightly faster host TLB handler). The check wouldn't go in DO_KVM, though, since on bookehv that only deals with diverting flow when xSRR1[GS] is set, which wouldn't be the case here. Thinking about it a bit more, how is this different from a failed get_user()? We can just use the same fixup mechanism as there, right? The fixup mechanism can be the same (we'd like to know whether it failed due to TLB miss or DSI, so we know which to reflect No, we only want to know fast path failed. The reason is a different pair of shoes and should be evaluated in the slow path. We shouldn't ever fault here during normal operation btw. We already executed a guest instruction, so there's almost no reason it can't be read. -- but if necessary I think we can figure that out with a tlbsx). What's different is that the page fault handler needs to know that any external pid (or AS1) fault is bad, same as if the address were in the kernel area, and it should go directly to searching the exception tables instead of trying to page something in. Yes and no. We need to force it to search the exception tables. We don't care if the page fault handlers knows anything about external pids. Either way, we discussed the further stuff on IRC and came to a working solution :). Stay tuned. Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On Wed, 2012-02-15 at 22:21 +, Arnd Bergmann wrote: On Tuesday 07 February 2012, Alexander Graf wrote: On 07.02.2012, at 07:58, Michael Ellerman wrote: On Mon, 2012-02-06 at 13:46 -0600, Scott Wood wrote: You're exposing a large, complex kernel subsystem that does very low-level things with the hardware. It's a potential source of exploits (from bugs in KVM or in hardware). I can see people wanting to be selective with access because of that. Exactly. In a perfect world I'd agree with Anthony, but in reality I think sysadmins are quite happy that they can prevent some users from using KVM. You could presumably achieve something similar with capabilities or whatever, but a node in /dev is much simpler. Well, you could still keep the /dev/kvm node and then have syscalls operate on the fd. But again, I don't see the problem with the ioctl interface. It's nice, extensible and works great for us. ioctl is good for hardware devices and stuff that you want to enumerate and/or control permissions on. For something like KVM that is really a core kernel service, a syscall makes much more sense. Yeah maybe. That distinction is at least in part just historical. The first problem I see with using a syscall is that you don't need one syscall for KVM, you need ~90. OK so you wouldn't do that, you'd use a multiplexed syscall like epoll_ctl() - or probably several (vm/vcpu/etc). Secondly you still need a handle/context for those syscalls, and I think the most sane thing to use for that is an fd. At that point you've basically reinvented ioctl :) I also think it is an advantage that you have a node in /dev for permissions. I know other core kernel interfaces don't use a /dev node, but arguably that is their loss. I would certainly never mix the two concepts: If you use a chardev to get a file descriptor, use ioctl to do operations on it, and if you use a syscall to get the file descriptor then use other syscalls to do operations on it. Sure, we use a syscall to get the fd (open) and then other syscalls to do operations on it, ioctl and kvm_vcpu_run. ;) But seriously, I guess that makes sense. Though it's a bit of a pity because if you want a syscall for any of it, eg. vcpu_run(), then you have to basically reinvent ioctl for all the other little operations. cheers signature.asc Description: This is a digitally signed message part
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On 2/15/2012 6:10 AM, Jamal Hadi Salim wrote: On Tue, 2012-02-14 at 10:57 -0800, John Fastabend wrote: Roopa was likely on the right track here, http://patchwork.ozlabs.org/patch/123064/ Doesnt seem related to the bridging stuff - the modeling looks reasonable however. The operations are really the same ADD/DEL/GET additional MAC addresses to a port, in this case a macvlan type port. The difference is the macvlan port type drops any packet with an address not in the FDB where the bridge type floods these. But I think the proper syntax is to use the existing PF_BRIDGE:RTM_XXX netlink messages. And if possible drive this without extending ndo_ops. An ideal user space interaction IMHO would look like, [root@jf-dev1-dcblab iproute2]# ./br/br fdb add 52:e5:62:7b:57:88 dev veth10 [root@jf-dev1-dcblab iproute2]# ./br/br fdb portmac addrflags veth2 36:a6:35:9b:96:c4 local veth4 aa:54:b0:7b:42:ef local veth0 2a:e8:5c:95:6c:1b local veth6 6e:26:d5:43:a3:36 local veth0 f2:c1:39:76:6a:fb veth8 4e:35:16:af:87:13 local veth10 52:e5:62:7b:57:88 static veth10 aa:a9:35:21:15:c4 local Looks nice, where is the targeted bridge(eg br0) in that syntax? [root@jf-dev1-dcblab src]# br fdb help Usage: br fdb { add | del | replace } ADDR dev DEV br fdb {show} [ dev DEV ] In my example I just dumped all bridge devices, #br fdb show dev bridge0 Using Stephen's br tool. First command adds FDB entry to SW bridge and if the same tool could be used to add entries to embedded bridge I think that would be the best case. That would be nice (although adds dependency on the presence of the s/ware bridge). Would be nicer to have either a knob in the kernel to say synchronize with h/w bridge foo which can be turned off. Seems we need both a synchronize and a { add | del | replace } option. So no RTNETLINK error on the second cmd. Then embedded FDB entries could be dumped this way also so I get a complete view of my FDB setup across multiple sw bridges and embedded bridges. So if you had multiple h/ware bridges - which one is tied to br0? Not sure I follow but does the additional dev parameter above answer this? Yes. The hardware has a bit to support this which is currently not exposed to user space. That's a case where we have 'yet another knob' that needs a clean solution. This causes real bugs today when users try to use the macvlan devices in VEPA mode on top of SR-IOV. By the way these modes are all part of the 802.1Qbg spec which people actually want to use with Linux so a good clean solution is probably needed. I think the knobs to flood and learn are important. The hardware seems to have the flood but not the learn/discover. I think the s/ware bridge needs to have both. At the moment - as pointed out in that *NEIGH* notification, s/w bridge assumes a policy that could be considered a security flaw in some circles - just because you are my neighbor does not mean i trust you to come into my house; i may trust you partially and allow you only to come through the front door. Even in Canada with a default policy of not locking your door we sometimes lock our doors ;- I have no problem with drawing the line here and trying to implement something over PF_BRIDGE:RTM_xxx nlmsgs. My comment/concern was in regard to the bridge built-in policy of reading from the neighbor updates (refer to above comments) So I think what your saying is a per port bit to disable learning... hmm but if you start tweaking it too much it looks less and less like a 802.1D bridge and more like something you would want to build with tc or openvswitch or tc+bridge or tc+macvlan. .John cheers, jamal -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [PATCH 0/3] [PULL] qemu-kvm.git uq/master queue
On 02/08/2012 02:01 PM, Marcelo Tosatti wrote: The following changes since commit cf4dc461a4cfc3e056ee24edb26154f4d34a6278: Restore consistent formatting (2012-02-07 22:11:04 +0400) are available in the git repository at: git://git.kernel.org/pub/scm/virt/kvm/qemu-kvm.git uq/master Pulled. Thanks. Regards, Anthony Liguori Jan Kiszka (3): kvm: Allow to set shadow MMU size kvm: Implement kvm_irqchip_in_kernel like kvm_enabled apic: Fix legacy vmstate loading for KVM hw/apic_common.c |7 ++- hw/pc.c |4 ++-- hw/pc_piix.c |6 +++--- kvm-all.c | 13 - kvm-stub.c|5 - kvm.h |8 +--- qemu-config.c |4 qemu-options.hx |5 - target-i386/kvm.c | 17 +++-- 9 files changed, 43 insertions(+), 26 deletions(-) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On Wed, 15 Feb 2012 15:39:41 +0200, Avi Kivity a...@redhat.com wrote: On 02/07/2012 08:12 PM, Rusty Russell wrote: I would really love to have this, but the problem is that we'd need a general purpose bytecode VM with binding to some kernel APIs. The bytecode VM, if made general enough to host more complicated devices, would likely be much larger than the actual code we have in the kernel now. We have the ability to upload bytecode into the kernel already. It's in a great bytecode interpreted by the CPU itself. Unfortunately it's inflexible (has to come with the kernel) and open to security vulnerabilities. It doesn't have to come with the kernel, but it does require privs. And the bytecode itself might be invulnerable, the services it will call will be, so it's not clear it'll be a win, given the reduced auditability. The grass is not really greener, and getting there involves many fences. If every user were emulating different machines, LPF this would make sense. Are they? They aren't. Or should we write those helpers once, in C, and provide that for them. There are many of them: PIT/PIC/IOAPIC/MSIX tables/HPET/kvmclock/Hyper-V stuff/vhost-net/DMA remapping/IO remapping (just for x86), and some of them are quite complicated. However implementing them in bytecode amounts to exposing a stable kernel ABI, since they use such a vast range of kernel services. We could think about regularizing and enumerating the various in-kernel helpers, and give userspace a generic mechanism for wiring them up. That would surely be the first step towards bytecode anyway. But the current device assignment ioctls make me think that this wouldn't be simple or neat. Cheers, Rusty. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
[I'm just catching up with this after getting my own driver changes into shape.] On Fri, 2012-02-10 at 10:18 -0500, jamal wrote: Hi John, I went backwards to summarize at the top after going through your email. TL;DR version 0.1: you provide a good use case where it makes sense to do things in the kernel. IMO, you could make the same arguement if your embedded switch could do ACLs, IPv4 forwarding etc. And the kernel bloats. I am always bigoted to move all policy control to user space instead of bloating in the kernel. [...] Now here is the potential issue, (G) The frame transmitted from ethx.y with the destination address of veth0 but the embedded switch is not a learning switch. If the FDB update is done in user space its possible (likely?) that the FDB entry for veth0 has not been added to the embedded switch yet. Ok, got it - so the catch here is the switch is not capable of learning. I think this depends on where learning is done. Your intent is to use the S/W bridge as something that does the learning for you i.e in the kernel. This makes the s/w bridge part of MUST-have-for-this-to-run. And that maybe the case for your use case. [...] Well, in addition, there are SR-IOV network adapters that don't have any bridge. For these, the software bridge is necessary to handle multicast, broadcast and forwarding between local ports, not only to do learning. Solarflare's implementation of accelerated guest networking (which Shradha and I are gradually sending upstream) builds on libvirt's existing support for software bridges and assigns VFs to guests as a means to offload some of the forwarding. If and when we implement a hardware bridge, we would probably still want to keep the software bridge as a fallback. If a guest is dependent on a VF that's connected to a hardware bridge, it becomes impossible or at least very disruptive to migrate it to another host that doesn't have a compatible VF available. Ben. -- Ben Hutchings, Staff Engineer, Solarflare Not speaking for my employer; that's the marketing department's job. They asked us to note that Solarflare product names are trademarked. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: mmu_notifier: Flush TLBs before releasing mmu_lock
On 02/15/2012 10:07 PM, Avi Kivity wrote: On 02/15/2012 01:37 PM, Xiao Guangrong wrote: I would really like to move the IPI back out of the lock. How about something like a sequence lock: spin_lock(mmu_lock) need_flush = write_protect_stuff(); atomic_add(kvm-want_flush_counter, need_flush); spin_unlock(mmu_lock); while ((done = atomic_read(kvm-done_flush_counter)) (want = atomic_read(kvm-want_flush_counter)) { kvm_make_request(flush) atomic_cmpxchg(kvm-done_flush_counter, done, want) } This (or maybe a corrected and optimized version) ensures that any need_flush cannot pass the while () barrier, no matter which thread encounters it first. However it violates the do not invent new locking techniques commandment. Can we map it to some existing method? There is no need to advance 'want' in the loop. So we could do /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } Hmm, we already have kvm-tlbs_dirty, so, we can do it like this: #define SPTE_INVALID_UNCLEAN (1 63 ) in invalid page path: lock mmu_lock if (spte is invalid) kvm-tlbs_dirty |= SPTE_INVALID_UNCLEAN; need_tlb_flush = kvm-tlbs_dirty; unlock mmu_lock if (need_tlb_flush) kvm_flush_remote_tlbs() And in page write-protected path: lock mmu_lock if (it has spte change to readonly | kvm-tlbs_dirty SPTE_INVALID_UNCLEAN) kvm_flush_remote_tlbs() unlock mmu_lock How about this? Well, it still has flushes inside the lock. And it seems to be more complicated, but maybe that's because I thought of my idea and didn't fully grok yours yet. Oh, i was not think that flush all tlbs out of mmu-lock, just invalid page path. But, there still have some paths need flush tlbs inside of mmu-lock(like sync_children, get_page). In your code: /* must call with mmu_lock held */ void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-flush_counter.want; } /* may call without mmu_lock */ void kvm_mmu_commit_remote_flush(kvm) { want = ACCESS_ONCE(kvm-flush_counter.want) while ((done = atomic_read(kvm-flush_counter.done) want) { kvm_make_request(flush) atomic_cmpxchg(kvm-flush_counter.done, done, want) } } I think we do not need handle all tlb-flushed request here since all of these request can be delayed to the point where mmu-lock is released , we can simply do it: void kvm_mmu_defer_remote_flush(kvm, need_flush) { if (need_flush) ++kvm-tlbs_dirty; } void kvm_mmu_commit_remote_flush(struct kvm *kvm) { int dirty_count = kvm-tlbs_dirty; smp_mb(); if (!dirty_count) return; if (make_all_cpus_request(kvm, KVM_REQ_TLB_FLUSH)) ++kvm-stat.remote_tlb_flush; cmpxchg(kvm-tlbs_dirty, dirty_count, 0); } if this is ok, we only need do small change in the current code, since kvm_mmu_commit_remote_flush is very similar with kvm_flush_remote_tlbs(). -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: perf: kvm events analysis tool
On 02/13/2012 11:52 PM, David Ahern wrote: The first patch is only needed for code compilation, after kvm-events is compiled, you can analyse any kernels. :) understood. Now that I recall perf's way of handling out of tree builds, a couple of comments: 1. you need to add the following to tools/perf/MANIFEST arch/x86/include/asm/svm.h arch/x86/include/asm/vmx.h arch/x86/include/asm/kvm_host.h Right. 2.scripts/checkpatch.pl is an unhappy camper. It seems checkpath always complains about TRACE_EVENT and many more than-80-characters lines in perf tools. I'll take a look at the code and try out the command when I get some time. Okay, i will post the next version after collecting your new comments! Thanks for your time, David! :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: perf: kvm events analysis tool
On 2/15/12 9:59 PM, Xiao Guangrong wrote: Okay, i will post the next version after collecting your new comments! Thanks for your time, David! :) I had more comments, but got sidetracked and forgot to come back to this. I still haven't looked at the code yet, but some comments from testing: 1. The error message: Warning: Error: expected type 5 but read 4 Warning: Error: expected type 5 but read 0 Warning: unknown op '}' is fixed by this patch which has not yet made its way into perf: https://lkml.org/lkml/2011/9/4/41 The most recent request: https://lkml.org/lkml/2012/2/8/479 Arnaldo: the patch still applies cleanly (but with an offset of -2 lines). 2. negatve testing: perf kvm-events record -e kvm:* -p 2603 -- sleep 10 Warning: Error: expected type 4 but read 7 Warning: Error: expected type 5 but read 0 Warning: failed to read event print fmt for kvm_apic Warning: Error: expected type 4 but read 7 Warning: Error: expected type 5 but read 0 Warning: failed to read event print fmt for kvm_inj_exception Fatal: bad op token { If other kvm events are specified in the record line they appear to be silently ignored in the report in which case why allow the -e option to record? 3. What is happening for multiple VMs? a. perf kvm-events report data is collected for all VMs. What is displayed in the report? An average for all VMs? b. perf kvm-events report --vcpu 1 Does this given an average of all vcpu 1's? Perhaps a -p option for the report to pull out events related to a single VM. Really this could be a generic option (to perf-report and perf-script as well) to only show/analyze events for the specified pid. ie., data is recorded for all VMs (or system wide for the regular perf-record) and you want to only consider events for a specific pid. e.g., in process_sample_event() skip event if event-ip.pid != report_pid (works for perf code because PERF_SAMPLE_TID attribute is always set). David -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/3] KVM: perf: kvm events analysis tool
On 02/16/2012 01:05 PM, David Ahern wrote: On 2/15/12 9:59 PM, Xiao Guangrong wrote: Okay, i will post the next version after collecting your new comments! Thanks for your time, David! :) I had more comments, but got sidetracked and forgot to come back to this. I still haven't looked at the code yet, but some comments from testing: 1. The error message: Warning: Error: expected type 5 but read 4 Warning: Error: expected type 5 but read 0 Warning: unknown op '}' is fixed by this patch which has not yet made its way into perf: https://lkml.org/lkml/2011/9/4/41 The most recent request: https://lkml.org/lkml/2012/2/8/479 Arnaldo: the patch still applies cleanly (but with an offset of -2 lines). Great, it is a good fix. But, it does not hurt the development of kvm-events. 2. negatve testing: perf kvm-events record -e kvm:* -p 2603 -- sleep 10 Warning: Error: expected type 4 but read 7 Warning: Error: expected type 5 but read 0 Warning: failed to read event print fmt for kvm_apic Warning: Error: expected type 4 but read 7 Warning: Error: expected type 5 but read 0 Warning: failed to read event print fmt for kvm_inj_exception Fatal: bad op token { If other kvm events are specified in the record line they appear to be silently ignored in the report in which case why allow the -e option to record? Yes, kvm-events doese not analyse these events specified by -e option since these events are not needed by vmexit/ioport/mmio analysis. And after kvm-evnets record, you can see these events by perf script 3. What is happening for multiple VMs? a. perf kvm-events report data is collected for all VMs. What is displayed in the report? An average for all VMs? Yes b. perf kvm-events report --vcpu 1 Does this given an average of all vcpu 1's? Yes Perhaps a -p option for the report to pull out events related to a single VM. Really this could be a generic option (to perf-report and perf-script as well) to only show/analyze events for the specified pid. ie., data is recorded for all VMs (or system wide for the regular perf-record) and you want to only consider events for a specific pid. e.g., in process_sample_event() skip event if event-ip.pid != report_pid (works for perf code because PERF_SAMPLE_TID attribute is always set). Analysis for per VMs is good idea, but please allow me put it into my TODO list. :) -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] New: KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 Summary: KVM domain hangs after loading initrd with Xenomai kernel Product: Virtualization Version: unspecified Kernel Version: 3.0.0-15 Platform: All OS/Version: Linux Tree: Mainline Status: NEW Severity: normal Priority: P1 Component: kvm AssignedTo: virtualization_...@kernel-bugs.osdl.org ReportedBy: madenginee...@gmail.com Regression: No Attempting to boot a 32 bit Debian guest with a Xenomai kernel inside KVM causes it to hang and spin (using 1 full CPU core) after loading the initrd, as determined by serial console output. The only error message is KVM internal error. Suberror: 1/emulation failure. Booting a regular Debian kernel succeeds, as does running the Xenomai kernel with software emulation (-no-kvm). Info: CPU: Intel Core i7-2670QM Emulator: qemu-kvm 0.14.1 Host kernel: 3.0.0-15 (Ubuntu build), x86_64 Guest OS: Debian Squeeze, kernel.org 2.6.37 kernel with Xenomai 2.6.0 (config attached) Qemu command: kvm -M pc-0.14 -enable-kvm -m 1024 -drive file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=21,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3 -chardev stdio,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -vga cirrus Effects of flags: Adding one or both of --no-kvm-irqchip or --no-kvm-pit has no apparent effect. Adding --no-kvm appears to correct the problem, at the cost of performance due to using the software emulator. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #1 from madenginee...@gmail.com 2012-02-16 05:46:07 --- Created an attachment (id=72393) -- (https://bugzilla.kernel.org/attachment.cgi?id=72393) Configuration of the guest kernel -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #2 from madenginee...@gmail.com 2012-02-16 05:47:18 --- Created an attachment (id=72394) -- (https://bugzilla.kernel.org/attachment.cgi?id=72394) Result of 'registers info' and 'x/30i $eip' after fault -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 madenginee...@gmail.com changed: What|Removed |Added Attachment #72393|application/octet-stream|text/plain mime type|| -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 madenginee...@gmail.com changed: What|Removed |Added Attachment #72394|application/octet-stream|text/plain mime type|| -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #3 from madenginee...@gmail.com 2012-02-16 05:57:25 --- Couldn't attach the trace I recorded of the fault occurring since it's 3 MB compressed with xz, bigger still with other formats. I can email it if it will be useful. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #4 from madenginee...@gmail.com 2012-02-16 06:32:48 --- Same problem occurs with qemu-kvm 1.0 from https://launchpad.net/~bderzhavets/+archive/lib-usbredir39: $ sudo kvm -M pc-1.0 -enable-kvm -m 1024 -drive file=/var/lib/libvirt/images/eve.img,if=none,id=drive-ide0-0-0,format=raw -device ide-drive,bus=ide.0,unit=0,drive=drive-ide0-0-0,id=ide0-0-0 -netdev tap,fd=21,id=hostnet0 -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3 -chardev vc,id=charserial0 -device isa-serial,chardev=charserial0,id=serial0 -usb -device usb-tablet,id=input0 -vga cirrus kvm: -netdev tap,fd=21,id=hostnet0: TUNGETIFF ioctl() failed: Bad file descriptor TUNSETOFFLOAD ioctl() failed: Bad file descriptor kvm: -device e1000,netdev=hostnet0,id=net0,mac=52:54:00:b5:f4:00,bus=pci.0,addr=0x3: pci_add_option_rom: failed to find romfile pxe-e1000.rom KVM internal error. Suberror: 1 emulation failure EAX=f681 EBX=003e ECX=003e EDX=c00b8000 ESI=c00b8000 EDI=c15b EBP=c15b1f74 ESP=c15b1f58 EIP=c1228905 EFL=00010206 [-P-] CPL=0 II=0 A20=1 SMM=0 HLT=0 ES =007b 00c0f300 DPL=3 DS [-WA] CS =0060 00c09b00 DPL=0 CS32 [-RA] SS =0068 00c09300 DPL=0 DS [-WA] DS =007b 00c0f300 DPL=3 DS [-WA] FS = GS = LDT= TR =0080 c15b6300 206b 8b00 DPL=0 TSS32-busy GDT= c15b3000 00ff IDT= c15b2000 07ff CR0=80050033 CR2=ffee4000 CR3=01663000 CR4=0690 DR0= DR1= DR2= DR3= DR6=0ff0 DR7=0400 EFER= Code=8e 2b 01 00 00 8b 4d f0 89 f2 8b 45 ec 0f 0d 82 40 01 00 00 0f 6f 02 0f 6f 4a 08 0f 6f 52 10 0f 6f 5a 18 0f 7f 00 0f 7f 48 08 0f 7f 50 10 0f 7f 58 18 -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 Gleb g...@redhat.com changed: What|Removed |Added CC||g...@redhat.com --- Comment #5 from Gleb g...@redhat.com 2012-02-16 07:15:49 --- (In reply to comment #3) Couldn't attach the trace I recorded of the fault occurring since it's 3 MB compressed with xz, bigger still with other formats. I can email it if it will be useful. Can you do tail -1 on it and attach it here? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #6 from madenginee...@gmail.com 2012-02-16 07:22:41 --- Created an attachment (id=72395) -- (https://bugzilla.kernel.org/attachment.cgi?id=72395) Last 10k lines of a trace showing the fault Per Gleb's request -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #7 from Gleb g...@redhat.com 2012-02-16 07:43:11 --- Have you installed trace-cmd before capturing the trace? It failed to parse kvm events. qemu haven't paused the guest after emulation error (looks like a bug), so 'x/30i $eip' output is not useful either. Can you do 'x/30i 0xXXX' where XXX is the address in EIP from register dump you see after instruction emulation failure message (c1228905 in your output from comment #4)? -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #8 from madenginee...@gmail.com 2012-02-16 07:56:03 --- Not sure what you mean by installing trace-cmd before capturing the trace--I did do that, otherwise I wouldn't have had a trace-cmd to run. The package version is trace-cmd 1.0.3-0ubuntu1 if that helps. I tried running it again against qemu 1.0 (the last one was for qemu 0.14), still contained a bunch of [FAILED TO PARSE] messages. Attaching the output you requested separately. -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[Bug 42779] KVM domain hangs after loading initrd with Xenomai kernel
https://bugzilla.kernel.org/show_bug.cgi?id=42779 --- Comment #9 from madenginee...@gmail.com 2012-02-16 07:57:35 --- Created an attachment (id=72397) -- (https://bugzilla.kernel.org/attachment.cgi?id=72397) Register state and code disassembly at failure point with qemu 1.0 -- Configure bugmail: https://bugzilla.kernel.org/userprefs.cgi?tab=email --- You are receiving this mail because: --- You are watching the assignee of the bug. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] [RFC] Next gen kvm api
On 15.02.2012, at 12:18, Avi Kivity wrote: On 02/07/2012 04:39 PM, Alexander Graf wrote: Syscalls are orthogonal to that - they're to avoid the fget_light() and to tighten the vcpu/thread and vm/process relationship. How about keeping the ioctl interface but moving vcpu_run to a syscall then? I dislike half-and-half interfaces even more. And it's not like the fget_light() is really painful - it's just that I see it occasionally in perf top so it annoys me. That should really be the only thing that belongs into the fast path, right? Every time we do a register sync in user space, we do something wrong. Instead, user space should either a) have wrappers around register accesses, so it can directly ask for specific registers that it needs or b) keep everything that would be requested by the register synchronization in shared memory Always-synced shared memory is a liability, since newer hardware might introduce on-chip caches for that state, making synchronization expensive. Or we may choose to keep some of the registers loaded, if we have a way to trap on their use from userspace - for example we can return to userspace with the guest fpu loaded, and trap if userspace tries to use it. Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. , keep the rest in user space. When a device is fully in the kernel, we have a good specification of the ABI: it just implements the spec, and the ABI provides the interface from the device to the rest of the world. Partially accelerated devices means a much greater effort in specifying exactly what it does. It's also vulnerable to changes in how the guest uses the device. Why? For the HPET timer register for example, we could have a simple MMIO hook that says on_read: return read_current_time() - shared_page.offset; on_write: handle_in_user_space(); It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Once the logic that's implemented by the kernel accelerator doesn't fit anymore, unregister it. Yeah. Also look at the PIT which latches on read. For IDE, it would be as simple as register_pio_hook_ptr_r(PIO_IDE, SIZE_BYTE,s-cmd[0]); for (i = 1; i 7; i++) { register_pio_hook_ptr_r(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); register_pio_hook_ptr_w(PIO_IDE + i, SIZE_BYTE,s-cmd[i]); } and we should have reduced overhead of IDE by quite a bit already. All the other 2k LOC in hw/ide/core.c don't matter for us really. Just use virtio. Just use xenbus. Seriously, this is not an answer. Why not? We invested effort in making it as fast as possible, and in writing the drivers. IDE will never, ever, get anything close to virtio performance, even if we put all of it in the kernel. However, after these examples, I'm more open to partial acceleration now. I won't ever like it though. - VGA - IDE Why? There are perfectly good replacements for these (qxl, virtio-blk, virtio-scsi). Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do
Re: [Qemu-devel] [RFC] Next gen kvm api
On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. btw, why are you interested in virtual addresses in userspace at all? It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. Ok. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it. Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads. For linear loads, so should we, perhaps with greater cpu utliization. If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple) means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits shouldn't matter. *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;). One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. KVM's strength has always been its close resemblance to hardware. This will remain. But we can't optimize everything. That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no? We should make sure that we don't default to IDE. Qemu has no knowledge of the guest, so it can't default to virtio, but higher level tools can and should. You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. Ah, because you're on NPT and you can have MMIO hints in the nested page table. Nifty. Yeah, we don't have that luxury :). Well the real reason is we have an extra bit reported by page faults that we can control. Can't you set up a hashed pte that is configured in a way that it will fault, no matter what type of access the guest does, and see it in your page fault handler? I might be able to synthesize a PTE that is !readable and might throw a permission exception instead of a miss exception. I might be able to synthesize something similar for booke. I don't however get any indication on why things failed. So for MMIO reads,
Re: [Qemu-devel] [RFC] Next gen kvm api
On 15.02.2012, at 14:29, Avi Kivity wrote: On 02/15/2012 01:57 PM, Alexander Graf wrote: Is an extra syscall for copying TLB entries to user space prohibitively expensive? The copying can be very expensive, yes. We want to have the possibility of exposing a very large TLB to the guest, in the order of multiple kentries. Every entry is a struct of 24 bytes. You don't need to copy the entire TLB, just the way that maps the address you're interested in. Yeah, unless we do migration in which case we need to introduce another special case to fetch the whole thing :(. btw, why are you interested in virtual addresses in userspace at all? We need them for gdb and monitor introspection. It works for the really simple cases, yes, but if the guest wants to set up one-shot timers, it fails. I don't understand. Why would anything fail here? It fails to provide a benefit, I didn't mean it causes guest failures. You also have to make sure the kernel part and the user part use exactly the same time bases. Right. It's an optional performance accelerator. If anything doesn't align, don't use it. But if you happen to have a system where everything's cool, you're faster. Sounds like a good deal to me ;). Depends on how much the alignment relies on guest knowledge. I guess with a simple device like HPET, it's simple, but with a complex device, different guests (or different versions of the same guest) could drive it very differently. Right. But accelerating simple devices not accelerating any devices. No? :) Because not every guest supports them. Virtio-blk needs 3rd party drivers. AHCI needs 3rd party drivers on w2k3 and wxp. 3rd party drivers are a way of life for Windows users; and the incremental benefits of IDE acceleration are still far behind virtio. The typical way of life for Windows users are all-included drivers. Which is the case for AHCI, where we're getting awesome performance for Vista and above guests. The iDE thing was just an idea for legacy ones. It'd be great to simply try and see how fast we could get by handling a few special registers in kernel space vs heavyweight exiting to QEMU. If it's only 10%, I wouldn't even bother with creating an interface for it. I'd bet the benefits are a lot bigger though. And the main point was that specific partial device emulation buys us more than pseudo-generic accelerators like coalesced mmio, which are also only used by 1 or 2 devices. Ok. I'm pretty sure non-Linux non-Windows systems won't get QXL drivers. Cirrus or vesa should be okay for them, I don't see what we could do for them in the kernel, or why. That's my point. You need fast emulation of standard devices to get a good baseline. Do PV on top, but keep the baseline as fast as is reasonable. Same for virtio. Please don't do the Xen mistake again of claiming that all we care about is Linux as a guest. Rest easy, there's no chance of that. But if a guest is important enough, virtio drivers will get written. IDE has no chance in hell of approaching virtio-blk performance, no matter how much effort we put into it. Ever used VMware? They basically get virtio-blk performance out of ordinary IDE for linear workloads. For linear loads, so should we, perhaps with greater cpu utliization. If we DMA 64 kB at a time, then 128 MB/sec (to keep the numbers simple) means 0.5 msec/transaction. Spending 30 usec on some heavyweight exits shouldn't matter. *shrug* last time I checked we were a lot slower. But maybe there's more stuff making things slow than the exit path ;). One thing that's different is that virtio offloads itself to a thread very quickly, while IDE does a lot of work in vcpu thread context. So it's all about latencies again, which could be reduced at least a fair bit with the scheme I described above. But really, this needs to be prototyped and benchmarked to actually give us data on how fast it would get us. KVM's strength has always been its close resemblance to hardware. This will remain. But we can't optimize everything. That's my point. Let's optimize the hot paths and be good. As long as we default to IDE for disk, we should have that be fast, no? We should make sure that we don't default to IDE. Qemu has no knowledge of the guest, so it can't default to virtio, but higher level tools can and should. You can only default to virtio on recent Linux. Windows, BSD, etc don't include drivers, so you can't assume it working. You can default to AHCI for basically any recent guest, but that still won't work for XP and the likes :(. The all-knowing management tool can provide a virtio driver disk, or even slip-stream the driver into the installation CD. One management tool might do that, another one might now. We can't assume that all management tools are all-knowing. Some times you also want to run guest OSs that
Re: [Qemu-devel] [RFC] Next gen kvm api
On Tuesday 07 February 2012, Alexander Graf wrote: Not sure we'll ever get there. For PPC, it will probably take another 1-2 years until we get the 32-bit targets stabilized. By then we will have new 64-bit support though. And then the next gen will come out giving us even more new constraints. I would expect that newer archs have less constraints, not more. Heh. I doubt it :). The 64-bit booke stuff is pretty similar to what we have today on 32-bit, but extends a bunch of registers to 64-bit. So what if we laid out stuff wrong before? I don't even want to imagine what v7 arm vs v8 arm looks like. It's a completely new architecture. I have not seen the source but I'm pretty sure that v7 and v8 they look very similar regarding virtualization support because they were designed together, including the concept that on v8 you can run either a v7 compatible 32 bit hypervisor with 32 bit guests or a 64 bit hypervisor with a combination of 32 and 64 bit guests. Also, the page table layout in v7-LPAE is identical to the v8 one. The main difference is the instruction set, but then ARMv7 already has four of these (ARM, Thumb, Thumb2, ThumbEE). Arnd -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH 14/16] KVM: PPC: booke: category E.HV (GS-mode) support
On 15.02.2012, at 20:40, Scott Wood wrote: On 02/15/2012 01:36 PM, Alexander Graf wrote: On 10.01.2012, at 01:51, Scott Wood wrote: I was thinking we'd check ESR[EPID] or SRR1[IS] as appropriate, and treat it as a kernel fault (search exception table) -- but this works too and is a bit cleaner (could be other uses of external pid), at the expense of a couple extra instructions in the emulation path (but probably a slightly faster host TLB handler). The check wouldn't go in DO_KVM, though, since on bookehv that only deals with diverting flow when xSRR1[GS] is set, which wouldn't be the case here. Thinking about it a bit more, how is this different from a failed get_user()? We can just use the same fixup mechanism as there, right? The fixup mechanism can be the same (we'd like to know whether it failed due to TLB miss or DSI, so we know which to reflect No, we only want to know fast path failed. The reason is a different pair of shoes and should be evaluated in the slow path. We shouldn't ever fault here during normal operation btw. We already executed a guest instruction, so there's almost no reason it can't be read. -- but if necessary I think we can figure that out with a tlbsx). What's different is that the page fault handler needs to know that any external pid (or AS1) fault is bad, same as if the address were in the kernel area, and it should go directly to searching the exception tables instead of trying to page something in. Yes and no. We need to force it to search the exception tables. We don't care if the page fault handlers knows anything about external pids. Either way, we discussed the further stuff on IRC and came to a working solution :). Stay tuned. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html