Re: virtio-blk performance regression and qemu-kvm
On Tue, Feb 28, 2012 at 5:15 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Stefan, I was bisecting qemu-kvm.git. qemu-kvm.git regularly merges from qemu.git. The history of the qemu-kvm.git repository is not linear because of these periodic merges from the qemu.git tree. I think what happened is that git bisect went down a qemu.git merge, which resulted in you effectively testing qemu.git instead of qemu-kvm.git! You can see this by using gitk(1) or git log --graph and searching for the bad commit (12d4536f7d). This will show a merge commit (0b9b128530b999e36f0629dddcbafeda114fb4fb) where these qemu.git commits were brought into qemu-kvm.git's history. qemu-kvm.git | ... | * merge commit (0b9b128530b999e36f0629dddcbafeda114fb4fb) |\ | * bad commit (12d4536f7d911b6d87a766ad7300482ea663cea2) | | | ... more qemu.git commits | * previous commit, also a merge commit (4fefc55ab04dd77002750f771e96477b5d2a473f) Bisect seems to have gone down the qemu.git side of the merge at 0b9b128530b. Instead we need to go down the qemu-kvm.git side of the history. The key thing is that the second git-bisect steps from qemu-kvm.git into qemu.git history the source tree will be fairly different and performance between the two is not easy to compare. I suggest testing both of the qemu-kvm.git merge commits, 0b9b128530b and 4fefc55ab04d. My guess is you will find they perform the same, i.e. the qemu.git commits which were merged did not affect performance in qemu-kvm.git. That would be evidence that git-bisect has not located the real culprit. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
On 02/29/2012 09:49 AM, Takuya Yoshikawa wrote: Avi Kivity a...@redhat.com wrote: The key part of the implementation is the use of xchg() operation for clearing dirty bits atomically. Since this allows us to update only BITS_PER_LONG pages at once, we need to iterate over the dirty bitmap until every dirty bit is cleared again for the next call. What about using cmpxchg16b? That should reduce locked ops by a factor of 2 (but note it needs 16 bytes alignment). I tried cmpxchg16b first: the implementation could not be naturally combined with the for loop over the unsigned long array. Extra if not zero, alignement check and ... it was ugly and I guessed it would be slow. Taking it into account that cmpxchg16b needs more cycles than others, I think this should be tried carefully with more measurement later. How about concentrating on xchg now? Okay. But note you don't need the alignment check; simply allocate the array aligned, and a multiple of 16 bytes, in the first place. The implementation is simple and gives us enough improvement for now. At least, I want to see whether xchg-based implementation works well for one release. GET_DIRTY_LOG can be easily tuned to one particular case and it is really hard to check whether the implementation works well for every important case. I really want feedback from users before adding non-obvious optimization. In addition, we should care about the new API. It is not decided about what kind of range can be ordered. I think restricting the range to be long size aligned is natural. Do you have any plan? Not really. But the current changes are all internal and don't affect the user API. Another point to note is that we do not use for_each_set_bit() to check which ones in each BITS_PER_LONG pages are actually dirty. Instead we simply use __ffs() and __fls() and pass the range in between the two positions found by them to kvm_mmu_write_protect_pt_range(). This seems artificial. OK, then I want to pass the bits (unsingned long) as a mask. Non-NPT machines may gain some. Even though the passed range may include clean pages, it is much faster than repeatedly call find_next_bit() due to the locality of dirty pages. Perhaps this is due to the implementation of find_next_bit()? would using bsf improve things? I need to explain what I did in the past. Before srcu-less work, I had already noticed the slowness of for_each_set_bit() and replaced it with simple for loop like now: the improvement was significant. Yes, find_next_bit() is for generic use and not at all good when there are many consecutive bits set: it cannot assume anything so needs to check a lot of cases - we have long size aligned bitmap and bits is already known to be non-zero after the first check of the for loop. Of course, doing 64 function calls alone should be avoided in our case. I also do not want to call kvm_mmu_* for each bit. So, above, I proposed just passing bits to kvm_mmu_*: we can check each bit i in a register before using rmap[i] if needed. __ffs is really fast compared to other APIs. You could do something like this for (i = 0; i bitmap_size_in_longs; ++i) { mask = bitmap[i]; if (!mask) continue; for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask)) handle i * BITS_PER_LONG + j; } This gives you the speed of __ffs() but without working on zero bits. One note is that we will lose in cases like bits = 0x.. 2271171.412064.9 138.6 16K -45 3375866.214743.3 103.0 32K -55 4408395.610720.067.2 64K -51 5915336.226538.145.1 128K -44 8497356.416441.032.4 256K -29 So the last one will become worse. For other 4 patterns I am not sure. I thought that we should tune to the last case for gaining a lot from the locality of WWS. What do you think about this point? - } else { - r = -EFAULT; - if (clear_user(log-dirty_bitmap, n)) - goto out; + kvm_mmu_write_protect_pt_range(kvm, memslot, start, end); If indeed the problem is find_next_bit(), then we could hanve kvm_mmu_write_protect_slot_masked() which would just take the bitmap as a parameter. This would allow covering just this function with the spinlock, not the xchg loop. We may see partial display updates if we do not hold the mmu_lock during xchg loop: it is possible that pages near the end of the framebuffer alone gets updated sometimes - I noticed this problem when I fixed the TLB flush issue. I don't understand why this happens. Not a big problem but still maybe-noticeable change, so I think we should do it separately with some comments if needed. Well if it's noticable on the framebuffer it's also noticable with live migration. We could do
Re: Reconciling qemu-kvm and qemu's PIT
On 02/28/2012 11:47 PM, Jan Kiszka wrote: Looks like isa-pit has zero gpio pins, so it fails when crashing. I must have mismerged it, but where is the gpio pin count set? In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to discuss this without seeing your code. pushed it into qemu-kvm.git usptream-merge -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: On Tue, 28 Feb 2012, Ian Campbell wrote: On Tue, 2012-02-28 at 10:20 +, Dave Martin wrote: On Mon, Feb 27, 2012 at 07:33:39PM +, Ian Campbell wrote: On Mon, 2012-02-27 at 18:03 +, Dave Martin wrote: Since we support only ARMv7+ there are T2 and T3 encodings available which do allow direct mov of an immediate into R12, but are 32 bit Thumb instructions. Should we use r7 instead to maximise instruction density for Thumb code? The difference seems trivial when put into context, even if you code a special Thumb version of the code to maximise density (the Thumb-2 code which gets built from assembler in the kernel is very suboptimal in size, but there simply isn't a high proportion of asm code in the kernel anyway.) I wouldn't consider the ARM/Thumb differences as an important factor when deciding on a register. OK, that's useful information. thanks. One argument for _not_ using r12 for this purpose is that it is then harder to put a generic HVC function (analogous to the syscall syscall) out-of-line, since r12 could get destroyed by the call. For an out of line syscall(2) wouldn't the syscall number either be in a standard C calling convention argument register or on the stack when the function was called, since it is just a normal argument at that point? As you point out it cannot be passed in r12 (and could never be, due to the clobbering). The syscall function itself would have to move the arguments and syscall nr etc around before issuing the syscall. I think the same is true of a similar hypercall(2) If you don't think you will ever care about putting HVC out of line though, it may not matter. If you have both inline and out-of-line hypercalls, it's hard to ensure that you never have to shuffle the registers in either case. Agreed. I think we want to optimise for the inline case since those are the majority. They are not just the majority, all of them are static inline at the moment, even on x86 (where the number of hypercalls is much higher). So yes, we should optimize for the inline case. The only non-inline case is the special privcmd ioctl which is the mechanism that allows the Xen toolstack to make hypercalls. It's somewhat akin to syscall(2). By the time you get to it you will already have done a system call for the ioctl, pulled the arguments from the ioctl argument structure etc, plus such hypercalls are not really performance critical. Even the privcmd hypercall (privcmd_call) is a static inline function, it is just that at the moment there is only one caller :) Shuffling can be reduced but only at the expense of strange argument ordering in some cases when calling from C -- the complexity is probably not worth it. Linux doesn't bother for its own syscalls. Note that even in assembler, a branch from one section to a label in another section may cause r12 to get destroyed, so you will need to be careful about how you code the hypervisor trap handler. However, this is not different from coding exception handlers in general, so I don't know that it constitutes a conclusive argument on its own. We are happy to arrange that this doesn't occur on our trap entry paths, at least until the guest register state has been saved. Currently the hypercall dispatcher is in C and gets r12 from the on-stack saved state. We will likely eventually optimise the hypercall path directly in ASM and in that case we are happy to take steps to ensure we don't clobber r12 before we need it. Yes, I don't think this should be an issue. Fair enough. My instinctive preference would therefore be for r7 (which also seems to be good enough for Linux syscalls) -- but it really depends how many arguments you expect to need to support. Apparently r7 is the frame pointer for gcc in thumb mode which I think is a good reason to avoid it. We currently have some 5 argument hypercalls and there have been occasional suggestions for interfaces which use 6 -- although none of them have come to reality. I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) The fact that r12 can be destroyed so easily is actually a good argument for using it because it means it is less likely to contain useful data that needs to be saved/restored by gcc. That's a fair point. Cheers ---Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 09:08:52AM +0800, Wen Congyang wrote: At 02/28/2012 06:45 PM, Gleb Natapov Wrote: On Tue, Feb 28, 2012 at 11:19:47AM +0100, Jan Kiszka wrote: On 2012-02-28 10:42, Wen Congyang wrote: At 02/28/2012 05:34 PM, Jan Kiszka Wrote: On 2012-02-28 09:23, Wen Congyang wrote: At 02/27/2012 11:08 PM, Jan Kiszka Wrote: On 2012-02-27 04:01, Wen Congyang wrote: We can know the guest is paniced when the guest runs on xen. But we do not have such feature on kvm. This patch implemnts this feature, and the implementation is the same as xen: register panic notifier, and call hypercall when the guest is paniced. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- arch/x86/kernel/kvm.c| 12 arch/x86/kvm/svm.c |8 ++-- arch/x86/kvm/vmx.c |8 ++-- arch/x86/kvm/x86.c | 13 +++-- include/linux/kvm.h |1 + include/linux/kvm_para.h |1 + 6 files changed, 37 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index f0c6fd6..b928d1d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -331,6 +331,17 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static int +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, void *unused) +{ + kvm_hypercall0(KVM_HC_GUEST_PANIC); + return NOTIFY_DONE; +} + +static struct notifier_block kvm_pv_panic_nb = { + .notifier_call = kvm_pv_panic_notify, +}; + You should split up host and guest-side changes. static u64 kvm_steal_clock(int cpu) { u64 steal; @@ -417,6 +428,7 @@ void __init kvm_guest_init(void) paravirt_ops_setup(); register_reboot_notifier(kvm_pv_reboot_nb); + atomic_notifier_chain_register(panic_notifier_list, kvm_pv_panic_nb); for (i = 0; i KVM_TASK_SLEEP_HASHSIZE; i++) spin_lock_init(async_pf_sleepers[i].lock); if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0b7690e..38b4705 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1900,10 +1900,14 @@ static int halt_interception(struct vcpu_svm *svm) static int vmmcall_interception(struct vcpu_svm *svm) { + int ret; + svm-next_rip = kvm_rip_read(svm-vcpu) + 3; skip_emulated_instruction(svm-vcpu); - kvm_emulate_hypercall(svm-vcpu); - return 1; + ret = kvm_emulate_hypercall(svm-vcpu); + + /* Ignore the error? */ + return ret == 0 ? 0 : 1; Why can't kvm_emulate_hypercall return the right value? kvm_emulate_hypercall() will call kvm_hv_hypercall(), and kvm_hv_hypercall() will return 0 when vcpu's CPL 0. If vcpu's CPL 0, does kvm need to exit and tell it to qemu? No, there is currently no exit to userspace due to hypercalls, neither of HV nor KVM kind. The point is that the return code of kvm_emulate_hypercall is unused so far, so you can easily redefine it to encode continue vs. exit to userspace. Once someone has different needs, this could still be refactored again. So, it is OK to change the return value of kvm_hv_hypercall() if vcpu's CPL 0? Yes, change it to encode what vendor modules need to return to their callers. Better introduce new request flag and set it in your hypercall emulation. See how triple fault is handled. triple fault sets KVM_EXIT_SHUTDOWN and exits to userspace. Do you mean introduce a new value(like KVM_EXIT_SHUTDOWN)? I mean introduce new request bit (like KVM_REQ_TRIPLE_FAULT) and set it in your hypercall if exit to userspace is needed instead of changing return values. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 03:29 AM, Wen Congyang wrote: At 02/28/2012 07:23 PM, Avi Kivity Wrote: On 02/27/2012 05:01 AM, Wen Congyang wrote: We can know the guest is paniced when the guest runs on xen. But we do not have such feature on kvm. This patch implemnts this feature, and the implementation is the same as xen: register panic notifier, and call hypercall when the guest is paniced. What's the motivation for this? Xen does this is insufficient. Another purpose is: management app(for example: libvirt) can do auto dump when the guest is crashed. If management app does not do auto dump, the guest's user can do dump by hand if he sees the guest is paniced. I am thinking about another status: dumping. This status tells the guest's user that the guest is paniced, and the OS's dump function is working. These two status can tell the guest's user whether the guest is pancied, and what should he do if the guest is paniced. How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 11:49:58AM +0200, Avi Kivity wrote: On 02/29/2012 03:29 AM, Wen Congyang wrote: At 02/28/2012 07:23 PM, Avi Kivity Wrote: On 02/27/2012 05:01 AM, Wen Congyang wrote: We can know the guest is paniced when the guest runs on xen. But we do not have such feature on kvm. This patch implemnts this feature, and the implementation is the same as xen: register panic notifier, and call hypercall when the guest is paniced. What's the motivation for this? Xen does this is insufficient. Another purpose is: management app(for example: libvirt) can do auto dump when the guest is crashed. If management app does not do auto dump, the guest's user can do dump by hand if he sees the guest is paniced. I am thinking about another status: dumping. This status tells the guest's user that the guest is paniced, and the OS's dump function is working. These two status can tell the guest's user whether the guest is pancied, and what should he do if the guest is paniced. How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote: On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) Peter Maydell suggested there was: r7 is (used by gcc as) the Thumb frame pointer; I don't know if this makes it worth avoiding in this context. Sounds like it might be a gcc-ism, possibly a non-default option? Anyway, I think r12 will be fine for our purposes so the point is rather moot. Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 11:49:58AM +0200, Avi Kivity wrote: On 02/29/2012 03:29 AM, Wen Congyang wrote: At 02/28/2012 07:23 PM, Avi Kivity Wrote: On 02/27/2012 05:01 AM, Wen Congyang wrote: We can know the guest is paniced when the guest runs on xen. But we do not have such feature on kvm. This patch implemnts this feature, and the implementation is the same as xen: register panic notifier, and call hypercall when the guest is paniced. What's the motivation for this? Xen does this is insufficient. Another purpose is: management app(for example: libvirt) can do auto dump when the guest is crashed. If management app does not do auto dump, the guest's user can do dump by hand if he sees the guest is paniced. I am thinking about another status: dumping. This status tells the guest's user that the guest is paniced, and the OS's dump function is working. These two status can tell the guest's user whether the guest is pancied, and what should he do if the guest is paniced. How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 11:55 AM, Gleb Natapov wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? So is calling hypercalls, or dumping, or writing to the screen. Of course calling a hypercall is simpler and so is more reliable. Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 11:58 AM, Daniel P. Berrange wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. If it's alive enough to panic, it's alive enough to kexec its kdump kernel. After that it can do anything. Guest-internal dumps are more useful IMO that host-initiated dumps. In a cloud, the host-initiated dump is left on the host, outside the reach of the guest admin, outside the guest image where all the symbols are, and sometimes not even on the same host if a live migration occurred. It's more useful in small setups, or if the problem is in the hypervisor, not the guest. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote: On 02/29/2012 11:55 AM, Gleb Natapov wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? So is calling hypercalls, or dumping, or writing to the screen. Of course calling a hypercall is simpler and so is more reliable. Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 12:05 PM, Gleb Natapov wrote: On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote: On 02/29/2012 11:55 AM, Gleb Natapov wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? So is calling hypercalls, or dumping, or writing to the screen. Of course calling a hypercall is simpler and so is more reliable. Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
Avi Kivity a...@redhat.com wrote: Okay. But note you don't need the alignment check; simply allocate the array aligned, and a multiple of 16 bytes, in the first place. OKay, then we can do something like: for each (x = bitmap[i], y = bitmap[i+1]) if (!x !y) continue else if ... cmpxchg16b or 8b ... I will try later. In addition, we should care about the new API. It is not decided about what kind of range can be ordered. I think restricting the range to be long size aligned is natural. Do you have any plan? Not really. But the current changes are all internal and don't affect the user API. Then I will do my best to improve the performance now! __ffs is really fast compared to other APIs. You could do something like this for (i = 0; i bitmap_size_in_longs; ++i) { mask = bitmap[i]; if (!mask) continue; for (j = __ffs(mask); mask; mask = mask - 1, j = __ffs(mask)) handle i * BITS_PER_LONG + j; } This gives you the speed of __ffs() but without working on zero bits. Yes, I will do this in v2. We may see partial display updates if we do not hold the mmu_lock during xchg loop: it is possible that pages near the end of the framebuffer alone gets updated sometimes - I noticed this problem when I fixed the TLB flush issue. I don't understand why this happens. Because only mmu_lock protects the bitmap for VGA. xchg i = 1 xchg i = 2 ... xchg i = N We cannot get a complete snapshot without mmu_lock; if the guest faults on the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will become newer. But others will be updated by the next call, so the problem is restricted: maybe not noticeable. Not a big problem but still maybe-noticeable change, so I think we should do it separately with some comments if needed. Well if it's noticable on the framebuffer it's also noticable with live migration. We could do it later, but we need to really understand it first. About live migration, we do not mind whether the bitmap is a complete snapshot. In addition, we cannot do anything because the emulator can access the bitmap without mmu_lock. What we are doing is calling GET_DIRTY_LOG slot by slot: so already the result is not a snapshot at all. In the end, at the last stage, we will stop the guest and get a complete snapshot. In addition, we do not want to scan the dirty bitmap twice. Using the bits value soon after it is read into a register seems to be the fastest. Probably. BTW, I also want to decide the design of the new API at this chance. Let's wait with that. We're adding APIs too quickly. Let's squeeze everything we can out of the current APIs. I agree with you of course. At the same time, we cannot say anything without actually implementing sample userspace programs. So I want to see how much improvement the proposed API can achieve. I thought this might be a good GSoC project but ... Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
At 02/29/2012 05:36 PM, Gleb Natapov Wrote: On Wed, Feb 29, 2012 at 09:08:52AM +0800, Wen Congyang wrote: At 02/28/2012 06:45 PM, Gleb Natapov Wrote: On Tue, Feb 28, 2012 at 11:19:47AM +0100, Jan Kiszka wrote: On 2012-02-28 10:42, Wen Congyang wrote: At 02/28/2012 05:34 PM, Jan Kiszka Wrote: On 2012-02-28 09:23, Wen Congyang wrote: At 02/27/2012 11:08 PM, Jan Kiszka Wrote: On 2012-02-27 04:01, Wen Congyang wrote: We can know the guest is paniced when the guest runs on xen. But we do not have such feature on kvm. This patch implemnts this feature, and the implementation is the same as xen: register panic notifier, and call hypercall when the guest is paniced. Signed-off-by: Wen Congyang we...@cn.fujitsu.com --- arch/x86/kernel/kvm.c| 12 arch/x86/kvm/svm.c |8 ++-- arch/x86/kvm/vmx.c |8 ++-- arch/x86/kvm/x86.c | 13 +++-- include/linux/kvm.h |1 + include/linux/kvm_para.h |1 + 6 files changed, 37 insertions(+), 6 deletions(-) diff --git a/arch/x86/kernel/kvm.c b/arch/x86/kernel/kvm.c index f0c6fd6..b928d1d 100644 --- a/arch/x86/kernel/kvm.c +++ b/arch/x86/kernel/kvm.c @@ -331,6 +331,17 @@ static struct notifier_block kvm_pv_reboot_nb = { .notifier_call = kvm_pv_reboot_notify, }; +static int +kvm_pv_panic_notify(struct notifier_block *nb, unsigned long code, void *unused) +{ + kvm_hypercall0(KVM_HC_GUEST_PANIC); + return NOTIFY_DONE; +} + +static struct notifier_block kvm_pv_panic_nb = { + .notifier_call = kvm_pv_panic_notify, +}; + You should split up host and guest-side changes. static u64 kvm_steal_clock(int cpu) { u64 steal; @@ -417,6 +428,7 @@ void __init kvm_guest_init(void) paravirt_ops_setup(); register_reboot_notifier(kvm_pv_reboot_nb); + atomic_notifier_chain_register(panic_notifier_list, kvm_pv_panic_nb); for (i = 0; i KVM_TASK_SLEEP_HASHSIZE; i++) spin_lock_init(async_pf_sleepers[i].lock); if (kvm_para_has_feature(KVM_FEATURE_ASYNC_PF)) diff --git a/arch/x86/kvm/svm.c b/arch/x86/kvm/svm.c index 0b7690e..38b4705 100644 --- a/arch/x86/kvm/svm.c +++ b/arch/x86/kvm/svm.c @@ -1900,10 +1900,14 @@ static int halt_interception(struct vcpu_svm *svm) static int vmmcall_interception(struct vcpu_svm *svm) { + int ret; + svm-next_rip = kvm_rip_read(svm-vcpu) + 3; skip_emulated_instruction(svm-vcpu); - kvm_emulate_hypercall(svm-vcpu); - return 1; + ret = kvm_emulate_hypercall(svm-vcpu); + + /* Ignore the error? */ + return ret == 0 ? 0 : 1; Why can't kvm_emulate_hypercall return the right value? kvm_emulate_hypercall() will call kvm_hv_hypercall(), and kvm_hv_hypercall() will return 0 when vcpu's CPL 0. If vcpu's CPL 0, does kvm need to exit and tell it to qemu? No, there is currently no exit to userspace due to hypercalls, neither of HV nor KVM kind. The point is that the return code of kvm_emulate_hypercall is unused so far, so you can easily redefine it to encode continue vs. exit to userspace. Once someone has different needs, this could still be refactored again. So, it is OK to change the return value of kvm_hv_hypercall() if vcpu's CPL 0? Yes, change it to encode what vendor modules need to return to their callers. Better introduce new request flag and set it in your hypercall emulation. See how triple fault is handled. triple fault sets KVM_EXIT_SHUTDOWN and exits to userspace. Do you mean introduce a new value(like KVM_EXIT_SHUTDOWN)? I mean introduce new request bit (like KVM_REQ_TRIPLE_FAULT) and set it in your hypercall if exit to userspace is needed instead of changing return values. I donot notice it. Thanks for you suggestion. I will modify my patch. Thanks Wen Congyang -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [Qemu-devel] blockdev operations [was: KVM call agenda for Tuesday 28th]
Am 28.02.2012 17:07, schrieb Eric Blake: On 02/28/2012 07:58 AM, Stefan Hajnoczi wrote: On Tue, Feb 28, 2012 at 2:47 PM, Paolo Bonzini pbonz...@redhat.com wrote: Il 28/02/2012 15:39, Stefan Hajnoczi ha scritto: I'm not a fan of transactions or freeze/thaw (if used to atomically perform other commands). We should not export low-level block device operations so that external software can micromanage via QMP. I don't think this is a good idea because it takes the block device offline and possibly blocks the VM. We're reaching a level comparable to an HTTP interface for acquiring pthread mutex, doing some operations, and then another HTTP request to unlock it. This is micromanagement it will create more problems because we will have to support lots of little API functions. So you're for extending Jeff's patches to group mirroring etc.? That's also my favorite one, assuming we can do it in time for 1.1. Yes, that's the approach I like the most. It's relatively clean and leaves us space to develop -blockdev. Here's the idea I was forming based on today's call: Jeff's idea of a group operation can be extended to allow multiple operations while reusing the framework. For oVirt, we need the ability to open a mirror (by passing the mirror file alongside the name of the new external snapshot), as well as reopening a blockdev (to pivot to the other side of an already-open mirror). Is there a way to express a designated union in QMP? I'm thinking something along the lines of having the overall group command take a list of operations, where each operation can either be 'create a snapshot', 'create a snapshot and mirror', or 'reopen a mirror'. I'm thinking it might look something like: { 'enum': 'BlockdevOp', 'data': [ 'snapshot', 'snapshot-mirror', 'reopen' ] } { 'type': 'BlockdevAction', 'data': {'device': 'str', 'op': 'BlockdevOp', 'file': 'str', '*format': 'str', '*reuse': 'bool', '*mirror': 'str', '*mirror-format': 'str' } } { 'command': 'blkdev-group-action-sync', 'data': { 'actionlist': [ 'BlockdevAction' ] } } I think the general approach is good. Your implementation in QAPI is kind of ugly because you mix the arguments of all three commands in BlockdevAction (how about extensibility? And the optional flags aren't the full truth either, we'd have to add checks in the handlers.). We really want to have some real union support in QAPI that creates three different C structs. So something like (I'll reintroduce the bad word transaction, because it's really what we're doing here, just everything in one command): { 'type': 'BlockdevSnapshot', 'data': {'device': 'str', 'snapshot-file': 'str', '*format': 'str' } } { 'union': 'BlockdevAction', 'types': { 'snapshot': 'BlockdevSnapshot', 'mirror': 'BlockdevMirror', ... } } { 'command': 'blockdev-transaction', 'data': { 'actionlist': [ 'BlockdevAction' ] } } On the wire in QMP this would look like: { 'execute': 'blockdev-transaction', 'arguments': { 'actionlist': [ { 'type': 'snapshot', 'data': { 'device': 'ide-hd0', ... } }, { 'type': 'mirror', 'data': { ... } ] } } Now if you look closely, BlockdevSnapshot is exactly what Jeff called SnapshotDev in his series, which in turn is the definition of the blockdev-snapshot-sync command. We can reuse all of this even in the API of a new more generic command. So my conclusion for now is that we can merge the group snapshots right now instead of waiting for the mirror design to be completed, and we can reuse everything in a more complex transaction command later. Kevin -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reconciling qemu-kvm and qemu's PIT
On 2012-02-29 10:29, Avi Kivity wrote: On 02/28/2012 11:47 PM, Jan Kiszka wrote: Looks like isa-pit has zero gpio pins, so it fails when crashing. I must have mismerged it, but where is the gpio pin count set? In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to discuss this without seeing your code. pushed it into qemu-kvm.git usptream-merge As I assumed: You try to initialize the kvm-pit like the userspace pit, that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just like my kvm-pit patches does in their kvm_pit_init. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
At 02/29/2012 06:08 PM, Avi Kivity Wrote: On 02/29/2012 12:05 PM, Gleb Natapov wrote: On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote: On 02/29/2012 11:55 AM, Gleb Natapov wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? So is calling hypercalls, or dumping, or writing to the screen. Of course calling a hypercall is simpler and so is more reliable. Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. If virtio-serial's driver has bug or the guest doesn't have such device... Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. If we have a reliable way with doing nothing, it is better. But I donot find such way. Thanks Wen Congyang -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 12:05:32PM +0200, Avi Kivity wrote: On 02/29/2012 11:58 AM, Daniel P. Berrange wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. If it's alive enough to panic, it's alive enough to kexec its kdump kernel. After that it can do anything. Guest-internal dumps are more useful IMO that host-initiated dumps. In a cloud, the host-initiated dump is left on the host, outside the reach of the guest admin, outside the guest image where all the symbols are, and sometimes not even on the same host if a live migration occurred. It's more useful in small setups, or if the problem is in the hypervisor, not the guest. I don't think guest vs host dumps should be considered mutually exclusive, they both have pluses+minuses. Configuring kexec+kdump requires non-negligable guest admin configuration work before it's usable, and this work is guest OS specific, if it is possible at all. A permanent panic notifier that's built in the kernel by default requires zero guest admin config, and can allow host admin to automate collection of dumps across all their hosts/guests. The KVM hypercall notification is fairly trivially ported to any OS kernel, by comparison with a full virtio + virtio-serial impl. Regards, Daniel -- |: http://berrange.com -o-http://www.flickr.com/photos/dberrange/ :| |: http://libvirt.org -o- http://virt-manager.org :| |: http://autobuild.org -o- http://search.cpan.org/~danberr/ :| |: http://entangle-photo.org -o- http://live.gnome.org/gtk-vnc :| -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
At 02/29/2012 06:05 PM, Avi Kivity Wrote: On 02/29/2012 11:58 AM, Daniel P. Berrange wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. If it's alive enough to panic, it's alive enough to kexec its kdump kernel. After that it can do anything. Guest-internal dumps are more useful IMO that host-initiated dumps. In Yes, guest-internal dump is better than host dump. But the user may not start guest-internal dump or guest-internal dump failed. So we need the following feature: 1. If the guest-internal dump does not work, the guest's status is 'crashed'. And then the user does the host dump. 2. If the guest-internal dump is working, the guest's status should be 'dumping'. The user see this status and know the guest has paniced, and the guest-internal dump is working. Thanks Wen Congyang a cloud, the host-initiated dump is left on the host, outside the reach of the guest admin, outside the guest image where all the symbols are, and sometimes not even on the same host if a live migration occurred. It's more useful in small setups, or if the problem is in the hypervisor, not the guest. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 12:17 PM, Wen Congyang wrote: Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. If virtio-serial's driver has bug or the guest doesn't have such device... We have the same issue with the hypercall; and virtio-serial is available on many deployed versions. Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. If we have a reliable way with doing nothing, it is better. But I donot find such way. We won't have a 100% reliable way. But I think a variant of the driver that doesn't use interrupts, or just using the ordinary serial driver, should be reliable enough. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 12:19 PM, Daniel P. Berrange wrote: On Wed, Feb 29, 2012 at 12:05:32PM +0200, Avi Kivity wrote: On 02/29/2012 11:58 AM, Daniel P. Berrange wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. If it's alive enough to panic, it's alive enough to kexec its kdump kernel. After that it can do anything. Guest-internal dumps are more useful IMO that host-initiated dumps. In a cloud, the host-initiated dump is left on the host, outside the reach of the guest admin, outside the guest image where all the symbols are, and sometimes not even on the same host if a live migration occurred. It's more useful in small setups, or if the problem is in the hypervisor, not the guest. I don't think guest vs host dumps should be considered mutually exclusive, they both have pluses+minuses. True. Configuring kexec+kdump requires non-negligable guest admin configuration work before it's usable, and this work is guest OS specific, if it is possible at all. I think it's on by default on Windows and requires installing a package on Linux, which may be part of the default configuration on many distros. A permanent panic notifier that's built in the kernel by default requires zero guest admin config, and can allow host admin to automate collection of dumps across all their hosts/guests. The KVM hypercall notification is fairly trivially ported to any OS kernel, by comparison with a full virtio + virtio-serial impl. That's the path of least resistance. But it's not necessarily the best path. We end up with a wide set of disconnected ABIs instead of a narrow set that is more flexible. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 12:08:19PM +0200, Avi Kivity wrote: On 02/29/2012 12:05 PM, Gleb Natapov wrote: On Wed, Feb 29, 2012 at 12:00:41PM +0200, Avi Kivity wrote: On 02/29/2012 11:55 AM, Gleb Natapov wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). Isn't it unreliable after the guest panicked? So is calling hypercalls, or dumping, or writing to the screen. Of course calling a hypercall is simpler and so is more reliable. Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. The more interface is complex the more chances it will fail during panic. Regular serial is likely more reliable than virtio-serial. Hypercall is likely more reliable than uart. On serial user can fake panic notification. Can this be problematic? Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. Fair enough. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 12:31 PM, Wen Congyang wrote: At 02/29/2012 06:05 PM, Avi Kivity Wrote: On 02/29/2012 11:58 AM, Daniel P. Berrange wrote: How about using a virtio-serial channel for this? You can transfer any amount of information (including the dump itself). When the guest OS has crashed, any dumps will be done from the host OS using libvirt's core dump mechanism. The guest OS isn't involved and is likely too dead to be of any use anyway. Likewise it is quite probably too dead to work a virtio-serial channel or any similarly complex device. We're really just after the simplest possible notification that the guest kernel has paniced. If it's alive enough to panic, it's alive enough to kexec its kdump kernel. After that it can do anything. Guest-internal dumps are more useful IMO that host-initiated dumps. In Yes, guest-internal dump is better than host dump. But the user may not start guest-internal dump or guest-internal dump failed. So we need the following feature: 1. If the guest-internal dump does not work, the guest's status is 'crashed'. And then the user does the host dump. 2. If the guest-internal dump is working, the guest's status should be 'dumping'. The user see this status and know the guest has paniced, and the guest-internal dump is working. I agree. There is room for host dump, and we do want notification about what the guest is doing. The question is whether we should reuse virtio-serial for guest-host communication in this case. It's more complicated, but allows us to avoid touching the hypervisor. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On 02/29/2012 12:44 PM, Gleb Natapov wrote: Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. The more interface is complex the more chances it will fail during panic. Regular serial is likely more reliable than virtio-serial. Hypercall is likely more reliable than uart. On serial user can fake panic notification. The serial device is under control of the kernel, so the user can only access it if the kernel so allows. Can this be problematic? From the point of view of the guest, yes, it can generate support calls in fill up the admin's dump quota. From the point of view of the host, there's no difference between the guest's admin and a random user. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
On Wed, Feb 29, 2012 at 12:48:57PM +0200, Avi Kivity wrote: On 02/29/2012 12:44 PM, Gleb Natapov wrote: Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. The more interface is complex the more chances it will fail during panic. Regular serial is likely more reliable than virtio-serial. Hypercall is likely more reliable than uart. On serial user can fake panic notification. The serial device is under control of the kernel, so the user can only access it if the kernel so allows. Yes, but if we will hijack UART for panic notification VM user will not be able to use it for anything else. virtio-serial have many channels, but AFAIK it does not work at early stages of boot process. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [SeaBIOS] [PATCH RFC] seabios: add OSHP method stub
diff --git a/src/ssdt-pcihp.dsl b/src/ssdt-pcihp.dsl index 442e7a8..3f50169 100644 --- a/src/ssdt-pcihp.dsl +++ b/src/ssdt-pcihp.dsl @@ -24,6 +24,7 @@ DefinitionBlock (ssdt-pcihp.aml, SSDT, 0x01, BXPC, BXSSDTPCIHP, 0x1) ACPI_EXTRACT_METHOD_STRING aml_ej0_name \ Method (_EJ0, 1) { Return(PCEJ(0x##slot)) } \ Name (_SUN, 0x##slot)\ + Method (OSHP, 1) { Return(0x0) } \ } Linux kernel (kernel-3.2.7-1.fc16.x86_64) complains: ACPI Warning: For \_SB_.PCI0.S10_.OSHP: Insufficient arguments - needs 1, found 0 (20110623/nspredef-321) cheers, Gerd -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote: On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote: On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) Peter Maydell suggested there was: r7 is (used by gcc as) the Thumb frame pointer; I don't know if this makes it worth avoiding in this context. Sounds like it might be a gcc-ism, possibly a non-default option? I seem to remember discussions about some cruft in gcc related to this. gcc actually barfs at you if you try to allocate r7 to inline asm without -fomit-frame-pointer. That use for r7 really relates to the legacy ABI, so this may be a bug. Anyway, I think r12 will be fine for our purposes so the point is rather moot. Yes, it sounds like it. If that r7 issue is a gcc bug, this would avoid it. If you leave the job of putting the right constant into r12 up to gcc, it should generate reasonable for you without having to code it explicitly anyway: register int hvc_num asm(r12) = 0xDEADBEEF; asm volatile ( hvc0 :: r (hvc_num) ) Cheers ---Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
On 02/29/2012 12:16 PM, Takuya Yoshikawa wrote: We may see partial display updates if we do not hold the mmu_lock during xchg loop: it is possible that pages near the end of the framebuffer alone gets updated sometimes - I noticed this problem when I fixed the TLB flush issue. I don't understand why this happens. Because only mmu_lock protects the bitmap for VGA. xchg i = 1 xchg i = 2 ... xchg i = N We cannot get a complete snapshot without mmu_lock; if the guest faults on the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will become newer. Ah, so there is no data corruption (missed dirty bits), just the display is updated inconsistently? I don't think we can get a consistent snapshot anyway, since the guest can update the framebuffer while userspace is processing it. But others will be updated by the next call, so the problem is restricted: maybe not noticeable. Not a big problem but still maybe-noticeable change, so I think we should do it separately with some comments if needed. Well if it's noticable on the framebuffer it's also noticable with live migration. We could do it later, but we need to really understand it first. About live migration, we do not mind whether the bitmap is a complete snapshot. In addition, we cannot do anything because the emulator can access the bitmap without mmu_lock. What we are doing is calling GET_DIRTY_LOG slot by slot: so already the result is not a snapshot at all. In the end, at the last stage, we will stop the guest and get a complete snapshot. Understood. I don't think we can get a consistent vga snapshot without stopping the guest, and even then, it depends on how the guest updates the framebuffer. In addition, we do not want to scan the dirty bitmap twice. Using the bits value soon after it is read into a register seems to be the fastest. Probably. BTW, I also want to decide the design of the new API at this chance. Let's wait with that. We're adding APIs too quickly. Let's squeeze everything we can out of the current APIs. I agree with you of course. At the same time, we cannot say anything without actually implementing sample userspace programs. So I want to see how much improvement the proposed API can achieve. I thought this might be a good GSoC project but ... It may be too involved for GSoC, the issues are difficult. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote: On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote: On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) Peter Maydell suggested there was: r7 is (used by gcc as) the Thumb frame pointer; I don't know if this makes it worth avoiding in this context. Sounds like it might be a gcc-ism, possibly a non-default option? Anyway, I think r12 will be fine for our purposes so the point is rather moot. Just had a chat with some tools guys -- apparently, when passing register arguments to gcc inline asms there really isn't a guarantee that those variables will be in the expected registers on entry to the inline asm. If gcc reorders other function calls or other code around the inline asm (which it can do, except under certain controlled situations), then intervening code can clobber any registers in general. Or, to summarise another way, there is no way to control which register is used to pass something to an inline asm in general (often we get away with this, and there are a lot of inline asms in the kernel that assume it works, but the more you inline the more likely you are to get nasty surprises). There is no workaroud, except on some architectures where special asm constraints allow specific individual registers to be specified for operands (i386 for example). If you need a specific register, this means that you must set up that register explicitly inside the asm if you want a guarantee that the code will work: asm volatile ( movw r12, %[hvc_num]\n\t ... hvc#0 :: [hvc_num] i (NUMBER) : r12 ); Of course, if you need to set up more than about 5 or 6 registers in this way, the doubled register footprint means that the compiler will have to start spilling stuff to the stack. This is the kind of problem which goes away when out-of-lining the hvc wrapper behind a C function interface, since the ABI then provides guarantees about how values are mershaled into and out of that code. Notwithstanding the above, even if we do make theoretically unsound (but often true) assumptions about inline asms, ARM will be no worse than other arches in this respect. Other than serving as a reminder that inline asm is a deep can of worms, this doesn't really give us a neat solution... ---Dave -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio-blk performance regression and qemu-kvm
Hi Stefan, you are right, the performance for the commits 0b9b128530b and 4fefc55ab04d are both good. What is the best approach to stay in the qemu-kvm.git history? -martin On 29.02.2012 09:38, Stefan Hajnoczi wrote: I suggest testing both of the qemu-kvm.git merge commits, 0b9b128530b and 4fefc55ab04d. My guess is you will find they perform the same, i.e. the qemu.git commits which were merged did not affect performance in qemu-kvm.git. That would be evidence that git-bisect has not located the real culprit. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] perf/x86: Fix HO/GO counting with SVM disabled
On Tue, Feb 28, 2012 at 07:38:34PM +0200, Avi Kivity wrote: On 02/28/2012 07:36 PM, David Ahern wrote: I was to suggest the reverse: since this patch addesses an AMD bug, why not push those functions into perf_event_amd.c and make them dependent on CONFIG_CPU_SUP_AMD as well. It depends on which direction you expect the code to grow. These hooks seem reasonable, so I think they should be generic. Okay, I am fine with both directions. But since this patch is a fix in the first place I make it more AMD specific for now. If we need this as a generic feature it can easily be changed back. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3] KVM: Resize kvm_io_range array dynamically
kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Changes from v1: - fix typo: kvm_io_bus_range - kvm_io_range Changes from v2: - unregister device only when it exists Signed-off-by: Amos Kong ak...@redhat.com Reviewed-by: Jason Wang jasow...@redhat.com CC: Alex Williamson alex.william...@redhat.com --- include/linux/kvm_host.h |3 +-- virt/kvm/kvm_main.c | 41 + 2 files changed, 22 insertions(+), 22 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 355e445..0e6d9d2 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -69,8 +69,7 @@ struct kvm_io_range { struct kvm_io_bus { int dev_count; -#define NR_IOBUS_DEVS 300 - struct kvm_io_range range[NR_IOBUS_DEVS]; + struct kvm_io_range range[]; }; enum kvm_bus { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e4431ad..1275979 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2) int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev, gpa_t addr, int len) { - if (bus-dev_count == NR_IOBUS_DEVS) - return -ENOSPC; - bus-range[bus-dev_count++] = (struct kvm_io_range) { .addr = addr, .len = len, @@ -2491,10 +2488,12 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; - if (bus-dev_count NR_IOBUS_DEVS-1) - return -ENOSPC; - new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL); + new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) * + sizeof(struct kvm_io_range)), GFP_KERNEL); + if (new_bus) + memcpy(new_bus, bus, sizeof(*bus) + (bus-dev_count * + sizeof(struct kvm_io_range))); if (!new_bus) return -ENOMEM; kvm_io_bus_insert_dev(new_bus, dev, addr, len); @@ -2513,26 +2512,28 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; - - new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL); - if (!new_bus) - return -ENOMEM; - r = -ENOENT; - for (i = 0; i new_bus-dev_count; i++) - if (new_bus-range[i].dev == dev) { + for (i = 0; i bus-dev_count; i++) + if (bus-range[i].dev == dev) { r = 0; - new_bus-dev_count--; - new_bus-range[i] = new_bus-range[new_bus-dev_count]; - sort(new_bus-range, new_bus-dev_count, -sizeof(struct kvm_io_range), -kvm_io_bus_sort_cmp, NULL); break; } - if (r) { - kfree(new_bus); + if (r) return r; + + new_bus = kmemdup(bus, sizeof(*bus) + ((bus-dev_count - 1) * + sizeof(struct kvm_io_range)), GFP_KERNEL); + if (!new_bus) + return -ENOMEM; + + new_bus-dev_count--; + /* copy last entry of bus-range to deleted entry spot if + deleted entry isn't the last entry of bus-range */ + if (i != bus-dev_count - 1) { + new_bus-range[i] = bus-range[bus-dev_count - 1]; + sort(new_bus-range, new_bus-dev_count, +sizeof(struct kvm_io_range), kvm_io_bus_sort_cmp, NULL); } rcu_assign_pointer(kvm-buses[bus_idx], new_bus); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reconciling qemu-kvm and qemu's PIT
On 02/29/2012 12:14 PM, Jan Kiszka wrote: On 2012-02-29 10:29, Avi Kivity wrote: On 02/28/2012 11:47 PM, Jan Kiszka wrote: Looks like isa-pit has zero gpio pins, so it fails when crashing. I must have mismerged it, but where is the gpio pin count set? In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to discuss this without seeing your code. pushed it into qemu-kvm.git usptream-merge As I assumed: You try to initialize the kvm-pit like the userspace pit, that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just like my kvm-pit patches does in their kvm_pit_init. The merge is now in 'next', hopefully not too many regressions. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio-blk performance regression and qemu-kvm
On Wed, Feb 29, 2012 at 1:12 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Stefan, you are right, the performance for the commits 0b9b128530b and 4fefc55ab04d are both good. What is the best approach to stay in the qemu-kvm.git history? I didn't know the answer so I asked on #git on freenode: charon stefanha: so just tell it that the upstream tip is good charon git-bisect assumes you are looking for a single commit C such that any commit that contains C (in its history) is bad, and any other commit is good. if you declare up front that upstream does not contain C, then it won't go looking there of course if that declaration was wrong, it will give wrong results. I think there are two approaches: 1. Side-step this issue by bisecting qemu.git instead of qemu-kvm.git. 2. First test qemu-kvm.git origin/upstream-merge and if there is no performance issue, do: git bisect good origin/upstream-merge. If we're luckily this will avoid going down all the qemu.git merge trees and instead stay focussed on qemu-kvm.git commits. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: virtio-blk performance regression and qemu-kvm
On Wed, Feb 29, 2012 at 1:44 PM, Stefan Hajnoczi stefa...@gmail.com wrote: On Wed, Feb 29, 2012 at 1:12 PM, Martin Mailand mar...@tuxadero.com wrote: Hi Stefan, you are right, the performance for the commits 0b9b128530b and 4fefc55ab04d are both good. What is the best approach to stay in the qemu-kvm.git history? I didn't know the answer so I asked on #git on freenode: And additional information from Avi: http://permalink.gmane.org/gmane.comp.emulators.kvm.devel/82851 Hope this helps. Stefan -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote: OK back to this. The last piece is where to put these messages... we could take PF_ROUTE:RTM_*NEIGH PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded switch. PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded switch. PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc may play a role in the user space app populating the FDB, i dont think they are necessary players. Learning could be via a table entry miss and packet redirect to user space. So my suggestion is to use FDB_*ENTRY for names The neighbor code is using the PF_UNSPEC protocol type so we won't collide with these unless someone was using PF_ROUTE and relying on falling back to PF_UNSPEC however I couldn't find any programs that did this iproute2 certainly doesn't. And the bridge pieces are using PF_BRIDGE so no collision there. They have to be different calls from the calls that talk to the s/ware bridge. In my opinion, as controversial as this may sound, you need to be flexible enough that some vendor can replace these calls with proprietary calls which are more efficient for their hardware. So a plugin to replace these calls in the user space code would be a good idea. Alternatively, you could make that something they do at the driver level i.e from user space to kernel it is hardware, please addthistotheFDBtable() call and the implementation of that could be proprietary to the specific hardware. [..] Also if there are embedded switches with learning capabilities they might want to trigger events to user space. In this case having a protocol type makes user space a bit easier to manage. I've added Lennert so maybe he can comment I think the Marvell chipsets might support something along these lines. The SR-IOV chipsets I'm aware of _today_ don't do learning. Learning makes the event model more plausible. The other events to consider is aging of hardware entries. The other mechanism would be to embed some more attributes into the PF_UNSPEC:RTM_XXXLINK msg however I'm thinking that if we want to support learning and triggering events then we likely also don't want to send these events to every app with RTNLGRP_LINK set. I think this needs to be a different event message. FDB_TABLEMISS? FDB_EXCEPTION? Plus there is already a proliferation of LINK attributes and dumping the FDB out of this seems a bit much but could be done with some bitmasks. Although the current ext_filter_mask u32 doesn't seem to be sufficient for events to trigger this. Dumping the FDB table should be something along the lines of FDB_GET with the dump flag. It shouldnt tie to the LINK side of things. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Tue, 2012-02-28 at 21:14 -0800, John Fastabend wrote: Just checked looks like the DSA infrastructure has commands to enable STP so guess it is doing learning. IIRC, Lennert built some of this stuff tied to the kernel. cheers, jamal -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. Cc: Peter Zijlstra a.p.zijls...@chello.nl Cc: Ingo Molnar mi...@elte.hu Cc: Avi Kivity a...@redhat.com Cc: Stephane Eranian eran...@google.com Cc: David Ahern dsah...@gmail.com Cc: Gleb Natapov g...@redhat.com Cc: Robert Richter robert.rich...@amd.com Cc: sta...@vger.kernel.org # 3.2 Signed-off-by: Joerg Roedel joerg.roe...@amd.com --- arch/x86/include/asm/perf_event.h|8 +++ arch/x86/kernel/cpu/perf_event.h |6 - arch/x86/kernel/cpu/perf_event_amd.c | 37 - arch/x86/kvm/svm.c |5 4 files changed, 53 insertions(+), 3 deletions(-) diff --git a/arch/x86/include/asm/perf_event.h b/arch/x86/include/asm/perf_event.h index 096c975e..ffad310 100644 --- a/arch/x86/include/asm/perf_event.h +++ b/arch/x86/include/asm/perf_event.h @@ -242,4 +242,12 @@ static inline void perf_get_x86_pmu_capability(struct x86_pmu_capability *cap) static inline void perf_events_lapic_init(void){ } #endif +#if defined(CONFIG_PERF_EVENTS) defined(CONFIG_CPU_SUP_AMD) +extern void amd_pmu_enable_virt(void); +extern void amd_pmu_disable_virt(void); +#else /* CONFIG_PERF_EVENTS CONFIG_CPU_SUP_AMD */ +static inline void amd_pmu_enable_virt(void) { } +static inline void amd_pmu_disable_virt(void) { } +#endif /* CONFIG_PERF_EVENTS CONFIG_CPU_SUP_AMD */ + #endif /* _ASM_X86_PERF_EVENT_H */ diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 8944062..2c581b9 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -148,6 +148,8 @@ struct cpu_hw_events { * AMD specific bits */ struct amd_nb *amd_nb; + /* Inverted mask of bits to clear in the perf_ctr ctrl registers */ + u64 perf_ctr_virt_mask; void*kfree_on_online; }; @@ -417,9 +419,11 @@ void x86_pmu_disable_all(void); static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc, u64 enable_mask) { + u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask); + if (hwc-extra_reg.reg) wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config); - wrmsrl(hwc-config_base, hwc-config | enable_mask); + wrmsrl(hwc-config_base, (hwc-config | enable_mask) ~disable_mask); } void x86_pmu_enable_all(int added); diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 0397b23..67250a5 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -1,4 +1,5 @@ #include linux/perf_event.h +#include linux/export.h #include linux/types.h #include linux/init.h #include linux/slab.h @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu) struct amd_nb *nb; int i, nb_id; - if (boot_cpu_data.x86_max_cores 2) + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + if (boot_cpu_data.x86_max_cores 2 || boot_cpu_data.x86 == 0x15) return; nb_id = amd_get_nb_id(cpu); @@ -587,9 +590,9 @@ static __initconst const struct x86_pmu amd_pmu_f15h = { .put_event_constraints = amd_put_event_constraints, .cpu_prepare= amd_pmu_cpu_prepare, - .cpu_starting = amd_pmu_cpu_starting, .cpu_dead = amd_pmu_cpu_dead, #endif + .cpu_starting = amd_pmu_cpu_starting, }; __init int amd_pmu_init(void) @@ -621,3 +624,33 @@ __init int amd_pmu_init(void) return 0; } + +void amd_pmu_enable_virt(void) +{ + struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); + + cpuc-perf_ctr_virt_mask = 0; + + /* Reload all events */ + x86_pmu_disable_all(); + x86_pmu_enable_all(0); +} +EXPORT_SYMBOL_GPL(amd_pmu_enable_virt); + +void amd_pmu_disable_virt(void) +{ + struct cpu_hw_events *cpuc = __get_cpu_var(cpu_hw_events); + + /* +* We only mask out the Host-only bit so that host-only counting works +* when SVM is disabled. If someone sets up a guest-only counter when +* SVM is disabled the Guest-only bits still gets set and the counter +* will not count anything. +*/ + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + /* Reload all events */ + x86_pmu_disable_all(); + x86_pmu_enable_all(0); +} +EXPORT_SYMBOL_GPL(amd_pmu_disable_virt); diff --git
Re: [PATCH 3/4] KVM: Switch to srcu-less get_dirty_log()
Avi Kivity a...@redhat.com wrote: We cannot get a complete snapshot without mmu_lock; if the guest faults on the Nth page during xchg'ing i = 1, 2, ... then the i = N alone will become newer. Ah, so there is no data corruption (missed dirty bits), just the display is updated inconsistently? Yes, no data corruption and just a matter of ... feeling. The basic rule I wrote in the comment in this patch should be enough to not lose dirty bits. I don't think we can get a consistent snapshot anyway, since the guest can update the framebuffer while userspace is processing it. Yes, nothing will be broken. I was just not sure what the API should promise to the userspace. To some extent the inconsistency may be felt more than before. Understood. I don't think we can get a consistent vga snapshot without stopping the guest, and even then, it depends on how the guest updates the framebuffer. OKay, I will not care about this from now. So I want to see how much improvement the proposed API can achieve. I thought this might be a good GSoC project but ... It may be too involved for GSoC, the issues are difficult. I am now checking QEMU code closely - rather readable than before! Though I can make experimental KVM patches for the student, just changing QEMU's live migration code seems to be difficult - but not sure now. Anyway, we should try this before deciding the new API: I mean if I cannot find anyone who want to try this, no need to be a student, I may do instead at some point. Takuya -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically
On 2012-02-29 14:30, Amos Kong wrote: kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Is there any limit, or can userspace allocate arbitrary amounts of kernel memory this way? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reconciling qemu-kvm and qemu's PIT
On 2012-02-29 14:45, Avi Kivity wrote: On 02/29/2012 12:14 PM, Jan Kiszka wrote: On 2012-02-29 10:29, Avi Kivity wrote: On 02/28/2012 11:47 PM, Jan Kiszka wrote: Looks like isa-pit has zero gpio pins, so it fails when crashing. I must have mismerged it, but where is the gpio pin count set? In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to discuss this without seeing your code. pushed it into qemu-kvm.git usptream-merge As I assumed: You try to initialize the kvm-pit like the userspace pit, that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just like my kvm-pit patches does in their kvm_pit_init. The merge is now in 'next', hopefully not too many regressions. Hope we can quickly finish the switch to the new kvm-pit. When do you plan to push or rebase uq/master? Then I could check/refresh my kvm-pit series on top of it. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: Reconciling qemu-kvm and qemu's PIT
On 02/29/2012 04:24 PM, Jan Kiszka wrote: On 2012-02-29 14:45, Avi Kivity wrote: On 02/29/2012 12:14 PM, Jan Kiszka wrote: On 2012-02-29 10:29, Avi Kivity wrote: On 02/28/2012 11:47 PM, Jan Kiszka wrote: Looks like isa-pit has zero gpio pins, so it fails when crashing. I must have mismerged it, but where is the gpio pin count set? In pit_initfn. Does -no-kvm-irqchip work fine? It's a bit tricky to discuss this without seeing your code. pushed it into qemu-kvm.git usptream-merge As I assumed: You try to initialize the kvm-pit like the userspace pit, that cannot work. Don't run the qdev_connect_gpio_out in pit_init, just like my kvm-pit patches does in their kvm_pit_init. The merge is now in 'next', hopefully not too many regressions. Hope we can quickly finish the switch to the new kvm-pit. When do you plan to push or rebase uq/master? Then I could check/refresh my kvm-pit series on top of it. Real soon... I hoped to do it after the upstream merge, but it doesn't pass autotest due to a screendump regression. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Wed, 2012-02-29 at 12:58 +, Dave Martin wrote: On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote: On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote: On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) Peter Maydell suggested there was: r7 is (used by gcc as) the Thumb frame pointer; I don't know if this makes it worth avoiding in this context. Sounds like it might be a gcc-ism, possibly a non-default option? Anyway, I think r12 will be fine for our purposes so the point is rather moot. Just had a chat with some tools guys -- apparently, when passing register arguments to gcc inline asms there really isn't a guarantee that those variables will be in the expected registers on entry to the inline asm. If gcc reorders other function calls or other code around the inline asm (which it can do, except under certain controlled situations), then intervening code can clobber any registers in general. Or, to summarise another way, there is no way to control which register is used to pass something to an inline asm in general (often we get away with this, and there are a lot of inline asms in the kernel that assume it works, but the more you inline the more likely you are to get nasty surprises). There is no workaroud, except on some architectures where special asm constraints allow specific individual registers to be specified for operands (i386 for example). I had assumed I just couldn't find the right syntax. Useful to know that I couldn't find it because it doesn't exist! If you need a specific register, this means that you must set up that register explicitly inside the asm if you want a guarantee that the code will work: asm volatile ( movw r12, %[hvc_num]\n\t Is gcc (or gas?) smart enough to optimise this away if it turns out that %[hvc_num] == r12? ... hvc#0 :: [hvc_num] i (NUMBER) : r12 ); Of course, if you need to set up more than about 5 or 6 registers in this way, the doubled register footprint means that the compiler will have to start spilling stuff to the stack. This is the kind of problem which goes away when out-of-lining the hvc wrapper behind a C function interface, since the ABI then provides guarantees about how values are mershaled into and out of that code. I don't think anything would stop gcc from clobbering an argument register right on function entry (e..g it might move r0 to r8 and clobber r0, for whatever reason), so that they are no longer where you expect them to be when you hit the asm. Unlikely perhaps but no more so than the other issues you've raised? Or did you mean out-of-line as in written in a .S file as well as out of line? Notwithstanding the above, even if we do make theoretically unsound (but often true) assumptions about inline asms, ARM will be no worse than other arches in this respect. This is true. Other than serving as a reminder that inline asm is a deep can of worms, this doesn't really give us a neat solution... How are system calls implemented on the userspace side? I confess I don't know what the ARM syscall ABI looks like -- is it all registers or is some of it on the stack? It sounds like the solution ought to be pretty similar though. Ian. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH-WIP 01/13] xen/arm: use r12 to pass the hypercall number to the hypervisor
On Wed, 29 Feb 2012, Dave Martin wrote: On Wed, Feb 29, 2012 at 09:56:02AM +, Ian Campbell wrote: On Wed, 2012-02-29 at 09:34 +, Dave Martin wrote: On Tue, Feb 28, 2012 at 12:28:29PM +, Stefano Stabellini wrote: I don't have a very strong opinion on which register we should use, but I would like to avoid r7 if it is already actively used by gcc. But there is no framepointer for Thumb-2 code (?) Peter Maydell suggested there was: r7 is (used by gcc as) the Thumb frame pointer; I don't know if this makes it worth avoiding in this context. Sounds like it might be a gcc-ism, possibly a non-default option? Anyway, I think r12 will be fine for our purposes so the point is rather moot. Just had a chat with some tools guys -- apparently, when passing register arguments to gcc inline asms there really isn't a guarantee that those variables will be in the expected registers on entry to the inline asm. If gcc reorders other function calls or other code around the inline asm (which it can do, except under certain controlled situations), then intervening code can clobber any registers in general. Or, to summarise another way, there is no way to control which register is used to pass something to an inline asm in general (often we get away with this, and there are a lot of inline asms in the kernel that assume it works, but the more you inline the more likely you are to get nasty surprises). There is no workaroud, except on some architectures where special asm constraints allow specific individual registers to be specified for operands (i386 for example). If you need a specific register, this means that you must set up that register explicitly inside the asm if you want a guarantee that the code will work: asm volatile ( movw r12, %[hvc_num]\n\t ... hvc#0 :: [hvc_num] i (NUMBER) : r12 ); OK, we can arrange the hypercall code to be like that. Also with your patch series it would be _hvc because of the .macro, right? This is the kind of problem which goes away when out-of-lining the hvc wrapper behind a C function interface, since the ABI then provides guarantees about how values are mershaled into and out of that code. Do you mean implementing the entire HYPERVISOR_example_op in assembly and calling it from C? Because I guess that gcc would still be free to mess with the registers between the C function entry point and any inline assembly code. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote: PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Is this really true? Looks like it's automatic. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v4: - Integrated doc changes as proposed by Alex - Fixed deassign_host_irq /wrt MSI - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3 devices Documentation/virtual/kvm/api.txt | 41 +++ arch/x86/kvm/x86.c|1 + include/linux/kvm.h |6 + include/linux/kvm_host.h |2 + virt/kvm/assigned-dev.c | 209 +++- 5 files changed, 230 insertions(+), 29 deletions(-) [snip] +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm, + struct kvm_assigned_pci_dev *assigned_dev) +{ + int r = 0; + struct kvm_assigned_dev_kernel *match; + + mutex_lock(kvm-lock); + + match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head, + assigned_dev-assigned_dev_id); + if (!match) { + r = -ENODEV; + goto out; + } + + mutex_lock(match-intx_mask_lock); + + match-flags = ~KVM_DEV_ASSIGN_MASK_INTX; + match-flags |= assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX; + + if (match-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { + if (assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX) { + kvm_set_irq(match-kvm, match-irq_source_id, + match-guest_irq, 0); + /* + * Masking at hardware-level is performed on demand, + * i.e. when an IRQ actually arrives at the host. + */ + } else if (!(assigned_dev-flags KVM_DEV_ASSIGN_PCI_2_3)) { + /* + * Unmask the IRQ line if required. Unmasking at + * device level will be performed by user space. + */ + spin_lock_irq(match-intx_lock); + if (match-host_irq_disabled) { + enable_irq(match-host_irq); + match-host_irq_disabled = false; + } + spin_unlock_irq(match-intx_lock); This still looks broken. If we start with a PCI 2.3 device with INTx disabled, how does this ever kick start to get another interrupt? Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both INTx modes? Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically
- Original Message - On 2012-02-29 14:30, Amos Kong wrote: kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Is there any limit, or can userspace allocate arbitrary amounts of kernel memory this way? Hi Jan, There is a fixed array in linux-2.6/include/linux/kvm_host.h, we can only register 300 devices. struct kvm_io_range { gpa_t addr; int len; struct kvm_io_device *dev; }; struct kvm_io_bus { int dev_count; #define NR_IOBUS_DEVS 300 struct kvm_io_range range[NR_IOBUS_DEVS]; }; ^^ Amos. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically
On 2012-02-29 16:22, Amos Kong wrote: - Original Message - On 2012-02-29 14:30, Amos Kong wrote: kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Is there any limit, or can userspace allocate arbitrary amounts of kernel memory this way? Hi Jan, There is a fixed array in linux-2.6/include/linux/kvm_host.h, we can only register 300 devices. struct kvm_io_range { gpa_t addr; int len; struct kvm_io_device *dev; }; struct kvm_io_bus { int dev_count; #define NR_IOBUS_DEVS 300 struct kvm_io_range range[NR_IOBUS_DEVS]; }; ^^ Right. But doesn't your patch remove precisely this limit? So what limits userspace now? To register 300 million devices...? Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
On 2012-02-29 16:22, Alex Williamson wrote: On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote: PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Is this really true? Looks like it's automatic. It is true: no IRQ sharing without userspace setting KVM_DEV_ASSIGN_PCI_2_3 during KVM_ASSIGN_PCI_DEVICE. My qemu-kvm patches will allow to control this, but make it default on (reasons for this will be provided in that context). Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v4: - Integrated doc changes as proposed by Alex - Fixed deassign_host_irq /wrt MSI - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3 devices Documentation/virtual/kvm/api.txt | 41 +++ arch/x86/kvm/x86.c|1 + include/linux/kvm.h |6 + include/linux/kvm_host.h |2 + virt/kvm/assigned-dev.c | 209 +++- 5 files changed, 230 insertions(+), 29 deletions(-) [snip] +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm, +struct kvm_assigned_pci_dev *assigned_dev) +{ +int r = 0; +struct kvm_assigned_dev_kernel *match; + +mutex_lock(kvm-lock); + +match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head, + assigned_dev-assigned_dev_id); +if (!match) { +r = -ENODEV; +goto out; +} + +mutex_lock(match-intx_mask_lock); + +match-flags = ~KVM_DEV_ASSIGN_MASK_INTX; +match-flags |= assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX; + +if (match-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { +if (assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX) { +kvm_set_irq(match-kvm, match-irq_source_id, +match-guest_irq, 0); +/* + * Masking at hardware-level is performed on demand, + * i.e. when an IRQ actually arrives at the host. + */ +} else if (!(assigned_dev-flags KVM_DEV_ASSIGN_PCI_2_3)) { +/* + * Unmask the IRQ line if required. Unmasking at + * device level will be performed by user space. + */ +spin_lock_irq(match-intx_lock); +if (match-host_irq_disabled) { +enable_irq(match-host_irq); +match-host_irq_disabled = false; +} +spin_unlock_irq(match-intx_lock); This still looks broken. If we start with a PCI 2.3 device with INTx disabled, how does this ever kick start to get another interrupt? Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both INTx modes? Thanks, No, that would be wrong. The IRQ must be delivered to the guest first. And that will happen with 2.3 once userspace unmasks the device INTx and, thus, triggers another host-side IRQ. Same for non-2.3 devices, just that this happens on enable_irq. Jan -- Siemens AG, Corporate Technology, CT T DE IT 1 Corporate Competence Center Embedded Linux -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] kvm: fill in padding to help valgrind
valgrind warns about padding fields which are passed to vcpu ioctls uninitialized. This is not an error in practice because kvm ignored padding. Since the ioctls in question are off data path and the cost is zero anyway, initialize padding to 0 to suppress these errors. Signed-off-by: Michael S. Tsirkin m...@redhat.com --- kvm-all.c |2 ++ target-i386/kvm.c |6 ++ 2 files changed, 8 insertions(+), 0 deletions(-) diff --git a/kvm-all.c b/kvm-all.c index c58c77b..3bc0eb3 100644 --- a/kvm-all.c +++ b/kvm-all.c @@ -447,6 +447,7 @@ int kvm_coalesce_mmio_region(target_phys_addr_t start, ram_addr_t size) zone.addr = start; zone.size = size; +zone.pad = 0; ret = kvm_vm_ioctl(s, KVM_REGISTER_COALESCED_MMIO, zone); } @@ -464,6 +465,7 @@ int kvm_uncoalesce_mmio_region(target_phys_addr_t start, ram_addr_t size) zone.addr = start; zone.size = size; +zone.pad = 0; ret = kvm_vm_ioctl(s, KVM_UNREGISTER_COALESCED_MMIO, zone); } diff --git a/target-i386/kvm.c b/target-i386/kvm.c index 981192d..285cf55 100644 --- a/target-i386/kvm.c +++ b/target-i386/kvm.c @@ -555,6 +555,7 @@ int kvm_arch_init_vcpu(CPUState *env) qemu_add_vm_change_state_handler(cpu_update_state, env); +cpuid_data.cpuid.padding = 0; r = kvm_vcpu_ioctl(env, KVM_SET_CPUID2, cpuid_data); if (r) { return r; @@ -740,6 +741,7 @@ static void set_seg(struct kvm_segment *lhs, const SegmentCache *rhs) lhs-g = (flags DESC_G_MASK) != 0; lhs-avl = (flags DESC_AVL_MASK) != 0; lhs-unusable = 0; +lhs-padding = 0; } static void get_seg(SegmentCache *lhs, const struct kvm_segment *rhs) @@ -919,8 +921,10 @@ static int kvm_put_sregs(CPUState *env) sregs.idt.limit = env-idt.limit; sregs.idt.base = env-idt.base; +memset(sregs.idt.padding, 0, sizeof sregs.idt.padding); sregs.gdt.limit = env-gdt.limit; sregs.gdt.base = env-gdt.base; +memset(sregs.gdt.padding, 0, sizeof sregs.gdt.padding); sregs.cr0 = env-cr[0]; sregs.cr2 = env-cr[2]; @@ -1392,6 +1396,7 @@ static int kvm_put_vcpu_events(CPUState *env, int level) events.exception.nr = env-exception_injected; events.exception.has_error_code = env-has_error_code; events.exception.error_code = env-error_code; +events.exception.pad = 0; events.interrupt.injected = (env-interrupt_injected = 0); events.interrupt.nr = env-interrupt_injected; @@ -1400,6 +1405,7 @@ static int kvm_put_vcpu_events(CPUState *env, int level) events.nmi.injected = env-nmi_injected; events.nmi.pending = env-nmi_pending; events.nmi.masked = !!(env-hflags2 HF2_NMI_MASK); +events.nmi.pad = 0; events.sipi_vector = env-sipi_vector; -- 1.7.9.111.gf3fb0 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4] KVM: Allow host IRQ sharing for assigned PCI 2.3 devices
On Wed, 2012-02-29 at 16:38 +0100, Jan Kiszka wrote: On 2012-02-29 16:22, Alex Williamson wrote: On Tue, 2012-02-28 at 14:19 +0100, Jan Kiszka wrote: PCI 2.3 allows to generically disable IRQ sources at device level. This enables us to share legacy IRQs of such devices with other host devices when passing them to a guest. The new IRQ sharing feature introduced here is optional, user space has to request it explicitly. Is this really true? Looks like it's automatic. It is true: no IRQ sharing without userspace setting KVM_DEV_ASSIGN_PCI_2_3 during KVM_ASSIGN_PCI_DEVICE. My qemu-kvm patches will allow to control this, but make it default on (reasons for this will be provided in that context). Ah right, we only clear it if not available. Thanks. Moreover, user space can inform us about its view of PCI_COMMAND_INTX_DISABLE so that we can avoid unmasking the interrupt and signaling it if the guest masked it via the virtualized PCI config space. Signed-off-by: Jan Kiszka jan.kis...@siemens.com --- Changes in v4: - Integrated doc changes as proposed by Alex - Fixed deassign_host_irq /wrt MSI - Fixed kvm_vm_ioctl_set_pci_irq_mask /wrt INTx unmasking of non-2.3 devices Documentation/virtual/kvm/api.txt | 41 +++ arch/x86/kvm/x86.c|1 + include/linux/kvm.h |6 + include/linux/kvm_host.h |2 + virt/kvm/assigned-dev.c | 209 +++- 5 files changed, 230 insertions(+), 29 deletions(-) [snip] +static int kvm_vm_ioctl_set_pci_irq_mask(struct kvm *kvm, + struct kvm_assigned_pci_dev *assigned_dev) +{ + int r = 0; + struct kvm_assigned_dev_kernel *match; + + mutex_lock(kvm-lock); + + match = kvm_find_assigned_dev(kvm-arch.assigned_dev_head, +assigned_dev-assigned_dev_id); + if (!match) { + r = -ENODEV; + goto out; + } + + mutex_lock(match-intx_mask_lock); + + match-flags = ~KVM_DEV_ASSIGN_MASK_INTX; + match-flags |= assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX; + + if (match-irq_requested_type KVM_DEV_IRQ_GUEST_INTX) { + if (assigned_dev-flags KVM_DEV_ASSIGN_MASK_INTX) { + kvm_set_irq(match-kvm, match-irq_source_id, + match-guest_irq, 0); + /* + * Masking at hardware-level is performed on demand, + * i.e. when an IRQ actually arrives at the host. + */ + } else if (!(assigned_dev-flags KVM_DEV_ASSIGN_PCI_2_3)) { + /* + * Unmask the IRQ line if required. Unmasking at + * device level will be performed by user space. + */ + spin_lock_irq(match-intx_lock); + if (match-host_irq_disabled) { + enable_irq(match-host_irq); + match-host_irq_disabled = false; + } + spin_unlock_irq(match-intx_lock); This still looks broken. If we start with a PCI 2.3 device with INTx disabled, how does this ever kick start to get another interrupt? Shouldn't we just call kvm_assigned_dev_ack_irq() here and handle both INTx modes? Thanks, No, that would be wrong. The IRQ must be delivered to the guest first. And that will happen with 2.3 once userspace unmasks the device INTx and, thus, triggers another host-side IRQ. Same for non-2.3 devices, just that this happens on enable_irq. Ok, got it. I was tempted to write in the revised description that userspace could skip the device command register update if INTx disable is the only change, but because of the way this works, that is clearly not the case. Thanks. Acked-by: Alex Williamson alex.william...@redhat.com -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically
- Original Message - On 2012-02-29 16:22, Amos Kong wrote: - Original Message - On 2012-02-29 14:30, Amos Kong wrote: kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Is there any limit, or can userspace allocate arbitrary amounts of kernel memory this way? Hi Jan, There is a fixed array in linux-2.6/include/linux/kvm_host.h, we can only register 300 devices. struct kvm_io_range { gpa_t addr; int len; struct kvm_io_device *dev; }; struct kvm_io_bus { int dev_count; #define NR_IOBUS_DEVS 300 struct kvm_io_range range[NR_IOBUS_DEVS]; }; ^^ Right. But doesn't your patch remove precisely this limit? So what limits userspace now? To register 300 million devices...? Hi Jan, It seems we need to reserve the limitation in kvm_host.h #define NR_IOBUS_DEVS 600 /* Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions. Only 29 slots can be used to add multiple function devices. Maximum of supported PCI devices: 29 * 8 = 232. Each virtio-blk device needs 1 iobus device, each virtio-net(vhost) device requires 2 such devices to service notifications (ioevent) from rx/tx queues. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. ioevent, pit, ioapic take less iobus devices. So we can set max limitation to 600. */ - check limit when register dev virt/kvm/kvm_main.c: /* Caller must hold slots_lock. */ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, struct kvm_io_device *dev) { struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; if (bus-dev_count NR_IOBUS_DEVS - 1) // can only register 600 devices return -ENOSPC; Amos. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On 02/29/2012 03:57 PM, Joerg Roedel wrote: It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 0397b23..67250a5 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -1,4 +1,5 @@ #include linux/perf_event.h +#include linux/export.h #include linux/types.h #include linux/init.h #include linux/slab.h @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu) struct amd_nb *nb; int i, nb_id; - if (boot_cpu_data.x86_max_cores 2) + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + if (boot_cpu_data.x86_max_cores 2 || boot_cpu_data.x86 == 0x15) return; Why this (boot_cpu_data.x86 == 0x15) change? -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote: On 02/29/2012 03:57 PM, Joerg Roedel wrote: It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 0397b23..67250a5 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -1,4 +1,5 @@ #include linux/perf_event.h +#include linux/export.h #include linux/types.h #include linux/init.h #include linux/slab.h @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu) struct amd_nb *nb; int i, nb_id; - if (boot_cpu_data.x86_max_cores 2) + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + if (boot_cpu_data.x86_max_cores 2 || boot_cpu_data.x86 == 0x15) return; Why this (boot_cpu_data.x86 == 0x15) change? I think it is related to .cpu_starting callback now available in amd_pmu_f15h where previously it wasn't. -- Gleb. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote: On 02/29/2012 03:57 PM, Joerg Roedel wrote: It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 0397b23..67250a5 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -1,4 +1,5 @@ #include linux/perf_event.h +#include linux/export.h #include linux/types.h #include linux/init.h #include linux/slab.h @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu) struct amd_nb *nb; int i, nb_id; - if (boot_cpu_data.x86_max_cores 2) + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + if (boot_cpu_data.x86_max_cores 2 || boot_cpu_data.x86 == 0x15) return; Why this (boot_cpu_data.x86 == 0x15) change? This is because this function did not run on Fam15h before but now it has to so that cpuc-perf_ctr_virt_mask is initialized. The other stuff done in this function is setup for northbridge counter. These are not yet implemented for Fam15h CPUs so this setup must not run on those CPUs. Therefore the check was added. Once northbridge counters are implemented for Fam15h this check can go away again. Joerg -- AMD Operating System Research Center Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach General Managers: Alberto Bozzo Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On 02/29/2012 07:05 PM, Joerg Roedel wrote: On Wed, Feb 29, 2012 at 07:00:09PM +0200, Avi Kivity wrote: On 02/29/2012 03:57 PM, Joerg Roedel wrote: It turned out that a performance counter on AMD does not count at all when the GO or HO bit is set in the control register and SVM is disabled in EFER. This patch works around this issue by masking out the HO bit in the performance counter control register when SVM is not enabled. The GO bit is not touched because it is only set when the user wants to count in guest-mode only. So when SVM is disabled the counter should not run at all and the not-counting is the intended behaviour. diff --git a/arch/x86/kernel/cpu/perf_event_amd.c b/arch/x86/kernel/cpu/perf_event_amd.c index 0397b23..67250a5 100644 --- a/arch/x86/kernel/cpu/perf_event_amd.c +++ b/arch/x86/kernel/cpu/perf_event_amd.c @@ -1,4 +1,5 @@ #include linux/perf_event.h +#include linux/export.h #include linux/types.h #include linux/init.h #include linux/slab.h @@ -357,7 +358,9 @@ static void amd_pmu_cpu_starting(int cpu) struct amd_nb *nb; int i, nb_id; - if (boot_cpu_data.x86_max_cores 2) + cpuc-perf_ctr_virt_mask = AMD_PERFMON_EVENTSEL_HOSTONLY; + + if (boot_cpu_data.x86_max_cores 2 || boot_cpu_data.x86 == 0x15) return; Why this (boot_cpu_data.x86 == 0x15) change? This is because this function did not run on Fam15h before but now it has to so that cpuc-perf_ctr_virt_mask is initialized. The other stuff done in this function is setup for northbridge counter. These are not yet implemented for Fam15h CPUs so this setup must not run on those CPUs. Therefore the check was added. Once northbridge counters are implemented for Fam15h this check can go away again. Thanks. Ingo, this had better go through tip.git since most of the changes are perf related. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On Wed, 2012-02-29 at 18:05 +0100, Joerg Roedel wrote: Once northbridge counters are implemented for Fam15h this check can go away again. Nope, northbridge on fam15 is completely disjoint from the regular pmu and should thus be a separate driver. The only reason the old amd driver has them both is because how they shared the registers so there is a shared resource to manage. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote: On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote: OK back to this. The last piece is where to put these messages... we could take PF_ROUTE:RTM_*NEIGH PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded switch. PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded switch. PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc may play a role in the user space app populating the FDB, i dont think they are necessary players. Learning could be via a table entry miss and packet redirect to user space. So my suggestion is to use FDB_*ENTRY for names Well I think NETLINK_ROUTE is the most correct type to use in this case. Per netlink.h its for routing and device hooks. #define NETLINK_ROUTE 0 /* Routing/device hook */ And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix were merely a copy from the SW BRIDGE code paths. How about, PF_BRIDGE:RTM_FDB_NEWENTRY PF_BRIDGE:RTM_FDB_DELENTRY PF_BRIDGE:RTM_FDB_GETENTRY And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct rtnl locking semantics for free. The neighbor code is using the PF_UNSPEC protocol type so we won't collide with these unless someone was using PF_ROUTE and relying on falling back to PF_UNSPEC however I couldn't find any programs that did this iproute2 certainly doesn't. And the bridge pieces are using PF_BRIDGE so no collision there. They have to be different calls from the calls that talk to the s/ware bridge. In my opinion, as controversial as this may sound, you need to be flexible enough that some vendor can replace these calls with proprietary calls which are more efficient for their hardware. So a plugin to replace these calls in the user space code would be a good idea. Alternatively, you could make that something they do at the driver level i.e from user space to kernel it is hardware, please addthistotheFDBtable() call and the implementation of that could be proprietary to the specific hardware. Agreed. I think adding some ndo_ops for bridging offloads here would work. For example the DSA infrastructure and/or macvlan devices might need this. Along the lines of extending this RFC, [RFC] hardware bridging support for DSA switches http://patchwork.ozlabs.org/patch/16578/ .John -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2] perf/x86: Fix HO/GO counting with SVM disabled
On Wed, 2012-02-29 at 14:57 +0100, Joerg Roedel wrote: diff --git a/arch/x86/kernel/cpu/perf_event.h b/arch/x86/kernel/cpu/perf_event.h index 8944062..2c581b9 100644 --- a/arch/x86/kernel/cpu/perf_event.h +++ b/arch/x86/kernel/cpu/perf_event.h @@ -148,6 +148,8 @@ struct cpu_hw_events { * AMD specific bits */ struct amd_nb *amd_nb; + /* Inverted mask of bits to clear in the perf_ctr ctrl registers */ + u64 perf_ctr_virt_mask; void*kfree_on_online; }; @@ -417,9 +419,11 @@ void x86_pmu_disable_all(void); static inline void __x86_pmu_enable_event(struct hw_perf_event *hwc, u64 enable_mask) { + u64 disable_mask = __this_cpu_read(cpu_hw_events.perf_ctr_virt_mask); + if (hwc-extra_reg.reg) wrmsrl(hwc-extra_reg.reg, hwc-extra_reg.config); - wrmsrl(hwc-config_base, hwc-config | enable_mask); + wrmsrl(hwc-config_base, (hwc-config | enable_mask) ~disable_mask); } Its starting to look like we should kill this helper, the extra_reg muck is Intel only and the disable_mask is AMD. Queued it for now though.. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. On book3e, Power ISA says timebase is read-only on virtualized implementations. My understanding is that book3s is paravirt-only (guest state is not considered an implementation of the Power ISA), and it says Writing the Time Base is privileged, and can be done only in hypervisor state. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On Wed, 29 Feb 2012 09:25:56 -0800 John Fastabend john.r.fastab...@intel.com wrote: On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote: On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote: OK back to this. The last piece is where to put these messages... we could take PF_ROUTE:RTM_*NEIGH PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded switch. PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded switch. PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc may play a role in the user space app populating the FDB, i dont think they are necessary players. Learning could be via a table entry miss and packet redirect to user space. So my suggestion is to use FDB_*ENTRY for names Well I think NETLINK_ROUTE is the most correct type to use in this case. Per netlink.h its for routing and device hooks. #define NETLINK_ROUTE 0 /* Routing/device hook */ And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix were merely a copy from the SW BRIDGE code paths. How about, PF_BRIDGE:RTM_FDB_NEWENTRY PF_BRIDGE:RTM_FDB_DELENTRY PF_BRIDGE:RTM_FDB_GETENTRY And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct rtnl locking semantics for free. The neighbor code is using the PF_UNSPEC protocol type so we won't collide with these unless someone was using PF_ROUTE and relying on falling back to PF_UNSPEC however I couldn't find any programs that did this iproute2 certainly doesn't. And the bridge pieces are using PF_BRIDGE so no collision there. They have to be different calls from the calls that talk to the s/ware bridge. In my opinion, as controversial as this may sound, you need to be flexible enough that some vendor can replace these calls with proprietary calls which are more efficient for their hardware. So a plugin to replace these calls in the user space code would be a good idea. Alternatively, you could make that something they do at the driver level i.e from user space to kernel it is hardware, please addthistotheFDBtable() call and the implementation of that could be proprietary to the specific hardware. Agreed. I think adding some ndo_ops for bridging offloads here would work. For example the DSA infrastructure and/or macvlan devices might need this. Along the lines of extending this RFC, [RFC] hardware bridging support for DSA switches http://patchwork.ozlabs.org/patch/16578/ I want to see a unified API so that user space control applications (RSTP, TRILL?) can use one set of netlink calls for both software bridge and hardware offloaded bridges. Does this proposal meet that requirement? -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [RFC PATCH v0 1/2] net: bridge: propagate FDB table into hardware
On 2/29/2012 9:52 AM, Stephen Hemminger wrote: On Wed, 29 Feb 2012 09:25:56 -0800 John Fastabend john.r.fastab...@intel.com wrote: On 2/29/2012 5:56 AM, Jamal Hadi Salim wrote: On Tue, 2012-02-28 at 20:40 -0800, John Fastabend wrote: OK back to this. The last piece is where to put these messages... we could take PF_ROUTE:RTM_*NEIGH PF_ROUTE:RTM_NEWNEIGH - Add a new FDB entry to an offloaded switch. PF_ROUTE:RTM_DELNEIGH - Delete a FDB entry from an offlaoded switch. PF_ROUTE:RTM_GETNEIGH - Dumps the embedded FDB table Why RTM_*NEIGH? RTM tends to map to Route/L3 and NEIGH tends to map to ndisc or ARP both tied to IP address resolution. While both ARP/Ndisc may play a role in the user space app populating the FDB, i dont think they are necessary players. Learning could be via a table entry miss and packet redirect to user space. So my suggestion is to use FDB_*ENTRY for names Well I think NETLINK_ROUTE is the most correct type to use in this case. Per netlink.h its for routing and device hooks. #define NETLINK_ROUTE 0 /* Routing/device hook */ And NETLINK_ROUTE msg_types use the RTM_* prefix. The _*NEIGH postfix were merely a copy from the SW BRIDGE code paths. How about, PF_BRIDGE:RTM_FDB_NEWENTRY PF_BRIDGE:RTM_FDB_DELENTRY PF_BRIDGE:RTM_FDB_GETENTRY And a new group RTNLGRP_FDB. Also using NETLINK_ROUTE gives the correct rtnl locking semantics for free. The neighbor code is using the PF_UNSPEC protocol type so we won't collide with these unless someone was using PF_ROUTE and relying on falling back to PF_UNSPEC however I couldn't find any programs that did this iproute2 certainly doesn't. And the bridge pieces are using PF_BRIDGE so no collision there. They have to be different calls from the calls that talk to the s/ware bridge. In my opinion, as controversial as this may sound, you need to be flexible enough that some vendor can replace these calls with proprietary calls which are more efficient for their hardware. So a plugin to replace these calls in the user space code would be a good idea. Alternatively, you could make that something they do at the driver level i.e from user space to kernel it is hardware, please addthistotheFDBtable() call and the implementation of that could be proprietary to the specific hardware. Agreed. I think adding some ndo_ops for bridging offloads here would work. For example the DSA infrastructure and/or macvlan devices might need this. Along the lines of extending this RFC, [RFC] hardware bridging support for DSA switches http://patchwork.ozlabs.org/patch/16578/ I want to see a unified API so that user space control applications (RSTP, TRILL?) can use one set of netlink calls for both software bridge and hardware offloaded bridges. Does this proposal meet that requirement? With the patches I sent out last night the same netlink calls are used for both SW and HW with a flag set in ndm_flags to indicate it is a hardware entry. The flag is needed when a port has offload support and is also a slave of a SW bridge. Another option would be to apply the command to both hardware and software tables. This might be good enough and user space would not have to make distinctions between HW and SW bridges. Also helps with my original use case where I want the SW and HW bridge FDBs to be in sync. In response to Jamal's comment I proposed changing the type to RTM_FDB_XXXENTRY but the message contents are the same in both cases. Jamal, so why do They have to be different calls? I'm not so sure anymore... moving to RTM_FDB_XXXENTRY saved some refactoring in the bridge module but that is just cosmetic. Thanks, John -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote: On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. Sure, but do you have a generic function to evaluate that? :) On book3e, Power ISA says timebase is read-only on virtualized implementations. My understanding is that book3s is paravirt-only (guest state is not considered an implementation of the Power ISA), and it says Writing the Time Base is privileged, and can be done only in hypervisor state. For PR non-PAPR KVM, we are non-paravirt, but ignore tb writes iirc. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. This is on e500mc. Alex -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH V2] Quirk for IVB graphics FLR errata
On Wed, 29 Feb 2012 03:11:24 + Hao, Xudong xudong@intel.com wrote: For IvyBridge Mobile platform, a system hang may occur if a FLR(Function Level Reset) is asserted to internal graphics. This quirk patch is workaround for the IVB FLR errata issue. We are disabling the FLR reset handshake between the PCH and CPU display, then manually powering down the panel power sequencing and resetting the PCH display. Signed-off-by: Xudong Hao xudong@intel.com Signed-off-by: Kay, Allen M allen.m@intel.com --- drivers/pci/quirks.c | 49 + 1 files changed, 49 insertions(+), 0 deletions(-) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6476547..5223b80 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -29,6 +29,11 @@ #include asm/dma.h /* isa_dma_bridge_buggy */ #include pci.h +#include ../gpu/drm/i915/i915_reg.h Ugg... this is ugly. Should probably go near the definition of the quirk at any rate. +#include asm/tsc.h +/* 10 seconds */ +#define IGD_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) + Same here, the asm/tsc.h can go at the top, but the timeout definition should go near the reset function. May as well make it a static const cycles_t as well. +#define MSG_CTL 0x45010 + +static int reset_ivb_igd(struct pci_dev *dev, int probe) { + u8 *mmio_base; + u32 val; + + if (probe) + return 0; + + mmio_base = ioremap_nocache(pci_resource_start(dev, 0), + pci_resource_len(dev, 0)); + if (!mmio_base) + return -ENOMEM; + + /* Work Around */ + *((u32 *)(mmio_base + MSG_CTL)) = 0x0002; + *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005; + val = *((u32 *)(mmio_base + PCH_PP_CONTROL)) 0xfffe; + *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val; These clobber existing values. Since we're doing a reset, clobbering PCH_PP_CONTROL should be ok, but clobbering SOUTH_CHICKEN2 is only ok if the next driver that loads sets the right bits (if i915 was previously using the device, it would have set a couple of bits). But again since it's it a reset that's probably ok, but you should put a comment indicating as much. + do { + cycles_t start_time = get_cycles(); + while (1) { + val = *((u32 *)(mmio_base + PCH_PP_STATUS)); + if (((val 0x8000) == 0) + ((val 0x3000) == 0)) + break; + if (IGD_OPERATION_TIMEOUT (get_cycles() - start_time)) + break; + cpu_relax(); + } + } while (0); + *((u32 *)(mmio_base + 0xd0100)) = 0x0002; + + iounmap(pci_resource_start(dev, 0)); + return 0; +} + #define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed +#define PCI_DEVICE_ID_INTEL_IVB_M_VGA 0x0156 +#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, reset_intel_82599_sfp_virtfn }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA, + reset_ivb_igd }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA, + reset_ivb_igd }, { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, reset_intel_generic_dev }, { 0 } Looks ok otherwise, thanks. Jesse -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 02/29/2012 12:28 PM, Alexander Graf wrote: On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote: On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. Sure, but do you have a generic function to evaluate that? :) The presence of a hypervisor node without testing compatible. Might not get them all, but at least it will cover more than just KVM. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. This is on e500mc. On e500mc Linux should never by trying to sync the timebase. If it is, let's fix that. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: Device isolation groups
On Thu, 2012-02-02 at 12:24 +1100, David Gibson wrote: On Wed, Feb 01, 2012 at 01:08:39PM -0700, Alex Williamson wrote: On Wed, 2012-02-01 at 15:46 +1100, David Gibson wrote: This patch series introduces a new infrastructure to the driver core for representing device isolation groups. That is, groups of devices which can be isolated in such a way that the rest of the system can be protected from them, even in the presence of userspace or a guest OS directly driving the devices. Isolation will typically be due to an IOMMU which can safely remap DMA and interrupts coming from these devices. We need to represent whole groups, rather than individual devices, because there are a number of cases where the group can be isolated as a whole, but devices within it cannot be safely isolated from each other - this usually occurs because the IOMMU cannot reliably distinguish which device in the group initiated a transaction. In other words, isolation groups represent the minimum safe granularity for passthrough to guests or userspace. This series provides the core infraustrcture for tracking isolation groups, and example implementations initializing the groups appropriately for two PCI bridges (which include IOMMUs) found on IBM POWER systems. Actually using the group information is not included here, but David Woodhouse has expressed an interest in using a structure like this to represent operations in iommu_ops more correctly. Some tracking of groups is a prerequisite for safe passthrough of devices to guests or userspace, such as done by VFIO. Current VFIO patches use the iommu_ops-device_group mechanism for this. However, that mechanism is awkward, because without an in-kernel concrete representation of groups, enumerating a group requires traversing every device on a given bus type. It also fails to cover some very plausible IOMMU topologies, because its groups cannot span devices on multiple bus types. So far so good, but there's not much meat on the bone yet. True.. Any update to this series? It would be great if we could map out the functionality to the point of breaking down and distributing work... or determining if the end result has any value add to what VFIO already does. Thanks, Alex -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: [PATCH V2] Quirk for IVB graphics FLR errata
-Original Message- From: Jesse Barnes [mailto:jbar...@virtuousgeek.org] Sent: Thursday, March 01, 2012 2:32 AM To: Hao, Xudong Cc: linux-...@vger.kernel.org; linux-ker...@vger.kernel.org; kvm@vger.kernel.org; Kay, Allen M; Zhang, Xiantao Subject: Re: [PATCH V2] Quirk for IVB graphics FLR errata On Wed, 29 Feb 2012 03:11:24 + Hao, Xudong xudong@intel.com wrote: For IvyBridge Mobile platform, a system hang may occur if a FLR(Function Level Reset) is asserted to internal graphics. This quirk patch is workaround for the IVB FLR errata issue. We are disabling the FLR reset handshake between the PCH and CPU display, then manually powering down the panel power sequencing and resetting the PCH display. Signed-off-by: Xudong Hao xudong@intel.com Signed-off-by: Kay, Allen M allen.m@intel.com --- drivers/pci/quirks.c | 49 + 1 files changed, 49 insertions(+), 0 deletions(-) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6476547..5223b80 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -29,6 +29,11 @@ #include asm/dma.h /* isa_dma_bridge_buggy */ #include pci.h +#include ../gpu/drm/i915/i915_reg.h Ugg... this is ugly. Should probably go near the definition of the quirk at any rate. OK, I'll move it just above the reset_ivb_igd function. +#include asm/tsc.h +/* 10 seconds */ +#define IGD_OPERATION_TIMEOUT ((cycles_t) tsc_khz*10*1000) + Same here, the asm/tsc.h can go at the top, but the timeout definition should go near the reset function. May as well make it a static const cycles_t as well. Will modify it. +#define MSG_CTL0x45010 + +static int reset_ivb_igd(struct pci_dev *dev, int probe) { + u8 *mmio_base; + u32 val; + + if (probe) + return 0; + + mmio_base = ioremap_nocache(pci_resource_start(dev, 0), +pci_resource_len(dev, 0)); + if (!mmio_base) + return -ENOMEM; + + /* Work Around */ + *((u32 *)(mmio_base + MSG_CTL)) = 0x0002; + *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005; + val = *((u32 *)(mmio_base + PCH_PP_CONTROL)) 0xfffe; + *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val; These clobber existing values. Since we're doing a reset, clobbering PCH_PP_CONTROL should be ok, but clobbering SOUTH_CHICKEN2 is only ok if the next driver that loads sets the right bits (if i915 was previously using the device, it would have set a couple of bits). But again since it's it a reset that's probably ok, but you should put a comment indicating as much. I'll add comment here, thanks Jesse. + do { + cycles_t start_time = get_cycles(); + while (1) { + val = *((u32 *)(mmio_base + PCH_PP_STATUS)); + if (((val 0x8000) == 0) +((val 0x3000) == 0)) + break; + if (IGD_OPERATION_TIMEOUT (get_cycles() - start_time)) + break; + cpu_relax(); + } + } while (0); + *((u32 *)(mmio_base + 0xd0100)) = 0x0002; + + iounmap(pci_resource_start(dev, 0)); + return 0; +} + #define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed +#define PCI_DEVICE_ID_INTEL_IVB_M_VGA 0x0156 +#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, reset_intel_82599_sfp_virtfn }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA, + reset_ivb_igd }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA, + reset_ivb_igd }, { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, reset_intel_generic_dev }, { 0 } Looks ok otherwise, thanks. Jesse -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] powerpc/e500: make load_up_spe a normal fuction
From: Liu Yu yu@freescale.com So that we can call it when improving SPE switch like book3e did for fp switch. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Olivia Yin hong-hua@freescale.com --- v2: add Signed-off-by arch/powerpc/kernel/head_fsl_booke.S | 23 ++- 1 files changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S index d5d78c4..c96e025 100644 --- a/arch/powerpc/kernel/head_fsl_booke.S +++ b/arch/powerpc/kernel/head_fsl_booke.S @@ -539,8 +539,10 @@ interrupt_base: /* SPE Unavailable */ START_EXCEPTION(SPEUnavailable) NORMAL_EXCEPTION_PROLOG - bne load_up_spe - addir3,r1,STACK_FRAME_OVERHEAD + beq 1f + bl load_up_spe + b fast_exception_return +1: addir3,r1,STACK_FRAME_OVERHEAD EXC_XFER_EE_LITE(0x2010, KernelSPE) #else EXCEPTION(0x2020, SPEUnavailable, unknown_exception, EXC_XFER_EE) @@ -743,7 +745,7 @@ tlb_write_entry: /* Note that the SPE support is closely modeled after the AltiVec * support. Changes to one are likely to be applicable to the * other! */ -load_up_spe: +_GLOBAL(load_up_spe) /* * Disable SPE for the task which had SPE previously, * and save its SPE registers in its thread_struct. @@ -791,20 +793,7 @@ load_up_spe: subir4,r5,THREAD stw r4,last_task_used_spe@l(r3) #endif /* !CONFIG_SMP */ - /* restore registers and return */ -2: REST_4GPRS(3, r11) - lwz r10,_CCR(r11) - REST_GPR(1, r11) - mtcrr10 - lwz r10,_LINK(r11) - mtlrr10 - REST_GPR(10, r11) - mtspr SPRN_SRR1,r9 - mtspr SPRN_SRR0,r12 - REST_GPR(9, r11) - REST_GPR(12, r11) - lwz r11,GPR11(r11) - rfi + blr /* * SPE unavailable trap from kernel - print a message, but let -- 1.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] KVM: booke: Improve SPE switch
From: Liu Yu yu@freescale.com Like book3s did for fp switch, instead of switch SPE between host and guest, the patch switch SPE state between qemu and guest. In this way, we can simulate a host loadup SPE when load guest SPE state, and let host to decide when to giveup SPE state. Therefor it cooperates better with host SPE usage, and so that has some performance benifit in UP host(lazy SPE). Moreover, since the patch save guest SPE state into linux thread field, it creates the condition to emulate guest SPE instructions in host, so that we can avoid injecting SPE exception to guest. The patch also turns all asm code into C code, and add SPE stat counts. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Olivia Yin hong-hua@freescale.com --- v2: Keep shadow MSR[SPE] consistent with thread MSR[SPE] in kvmppc_core_vcpu_load arch/powerpc/include/asm/kvm_host.h | 11 +- arch/powerpc/kernel/asm-offsets.c |7 arch/powerpc/kvm/booke.c| 63 +++ arch/powerpc/kvm/booke.h|8 + arch/powerpc/kvm/booke_interrupts.S | 37 arch/powerpc/kvm/e500.c | 13 --- arch/powerpc/kvm/timing.c |5 +++ arch/powerpc/kvm/timing.h | 11 ++ 8 files changed, 91 insertions(+), 64 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 1843d5d..6186d08 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -117,6 +117,11 @@ struct kvm_vcpu_stat { u32 st; u32 st_slow; #endif +#ifdef CONFIG_SPE + u32 spe_unavail; + u32 spe_fp_data; + u32 spe_fp_round; +#endif }; enum kvm_exit_types { @@ -147,6 +152,11 @@ enum kvm_exit_types { FP_UNAVAIL, DEBUG_EXITS, TIMEINGUEST, +#ifdef CONFIG_SPE + SPE_UNAVAIL, + SPE_FP_DATA, + SPE_FP_ROUND, +#endif __NUMBER_OF_KVM_EXIT_TYPES }; @@ -330,7 +340,6 @@ struct kvm_vcpu_arch { #ifdef CONFIG_SPE ulong evr[32]; ulong spefscr; - ulong host_spefscr; u64 acc; #endif #ifdef CONFIG_ALTIVEC diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 8e0db0b..ff68f71 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -604,13 +604,6 @@ int main(void) DEFINE(TLBCAM_MAS7, offsetof(struct tlbcam, MAS7)); #endif -#if defined(CONFIG_KVM) defined(CONFIG_SPE) - DEFINE(VCPU_EVR, offsetof(struct kvm_vcpu, arch.evr[0])); - DEFINE(VCPU_ACC, offsetof(struct kvm_vcpu, arch.acc)); - DEFINE(VCPU_SPEFSCR, offsetof(struct kvm_vcpu, arch.spefscr)); - DEFINE(VCPU_HOST_SPEFSCR, offsetof(struct kvm_vcpu, arch.host_spefscr)); -#endif - #ifdef CONFIG_KVM_EXIT_TIMING DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu)); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index ee9e1ee..f20010b 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -55,6 +55,11 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { dec,VCPU_STAT(dec_exits) }, { ext_intr, VCPU_STAT(ext_intr_exits) }, { halt_wakeup, VCPU_STAT(halt_wakeup) }, +#ifdef CONFIG_SPE + { spe_unavail, VCPU_STAT(spe_unavail) }, + { spe_fp_data, VCPU_STAT(spe_fp_data) }, + { spe_fp_round, VCPU_STAT(spe_fp_round) }, +#endif { NULL } }; @@ -80,11 +85,11 @@ void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu) } #ifdef CONFIG_SPE -void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) +static void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) { preempt_disable(); - enable_kernel_spe(); - kvmppc_save_guest_spe(vcpu); + if (current-thread.regs-msr MSR_SPE) + giveup_spe(current); vcpu-arch.shadow_msr = ~MSR_SPE; preempt_enable(); } @@ -92,8 +97,10 @@ void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) static void kvmppc_vcpu_enable_spe(struct kvm_vcpu *vcpu) { preempt_disable(); - enable_kernel_spe(); - kvmppc_load_guest_spe(vcpu); + if (!(current-thread.regs-msr MSR_SPE)) { + load_up_spe(NULL); + current-thread.regs-msr |= MSR_SPE; + } vcpu-arch.shadow_msr |= MSR_SPE; preempt_enable(); } @@ -104,7 +111,7 @@ static void kvmppc_vcpu_sync_spe(struct kvm_vcpu *vcpu) if (!(vcpu-arch.shadow_msr MSR_SPE)) kvmppc_vcpu_enable_spe(vcpu); } else if (vcpu-arch.shadow_msr MSR_SPE) { - kvmppc_vcpu_disable_spe(vcpu); + vcpu-arch.shadow_msr = ~MSR_SPE; } } #else @@ -124,7 +131,8 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr) vcpu-arch.shared-msr = new_msr; kvmppc_mmu_msr_notify(vcpu, old_msr); -
Re: [PATCH] kvm: notify host when guest paniced
At 02/29/2012 06:39 PM, Avi Kivity Wrote: On 02/29/2012 12:17 PM, Wen Congyang wrote: Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. If virtio-serial's driver has bug or the guest doesn't have such device... We have the same issue with the hypercall; and virtio-serial is available on many deployed versions. How to know whether a guest has virtio-serial? Thanks Wen Congyang Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. If we have a reliable way with doing nothing, it is better. But I donot find such way. We won't have a 100% reliable way. But I think a variant of the driver that doesn't use interrupts, or just using the ordinary serial driver, should be reliable enough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH V3] Quirk for IVB graphics FLR errata
For IvyBridge Mobile platform, a system hang may occur if a FLR(Function Level Reset) is asserted to internal graphics. This quirk patch is workaround for the IVB FLR errata issue. We are disabling the FLR reset handshake between the PCH and CPU display, then manually powering down the panel power sequencing and resetting the PCH display. Signed-off-by: Xudong Hao xudong@intel.com Signed-off-by: Kay, Allen M allen.m@intel.com Reviewed-by: Xiantao Zhang xiantao.zh...@intel.com --- drivers/pci/quirks.c | 53 ++ 1 files changed, 53 insertions(+), 0 deletions(-) diff --git a/drivers/pci/quirks.c b/drivers/pci/quirks.c index 6476547..c687218 100644 --- a/drivers/pci/quirks.c +++ b/drivers/pci/quirks.c @@ -27,6 +27,7 @@ #include linux/pci-aspm.h #include linux/ioport.h #include asm/dma.h /* isa_dma_bridge_buggy */ +#include asm/tsc.h #include pci.h /* @@ -3069,11 +3070,63 @@ static int reset_intel_82599_sfp_virtfn(struct pci_dev *dev, int probe) return 0; } +#include ../gpu/drm/i915/i915_reg.h +#define MSG_CTL0x45010 +static const int op_timeout = 10; /* set timeout 10 seconds */ + +static int reset_ivb_igd(struct pci_dev *dev, int probe) { + u8 *mmio_base; + u32 val; + cycles_t cyc_op_timeout = tsc_khz*op_timeout*1000; + + if (probe) + return 0; + + mmio_base = ioremap_nocache(pci_resource_start(dev, 0), +pci_resource_len(dev, 0)); + if (!mmio_base) + return -ENOMEM; + + /* Work Around */ + *((u32 *)(mmio_base + MSG_CTL)) = 0x0002; + /* Clobbering SOUTH_CHICKEN2 register is fine only if the next +* driver loaded sets the right bits. However, this's a reset and +* the bits have been set by i915 previously, so we clobber +* SOUTH_CHICKEN2 register directly here. +*/ + *((u32 *)(mmio_base + SOUTH_CHICKEN2)) = 0x0005; + val = *((u32 *)(mmio_base + PCH_PP_CONTROL)) 0xfffe; + *((u32 *)(mmio_base + PCH_PP_CONTROL)) = val; + do { + cycles_t start_time = get_cycles(); + while (1) { + val = *((u32 *)(mmio_base + PCH_PP_STATUS)); + if (((val 0x8000) == 0) +((val 0x3000) == 0)) + break; + if (cyc_op_timeout (get_cycles() - start_time)) + break; + cpu_relax(); + } + } while (0); + *((u32 *)(mmio_base + 0xd0100)) = 0x0002; + + iounmap(pci_resource_start(dev, 0)); + return 0; +} + #define PCI_DEVICE_ID_INTEL_82599_SFP_VF 0x10ed +#define PCI_DEVICE_ID_INTEL_IVB_M_VGA 0x0156 +#define PCI_DEVICE_ID_INTEL_IVB_M2_VGA 0x0166 static const struct pci_dev_reset_methods pci_dev_reset_methods[] = { { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_82599_SFP_VF, reset_intel_82599_sfp_virtfn }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M_VGA, + reset_ivb_igd }, + { PCI_VENDOR_ID_INTEL, PCI_DEVICE_ID_INTEL_IVB_M2_VGA, + reset_ivb_igd }, { PCI_VENDOR_ID_INTEL, PCI_ANY_ID, reset_intel_generic_dev }, { 0 } -- 1.6.0.rc1 -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3] KVM: Resize kvm_io_range array dynamically
On 01/03/12 00:34, Amos Kong wrote: - Original Message - On 2012-02-29 16:22, Amos Kong wrote: - Original Message - On 2012-02-29 14:30, Amos Kong wrote: kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. 300 io_bus devices is not enough. This patch makes the kvm_io_range array can be resized dynamically. Is there any limit, or can userspace allocate arbitrary amounts of kernel memory this way? Hi Jan, There is a fixed array in linux-2.6/include/linux/kvm_host.h, we can only register 300 devices. struct kvm_io_range { gpa_t addr; int len; struct kvm_io_device *dev; }; struct kvm_io_bus { int dev_count; #define NR_IOBUS_DEVS 300 struct kvm_io_range range[NR_IOBUS_DEVS]; }; ^^ Right. But doesn't your patch remove precisely this limit? So what limits userspace now? To register 300 million devices...? Hi Jan, It seems we need to reserve the limitation in kvm_host.h #define NR_IOBUS_DEVS 600 /* Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions. Only 29 slots can be used to add multiple function devices. Maximum of supported PCI devices: 29 * 8 = 232. Each virtio-blk device needs 1 iobus device, each virtio-net(vhost) device requires 2 such devices to service notifications (ioevent) from rx/tx queues. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. ioevent, pit, ioapic take less iobus devices. So we can set max limitation to 600. */ One virtio-net(vhost=on) takes two iobus devices, and it needs three IRQs for MSI/MIS-X. I started a guest with 232 virtio-net(vhost=on), guest IRQ 24 to 191 were used for virtio-config/input/output, and only 56 virtio-nics' MSIX were enabled. 56 virtio-net(vhost=on) registered 56 * 2 = 112 iobus devices. It's safe to set the limit to 300, right ? - check limit when register dev virt/kvm/kvm_main.c: /* Caller must hold slots_lock. */ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, int len, struct kvm_io_device *dev) { struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; if (bus-dev_count NR_IOBUS_DEVS - 1) // can only register 600 devices return -ENOSPC; Amos. -- Amos. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] kvm: notify host when guest paniced
At 02/29/2012 06:39 PM, Avi Kivity Wrote: On 02/29/2012 12:17 PM, Wen Congyang wrote: Yes, crash can be so severe that it is not even detected by a kernel itself, so not OOPS message even printed. But in most cases if kernel is functional enough to print OOPS it is functional enough to call single hypercall instruction. Why not print the oops to virtio-serial? Or even just a regular serial port? That's what bare metal does. If virtio-serial's driver has bug or the guest doesn't have such device... We have the same issue with the hypercall; and virtio-serial is available on many deployed versions. virtio-serial is available, but it is an optional device. If the guest does not have this device, the guest cannot tell the host that is is paniced. So I still prefer to touch the hypervisor. Thanks Wen Congyang Having special kdump kernel that transfers dump to a host via virtio-serial channel though sounds interesting. May be that's what you mean. Yes. The panic, starting dump signal should be initiated by the panicking kernel though, in case the dump fails. Then panic hypercall sounds like a reasonable solution. It is, but I'm trying to see if we can get away with doing nothing. If we have a reliable way with doing nothing, it is better. But I donot find such way. We won't have a 100% reliable way. But I think a variant of the driver that doesn't use interrupts, or just using the ordinary serial driver, should be reliable enough. -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v4] KVM: Resize kvm_io_range array dynamically
kvm_io_bus devices are used for ioevent, pit, pic, ioapic, coalesced_mmio. Currently Qemu only emulates one PCI bus, it contains 32 slots, one slot contains 8 functions, maximum of supported PCI devices: 1 * 32 * 8 = 256. One virtio-blk takes one iobus device, one virtio-net(vhost=on) takes two iobus devices. The maximum of coalesced mmio zone is 100, each zone has an iobus devices. So 300 io_bus devices are not enough. This patch makes the kvm_io_range array can be resized dynamically. Set an upper bounds for kvm_io_range to limit userspace. 1000 is a very large limit and not bloat the typical user. Changes from v1: - fix typo: kvm_io_bus_range - kvm_io_range Changes from v2: - unregister device only when it exists Changes from v3: - set upper bounds to limit userspace Signed-off-by: Amos Kong ak...@redhat.com --- include/linux/kvm_host.h |5 +++-- virt/kvm/kvm_main.c | 41 ++--- 2 files changed, 25 insertions(+), 21 deletions(-) diff --git a/include/linux/kvm_host.h b/include/linux/kvm_host.h index 355e445..24ee2db 100644 --- a/include/linux/kvm_host.h +++ b/include/linux/kvm_host.h @@ -67,10 +67,11 @@ struct kvm_io_range { struct kvm_io_device *dev; }; +#define NR_IOBUS_DEVS 1000 + struct kvm_io_bus { int dev_count; -#define NR_IOBUS_DEVS 300 - struct kvm_io_range range[NR_IOBUS_DEVS]; + struct kvm_io_range range[]; }; enum kvm_bus { diff --git a/virt/kvm/kvm_main.c b/virt/kvm/kvm_main.c index e4431ad..1baed68 100644 --- a/virt/kvm/kvm_main.c +++ b/virt/kvm/kvm_main.c @@ -2389,9 +2389,6 @@ int kvm_io_bus_sort_cmp(const void *p1, const void *p2) int kvm_io_bus_insert_dev(struct kvm_io_bus *bus, struct kvm_io_device *dev, gpa_t addr, int len) { - if (bus-dev_count == NR_IOBUS_DEVS) - return -ENOSPC; - bus-range[bus-dev_count++] = (struct kvm_io_range) { .addr = addr, .len = len, @@ -2491,10 +2488,14 @@ int kvm_io_bus_register_dev(struct kvm *kvm, enum kvm_bus bus_idx, gpa_t addr, struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; - if (bus-dev_count NR_IOBUS_DEVS-1) + if (bus-dev_count NR_IOBUS_DEVS - 1) return -ENOSPC; - new_bus = kmemdup(bus, sizeof(struct kvm_io_bus), GFP_KERNEL); + new_bus = kzalloc(sizeof(*bus) + ((bus-dev_count + 1) * + sizeof(struct kvm_io_range)), GFP_KERNEL); + if (new_bus) + memcpy(new_bus, bus, sizeof(*bus) + (bus-dev_count * + sizeof(struct kvm_io_range))); if (!new_bus) return -ENOMEM; kvm_io_bus_insert_dev(new_bus, dev, addr, len); @@ -2513,26 +2514,28 @@ int kvm_io_bus_unregister_dev(struct kvm *kvm, enum kvm_bus bus_idx, struct kvm_io_bus *new_bus, *bus; bus = kvm-buses[bus_idx]; - - new_bus = kmemdup(bus, sizeof(*bus), GFP_KERNEL); - if (!new_bus) - return -ENOMEM; - r = -ENOENT; - for (i = 0; i new_bus-dev_count; i++) - if (new_bus-range[i].dev == dev) { + for (i = 0; i bus-dev_count; i++) + if (bus-range[i].dev == dev) { r = 0; - new_bus-dev_count--; - new_bus-range[i] = new_bus-range[new_bus-dev_count]; - sort(new_bus-range, new_bus-dev_count, -sizeof(struct kvm_io_range), -kvm_io_bus_sort_cmp, NULL); break; } - if (r) { - kfree(new_bus); + if (r) return r; + + new_bus = kmemdup(bus, sizeof(*bus) + ((bus-dev_count - 1) * + sizeof(struct kvm_io_range)), GFP_KERNEL); + if (!new_bus) + return -ENOMEM; + + new_bus-dev_count--; + /* copy last entry of bus-range to deleted entry spot if + deleted entry isn't the last entry of bus-range */ + if (i != bus-dev_count - 1) { + new_bus-range[i] = bus-range[bus-dev_count - 1]; + sort(new_bus-range, new_bus-dev_count, +sizeof(struct kvm_io_range), kvm_io_bus_sort_cmp, NULL); } rcu_assign_pointer(kvm-buses[bus_idx], new_bus); -- To unsubscribe from this list: send the line unsubscribe kvm in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. On book3e, Power ISA says timebase is read-only on virtualized implementations. My understanding is that book3s is paravirt-only (guest state is not considered an implementation of the Power ISA), and it says Writing the Time Base is privileged, and can be done only in hypervisor state. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote: On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. Sure, but do you have a generic function to evaluate that? :) On book3e, Power ISA says timebase is read-only on virtualized implementations. My understanding is that book3s is paravirt-only (guest state is not considered an implementation of the Power ISA), and it says Writing the Time Base is privileged, and can be done only in hypervisor state. For PR non-PAPR KVM, we are non-paravirt, but ignore tb writes iirc. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. This is on e500mc. Alex -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] KVM: PPC: Don't sync timebase when inside KVM
On 02/29/2012 12:28 PM, Alexander Graf wrote: On 29.02.2012, at 18:50, Scott Wood scottw...@freescale.com wrote: On 02/28/2012 08:16 PM, Alexander Graf wrote: When we know that we're running inside of a KVM guest, we don't have to worry about synchronizing timebases between different CPUs, since the host already took care of that. This fixes CPU overcommit scenarios where vCPUs could hang forever trying to sync each other while not being scheduled. Reported-by: Stuart Yoder b08...@freescale.com Signed-off-by: Alexander Graf ag...@suse.de This should apply to any hypervisor, not just KVM. Sure, but do you have a generic function to evaluate that? :) The presence of a hypervisor node without testing compatible. Might not get them all, but at least it will cover more than just KVM. Which platforms are you seeing this on? If it's on Freescale chips, U-Boot should be doing the sync and Linux should never do it, even in the absence of a hypervisor. This is on e500mc. On e500mc Linux should never by trying to sync the timebase. If it is, let's fix that. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 1/2] powerpc/e500: make load_up_spe a normal fuction
From: Liu Yu yu@freescale.com So that we can call it when improving SPE switch like book3e did for fp switch. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Olivia Yin hong-hua@freescale.com --- v2: add Signed-off-by arch/powerpc/kernel/head_fsl_booke.S | 23 ++- 1 files changed, 6 insertions(+), 17 deletions(-) diff --git a/arch/powerpc/kernel/head_fsl_booke.S b/arch/powerpc/kernel/head_fsl_booke.S index d5d78c4..c96e025 100644 --- a/arch/powerpc/kernel/head_fsl_booke.S +++ b/arch/powerpc/kernel/head_fsl_booke.S @@ -539,8 +539,10 @@ interrupt_base: /* SPE Unavailable */ START_EXCEPTION(SPEUnavailable) NORMAL_EXCEPTION_PROLOG - bne load_up_spe - addir3,r1,STACK_FRAME_OVERHEAD + beq 1f + bl load_up_spe + b fast_exception_return +1: addir3,r1,STACK_FRAME_OVERHEAD EXC_XFER_EE_LITE(0x2010, KernelSPE) #else EXCEPTION(0x2020, SPEUnavailable, unknown_exception, EXC_XFER_EE) @@ -743,7 +745,7 @@ tlb_write_entry: /* Note that the SPE support is closely modeled after the AltiVec * support. Changes to one are likely to be applicable to the * other! */ -load_up_spe: +_GLOBAL(load_up_spe) /* * Disable SPE for the task which had SPE previously, * and save its SPE registers in its thread_struct. @@ -791,20 +793,7 @@ load_up_spe: subir4,r5,THREAD stw r4,last_task_used_spe@l(r3) #endif /* !CONFIG_SMP */ - /* restore registers and return */ -2: REST_4GPRS(3, r11) - lwz r10,_CCR(r11) - REST_GPR(1, r11) - mtcrr10 - lwz r10,_LINK(r11) - mtlrr10 - REST_GPR(10, r11) - mtspr SPRN_SRR1,r9 - mtspr SPRN_SRR0,r12 - REST_GPR(9, r11) - REST_GPR(12, r11) - lwz r11,GPR11(r11) - rfi + blr /* * SPE unavailable trap from kernel - print a message, but let -- 1.6.4 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 2/2] KVM: booke: Improve SPE switch
From: Liu Yu yu@freescale.com Like book3s did for fp switch, instead of switch SPE between host and guest, the patch switch SPE state between qemu and guest. In this way, we can simulate a host loadup SPE when load guest SPE state, and let host to decide when to giveup SPE state. Therefor it cooperates better with host SPE usage, and so that has some performance benifit in UP host(lazy SPE). Moreover, since the patch save guest SPE state into linux thread field, it creates the condition to emulate guest SPE instructions in host, so that we can avoid injecting SPE exception to guest. The patch also turns all asm code into C code, and add SPE stat counts. Signed-off-by: Liu Yu yu@freescale.com Signed-off-by: Olivia Yin hong-hua@freescale.com --- v2: Keep shadow MSR[SPE] consistent with thread MSR[SPE] in kvmppc_core_vcpu_load arch/powerpc/include/asm/kvm_host.h | 11 +- arch/powerpc/kernel/asm-offsets.c |7 arch/powerpc/kvm/booke.c| 63 +++ arch/powerpc/kvm/booke.h|8 + arch/powerpc/kvm/booke_interrupts.S | 37 arch/powerpc/kvm/e500.c | 13 --- arch/powerpc/kvm/timing.c |5 +++ arch/powerpc/kvm/timing.h | 11 ++ 8 files changed, 91 insertions(+), 64 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_host.h b/arch/powerpc/include/asm/kvm_host.h index 1843d5d..6186d08 100644 --- a/arch/powerpc/include/asm/kvm_host.h +++ b/arch/powerpc/include/asm/kvm_host.h @@ -117,6 +117,11 @@ struct kvm_vcpu_stat { u32 st; u32 st_slow; #endif +#ifdef CONFIG_SPE + u32 spe_unavail; + u32 spe_fp_data; + u32 spe_fp_round; +#endif }; enum kvm_exit_types { @@ -147,6 +152,11 @@ enum kvm_exit_types { FP_UNAVAIL, DEBUG_EXITS, TIMEINGUEST, +#ifdef CONFIG_SPE + SPE_UNAVAIL, + SPE_FP_DATA, + SPE_FP_ROUND, +#endif __NUMBER_OF_KVM_EXIT_TYPES }; @@ -330,7 +340,6 @@ struct kvm_vcpu_arch { #ifdef CONFIG_SPE ulong evr[32]; ulong spefscr; - ulong host_spefscr; u64 acc; #endif #ifdef CONFIG_ALTIVEC diff --git a/arch/powerpc/kernel/asm-offsets.c b/arch/powerpc/kernel/asm-offsets.c index 8e0db0b..ff68f71 100644 --- a/arch/powerpc/kernel/asm-offsets.c +++ b/arch/powerpc/kernel/asm-offsets.c @@ -604,13 +604,6 @@ int main(void) DEFINE(TLBCAM_MAS7, offsetof(struct tlbcam, MAS7)); #endif -#if defined(CONFIG_KVM) defined(CONFIG_SPE) - DEFINE(VCPU_EVR, offsetof(struct kvm_vcpu, arch.evr[0])); - DEFINE(VCPU_ACC, offsetof(struct kvm_vcpu, arch.acc)); - DEFINE(VCPU_SPEFSCR, offsetof(struct kvm_vcpu, arch.spefscr)); - DEFINE(VCPU_HOST_SPEFSCR, offsetof(struct kvm_vcpu, arch.host_spefscr)); -#endif - #ifdef CONFIG_KVM_EXIT_TIMING DEFINE(VCPU_TIMING_EXIT_TBU, offsetof(struct kvm_vcpu, arch.timing_exit.tv32.tbu)); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index ee9e1ee..f20010b 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -55,6 +55,11 @@ struct kvm_stats_debugfs_item debugfs_entries[] = { { dec,VCPU_STAT(dec_exits) }, { ext_intr, VCPU_STAT(ext_intr_exits) }, { halt_wakeup, VCPU_STAT(halt_wakeup) }, +#ifdef CONFIG_SPE + { spe_unavail, VCPU_STAT(spe_unavail) }, + { spe_fp_data, VCPU_STAT(spe_fp_data) }, + { spe_fp_round, VCPU_STAT(spe_fp_round) }, +#endif { NULL } }; @@ -80,11 +85,11 @@ void kvmppc_dump_vcpu(struct kvm_vcpu *vcpu) } #ifdef CONFIG_SPE -void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) +static void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) { preempt_disable(); - enable_kernel_spe(); - kvmppc_save_guest_spe(vcpu); + if (current-thread.regs-msr MSR_SPE) + giveup_spe(current); vcpu-arch.shadow_msr = ~MSR_SPE; preempt_enable(); } @@ -92,8 +97,10 @@ void kvmppc_vcpu_disable_spe(struct kvm_vcpu *vcpu) static void kvmppc_vcpu_enable_spe(struct kvm_vcpu *vcpu) { preempt_disable(); - enable_kernel_spe(); - kvmppc_load_guest_spe(vcpu); + if (!(current-thread.regs-msr MSR_SPE)) { + load_up_spe(NULL); + current-thread.regs-msr |= MSR_SPE; + } vcpu-arch.shadow_msr |= MSR_SPE; preempt_enable(); } @@ -104,7 +111,7 @@ static void kvmppc_vcpu_sync_spe(struct kvm_vcpu *vcpu) if (!(vcpu-arch.shadow_msr MSR_SPE)) kvmppc_vcpu_enable_spe(vcpu); } else if (vcpu-arch.shadow_msr MSR_SPE) { - kvmppc_vcpu_disable_spe(vcpu); + vcpu-arch.shadow_msr = ~MSR_SPE; } } #else @@ -124,7 +131,8 @@ void kvmppc_set_msr(struct kvm_vcpu *vcpu, u32 new_msr) vcpu-arch.shared-msr = new_msr; kvmppc_mmu_msr_notify(vcpu, old_msr); -