[v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()
We should ensure the preemption cannot occur while calling get_paca() insdide hard_irq_disable(), otherwise the paca_struct may be the wrong one just after. And btw, we may update timing stats in this case. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kvm/booke.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index dcc94f0..9dae25d 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -839,6 +839,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, WARN_ON(local_paca-irq_happened != 0); #endif + preempt_disable(); /* * We enter with interrupts disabled in hardware, but * we need to call hard_irq_disable anyway to ensure that @@ -848,6 +849,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* update before a new last_exit_type is rewritten */ kvmppc_update_timing_stats(vcpu); + preempt_enable(); /* restart interrupts if they were meant for the host */ kvmppc_restart_interrupt(vcpu, exit_nr); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 0/8] kvm/ppc: fixes for 3.10
On 07/10/2013 05:33 AM, Alexander Graf wrote: On 09.07.2013, at 11:09, tiejun.chen wrote: On 07/09/2013 04:53 PM, Alexander Graf wrote: Am 09.07.2013 um 09:21 schrieb tiejun.chen tiejun.c...@windriver.com: On 06/07/2013 08:16 AM, Scott Wood wrote: Most of these have been posted before, but I grouped them together as there are some contextual dependencies between them. Gleb/Paolo: As Alex doesn't appear to be back yet, can you apply these if there's no objection over the next few days? Scott and Alex, Any plan to merge these commits into git://github.com/agraf/linux-2.6.git as well? Which commits exactly are you missing there? Alex, Four patches, kvm/ppc/booke: Delay kvmppc_lazy_ee_enable kvm/ppc/booke64: Fix lazy ee handling in kvmppc_handle_exit() kvm/ppc/booke: Hold srcu lock when calling gfn functions kvm/ppc/booke64: Disable e6500 support are already in Linus's tree, https://git.kernel.org/cgit/linux/kernel/git/torvalds/linux.git/log/?qt=authorq=Scott+Wood So should we sync these commits into your tree? They should already be in there? Yes, I already see this. Thanks, Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()
On 10.07.2013, at 08:02, Tiejun Chen wrote: We should ensure the preemption cannot occur while calling get_paca() insdide hard_irq_disable(), otherwise the paca_struct may be the wrong one just after. And btw, we may update timing stats in this case. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kvm/booke.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index dcc94f0..9dae25d 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -839,6 +839,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, WARN_ON(local_paca-irq_happened != 0); #endif + preempt_disable(); /* * We enter with interrupts disabled in hardware, but * we need to call hard_irq_disable anyway to ensure that @@ -848,6 +849,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* update before a new last_exit_type is rewritten */ kvmppc_update_timing_stats(vcpu); + preempt_enable(); All of the code here is already called with interrupts disabled. I don't see how we could preempt then? Alex /* restart interrupts if they were meant for the host */ kvmppc_restart_interrupt(vcpu, exit_nr); -- 1.7.9.5 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote: On 07/10/2013 03:02 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. We don't mention QEMU explicitly in KVM code usually. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Same as above. But really you're only giving recommendations here. What's the point? Please describe what the benefit of this patch is, not what some other random subsystem might do with the benefits it brings. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changelog: 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 154 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 260 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 517 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. significanly? Please run this through a spell checker. +The user space should expect that its handlers for these hypercalls s/The// +are not going to be called. Is user space guaranteed they will not be called? Or can it still happen? ... if user space previously registered LIOBN in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). ok? How about this? The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. --- The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have. There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet and may never get there. +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present in +the ibm,hypertas-functions
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 10.07.2013, at 02:06, Scott Wood wrote: On 07/09/2013 04:44:24 PM, Alexander Graf wrote: On 09.07.2013, at 20:46, Scott Wood wrote: I suspect that tlbsx is faster, or at worst similar. And unlike comparing tlbsx to lwepx (not counting a fix for the threading problem), we don't already have code to search the guest TLB, so testing would be more work. We have code to walk the guest TLB for TLB misses. This really is just the TLB miss search without host TLB injection. So let's say we're using the shadow TLB. The guest always has its say 64 TLB entries that it can count on - we never evict anything by accident, because we store all of the 64 entries in our guest TLB cache. When the guest faults at an address, the first thing we do is we check the cache whether we have that page already mapped. However, with this method we now have 2 enumeration methods for guest TLB searches. We have the tlbsx one which searches the host TLB and we have our guest TLB cache. The guest TLB cache might still contain an entry for an address that we already invalidated on the host. Would that impose a problem? I guess not because we're swizzling the exit code around to instead be an instruction miss which means we restore the TLB entry into our host's TLB so that when we resume, we land here and the tlbsx hits. But it feels backwards. Any better way? Searching the guest TLB won't work for the LRAT case, so we'd need to have this logic around anyway. We shouldn't add a second codepath unless it's a clear performance gain -- and again, I suspect it would be the opposite, especially if the entry is not in TLB0 or in one of the first few entries searched in TLB1. The tlbsx miss case is not what we should optimize for. Hrm. So let's redesign this thing theoretically. We would have an exit that requires an instruction fetch. We would override kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the host TLB. When it fails, we have to abort the emulation and resume the guest at the same IP. Now the guest gets the TLB miss, we populate, go back into the guest. The guest hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds this time and we can emulate the instruction. I think this works. Just make sure that the gateway to the instruction fetch is kvmppc_get_last_inst() and make that failable. Then the difference between looking for the TLB entry in the host's TLB or in the guest's TLB cache is hopefully negligible. At least this code has to become something more generic, such as kvmppc_read_guest(vcpu, addr, TYPE_INSN) and move into the host mmu implementation, as it's 100% host mmu specific. I agree that e500_mmu_host.c is a better place for it (with an ifdef for BOOKEHV), but supporting anything other than instruction fetches could wait until we have a user for it (it means extra code to figure out if permissions are correct). Works for me, as long as it's either documented through BUG_ON/WARN_ON's or an explicit naming convention. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 10.07.2013, at 02:12, Scott Wood wrote: On 07/09/2013 04:45:10 PM, Alexander Graf wrote: On 28.06.2013, at 11:20, Mihai Caraman wrote: + /* Get page size */ + if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0) + psize_shift = PAGE_SHIFT; + else + psize_shift = MAS1_GET_TSIZE(mas1) + 10; + + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) 32) | + mfspr(SPRN_MAS3); + addr = (mas7_mas3 (~0ULL psize_shift)) | + (geaddr ((1ULL psize_shift) - 1ULL)); + + /* Map a page and get guest's instruction */ + page = pfn_to_page(addr PAGE_SHIFT); While looking at this I just realized that you're missing a check here. What if our IP is in some PCI BAR? Or can't we execute from those? We at least need to check pfn_valid() first. That'll just keep us from accessing a bad pointer in the host kernel, though -- it won't make the emulation actually work. If we need that, we'll probably need to create a temporary TLB entry manually. ioremap()? However, if we were walking the guest TLB cache instead we would get a guest physical address which we can always resolve to a host virtual address. I'm not sure how important that whole use case is though. Maybe we should just error out to the guest for now. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction
On 10.07.2013, at 00:26, Scott Wood wrote: On 07/09/2013 05:00:26 PM, Alexander Graf wrote: On 09.07.2013, at 23:54, Scott Wood wrote: On 07/09/2013 04:49:32 PM, Alexander Graf wrote: Not sure I understand. What the timing stats do is that they measure the time between [exit ... entry], right? We'd do the same thing, just all in C code. That means we would become slightly less accurate, but gain dynamic enabling of the traces and get rid of all the timing stat asm code. Compile-time enabling bothers me less than a loss of accuracy (not just a small loss by moving into C code, but a potential for a large loss if we overflow the buffer) Then don't overflow the buffer. Make it large enough. How large is that? Does the tool recognize and report when overflow happens? How much will the overhead of running some python script on the host, consuming a large volume of data, affect the results? IIRC ftrace improved recently to dynamically increase the buffer size too. Steven, do I remember correctly here? Yay more complexity. So now we get to worry about possible memory allocations happening when we try to log something? Or if there is a way to do an atomic log, we're back to the buffer might be full situation. and a dependency on a userspace tool We already have that for kvm_stat. It's a simple python script - and you surely have python on your rootfs, no? (both in terms of the tool needing to be written, and in the hassle of ensuring that it's present in the root filesystem of whatever system I'm testing). And the whole mechanism will be more complicated. It'll also be more flexible at the same time. You could take the logs and actually check what's going on to debug issues that you're encountering for example. We could even go as far as sharing the same tool with other architectures, so that we only have to learn how to debug things once. Have you encountered an actual need for this flexibility, or is it theoretical? Yeah, first thing I did back then to actually debug kvm failures was to add trace points. Is there common infrastructure for dealing with measuring intervals and tracking statistics thereof, rather than just tracking points and letting userspace connect the dots (though it could still do that as an option)? Even if it must be done in userspace, it doesn't seem like something that should be KVM-specific. Would you like to have different ways of measuring mm subsystem overhead? I don't :). The same goes for KVM really. If we could converge towards a single user space interface to get exit timings, it'd make debugging a lot easier. We already have this for the debugfs counters btw. And the timing framework does break kvm_stat today already, as it emits textual stats rather than numbers which all of the other debugfs stats do. But at least I can take the x86 kvm_stat tool and run it on ppc just fine to see exit stats. Lots of debug options are enabled at build time; why must this be different? Because I think it's valuable as debug tool for cases where compile time switches are not the best way of debugging things. It's not a high profile thing to tackle for me tbh, but I don't really think working heavily on the timing stat thing is the correct path to walk along. Adding new exit types isn't working heavily on it. No, but the fact that the first patch is a fix to add exit stats for exits that we missed out before doesn't give me a lot of confidence that lots of people use timing stats. And I am always very weary of #ifdef'ed code, as it blows up the test matrix heavily. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote: On 07/10/2013 01:35 AM, Alexander Graf wrote: On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote: Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- include/uapi/linux/kvm.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 970b1f5..0865c01 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) Please order them by number. Oh. Again :( We have had this discussion with Scott Wood here already. Where _exactly_ do you want me to put it? 8 lines further down. With a comment saying when it's available. Also why is it af, not ad? Many sections, not really ordered. Thank you. They should all be ordered inside of their own categories. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote: It's not exactly obvious that you're calling it with writing == 1 :). Can you create a new local variable is_write in the calling function, set that to 1 before the call to get_user_pages_fast and pass it in instead of the 1? The compiler should easily optimize all of that away, but it makes the code by far easier to read. Ugh ? Nobody else does that (look at futex :-) Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote: On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote: It's not exactly obvious that you're calling it with writing == 1 :). Can you create a new local variable is_write in the calling function, set that to 1 before the call to get_user_pages_fast and pass it in instead of the 1? The compiler should easily optimize all of that away, but it makes the code by far easier to read. Ugh ? Nobody else does that (look at futex :-) Yeah, that's fortunately code that I don't have to read :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 8/8] KVM: PPC: Add hugepage support for IOMMU in-kernel handling
On 10.07.2013, at 12:40, Alexander Graf wrote: On 10.07.2013, at 12:39, Benjamin Herrenschmidt wrote: On Wed, 2013-07-10 at 12:33 +0200, Alexander Graf wrote: It's not exactly obvious that you're calling it with writing == 1 :). Can you create a new local variable is_write in the calling function, set that to 1 before the call to get_user_pages_fast and pass it in instead of the 1? The compiler should easily optimize all of that away, but it makes the code by far easier to read. Ugh ? Nobody else does that (look at futex :-) Yeah, that's fortunately code that I don't have to read :). The proper alternative would be to pass an enum for read/write into the function rather than an int. But that'd be a pretty controversial, big change that I'd rather not put on Alexey. With a local variable we're nicely self-contained readable ;) Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 1/2] KVM: PPC: Book3S HV: Correct tlbie usage
On 08.07.2013, at 12:08, Paul Mackerras wrote: This corrects the usage of the tlbie (TLB invalidate entry) instruction in HV KVM. The tlbie instruction changed between PPC970 and POWER7. On the PPC970, the bit to select large vs. small page is in the instruction, not in the RB register value. This changes the code to use the correct form on PPC970. On POWER7 we were calculating the AVAL (Abbreviated Virtual Address, Lower) field of the RB value incorrectly for 64k pages. This fixes it. Since we now have several cases to handle for the tlbie instruction, this factors out the code to do a sequence of tlbies into a new function, do_tlbies(), and calls that from the various places where the code was doing tlbie instructions inline. It also makes kvmppc_h_bulk_remove() use the same global_invalidates() function for determining whether to do local or global TLB invalidations as is used in other places, for consistency, and also to make sure that kvm-arch.need_tlb_flush gets updated properly. Signed-off-by: Paul Mackerras pau...@samba.org Cc: sta...@vger.kernel.org Thanks, applied both to kvm-ppc-queue. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/8] KVM: PPC: reserve a capability and ioctl numbers for realmode VFIO
On 07/10/2013 08:27 PM, Alexander Graf wrote: On 10.07.2013, at 01:35, Alexey Kardashevskiy wrote: On 07/10/2013 01:35 AM, Alexander Graf wrote: On 06/27/2013 07:02 AM, Alexey Kardashevskiy wrote: Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- include/uapi/linux/kvm.h |2 ++ 1 file changed, 2 insertions(+) diff --git a/include/uapi/linux/kvm.h b/include/uapi/linux/kvm.h index 970b1f5..0865c01 100644 --- a/include/uapi/linux/kvm.h +++ b/include/uapi/linux/kvm.h @@ -667,6 +667,7 @@ struct kvm_ppc_smmu_info { #define KVM_CAP_PPC_RTAS 91 #define KVM_CAP_IRQ_XICS 92 #define KVM_CAP_SPAPR_MULTITCE 93 +#define KVM_CAP_SPAPR_TCE_IOMMU 94 #ifdef KVM_CAP_IRQ_ROUTING @@ -923,6 +924,7 @@ struct kvm_s390_ucas_mapping { /* Available with KVM_CAP_PPC_ALLOC_HTAB */ #define KVM_PPC_ALLOCATE_HTAB _IOWR(KVMIO, 0xa7, __u32) #define KVM_CREATE_SPAPR_TCE _IOW(KVMIO, 0xa8, struct kvm_create_spapr_tce) +#define KVM_CREATE_SPAPR_TCE_IOMMU _IOW(KVMIO, 0xaf, struct kvm_create_spapr_tce_iommu) Please order them by number. Oh. Again :( We have had this discussion with Scott Wood here already. Where _exactly_ do you want me to put it? 8 lines further down. With a comment saying when it's available. Also why is it af, not ad? 0xad and 0xae are taken. Where should I have commented this? In the commit message? Or in the patch itself? Many sections, not really ordered. Thank you. They should all be ordered inside of their own categories. -- Alexey -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction
On 07/10/2013 05:23:36 AM, Alexander Graf wrote: On 10.07.2013, at 00:26, Scott Wood wrote: On 07/09/2013 05:00:26 PM, Alexander Graf wrote: It'll also be more flexible at the same time. You could take the logs and actually check what's going on to debug issues that you're encountering for example. We could even go as far as sharing the same tool with other architectures, so that we only have to learn how to debug things once. Have you encountered an actual need for this flexibility, or is it theoretical? Yeah, first thing I did back then to actually debug kvm failures was to add trace points. I meant specifically for handling exit timings this way. Is there common infrastructure for dealing with measuring intervals and tracking statistics thereof, rather than just tracking points and letting userspace connect the dots (though it could still do that as an option)? Even if it must be done in userspace, it doesn't seem like something that should be KVM-specific. Would you like to have different ways of measuring mm subsystem overhead? I don't :). The same goes for KVM really. If we could converge towards a single user space interface to get exit timings, it'd make debugging a lot easier. I agree -- that's why I said it doesn't seem like something that should be KVM-specific. But that's orthogonal to whether it's done in kernel space or user space. The ability to get begin/end events from userspace would be nice when it is specifically requested, but it would also be nice if the kernel could track some basic statistics so we wouldn't have to ship so much data around to arrive at the same result. At the very least, I'd like such a tool/infrastructure to exist before we start complaining about doing minor maintenance of the current mechanism. We already have this for the debugfs counters btw. And the timing framework does break kvm_stat today already, as it emits textual stats rather than numbers which all of the other debugfs stats do. But at least I can take the x86 kvm_stat tool and run it on ppc just fine to see exit stats. We already have what? The last two sentences seem contradictory -- can you or can't you use kvm_stat as is? I'm not familiar with kvm_stat. What does x86 KVM expose in debugfs? Lots of debug options are enabled at build time; why must this be different? Because I think it's valuable as debug tool for cases where compile time switches are not the best way of debugging things. It's not a high profile thing to tackle for me tbh, but I don't really think working heavily on the timing stat thing is the correct path to walk along. Adding new exit types isn't working heavily on it. No, but the fact that the first patch is a fix to add exit stats for exits that we missed out before doesn't give me a lot of confidence that lots of people use timing stats. And I am always very weary of #ifdef'ed code, as it blows up the test matrix heavily. I used it quite a lot when I was doing KVM performance work. It's just been a while since I last did that. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 07/10/2013 05:18:10 AM, Alexander Graf wrote: On 10.07.2013, at 02:12, Scott Wood wrote: On 07/09/2013 04:45:10 PM, Alexander Graf wrote: On 28.06.2013, at 11:20, Mihai Caraman wrote: + /* Get page size */ + if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0) + psize_shift = PAGE_SHIFT; + else + psize_shift = MAS1_GET_TSIZE(mas1) + 10; + + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) 32) | + mfspr(SPRN_MAS3); + addr = (mas7_mas3 (~0ULL psize_shift)) | +(geaddr ((1ULL psize_shift) - 1ULL)); + + /* Map a page and get guest's instruction */ + page = pfn_to_page(addr PAGE_SHIFT); While looking at this I just realized that you're missing a check here. What if our IP is in some PCI BAR? Or can't we execute from those? We at least need to check pfn_valid() first. That'll just keep us from accessing a bad pointer in the host kernel, though -- it won't make the emulation actually work. If we need that, we'll probably need to create a temporary TLB entry manually. ioremap()? That's a bit heavy... also we'd need to deal with cacheability. This code is already engaged in directly creating TLB entries, so it doesn't seem like much of a stretch to create one for this. It should be faster than ioremap() or kmap_atomic(). The one complication is allocating the virtual address space, but maybe we could just use the page that kmap_atomic would have used? Of course, if we want to handle execution from other than normal kernel memory, we'll need to make sure that the virtual address space is allocated even when highmem is not present (e.g. 64-bit). However, if we were walking the guest TLB cache instead we would get a guest physical address which we can always resolve to a host virtual address. I'm not sure how important that whole use case is though. Maybe we should just error out to the guest for now. It's not that important, now that we are using hugetlb rather than directly mapping a large hunk of reserved memory. It would be nice to handle it though, if we can do without too much hassle. And I think manually creating a TLB entry could be faster than kmap_atomic(), or searching the guest TLB and then doing a reverse memslot lookup. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 07/10/2013 05:15:09 AM, Alexander Graf wrote: On 10.07.2013, at 02:06, Scott Wood wrote: On 07/09/2013 04:44:24 PM, Alexander Graf wrote: On 09.07.2013, at 20:46, Scott Wood wrote: I suspect that tlbsx is faster, or at worst similar. And unlike comparing tlbsx to lwepx (not counting a fix for the threading problem), we don't already have code to search the guest TLB, so testing would be more work. We have code to walk the guest TLB for TLB misses. This really is just the TLB miss search without host TLB injection. So let's say we're using the shadow TLB. The guest always has its say 64 TLB entries that it can count on - we never evict anything by accident, because we store all of the 64 entries in our guest TLB cache. When the guest faults at an address, the first thing we do is we check the cache whether we have that page already mapped. However, with this method we now have 2 enumeration methods for guest TLB searches. We have the tlbsx one which searches the host TLB and we have our guest TLB cache. The guest TLB cache might still contain an entry for an address that we already invalidated on the host. Would that impose a problem? I guess not because we're swizzling the exit code around to instead be an instruction miss which means we restore the TLB entry into our host's TLB so that when we resume, we land here and the tlbsx hits. But it feels backwards. Any better way? Searching the guest TLB won't work for the LRAT case, so we'd need to have this logic around anyway. We shouldn't add a second codepath unless it's a clear performance gain -- and again, I suspect it would be the opposite, especially if the entry is not in TLB0 or in one of the first few entries searched in TLB1. The tlbsx miss case is not what we should optimize for. Hrm. So let's redesign this thing theoretically. We would have an exit that requires an instruction fetch. We would override kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the host TLB. When it fails, we have to abort the emulation and resume the guest at the same IP. Now the guest gets the TLB miss, we populate, go back into the guest. The guest hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds this time and we can emulate the instruction. That's pretty much what this patch does, except that it goes immediately to the TLB miss code rather than having the extra round-trip back to the guest. Is there any benefit from adding that extra round-trip? Rewriting the exit type instead doesn't seem that bad... I think this works. Just make sure that the gateway to the instruction fetch is kvmppc_get_last_inst() and make that failable. Then the difference between looking for the TLB entry in the host's TLB or in the guest's TLB cache is hopefully negligible. I don't follow here. What does this have to do with looking in the guest TLB? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()
On 07/10/2013 01:02:19 AM, Tiejun Chen wrote: We should ensure the preemption cannot occur while calling get_paca() insdide hard_irq_disable(), otherwise the paca_struct may be the wrong one just after. And btw, we may update timing stats in this case. The soft-ee mechanism depends on accessing the PACA directly via r13 to avoid this. We probably should be using inline asm to read was_enabled rather than hoping the compiler doesn't do anything silly. Plus what Alex said, regarding this patch specifically. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Emulate MCSRR0/1 SPR and rfmci instruction
On 10.07.2013, at 20:24, Scott Wood wrote: On 07/10/2013 05:23:36 AM, Alexander Graf wrote: On 10.07.2013, at 00:26, Scott Wood wrote: On 07/09/2013 05:00:26 PM, Alexander Graf wrote: It'll also be more flexible at the same time. You could take the logs and actually check what's going on to debug issues that you're encountering for example. We could even go as far as sharing the same tool with other architectures, so that we only have to learn how to debug things once. Have you encountered an actual need for this flexibility, or is it theoretical? Yeah, first thing I did back then to actually debug kvm failures was to add trace points. I meant specifically for handling exit timings this way. No, but I did encounter the need for debugging exits. And the only thing we would need to get exit timing stats once we already have trace points for exit types would be to have trace points for guest entry and maybe type specific events to indicate what the exit is about as well. Is there common infrastructure for dealing with measuring intervals and tracking statistics thereof, rather than just tracking points and letting userspace connect the dots (though it could still do that as an option)? Even if it must be done in userspace, it doesn't seem like something that should be KVM-specific. Would you like to have different ways of measuring mm subsystem overhead? I don't :). The same goes for KVM really. If we could converge towards a single user space interface to get exit timings, it'd make debugging a lot easier. I agree -- that's why I said it doesn't seem like something that should be KVM-specific. But that's orthogonal to whether it's done in kernel space or user space. The ability to get begin/end events from userspace would be nice when it is specifically requested, but it would also be nice if the kernel could track some basic statistics so we wouldn't have to ship so much data around to arrive at the same result. At the very least, I'd like such a tool/infrastructure to exist before we start complaining about doing minor maintenance of the current mechanism. I admit that I don't fully understand qemu/scripts/kvm/kvm_stat, but it seems to me as if it already does pretty much what we want. It sets up a filter to only get events and their time stamps through. It does use normal exit trace points on x86 to replace the old debugfs based stat counters. And it seems to work reasonably well for that. We already have this for the debugfs counters btw. And the timing framework does break kvm_stat today already, as it emits textual stats rather than numbers which all of the other debugfs stats do. But at least I can take the x86 kvm_stat tool and run it on ppc just fine to see exit stats. We already have what? The last two sentences seem contradictory -- can you or can't you use kvm_stat as is? I'm not familiar with kvm_stat. Kvm_stat back in the day used debugfs to give you an idea on what exit event happens most often. That mechanism got replaced by trace points later which the current kvm_stat uses. I still have a copy of the old kvm_stat that I always use to get a first feeling for what goes wrong if something goes wrong. The original code couldn't deal with the fact that we have a debugfs file that contains text though. I patched it locally. It also works just fine if you simply disable timing stats, since then you won't have the text file. What does x86 KVM expose in debugfs? The same thing it always exposed - exit stats. I am fairly sure Avi wanted to completely deprecate that interface in favor of the trace point based approach, but I don't think he ever got around to it. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 3/3] kvm/ppc/booke: Don't call kvm_guest_enter twice
kvm_guest_enter() was already called by kvmppc_prepare_to_enter(). Don't call it again. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/kvm/booke.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index aa3bc36..57c1462 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -672,8 +672,6 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) goto out; } - kvm_guest_enter(); - #ifdef CONFIG_PPC_FPU /* Save userspace FPU state in stack */ enable_kernel_fp(); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 2/3] kvm/ppc: IRQ disabling cleanup
Simplify the handling of lazy EE by going directly from fully-enabled to hard-disabled. This replaces the lazy_irq_pending() check (including its misplaced kvm_guest_exit() call). As suggested by Tiejun Chen, move the interrupt disabling into kvmppc_prepare_to_enter() rather than have each caller do it. Also move the IRQ enabling on heavyweight exit into kvmppc_prepare_to_enter(). Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/include/asm/kvm_ppc.h | 6 ++ arch/powerpc/kvm/book3s_pr.c | 12 +++- arch/powerpc/kvm/booke.c | 11 +++ arch/powerpc/kvm/powerpc.c | 23 ++- 4 files changed, 22 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 6885846..e4474f8 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -404,6 +404,12 @@ static inline void kvmppc_fix_ee_before_entry(void) trace_hardirqs_on(); #ifdef CONFIG_PPC64 + /* +* To avoid races, the caller must have gone directly from having +* interrupts fully-enabled to hard-disabled. +*/ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); + /* Only need to enable IRQs by hard enabling them after this */ local_paca-irq_happened = 0; local_paca-soft_enabled = 1; diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index ddfaf56..c13caa0 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -884,14 +884,11 @@ program_interrupt: * and if we really did time things so badly, then we just exit * again due to a host external interrupt. */ - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); - if (s = 0) { - local_irq_enable(); + if (s = 0) r = s; - } else { + else kvmppc_fix_ee_before_entry(); - } } trace_kvm_book3s_reenter(r, vcpu); @@ -1121,12 +1118,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) * really did time things so badly, then we just exit again due to * a host external interrupt. */ - local_irq_disable(); ret = kvmppc_prepare_to_enter(vcpu); - if (ret = 0) { - local_irq_enable(); + if (ret = 0) goto out; - } /* Save FPU state in stack */ if (current-thread.regs-msr MSR_FP) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 6e35351..aa3bc36 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -617,7 +617,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) local_irq_enable(); kvm_vcpu_block(vcpu); clear_bit(KVM_REQ_UNHALT, vcpu-requests); - local_irq_disable(); + hard_irq_disable(); kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); r = 1; @@ -666,10 +666,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) return -EINVAL; } - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); if (s = 0) { - local_irq_enable(); ret = s; goto out; } @@ -1162,14 +1160,11 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, * aren't already exiting to userspace for some other reason. */ if (!(r RESUME_HOST)) { - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); - if (s = 0) { - local_irq_enable(); + if (s = 0) r = (s 2) | RESUME_HOST | (r RESUME_FLAG_NV); - } else { + else kvmppc_fix_ee_before_entry(); - } } return r; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4e05f8c..2f7a221 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -64,12 +64,14 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu) { int r = 1; - WARN_ON_ONCE(!irqs_disabled()); + WARN_ON(irqs_disabled()); + hard_irq_disable(); + while (true) { if (need_resched()) { local_irq_enable(); cond_resched(); - local_irq_disable(); + hard_irq_disable(); continue; } @@ -95,7 +97,7 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu) local_irq_enable(); trace_kvm_check_requests(vcpu); r =
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 10.07.2013, at 20:37, Scott Wood wrote: On 07/10/2013 05:18:10 AM, Alexander Graf wrote: On 10.07.2013, at 02:12, Scott Wood wrote: On 07/09/2013 04:45:10 PM, Alexander Graf wrote: On 28.06.2013, at 11:20, Mihai Caraman wrote: + /* Get page size */ + if (MAS0_GET_TLBSEL(mfspr(SPRN_MAS0)) == 0) + psize_shift = PAGE_SHIFT; + else + psize_shift = MAS1_GET_TSIZE(mas1) + 10; + + mas7_mas3 = (((u64) mfspr(SPRN_MAS7)) 32) | + mfspr(SPRN_MAS3); + addr = (mas7_mas3 (~0ULL psize_shift)) | + (geaddr ((1ULL psize_shift) - 1ULL)); + + /* Map a page and get guest's instruction */ + page = pfn_to_page(addr PAGE_SHIFT); While looking at this I just realized that you're missing a check here. What if our IP is in some PCI BAR? Or can't we execute from those? We at least need to check pfn_valid() first. That'll just keep us from accessing a bad pointer in the host kernel, though -- it won't make the emulation actually work. If we need that, we'll probably need to create a temporary TLB entry manually. ioremap()? That's a bit heavy... also we'd need to deal with cacheability. This code is already engaged in directly creating TLB entries, so it doesn't seem like much of a stretch to create one for this. It should be faster than ioremap() or kmap_atomic(). Do we have any guarantees that the TLB entry we create doesn't get evicted by another thread by the time we want to use it? Alex The one complication is allocating the virtual address space, but maybe we could just use the page that kmap_atomic would have used? Of course, if we want to handle execution from other than normal kernel memory, we'll need to make sure that the virtual address space is allocated even when highmem is not present (e.g. 64-bit). However, if we were walking the guest TLB cache instead we would get a guest physical address which we can always resolve to a host virtual address. I'm not sure how important that whole use case is though. Maybe we should just error out to the guest for now. It's not that important, now that we are using hugetlb rather than directly mapping a large hunk of reserved memory. It would be nice to handle it though, if we can do without too much hassle. And I think manually creating a TLB entry could be faster than kmap_atomic(), or searching the guest TLB and then doing a reverse memslot lookup. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 1/3] kvm/ppc: Call trace_hardirqs_on before entry
Currently this is only being done on 64-bit. Rather than just move it out of the 64-bit ifdef, move it to kvm_lazy_ee_enable() so that it is consistent with lazy ee state, and so that we don't track more host code as interrupts-enabled than necessary. Rename kvm_lazy_ee_enable() to kvm_fix_ee_before_entry() to reflect that this function now has a role on 32-bit as well. Signed-off-by: Scott Wood scottw...@freescale.com --- arch/powerpc/include/asm/kvm_ppc.h | 11 --- arch/powerpc/kvm/book3s_pr.c | 4 ++-- arch/powerpc/kvm/booke.c | 4 ++-- arch/powerpc/kvm/powerpc.c | 2 -- 4 files changed, 12 insertions(+), 9 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index a5287fe..6885846 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -394,10 +394,15 @@ static inline void kvmppc_mmu_flush_icache(pfn_t pfn) } } -/* Please call after prepare_to_enter. This function puts the lazy ee state - back to normal mode, without actually enabling interrupts. */ -static inline void kvmppc_lazy_ee_enable(void) +/* + * Please call after prepare_to_enter. This function puts the lazy ee and irq + * disabled tracking state back to normal mode, without actually enabling + * interrupts. + */ +static inline void kvmppc_fix_ee_before_entry(void) { + trace_hardirqs_on(); + #ifdef CONFIG_PPC64 /* Only need to enable IRQs by hard enabling them after this */ local_paca-irq_happened = 0; diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index 19498a5..ddfaf56 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -890,7 +890,7 @@ program_interrupt: local_irq_enable(); r = s; } else { - kvmppc_lazy_ee_enable(); + kvmppc_fix_ee_before_entry(); } } @@ -1161,7 +1161,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) if (vcpu-arch.shared-msr MSR_FP) kvmppc_handle_ext(vcpu, BOOK3S_INTERRUPT_FP_UNAVAIL, MSR_FP); - kvmppc_lazy_ee_enable(); + kvmppc_fix_ee_before_entry(); ret = __kvmppc_vcpu_run(kvm_run, vcpu); diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index dcc94f0..6e35351 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -698,7 +698,7 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) kvmppc_load_guest_fp(vcpu); #endif - kvmppc_lazy_ee_enable(); + kvmppc_fix_ee_before_entry(); ret = __kvmppc_vcpu_run(kvm_run, vcpu); @@ -1168,7 +1168,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, local_irq_enable(); r = (s 2) | RESUME_HOST | (r RESUME_FLAG_NV); } else { - kvmppc_lazy_ee_enable(); + kvmppc_fix_ee_before_entry(); } } diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 6316ee3..4e05f8c 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -117,8 +117,6 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu) kvm_guest_exit(); continue; } - - trace_hardirqs_on(); #endif kvm_guest_enter(); -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 10.07.2013, at 20:42, Scott Wood wrote: On 07/10/2013 05:15:09 AM, Alexander Graf wrote: On 10.07.2013, at 02:06, Scott Wood wrote: On 07/09/2013 04:44:24 PM, Alexander Graf wrote: On 09.07.2013, at 20:46, Scott Wood wrote: I suspect that tlbsx is faster, or at worst similar. And unlike comparing tlbsx to lwepx (not counting a fix for the threading problem), we don't already have code to search the guest TLB, so testing would be more work. We have code to walk the guest TLB for TLB misses. This really is just the TLB miss search without host TLB injection. So let's say we're using the shadow TLB. The guest always has its say 64 TLB entries that it can count on - we never evict anything by accident, because we store all of the 64 entries in our guest TLB cache. When the guest faults at an address, the first thing we do is we check the cache whether we have that page already mapped. However, with this method we now have 2 enumeration methods for guest TLB searches. We have the tlbsx one which searches the host TLB and we have our guest TLB cache. The guest TLB cache might still contain an entry for an address that we already invalidated on the host. Would that impose a problem? I guess not because we're swizzling the exit code around to instead be an instruction miss which means we restore the TLB entry into our host's TLB so that when we resume, we land here and the tlbsx hits. But it feels backwards. Any better way? Searching the guest TLB won't work for the LRAT case, so we'd need to have this logic around anyway. We shouldn't add a second codepath unless it's a clear performance gain -- and again, I suspect it would be the opposite, especially if the entry is not in TLB0 or in one of the first few entries searched in TLB1. The tlbsx miss case is not what we should optimize for. Hrm. So let's redesign this thing theoretically. We would have an exit that requires an instruction fetch. We would override kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the host TLB. When it fails, we have to abort the emulation and resume the guest at the same IP. Now the guest gets the TLB miss, we populate, go back into the guest. The guest hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds this time and we can emulate the instruction. That's pretty much what this patch does, except that it goes immediately to the TLB miss code rather than having the extra round-trip back to the guest. Is there any benefit from adding that extra round-trip? Rewriting the exit type instead doesn't seem that bad... It's pretty bad. I want to have code that is easy to follow - and I don't care whether the very rare case of a TLB entry getting evicted by a random other thread right when we execute the exit path is slower by a few percent if we get cleaner code for that. I think this works. Just make sure that the gateway to the instruction fetch is kvmppc_get_last_inst() and make that failable. Then the difference between looking for the TLB entry in the host's TLB or in the guest's TLB cache is hopefully negligible. I don't follow here. What does this have to do with looking in the guest TLB? I want to hide the fact that we're cheating as much as possible, that's it. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On 11.07.2013, at 00:47, Scott Wood wrote: Simplify the handling of lazy EE by going directly from fully-enabled to hard-disabled. This replaces the lazy_irq_pending() check (including its misplaced kvm_guest_exit() call). As suggested by Tiejun Chen, move the interrupt disabling into kvmppc_prepare_to_enter() rather than have each caller do it. Also move the IRQ enabling on heavyweight exit into kvmppc_prepare_to_enter(). Signed-off-by: Scott Wood scottw...@freescale.com Ben, could you please review this too? :) Thanks a bunch! --- arch/powerpc/include/asm/kvm_ppc.h | 6 ++ arch/powerpc/kvm/book3s_pr.c | 12 +++- arch/powerpc/kvm/booke.c | 11 +++ arch/powerpc/kvm/powerpc.c | 23 ++- 4 files changed, 22 insertions(+), 30 deletions(-) diff --git a/arch/powerpc/include/asm/kvm_ppc.h b/arch/powerpc/include/asm/kvm_ppc.h index 6885846..e4474f8 100644 --- a/arch/powerpc/include/asm/kvm_ppc.h +++ b/arch/powerpc/include/asm/kvm_ppc.h @@ -404,6 +404,12 @@ static inline void kvmppc_fix_ee_before_entry(void) trace_hardirqs_on(); #ifdef CONFIG_PPC64 + /* + * To avoid races, the caller must have gone directly from having + * interrupts fully-enabled to hard-disabled. + */ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); WARN_ON(lazy_irq_pending()); ? Alex + /* Only need to enable IRQs by hard enabling them after this */ local_paca-irq_happened = 0; local_paca-soft_enabled = 1; diff --git a/arch/powerpc/kvm/book3s_pr.c b/arch/powerpc/kvm/book3s_pr.c index ddfaf56..c13caa0 100644 --- a/arch/powerpc/kvm/book3s_pr.c +++ b/arch/powerpc/kvm/book3s_pr.c @@ -884,14 +884,11 @@ program_interrupt: * and if we really did time things so badly, then we just exit * again due to a host external interrupt. */ - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); - if (s = 0) { - local_irq_enable(); + if (s = 0) r = s; - } else { + else kvmppc_fix_ee_before_entry(); - } } trace_kvm_book3s_reenter(r, vcpu); @@ -1121,12 +1118,9 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) * really did time things so badly, then we just exit again due to * a host external interrupt. */ - local_irq_disable(); ret = kvmppc_prepare_to_enter(vcpu); - if (ret = 0) { - local_irq_enable(); + if (ret = 0) goto out; - } /* Save FPU state in stack */ if (current-thread.regs-msr MSR_FP) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index 6e35351..aa3bc36 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -617,7 +617,7 @@ int kvmppc_core_prepare_to_enter(struct kvm_vcpu *vcpu) local_irq_enable(); kvm_vcpu_block(vcpu); clear_bit(KVM_REQ_UNHALT, vcpu-requests); - local_irq_disable(); + hard_irq_disable(); kvmppc_set_exit_type(vcpu, EMULATED_MTMSRWE_EXITS); r = 1; @@ -666,10 +666,8 @@ int kvmppc_vcpu_run(struct kvm_run *kvm_run, struct kvm_vcpu *vcpu) return -EINVAL; } - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); if (s = 0) { - local_irq_enable(); ret = s; goto out; } @@ -1162,14 +1160,11 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, * aren't already exiting to userspace for some other reason. */ if (!(r RESUME_HOST)) { - local_irq_disable(); s = kvmppc_prepare_to_enter(vcpu); - if (s = 0) { - local_irq_enable(); + if (s = 0) r = (s 2) | RESUME_HOST | (r RESUME_FLAG_NV); - } else { + else kvmppc_fix_ee_before_entry(); - } } return r; diff --git a/arch/powerpc/kvm/powerpc.c b/arch/powerpc/kvm/powerpc.c index 4e05f8c..2f7a221 100644 --- a/arch/powerpc/kvm/powerpc.c +++ b/arch/powerpc/kvm/powerpc.c @@ -64,12 +64,14 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu) { int r = 1; - WARN_ON_ONCE(!irqs_disabled()); + WARN_ON(irqs_disabled()); + hard_irq_disable(); + while (true) { if (need_resched()) { local_irq_enable(); cond_resched(); - local_irq_disable(); + hard_irq_disable(); continue; } @@ -95,7 +97,7 @@ int kvmppc_prepare_to_enter(struct kvm_vcpu *vcpu) local_irq_enable();
Re: [PATCH 0/3] kvm/ppc: fixes/cleanup
On 11.07.2013, at 00:47, Scott Wood wrote: These are the remaining patches from the set of 8 that was originally posted for 3.10, but which weren't deemed urgent enough for that. Thanks, applied 1/3 and 3/3 to kvm-ppc-queue. Alex Scott Wood (3): kvm/ppc: Call trace_hardirqs_on before entry kvm/ppc: IRQ disabling cleanup kvm/ppc/booke: Don't call kvm_guest_enter twice arch/powerpc/include/asm/kvm_ppc.h | 17 ++--- arch/powerpc/kvm/book3s_pr.c | 16 +--- arch/powerpc/kvm/booke.c | 17 + arch/powerpc/kvm/powerpc.c | 25 ++--- 4 files changed, 34 insertions(+), 41 deletions(-) -- 1.8.1.2 -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: #ifdef CONFIG_PPC64 + /* + * To avoid races, the caller must have gone directly from having + * interrupts fully-enabled to hard-disabled. + */ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); WARN_ON(lazy_irq_pending()); ? Different semantics. What you propose will not catch irq_happened == 0 :-) Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote: On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: #ifdef CONFIG_PPC64 + /* + * To avoid races, the caller must have gone directly from having + * interrupts fully-enabled to hard-disabled. + */ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); WARN_ON(lazy_irq_pending()); ? Different semantics. What you propose will not catch irq_happened == 0 :-) Right, but we only ever reach here after hard_irq_disable() I think. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On 07/10/2013 06:04:53 PM, Alexander Graf wrote: On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote: On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: #ifdef CONFIG_PPC64 + /* + * To avoid races, the caller must have gone directly from having + * interrupts fully-enabled to hard-disabled. + */ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); WARN_ON(lazy_irq_pending()); ? Different semantics. What you propose will not catch irq_happened == 0 :-) Right, but we only ever reach here after hard_irq_disable() I think. And the WARN_ON helps us ensure that it stays that way. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/3] kvm/ppc: IRQ disabling cleanup
On 11.07.2013, at 01:08, Scott Wood wrote: On 07/10/2013 06:04:53 PM, Alexander Graf wrote: On 11.07.2013, at 01:01, Benjamin Herrenschmidt wrote: On Thu, 2013-07-11 at 00:57 +0200, Alexander Graf wrote: #ifdef CONFIG_PPC64 + /* + * To avoid races, the caller must have gone directly from having + * interrupts fully-enabled to hard-disabled. + */ + WARN_ON(local_paca-irq_happened != PACA_IRQ_HARD_DIS); WARN_ON(lazy_irq_pending()); ? Different semantics. What you propose will not catch irq_happened == 0 :-) Right, but we only ever reach here after hard_irq_disable() I think. And the WARN_ON helps us ensure that it stays that way. Heh - ok :). Works for me. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 07/10/2013 05:50:01 PM, Alexander Graf wrote: On 10.07.2013, at 20:42, Scott Wood wrote: On 07/10/2013 05:15:09 AM, Alexander Graf wrote: On 10.07.2013, at 02:06, Scott Wood wrote: On 07/09/2013 04:44:24 PM, Alexander Graf wrote: On 09.07.2013, at 20:46, Scott Wood wrote: I suspect that tlbsx is faster, or at worst similar. And unlike comparing tlbsx to lwepx (not counting a fix for the threading problem), we don't already have code to search the guest TLB, so testing would be more work. We have code to walk the guest TLB for TLB misses. This really is just the TLB miss search without host TLB injection. So let's say we're using the shadow TLB. The guest always has its say 64 TLB entries that it can count on - we never evict anything by accident, because we store all of the 64 entries in our guest TLB cache. When the guest faults at an address, the first thing we do is we check the cache whether we have that page already mapped. However, with this method we now have 2 enumeration methods for guest TLB searches. We have the tlbsx one which searches the host TLB and we have our guest TLB cache. The guest TLB cache might still contain an entry for an address that we already invalidated on the host. Would that impose a problem? I guess not because we're swizzling the exit code around to instead be an instruction miss which means we restore the TLB entry into our host's TLB so that when we resume, we land here and the tlbsx hits. But it feels backwards. Any better way? Searching the guest TLB won't work for the LRAT case, so we'd need to have this logic around anyway. We shouldn't add a second codepath unless it's a clear performance gain -- and again, I suspect it would be the opposite, especially if the entry is not in TLB0 or in one of the first few entries searched in TLB1. The tlbsx miss case is not what we should optimize for. Hrm. So let's redesign this thing theoretically. We would have an exit that requires an instruction fetch. We would override kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the host TLB. When it fails, we have to abort the emulation and resume the guest at the same IP. Now the guest gets the TLB miss, we populate, go back into the guest. The guest hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds this time and we can emulate the instruction. That's pretty much what this patch does, except that it goes immediately to the TLB miss code rather than having the extra round-trip back to the guest. Is there any benefit from adding that extra round-trip? Rewriting the exit type instead doesn't seem that bad... It's pretty bad. I want to have code that is easy to follow - and I don't care whether the very rare case of a TLB entry getting evicted by a random other thread right when we execute the exit path is slower by a few percent if we get cleaner code for that. I guess I just don't see how this is so much harder to follow than returning to guest. I find it harder to follow the flow when there are more round trips to the guest involved. Treat this as an ITLB miss is simpler than, Let this fail, and make sure we retry the trapping instruction on failure. Then, an ITLB miss will happen. Also note that making kvmppc_get_last_inst() able to fail means updating several existing callsites, both for the change in function signature and to actually handle failures. I don't care that deeply either way, it just doesn't seem obviously better. I think this works. Just make sure that the gateway to the instruction fetch is kvmppc_get_last_inst() and make that failable. Then the difference between looking for the TLB entry in the host's TLB or in the guest's TLB cache is hopefully negligible. I don't follow here. What does this have to do with looking in the guest TLB? I want to hide the fact that we're cheating as much as possible, that's it. How are we cheating, and what specifically are you proposing to do to hide that? How is the guest TLB involved at all in the change you're asking for? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 2/2] KVM: PPC: Book3E: Get vcpu's last instruction for emulation
On 11.07.2013, at 02:15, Scott Wood wrote: On 07/10/2013 05:50:01 PM, Alexander Graf wrote: On 10.07.2013, at 20:42, Scott Wood wrote: On 07/10/2013 05:15:09 AM, Alexander Graf wrote: On 10.07.2013, at 02:06, Scott Wood wrote: On 07/09/2013 04:44:24 PM, Alexander Graf wrote: On 09.07.2013, at 20:46, Scott Wood wrote: I suspect that tlbsx is faster, or at worst similar. And unlike comparing tlbsx to lwepx (not counting a fix for the threading problem), we don't already have code to search the guest TLB, so testing would be more work. We have code to walk the guest TLB for TLB misses. This really is just the TLB miss search without host TLB injection. So let's say we're using the shadow TLB. The guest always has its say 64 TLB entries that it can count on - we never evict anything by accident, because we store all of the 64 entries in our guest TLB cache. When the guest faults at an address, the first thing we do is we check the cache whether we have that page already mapped. However, with this method we now have 2 enumeration methods for guest TLB searches. We have the tlbsx one which searches the host TLB and we have our guest TLB cache. The guest TLB cache might still contain an entry for an address that we already invalidated on the host. Would that impose a problem? I guess not because we're swizzling the exit code around to instead be an instruction miss which means we restore the TLB entry into our host's TLB so that when we resume, we land here and the tlbsx hits. But it feels backwards. Any better way? Searching the guest TLB won't work for the LRAT case, so we'd need to have this logic around anyway. We shouldn't add a second codepath unless it's a clear performance gain -- and again, I suspect it would be the opposite, especially if the entry is not in TLB0 or in one of the first few entries searched in TLB1. The tlbsx miss case is not what we should optimize for. Hrm. So let's redesign this thing theoretically. We would have an exit that requires an instruction fetch. We would override kvmppc_get_last_inst() to always do kvmppc_ld_inst(). That one can fail because it can't find the TLB entry in the host TLB. When it fails, we have to abort the emulation and resume the guest at the same IP. Now the guest gets the TLB miss, we populate, go back into the guest. The guest hits the emulation failure again. We go back to kvmppc_ld_inst() which succeeds this time and we can emulate the instruction. That's pretty much what this patch does, except that it goes immediately to the TLB miss code rather than having the extra round-trip back to the guest. Is there any benefit from adding that extra round-trip? Rewriting the exit type instead doesn't seem that bad... It's pretty bad. I want to have code that is easy to follow - and I don't care whether the very rare case of a TLB entry getting evicted by a random other thread right when we execute the exit path is slower by a few percent if we get cleaner code for that. I guess I just don't see how this is so much harder to follow than returning to guest. I find it harder to follow the flow when there are more round trips to the guest involved. Treat this as an ITLB miss is simpler than, Let this fail, and make sure we retry the trapping instruction on failure. Then, an ITLB miss will happen. Also note that making kvmppc_get_last_inst() able to fail means updating several existing callsites, both for the change in function signature and to actually handle failures. I don't care that deeply either way, it just doesn't seem obviously better. I think this works. Just make sure that the gateway to the instruction fetch is kvmppc_get_last_inst() and make that failable. Then the difference between looking for the TLB entry in the host's TLB or in the guest's TLB cache is hopefully negligible. I don't follow here. What does this have to do with looking in the guest TLB? I want to hide the fact that we're cheating as much as possible, that's it. How are we cheating, and what specifically are you proposing to do to hide that? How is the guest TLB involved at all in the change you're asking for? It's not involved, but it's basically what we do on book3s pr kvm. There kvmppc_ld reads the guest htab, not the host htab. I think it's fine to expose both cases as the same thing to the rest of kvm. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()
On 07/10/2013 05:49 PM, Alexander Graf wrote: On 10.07.2013, at 08:02, Tiejun Chen wrote: We should ensure the preemption cannot occur while calling get_paca() insdide hard_irq_disable(), otherwise the paca_struct may be the wrong one just after. And btw, we may update timing stats in this case. Signed-off-by: Tiejun Chen tiejun.c...@windriver.com --- arch/powerpc/kvm/booke.c |2 ++ 1 file changed, 2 insertions(+) diff --git a/arch/powerpc/kvm/booke.c b/arch/powerpc/kvm/booke.c index dcc94f0..9dae25d 100644 --- a/arch/powerpc/kvm/booke.c +++ b/arch/powerpc/kvm/booke.c @@ -839,6 +839,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, WARN_ON(local_paca-irq_happened != 0); #endif + preempt_disable(); /* * We enter with interrupts disabled in hardware, but * we need to call hard_irq_disable anyway to ensure that @@ -848,6 +849,7 @@ int kvmppc_handle_exit(struct kvm_run *run, struct kvm_vcpu *vcpu, /* update before a new last_exit_type is rewritten */ kvmppc_update_timing_stats(vcpu); + preempt_enable(); All of the code here is already called with interrupts disabled. I don't see how we could preempt then? But the kernel still check this in preempt case, #define get_paca() ((void) debug_smp_processor_id(), local_paca) then trigger that known call trace as I observed :) BUG: using smp_processor_id() in preemptible [] code: qemu-system-ppc/2065 caller is .kvmppc_handle_exit+0x48/0x810 CPU: 0 PID: 2065 Comm: qemu-system-ppc Not tainted 3.10.0-172036-ge2daa28-dirty #116 Call Trace: [c001fc637570] [c000835c] .show_stack+0x7c/0x1f0 (unreliable) [c001fc637640] [c0673a0c] .dump_stack+0x28/0x3c [c001fc6376b0] [c02f04d8] .debug_smp_processor_id+0x108/0x120 [c001fc637740] [c00444e8] .kvmppc_handle_exit+0x48/0x810 [c001fc6377f0] [c0047c80] .kvmppc_resume_host+0xa4/0xf8 Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [v1][PATCH 1/1] KVM: PPC: disable preemption when using hard_irq_disable()
On 07/11/2013 03:15 AM, Scott Wood wrote: On 07/10/2013 01:02:19 AM, Tiejun Chen wrote: We should ensure the preemption cannot occur while calling get_paca() insdide hard_irq_disable(), otherwise the paca_struct may be the wrong one just after. And btw, we may update timing stats in this case. The soft-ee mechanism depends on accessing the PACA directly via r13 to avoid this. We probably should be using inline asm to read was_enabled rather than Yes. hoping the compiler doesn't do anything silly. Do you recommend I should directly replace get_paca() with local_paca inside hard_irq_disable()? #define hard_irq_disable() do {\ u8 _was_enabled = get_paca()-soft_enabled; \ - u8 _was_enabled = local_paca-soft_enabled; But is this safe for all scenarios? Tiejun -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/8] KVM: PPC: Add support for multiple-TCE hcalls
On 07/10/2013 08:05 PM, Alexander Graf wrote: On 10.07.2013, at 07:00, Alexey Kardashevskiy wrote: On 07/10/2013 03:02 AM, Alexander Graf wrote: On 07/06/2013 05:07 PM, Alexey Kardashevskiy wrote: This adds real mode handlers for the H_PUT_TCE_INDIRECT and H_STUFF_TCE hypercalls for QEMU emulated devices such as IBMVIO devices or emulated PCI. These calls allow adding multiple entries (up to 512) into the TCE table in one call which saves time on transition to/from real mode. We don't mention QEMU explicitly in KVM code usually. This adds a tce_tmp cache to kvm_vcpu_arch to save valid TCEs (copied from user and verified) before writing the whole list into the TCE table. This cache will be utilized more in the upcoming VFIO/IOMMU support to continue TCE list processing in the virtual mode in the case if the real mode handler failed for some reason. This adds a guest physical to host real address converter and calls the existing H_PUT_TCE handler. The converting function is going to be fully utilized by upcoming VFIO supporting patches. This also implements the KVM_CAP_PPC_MULTITCE capability, so in order to support the functionality of this patch, QEMU needs to query for this capability and set the hcall-multi-tce hypertas property only if the capability is present, otherwise there will be serious performance degradation. Same as above. But really you're only giving recommendations here. What's the point? Please describe what the benefit of this patch is, not what some other random subsystem might do with the benefits it brings. Signed-off-by: Paul Mackerraspau...@samba.org Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Changelog: 2013/07/06: * fixed number of wrong get_page()/put_page() calls 2013/06/27: * fixed clear of BUSY bit in kvmppc_lookup_pte() * H_PUT_TCE_INDIRECT does realmode_get_page() now * KVM_CAP_SPAPR_MULTITCE now depends on CONFIG_PPC_BOOK3S_64 * updated doc 2013/06/05: * fixed mistype about IBMVIO in the commit message * updated doc and moved it to another section * changed capability number 2013/05/21: * added kvm_vcpu_arch::tce_tmp * removed cleanup if put_indirect failed, instead we do not even start writing to TCE table if we cannot get TCEs from the user and they are invalid * kvmppc_emulated_h_put_tce is split to kvmppc_emulated_put_tce and kvmppc_emulated_validate_tce (for the previous item) * fixed bug with failthrough for H_IPI * removed all get_user() from real mode handlers * kvmppc_lookup_pte() added (instead of making lookup_linux_pte public) Signed-off-by: Alexey Kardashevskiya...@ozlabs.ru --- Documentation/virtual/kvm/api.txt | 25 +++ arch/powerpc/include/asm/kvm_host.h | 9 ++ arch/powerpc/include/asm/kvm_ppc.h | 16 +- arch/powerpc/kvm/book3s_64_vio.c| 154 ++- arch/powerpc/kvm/book3s_64_vio_hv.c | 260 arch/powerpc/kvm/book3s_hv.c| 41 - arch/powerpc/kvm/book3s_hv_rmhandlers.S | 6 + arch/powerpc/kvm/book3s_pr_papr.c | 37 - arch/powerpc/kvm/powerpc.c | 3 + 9 files changed, 517 insertions(+), 34 deletions(-) diff --git a/Documentation/virtual/kvm/api.txt b/Documentation/virtual/kvm/api.txt index 6365fef..762c703 100644 --- a/Documentation/virtual/kvm/api.txt +++ b/Documentation/virtual/kvm/api.txt @@ -2362,6 +2362,31 @@ calls by the guest for that service will be passed to userspace to be handled. +4.86 KVM_CAP_PPC_MULTITCE + +Capability: KVM_CAP_PPC_MULTITCE +Architectures: ppc +Type: vm + +This capability means the kernel is capable of handling hypercalls +H_PUT_TCE_INDIRECT and H_STUFF_TCE without passing those into the user +space. This significanly accelerates DMA operations for PPC KVM guests. significanly? Please run this through a spell checker. +The user space should expect that its handlers for these hypercalls s/The// +are not going to be called. Is user space guaranteed they will not be called? Or can it still happen? ... if user space previously registered LIOBN in KVM (via KVM_CREATE_SPAPR_TCE or similar calls). ok? How about this? The hypercalls mentioned above may or may not be processed successfully in the kernel based fast path. If they can not be handled by the kernel, they will get passed on to user space. So user space still has to have an implementation for these despite the in kernel acceleration. --- The target audience for this documentation is user space KVM API users. Someone developing kvm tool for example. They want to know implications specific CAPs have. There is also KVM_CREATE_SPAPR_TCE_IOMMU but it is not in the kernel yet and may never get there. +In order to enable H_PUT_TCE_INDIRECT and H_STUFF_TCE use in the guest, +the user space might have to advertise it for the guest. For example, +IBM pSeries guest starts using them if hcall-multi-tce is present in