kvm BookE and SPRGs
Hi Hollis ! I was roaming through kernel usage of SPRGs and noticed a small detail in kvmppc for BookE ... any reason why in OP_31_XOP_MTSPR, you open coded the emulation of SPRG0..3, but 4...7 are handled in kvmppc_core_emulate_mtspr() ? It occurs to me that in fact for both MTSPR and MFSPR, the code should be moved into kvmppc_core_emulate_mtspr() and kvmppc_core_emulate_mfspr() for consistency. Also, from looking at the FSL BookE code, it seems that there is such a thing as SPRG9 (and so I suppose there must be an SPRG8 somewhere too), shouldn't we handle it too ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 16:31 +1000, Benjamin Herrenschmidt wrote: I was roaming through kernel usage of SPRGs and noticed a small detail in kvmppc for BookE ... any reason why in OP_31_XOP_MTSPR, you open coded the emulation of SPRG0..3, but 4...7 are handled in kvmppc_core_emulate_mtspr() ? It occurs to me that in fact for both MTSPR and MFSPR, the code should be moved into kvmppc_core_emulate_mtspr() and kvmppc_core_emulate_mfspr() for consistency. Also, from looking at the FSL BookE code, it seems that there is such a thing as SPRG9 (and so I suppose there must be an SPRG8 somewhere too), shouldn't we handle it too ? BTW. That leads me to another question (CC'ing Avi there too), which is what is the policy vs. para-virtualization ? IE. Are we ok with adding paravirt tricks to speed things up ? A prime example I have in mind that could possibly help a lot here is to have a shared page mapped at -4K (at the top of the address space) when the guest is in supervisor mode only that hosts part of the current VCPU supervisor register state. That way, we could, either using our existing alternate instruction patching mechanism, or maybe lazily patching them as we trap on them, replace instructions such as mtsprg and mfsprg with la/sta (load absolute/store absolute) from/to this page (absolute addresses on ppc are 16 bits signed so can reach either the top of the bottom of the address space). We could also access the guest MSR read only that way, the guest SRR0 and SRR1, and a few more things. I also have ideas to do soft irq disabling that way as well which would eventually remove most if not all the spurrious emulation traps in the exception entry/exit of the guest kernel. (Note: this is paravirt even if we patch instructions on traps, in part because if we use that instead of SPRGs, then the values will not be reflected in the user readable SPRG aliases, so the guest kernel needs to be aware of that, typically, the current BookE code -does- use the user readable variants of SPRG4..7 so we must be careful here). The cost of course is an additional TLB entry for mapping that -4K page (but only when running guest kernel code). (Note: this technique would apply to KVM ppc64 from Alex as well) Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
Hi Ben, On 10.07.2009, at 10:10, Benjamin Herrenschmidt wrote: On Fri, 2009-07-10 at 16:31 +1000, Benjamin Herrenschmidt wrote: I was roaming through kernel usage of SPRGs and noticed a small detail in kvmppc for BookE ... any reason why in OP_31_XOP_MTSPR, you open coded the emulation of SPRG0..3, but 4...7 are handled in kvmppc_core_emulate_mtspr() ? It occurs to me that in fact for both MTSPR and MFSPR, the code should be moved into kvmppc_core_emulate_mtspr() and kvmppc_core_emulate_mfspr() for consistency. Also, from looking at the FSL BookE code, it seems that there is such a thing as SPRG9 (and so I suppose there must be an SPRG8 somewhere too), shouldn't we handle it too ? BTW. That leads me to another question (CC'ing Avi there too), which is what is the policy vs. para-virtualization ? IE. Are we ok with adding paravirt tricks to speed things up ? IMHO paravirt stuff can be really useful, but should stay in the guest. I don't really like the idea of adding binary patching of guests in the hypervisor more than for dcbz where I didn't see another way to do it. Linux does provide pv_ops for such purposes, or maybe you could use the magic kernel patches itself hacks that exist in the power port today already. So then newer guests would be fast, older guests would be slow. Sounds like a good tradeoff to me :-). Maybe we could also do the hacks in the hypervisor, but #ifdef them out by default. I always get stomachaches from patching guests by default ;-). [...] That way, we could, either using our existing alternate instruction patching mechanism, or maybe lazily patching them as we trap on them, replace instructions such as mtsprg and mfsprg with la/sta (load absolute/store absolute) from/to this page (absolute addresses on ppc are 16 bits signed so can reach either the top of the bottom of the address space). That seems to be guest responsibility, no? -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 10:42 +0200, Alexander Graf wrote: IMHO paravirt stuff can be really useful, but should stay in the guest. I don't really like the idea of adding binary patching of guests in the hypervisor more than for dcbz where I didn't see another way to do it. I wasn't talking about that sort of binary patching :-) There's two ways to do it: - One is when you fault on an instruction like mtsprg2, you can patch -that- instruction and replace it with a magic stwa to the shared page. However, I prefer -real- paravirt which is: - The guest can use the existing self-binary patching facility we have to replace its own SPR access instructions with instructions that access the magic shared page. Linux does provide pv_ops for such purposes, or maybe you could use the magic kernel patches itself hacks that exist in the power port today already. pv_ops are useful for higher level things. We don't necessarily needs them anyway as we already have various hooks for our existing hypervisors which are all some kind of paravirt. But the problem we have now with running supervisor instructions in user mode is too low level and performance sensitive for something like pv_ops. My proposed scheme would be much more efficient and remains reasonably simple. So then newer guests would be fast, older guests would be slow. Sounds like a good tradeoff to me :-). Right :-) Maybe we could also do the hacks in the hypervisor, but #ifdef them out by default. I always get stomachaches from patching guests by default ;-). I don't like patching guest from the HV that much neither, I prefer paravirt for things like that. The case where we may -have- to do it would be if we tried to run legacy non-open source OSes like MacOS to handle things like cache line size issues, but then, it should be special options that have to be explicitely enabled via some sort of flags passed from userspace. Thus, from the userspace tools, when creating a VM, you could enable special MacOS 9 compatibility hacks for example. But let's deal with that later, right now, the focus is linux on linux. I was just proposing a simple paravirt approach that would speed up significantly a whole bunch of existing low level exception entry/exit code path. Another approach would be to do that at a higher level, by having more C-like entry points for the HV to call the guest into but that seems to inflexible to me and complicated. [...] That seems to be guest responsibility, no? Yes. mostly. The host side KVM code would have to provide the shared page which contains the shadows of SPRGs, SRR's, MSR, etc... and properly context switch and update it, and provide a way to map it up the top of the address space (ie, we should make it appear in pseudo real-mode too on KVM server, on existing KVM BookE, I suppose the guest can do an explicit call to the HV to instanciate it). But for the actual replacement of the various instructions with accesses to this page, that would be the responsibility of the guest to patch itself, for which we already have appropriate mechanisms so it should be reasonably easy. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kvm BookE and SPRGs
On Fri, 2009-07-10 at 17:15 +0800, Liu Yu-B13201 wrote: Sounds reasonable. There are some old patchset which implemented the binary patch as Ben described. http://marc.info/?l=kvm-ppcm=122154653905212w=2 http://marc.info/?l=kvm-ppcm=122154657905306w=2 Interesting. Any reason why that wasn't merged ? Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kvm BookE and SPRGs
On Fri, 2009-07-10 at 19:17 +1000, Benjamin Herrenschmidt wrote: On Fri, 2009-07-10 at 17:15 +0800, Liu Yu-B13201 wrote: Sounds reasonable. There are some old patchset which implemented the binary patch as Ben described. http://marc.info/?l=kvm-ppcm=122154653905212w=2 http://marc.info/?l=kvm-ppcm=122154657905306w=2 Interesting. Any reason why that wasn't merged ? Ok, I had a look and it seems like he's rewriting the guest instructions from the hypervisor. I prefer having the guest rewrite it's own instructions. That does mean that the layout inside the magic page has to be fixed to a certain extent (or we need the hypervisor to at least pass some kind of description of where the various fields are) but that's a much better approach I believe. The main reason is because of the user-readable SPRG4..7. Because the guest will -not- trap when reading them, it will be able to read the value from the real underlying registers. However, when the writes to them are replaced by writing to the magic page, the underlying register is not kept in sync and things will break. Thus I prefer having the guest itself replace those instructions with magic page accesses in both case (stores and loads), it becomes the guest responsibility to ensure it's properly using the magic page -only- and doesn't trap on the actual instructions. We would thus continue trapping on the normal instructions and emulate them the old way (though we can probably move that emulation to asm code that is run before the switch back to the linux mm via the magic page :-) and thus make the emulation much faster, but that's a different deal. But still, the bulk of the patches for adding the cleaner paravirt interfaces, the magic page etc... seems sane. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 11:18 +0200, Alexander Graf wrote: The only problem I see is that the firmware lives in the high 4k, so we'd have to have some sort of enabling HV-call too. What firmware out of curiosity ? The treeboot thingy ? And yes, we definitely need an enabling HV call, ie, we stick to traps until it's enabled, we are talking about virtual space here so the kernel takes over and put what it wants up there for BookE. For server, we really don't care much what's in the ROM after we have booted neither. But yes, my idea did involve an enabling HV call. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On 10.07.2009, at 11:39, Benjamin Herrenschmidt wrote: On Fri, 2009-07-10 at 11:18 +0200, Alexander Graf wrote: The only problem I see is that the firmware lives in the high 4k, so we'd have to have some sort of enabling HV-call too. What firmware out of curiosity ? The treeboot thingy ? And yes, we definitely need an enabling HV call, ie, we stick to traps until it's On PPC32 openbios is somewhere up there. On PPC64 openbios stays where it was on PPC32, so it's fine. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 11:39 +0200, Alexander Graf wrote: Oh so we could have the emulation code mapped into the guest and could just jump there from our trampline code, so all page faults and other fun traps still work. That'd be nice :-) We can put -some- code in there yes, but some things will still have to do the big switch over to linux. Again, in your case, let's get your thingy stable and merged first, but I may toy with the magic page on the BookE KVM if I have some spare time. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 11:43 +0200, Alexander Graf wrote: What firmware out of curiosity ? The treeboot thingy ? And yes, we definitely need an enabling HV call, ie, we stick to traps until it's On PPC32 openbios is somewhere up there. On PPC64 openbios stays where it was on PPC32, so it's fine. Ok, but that's no big deal, we'll only enable it once we don't need the FW anymore. Cheers, Ben. -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: kvm BookE and SPRGs
On Fri, 2009-07-10 at 19:17 +1000, Benjamin Herrenschmidt wrote: On Fri, 2009-07-10 at 17:15 +0800, Liu Yu-B13201 wrote: Sounds reasonable. There are some old patchset which implemented the binary patch as Ben described. http://marc.info/?l=kvm-ppcm=122154653905212w=2 http://marc.info/?l=kvm-ppcm=122154657905306w=2 Interesting. Any reason why that wasn't merged ? Basically because we ran out of manpower to maintain it. We didn't want to push PV changes into upstream Linux, useful only to us, and then disappear. -- Hollis Blanchard IBM Linux Technology Center -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: kvm BookE and SPRGs
On Fri, 2009-07-10 at 18:10 +1000, Benjamin Herrenschmidt wrote: On Fri, 2009-07-10 at 16:31 +1000, Benjamin Herrenschmidt wrote: I was roaming through kernel usage of SPRGs and noticed a small detail in kvmppc for BookE ... any reason why in OP_31_XOP_MTSPR, you open coded the emulation of SPRG0..3, but 4...7 are handled in kvmppc_core_emulate_mtspr() ? It occurs to me that in fact for both MTSPR and MFSPR, the code should be moved into kvmppc_core_emulate_mtspr() and kvmppc_core_emulate_mfspr() for consistency. Also, from looking at the FSL BookE code, it seems that there is such a thing as SPRG9 (and so I suppose there must be an SPRG8 somewhere too), shouldn't we handle it too ? BTW. That leads me to another question (CC'ing Avi there too), which is what is the policy vs. para-virtualization ? IE. Are we ok with adding paravirt tricks to speed things up ? Yes, that's fine. We would rather not *require* paravirtualization though. A prime example I have in mind that could possibly help a lot here is to have a shared page mapped at -4K (at the top of the address space) when the guest is in supervisor mode only that hosts part of the current VCPU supervisor register state. ... The cost of course is an additional TLB entry for mapping that -4K page (but only when running guest kernel code). It was a net win when Christian implemented it last year. While the first access may miss in the TLB, these register accesses tend to come in bunches (i.e. the guest interrupt vectors). -- Hollis Blanchard IBM Linux Technology Center -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html