Re: RFC: New API for PPC for vcpu mmu access
On Sun, 13 Feb 2011 23:43:40 +0100 Alexander Graf ag...@suse.de wrote: struct kvmppc_book3e_tlb_entry { union { __u64 mas8_1; struct { __u32 mas8; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; }; Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). :-( Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked __u64 entries, no? Yes, the intent was just to make it easier to access individual mas registers, and reuse existing bit declarations that are defined relative to individual registers. Why clutter up the source code when the compiler can deal with it? Same applies to the anonymous unions/structs -- it's done that way to present the fields in the most straightforward manner (equivalent to how the SPRs themselves are named and aliased). If it's a firm NACK on the existing structure, I think I'd rather just drop the paired versions altogether (but still order them this way so it can be changed back if there's any desire to in the future, without breaking compatibility). The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right? It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV. For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is present but not supported. struct kvmppc_book3e_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not already exist on the target must be * configured with a size of zero. A tlb_ways value of zero means * the array is fully associative. Only TLBs that are already * set-associative on the target may be configured with a different * associativity. A set-associative TLB may not exceed 255 ways. * * KVM will adjust TLBnCFG based on the sizes configured here, * though arrays greater than 2048 entries will have TLBnCFG[NENTRY] * set to zero. * * The size of any TLB that is set-associative must be a multiple of * the number of ways, and the number of sets must be a power of two. * * The page sizes supported by a TLB shall be determined by reading * the TLB configuration registers. This is not adjustable by userspace. * [Note: need sregs] */ __u32 tlb_sizes[4]; __u8 tlb_ways[4]; __u32 reserved[11]; }; KVM_CONFIG_TLB -- Capability: KVM_CAP_SW_TLB Type: vcpu ioctl Parameters: struct kvm_config_tlb (in) Returns: 0 on success -1 on error struct kvm_config_tlb { __u64 params; __u64 array; __u32 mmu_type; __u32 array_len; Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a features int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right? params itself is a versioned struct, with reserved bits. Do we really need it here as well? - The hash for determining set number is: (MAS2[EPN] / page_size) (num_sets - 1) where page_size is the smallest page size supported by the TLB, and num_sets is the tlb_sizes[] value divided by the tlb_ways[] value. If a book3e chip is made for which a different hash is needed, a new MMU type must be used, to ensure that userspace and KVM agree on layout. Please state the size explicitly then. It's 1k, right? It's 4K on Freescale chips. We should probably implement sregs first, in which case qemu can read the MMU config registers to find out the minimum supported page size. If we specify 4K here, we should probably just go ahead and stick FSL in the MMU type name. Specifying the hash itself already makes me nervous about claiming the more generic name. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 14.02.2011, at 18:11, Scott Wood scottw...@freescale.com wrote: On Sun, 13 Feb 2011 23:43:40 +0100 Alexander Graf ag...@suse.de wrote: struct kvmppc_book3e_tlb_entry { union { __u64 mas8_1; struct { __u32 mas8; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; }; Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). :-( Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked __u64 entries, no? Yes, the intent was just to make it easier to access individual mas registers, and reuse existing bit declarations that are defined relative to individual registers. Why clutter up the source code when the compiler can deal with it? Same applies to the anonymous unions/structs -- it's done that way to present the fields in the most straightforward manner (equivalent to how the SPRs themselves are named and aliased). If it's a firm NACK on the existing structure, I think I'd rather just drop the paired versions altogether (but still order them this way so it can be changed back if there's any desire to in the future, without breaking compatibility). There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well. The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry-u81.split.mas8 or entry-mas8_1 MAS8_1_MAS8_MASK. The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right? It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV. Even fictional future changes to the tlb layout? Either way, we can name that differently when it comes. For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is present but not supported. struct kvmppc_book3e_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not already exist on the target must be * configured with a size of zero. A tlb_ways value of zero means * the array is fully associative. Only TLBs that are already * set-associative on the target may be configured with a different * associativity. A set-associative TLB may not exceed 255 ways. * * KVM will adjust TLBnCFG based on the sizes configured here, * though arrays greater than 2048 entries will have TLBnCFG[NENTRY] * set to zero. * * The size of any TLB that is set-associative must be a multiple of * the number of ways, and the number of sets must be a power of two. * * The page sizes supported by a TLB shall be determined by reading * the TLB configuration registers. This is not adjustable by userspace. * [Note: need sregs] */ __u32 tlb_sizes[4]; __u8 tlb_ways[4]; __u32 reserved[11]; }; KVM_CONFIG_TLB -- Capability: KVM_CAP_SW_TLB Type: vcpu ioctl Parameters: struct kvm_config_tlb (in) Returns: 0 on success -1 on error struct kvm_config_tlb { __u64 params; __u64 array; __u32 mmu_type; __u32 array_len; Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a features int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right? params itself is a versioned struct, with reserved bits. Do we really need it here as well? Right. Probably not :). - The hash for determining set number is: (MAS2[EPN] / page_size) (num_sets - 1) where page_size is the smallest page size supported by the TLB, and num_sets is the tlb_sizes[] value divided by the tlb_ways[] value. If a book3e chip is made for which a different hash is needed, a new MMU type must be used, to ensure that userspace and KVM agree on layout. Please state the size explicitly then. It's 1k, right? It's 4K on Freescale chips. We should probably implement sregs first, in which case qemu can read the MMU config registers to find out the minimum supported page size. If we specify 4K here, we should probably just go ahead and stick FSL in the MMU type name. Specifying the hash itself already makes me nervous about claiming the more generic name. Yup, sounds good :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 14.02.2011, at 22:16, Scott Wood wrote: On Mon, 14 Feb 2011 21:19:19 +0100 Alexander Graf ag...@suse.de wrote: There's no nack here :). The only thing that needs to change is the anonymous part, as that's a gnu extension. Just name the structs and unions and all is well. Ah, I thought it was an aesthetic objection -- didn't realize it was a GNUism. Oh well. Maybe it was some other random extension, but it's certainly less compatible :). The reason I was asking is that I assumed the code would end up being easier, not more complex without the u32s. In fact, it probably would. I'll leave the final decision if you want to access things by entry-u81.split.mas8 or entry-mas8_1 MAS8_1_MAS8_MASK. After sending that, I was thinking that mas7_3 is more naturally used as a pair, so I'd stick with the u64 there. I think mas8_1 benefits less from the pairing, though -- it's only really useful if you're going to put the value directly in hardware, which we won't. Sounds good. Make it 2 u32s then :). The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right? It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV. Even fictional future changes to the tlb layout? No, those need a new MMU type ID. ... which means they need a new name, but then booke_tlb_entry is taken. Please state the size explicitly then. It's 1k, right? It's 4K on Freescale chips. We should probably implement sregs first, in which case qemu can read the MMU config registers to find out the minimum supported page size. If we specify 4K here, we should probably just go ahead and stick FSL in the MMU type name. Specifying the hash itself already makes me nervous about claiming the more generic name. Yup, sounds good :). Which one, read the MMU config registers or specify 4K and stick FSL in the name? The specify 4k and stick fsl in the name part. If we simply always define it to 4k for all currently supported clients of the interface (e500) we should be good, no? No need for evaluations then. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 15.02.2011, at 00:49, Scott Wood wrote: On Tue, 15 Feb 2011 00:39:51 +0100 Alexander Graf ag...@suse.de wrote: On 14.02.2011, at 22:16, Scott Wood wrote: On Mon, 14 Feb 2011 21:19:19 +0100 Alexander Graf ag...@suse.de wrote: The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right? It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV. Even fictional future changes to the tlb layout? No, those need a new MMU type ID. ... which means they need a new name, but then booke_tlb_entry is taken. The MMU ID name and struct name are about equally generic. I'll add FSL to both. If there's a v2 down the road, then that goes in both. Very good - thank you! Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 12.02.2011, at 01:57, Scott Wood wrote: On Fri, 11 Feb 2011 22:07:11 +0100 Alexander Graf ag...@suse.de wrote: On 11.02.2011, at 21:53, Scott Wood wrote: On Fri, 11 Feb 2011 02:41:35 +0100 Alexander Graf ag...@suse.de wrote: Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Hmm, the idea is growing on me. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. From kernel to user space is simple. We can just document that after every RUN, all fields can be modified. From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the something changed call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. OK. We'll also need a config ioctl to specify MMU type/size and the address of the arrays. Right, a setup call basically :). OK, here goes v3: [Note: one final thought after writing this up -- this isn't going to work too well in cases where the guest can directly manipulate its TLB, such as with the LRAT feature of Power Arch 2.06. We'll still need a copy-in/copy-out mechanism for that.] In that case qemu sets the mmu mode respectively and is aware of the fact that it needs get/set methods. We'll get to that when we have LRAT implemented :). struct kvmppc_book3e_tlb_entry { union { __u64 mas8_1; struct { __u32 mas8; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; }; Looks good to me, except for the anonymous unions and structs of course. Avi dislikes those :). Is there any obvious reason we need to have unions in the first place? The compiler should be clever enough to pick the right size accessors when writing/reading masked __u64 entries, no? The struct name should also have a version indicator - it's the data descriptor only a single specific mmu_type, right? For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is present but not supported. struct kvmppc_book3e_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not already exist on the target must be * configured with a size of zero. A tlb_ways value of zero means * the array is fully associative. Only TLBs that are already * set-associative on the target may be configured with a different * associativity. A set-associative TLB may not exceed 255 ways. * * KVM will adjust TLBnCFG based on the sizes configured here, * though arrays greater than 2048 entries will have TLBnCFG[NENTRY] * set to zero. * * The size of any TLB that is set-associative must be a multiple of * the number of ways, and the number of sets must be a power of two. * * The page sizes supported by a TLB shall be determined by reading * the TLB configuration registers. This is not adjustable by userspace. * [Note: need sregs] */ __u32 tlb_sizes[4]; __u8 tlb_ways[4]; __u32 reserved[11]; }; KVM_CONFIG_TLB -- Capability: KVM_CAP_SW_TLB Type: vcpu ioctl Parameters: struct kvm_config_tlb (in) Returns: 0 on success -1 on error struct kvm_config_tlb { __u64 params; __u64 array; __u32 mmu_type; __u32 array_len; Some reserved bits please. IIRC Avi also really likes it when in addition to reserved fields there's also a features int that indicates which reserved fields are actually usable. Shouldn't hurt here either, right? }; Configures the virtual CPU's TLB array, establishing a shared memory area between userspace and KVM. The params and array fields are userspace addresses of mmu-type-specific data
Re: RFC: New API for PPC for vcpu mmu access
On Fri, 11 Feb 2011 02:41:35 +0100 Alexander Graf ag...@suse.de wrote: Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Hmm, the idea is growing on me. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. From kernel to user space is simple. We can just document that after every RUN, all fields can be modified. From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the something changed call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. OK. We'll also need a config ioctl to specify MMU type/size and the address of the arrays. Also, please tell me you didn't implement the previous revisions already. I didn't. :-) -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 11.02.2011, at 21:53, Scott Wood wrote: On Fri, 11 Feb 2011 02:41:35 +0100 Alexander Graf ag...@suse.de wrote: Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Hmm, the idea is growing on me. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. From kernel to user space is simple. We can just document that after every RUN, all fields can be modified. From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the something changed call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. OK. We'll also need a config ioctl to specify MMU type/size and the address of the arrays. Right, a setup call basically :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Fri, 11 Feb 2011 22:07:11 +0100 Alexander Graf ag...@suse.de wrote: On 11.02.2011, at 21:53, Scott Wood wrote: On Fri, 11 Feb 2011 02:41:35 +0100 Alexander Graf ag...@suse.de wrote: Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Hmm, the idea is growing on me. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. From kernel to user space is simple. We can just document that after every RUN, all fields can be modified. From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the something changed call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. OK. We'll also need a config ioctl to specify MMU type/size and the address of the arrays. Right, a setup call basically :). OK, here goes v3: [Note: one final thought after writing this up -- this isn't going to work too well in cases where the guest can directly manipulate its TLB, such as with the LRAT feature of Power Arch 2.06. We'll still need a copy-in/copy-out mechanism for that.] struct kvmppc_book3e_tlb_entry { union { __u64 mas8_1; struct { __u32 mas8; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; }; For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry is present but not supported. struct kvmppc_book3e_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not already exist on the target must be * configured with a size of zero. A tlb_ways value of zero means * the array is fully associative. Only TLBs that are already * set-associative on the target may be configured with a different * associativity. A set-associative TLB may not exceed 255 ways. * * KVM will adjust TLBnCFG based on the sizes configured here, * though arrays greater than 2048 entries will have TLBnCFG[NENTRY] * set to zero. * * The size of any TLB that is set-associative must be a multiple of * the number of ways, and the number of sets must be a power of two. * * The page sizes supported by a TLB shall be determined by reading * the TLB configuration registers. This is not adjustable by userspace. * [Note: need sregs] */ __u32 tlb_sizes[4]; __u8 tlb_ways[4]; __u32 reserved[11]; }; KVM_CONFIG_TLB -- Capability: KVM_CAP_SW_TLB Type: vcpu ioctl Parameters: struct kvm_config_tlb (in) Returns: 0 on success -1 on error struct kvm_config_tlb { __u64 params; __u64 array; __u32 mmu_type; __u32 array_len; }; Configures the virtual CPU's TLB array, establishing a shared memory area between userspace and KVM. The params and array fields are userspace addresses of mmu-type-specific data structures. The array_len field is an safety mechanism, and should be set to the size in bytes of the memory that userspace has reserved for the array. It must be at least the size dictated by mmu_type and params. On return from this call, KVM assumes the state of the TLBs to be empty. Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about any valid entries. While KVM_RUN is active, the shared region is under control of KVM. Its contents are undefined, and any modification by userspace results in boundedly undefined behavior. On return from KVM_RUN, the shared region will reflect the current state of the guest's TLB. If userspace makes any changes, it must call KVM_DIRTY_TLB to tell KVM which entries have been changed, prior to calling KVM_RUN again on this vcpu.
Re: RFC: New API for PPC for vcpu mmu access
Scott Wood wrote: On Wed, 9 Feb 2011 18:21:40 +0100 Alexander Graf ag...@suse.de wrote: On 07.02.2011, at 21:15, Scott Wood wrote: That's pretty much what the proposed API does -- except it uses a void pointer instead of uint64_t *. Oh? Did I miss something there? The proposal looked as if it only transfers a single TLB entry at a time. Right, I just meant in terms of avoiding a fixed reference to a hw-specific type. How about: struct kvmppc_booke_tlb_entry { union { __u64 mas0_1; struct { __u32 mas0; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; __u32 mas8; __u32 pad; Would it make sense to add some reserved fields or would we just bump up the mmu id? I was thinking we'd just bump the ID. I only stuck pad in there for alignment. And we're making a large array of it, so padding could hurt. Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). struct kvmppc_booke_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not exist on the target must be configured * with a size of zero. KVM will adjust TLBnCFG based on the sizes * configured here, though arrays greater than 2048 entries will * have TLBnCFG[NENTRY] set to zero. */ __u32 tlb_sizes[4]; Add some reserved fields? MMU type ID also controls this, but could add some padding to make extensions simpler (esp. since we're not making an array of it). How much would you recommend? How about making it 64 bytes? That should leave us plenty of room. struct kvmppc_booke_tlb_search { Search? I thought we agreed on having a search later, after the full get/set is settled? We agreed on having a full array-like get/set... my preference was to keep it all under one capability, which implies adding it at the same time. But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB. I'm skeptical that array-only will not be a performance issue under any usage pattern, but we can implement it and try it out before finalizing any of this. Yup. We can even implement it, measure what exactly is slow and then decide on how to implement it. I'd bet that only the emulation stub is slow - and for that KVM_TRANSLATE seems like a good fit. struct kvmppc_booke_tlb_entry entry; union { __u64 mas5_6; struct { __u64 mas5; __u64 mas6; }; }; }; The fields inside the struct should be __u32, of course. :-P Ugh, yes :). But since we're dopping this anyways, it doesn't matter, right? :) - An entry with MAS1[V] = 0 terminates the list early (but there will be no terminating entry if the full array is valid). On a call to KVM_GET_TLB, the contents of elemnts after the terminator are undefined. On a call to KVM_SET_TLB, excess elements beyond the terminating entry may not be accessed by KVM. Very implementation specific, but ok with me. I assumed most MMU types would have some straightforward way of marking an entry invalid (if not, it can add a software field in the struct), and that it would be MMU-specific code that is processing the list. See above :). It's constrained to the BOOKE implementation of that GET/SET anyway. Is this how the hardware works too? Hardware doesn't process lists of entries. But MAS1[V] is the valid bit in hardware. [Note: Once we implement sregs, Qemu can determine which TLBs are implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be unsupported by KVM if its existence is implied by the target CPU] KVM_SET_TLB --- Capability: KVM_CAP_SW_TLB Type: vcpu ioctl Parameters: struct kvm_set_tlb (in) Returns: 0 on success -1 on error struct kvm_set_tlb { __u64 params; __u64 array; __u32 mmu_type; }; [Note: I used __u64 rather than void * to avoid the need for special compat handling with 32-bit userspace on a 64-bit kernel -- if the other way is preferred, that's fine with me] Oh, now I understand what you were proposing :). Sorry. No, this way is sane. What about the ioctls that take only a pointer? The actual calling mechanism should work without compat, but in order for _IOR and such to not assign a different IOCTL number based
Re: RFC: New API for PPC for vcpu mmu access
Scott Wood wrote: On Thu, 3 Feb 2011 10:19:06 +0100 Alexander Graf ag...@suse.de wrote: Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows: * kvm goes to qemu * qemu fetches all mmu and register data from kvm * qemu runs its mmu resolution function as if the target was emulated So the normal way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy 30kb of data only to get the mmu data either :). I guess you mean that cpu_synchronize_state() is supposed to pull in the MMU state, though I don't see where it gets called for 'm'/'M' commands in the gdb stub. Well, we could also call it in get_phys_page_debug in target-ppc, but yes. I guess the reason it works for now is that SDR1 is pretty constant and was fetched earlier on. For BookE not syncing is obviously even more broken. The MMU code seems to be pretty target-specific. It's not clear to what extent there is a normal way, versus what book3s happens to rely on in its get_physical_address() code. I don't think there are any platforms supported yet (with both KVM and a non-empty cpu_get_phys_page_debug() implementation) that have a pure software-managed TLB. x86 has page tables, and book3s has the hash table (603/e300 doesn't, or more accurately Linux doesn't use it, but I guess that's not supported by KVM yet?). As for PPC, only 440, e500 and G3-5 are basically supported. It happens to work on POWER4 and above too and I've even got reports that it's good on e600 :). We could probably do some sort of lazy state transfer only when MMU code that needs it is run. This could initially include debug translations, for testing a non-KVM-dependent get_physical_address() implementation, but eventually that would use KVM_TRANSLATE (when KVM is used) and thus not Yup :). trigger the state transfer. I'd also like to add an info tlb command, which would require the state transfer. Very nice. BTW, how much other than the MMU is missing to be able to run an e500 target in qemu, without kvm? The last person working on BookE emulation was Edgar. Edgar, how far did you get? Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote: Scott Wood wrote: On Thu, 3 Feb 2011 10:19:06 +0100 Alexander Graf ag...@suse.de wrote: Yeah, that one's tricky. Usually the way the memory resolver in qemu works is as follows: * kvm goes to qemu * qemu fetches all mmu and register data from kvm * qemu runs its mmu resolution function as if the target was emulated So the normal way would be to fetch _all_ TLB entries from KVM, shove them into env and implement the MMU in qemu (at least enough of it to enable debugging). No other target modifies this code path. But no other target needs to copy 30kb of data only to get the mmu data either :). I guess you mean that cpu_synchronize_state() is supposed to pull in the MMU state, though I don't see where it gets called for 'm'/'M' commands in the gdb stub. Well, we could also call it in get_phys_page_debug in target-ppc, but yes. I guess the reason it works for now is that SDR1 is pretty constant and was fetched earlier on. For BookE not syncing is obviously even more broken. The MMU code seems to be pretty target-specific. It's not clear to what extent there is a normal way, versus what book3s happens to rely on in its get_physical_address() code. I don't think there are any platforms supported yet (with both KVM and a non-empty cpu_get_phys_page_debug() implementation) that have a pure software-managed TLB. x86 has page tables, and book3s has the hash table (603/e300 doesn't, or more accurately Linux doesn't use it, but I guess that's not supported by KVM yet?). As for PPC, only 440, e500 and G3-5 are basically supported. It happens to work on POWER4 and above too and I've even got reports that it's good on e600 :). We could probably do some sort of lazy state transfer only when MMU code that needs it is run. This could initially include debug translations, for testing a non-KVM-dependent get_physical_address() implementation, but eventually that would use KVM_TRANSLATE (when KVM is used) and thus not Yup :). trigger the state transfer. I'd also like to add an info tlb command, which would require the state transfer. Very nice. BTW, how much other than the MMU is missing to be able to run an e500 target in qemu, without kvm? The last person working on BookE emulation was Edgar. Edgar, how far did you get? Hi, TBH, I don't really know. My goal was to get linux running on an PPC-440 embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation to get that far. After that, we've done a few more hacks to run fsboot and uboot. Also, we've added support for some of the BookE debug registers to be able to run gdbserver from within linux guests. Some of these patches haven't made it upstream yet. I haven't taken the time to compare the specs to qemu code, so I don't really know how much is missing. My guess is that If you wan't to run linux guests, the MMU won't be the limiting factor. Cheers -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Thu, 10 Feb 2011 12:45:38 +0100 Alexander Graf ag...@suse.de wrote: Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). No, this is a full sync -- the list replaces any existing TLB entries (need to make that explicit in the doc). Basically it's an invalidate plus a list of tlb set operations. Qemu's internal representation will want to be ordered with no missing entries. If we require that of the transfer representation we can't do early termination. It would also limit Qemu's flexibility in choosing its internal representation, and make it more awkward to support multiple MMU types. Let's see if the format conversion imposes significant overhead before imposing a less flexible/larger transfer format. :-) MMU type ID also controls this, but could add some padding to make extensions simpler (esp. since we're not making an array of it). How much would you recommend? How about making it 64 bytes? That should leave us plenty of room. OK. The fields inside the struct should be __u32, of course. :-P Ugh, yes :). But since we're dopping this anyways, it doesn't matter, right? :) Right. I assumed most MMU types would have some straightforward way of marking an entry invalid (if not, it can add a software field in the struct), and that it would be MMU-specific code that is processing the list. See above :). Which part? -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 10.02.2011, at 19:51, Scott Wood wrote: On Thu, 10 Feb 2011 12:45:38 +0100 Alexander Graf ag...@suse.de wrote: Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). No, this is a full sync -- the list replaces any existing TLB entries (need to make that explicit in the doc). Basically it's an invalidate plus a list of tlb set operations. Qemu's internal representation will want to be ordered with no missing entries. If we require that of the transfer representation we can't do early termination. It would also limit Qemu's flexibility in choosing its internal representation, and make it more awkward to support multiple MMU types. Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times: SET: * qemu assembles the list from its internal representation * kvm disassembles the list into its internal structure GET: * kvm assembles the list from its internal representation * qemu disassembles the list into its internal structure Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 11.02.2011, at 01:20, Alexander Graf wrote: On 10.02.2011, at 19:51, Scott Wood wrote: On Thu, 10 Feb 2011 12:45:38 +0100 Alexander Graf ag...@suse.de wrote: Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). No, this is a full sync -- the list replaces any existing TLB entries (need to make that explicit in the doc). Basically it's an invalidate plus a list of tlb set operations. Qemu's internal representation will want to be ordered with no missing entries. If we require that of the transfer representation we can't do early termination. It would also limit Qemu's flexibility in choosing its internal representation, and make it more awkward to support multiple MMU types. Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times: SET: * qemu assembles the list from its internal representation * kvm disassembles the list into its internal structure GET: * kvm assembles the list from its internal representation * qemu disassembles the list into its internal structure Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 11.02.2011, at 01:22, Alexander Graf wrote: On 11.02.2011, at 01:20, Alexander Graf wrote: On 10.02.2011, at 19:51, Scott Wood wrote: On Thu, 10 Feb 2011 12:45:38 +0100 Alexander Graf ag...@suse.de wrote: Ok, thinking about this a bit more. You're basically proposing a list of tlb set calls, with each array field identifying one tlb set call. What I was thinking of was a full TLB sync, so we could keep qemu's internal TLB representation identical to the ioctl layout and then just call that one ioctl to completely overwrite all of qemu's internal data (and vice versa). No, this is a full sync -- the list replaces any existing TLB entries (need to make that explicit in the doc). Basically it's an invalidate plus a list of tlb set operations. Qemu's internal representation will want to be ordered with no missing entries. If we require that of the transfer representation we can't do early termination. It would also limit Qemu's flexibility in choosing its internal representation, and make it more awkward to support multiple MMU types. Well, but this way it means we'll have to assemble/disassemble a list of entries multiple times: SET: * qemu assembles the list from its internal representation * kvm disassembles the list into its internal structure GET: * kvm assembles the list from its internal representation * qemu disassembles the list into its internal structure Maybe we should go with Avi's proposal after all and simply keep the full soft-mmu synced between kernel and user space? That way we only need a setup call at first, no copying in between and simply update the user space version whenever something changes in the guest. We need to store the TLB's contents off somewhere anyways, so all we need is an additional in-kernel array with internal translation data, but that can be separate from the guest visible data, right? If we could then keep qemu's internal representation == shared data with kvm == kvm's internal data for guest visible stuff, we get this done with almost no additional overhead. And I don't see any problem with this. Should be easily doable. So then everything we need to get all the functionality we need is a hint from kernel to user space that something changed and vice versa. From kernel to user space is simple. We can just document that after every RUN, all fields can be modified. From user space to kernel, we could modify the entries directly and then pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide what to do with it. I guess the easiest implementation for now would be to ignore the bitmap and simply flush the shadow tlb. That gives us the flush almost for free. All we need to do is set the tlb to all zeros (should be done by env init anyways) and pass in the something changed call. KVM can then decide to simply drop all of its shadow state or loop through every shadow entry and flush it individually. Maybe we should give a hint on the amount of flushes, so KVM can implement some threshold. Also, please tell me you didn't implement the previous revisions already. It'd be a real bummer to see that work wasted only because we're still iterating through the spec O_o. Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 02/07/2011 07:30 PM, Yoder Stuart-B08248 wrote: -Original Message- From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org] On Behalf Of Avi Kivity Sent: Monday, February 07, 2011 11:14 AM To: Alexander Graf Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org; k...@vger.kernel.org; qemu-de...@nongnu.org Subject: Re: RFC: New API for PPC for vcpu mmu access On 02/03/2011 11:19 AM, Alexander Graf wrote: I have no idea what things will look like 10 years down the road, but currently e500mc has 576 entries (512 TLB0, 64 TLB1). That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA. You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or similar), or have the TLB stored in user memory, so there is no need to transfer it (on the other hand, you have to re-validate it every time you peek at it). The most convenient and flexible thing for Power Book III-E I think will be something that operates like a TLB search instruction. Inputs are 'address space' and 'process id' and outputs are in which TLB the entry was found and all the components of a TLB entry: address space pid entry number ea rpn guest state permissions flags attributes (WIMGE) Since all of those fields are architected in MAS registers, in the previous proposal we just proposed to return several 32-bit fields (one per MAS) that use the architected layout instead of inventing a brand new structure defining these fields. This looks reasonable assuming you can take the hit of a system call per translation. -- error compiling committee.c: too many arguments to function -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On 04.02.2011, at 23:33, Scott Wood wrote: On Thu, 3 Feb 2011 10:19:06 +0100 Alexander Graf ag...@suse.de wrote: On 02.02.2011, at 23:08, Scott Wood wrote: On Wed, 2 Feb 2011 22:33:41 +0100 Alexander Graf ag...@suse.de wrote: This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years? I have no idea what things will look like 10 years down the road, but currently e500mc has 576 entries (512 TLB0, 64 TLB1). That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA. And that's only with the standard hardware TLB size. On Topaz (our standalone hypervisor) we increased the guest's TLB0 to 16384 entries. It speeds up some workloads nicely, but invalidation-heavy loads get hurt. Yup - I do a similar trick for Book3S. Just that the TLB is implementation dependent anyways and mostly hidden from the OS :). Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around. No, we don't know the size (or necessarily even the structure) of the TLB. KVM may provide a guest TLB that is larger than what hardware has, as a cache to reduce the number of TLB misses that have to go to the guest (we do this now in another hypervisor). Plus sometimes it's just simpler -- why bother halving the size of the guest TLB when running on e500v1? Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe, though keeping it in KVM means we can change it whenever we want without having to sync up Qemu and worry about backward compatibility. Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace. Same-as-hardware TLB geometry with a Qemu-specified number of sets is probably good enough for the forseeable future, though. There were some other schemes we considered a while back for Topaz, but we ended up just going with a larger version of what's in hardware. Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but I guess we really should take one step at a time. Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions. U-Boot should be OK with it -- the increased TLB size is architecturally valid, and U-boot doesn't do much with TLB0 anyway. If we later have true PV extensions such as a page table, that'd be another matter. Yeah. Or imagine we realize that we should make the TLB size dynamic, so that non-PV aware guest OSs get the exact same behavior as on real hardware (because they're buggy for example) while PV aware guests can bump up the TLB size. But really, let's worry about that later. The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset. The other way to look at it is that there's no need for a reset function if all the state is properly settable. :-) You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs. Sorry, I misread reset function in qemu as reset ioctl in KVM. This is meant to be used by a qemu reset function. If there's a full-tlb set, that could be used instead, though it'd be slower. With the API as proposed it's needed to clear the slate before you set the individual entries you want. Yeah, let's just go for a full TLB get/set and optimize specific slow parts later. It's easier to fine-tune stuff that's slow than to miss out on functionality because we were trying to be too clever from the start. Which we want anyway for debugging (and migration, though I wonder if anyone would actually use that with embedded hardware). We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable. Oh, I agree -- or I wouldn't have even mentioned it. :-) I just wouldn't make it the primary optimization concern at this point. No, not at all - no optimization for it. But keeping the door open :). Haven't fully made up my mind on the tlb entry structure yet. Maybe something like struct kvm_ppc_booke_tlbe { __u64 data[8]; }; would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least
Re: RFC: New API for PPC for vcpu mmu access
On Mon, 7 Feb 2011 17:49:51 +0100 Alexander Graf ag...@suse.de wrote: On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote: Suggested change to this would be to have Qemu set tlb_type as an _input_ argument. If KVM supports it, that type gets used, else an error is returned.This would allow Qemu to tell the kernel what type of MMU it is prepared to support. Without this Qemu would just have to error out if the type returned is unknown. Yes, we could use the same struct for get and set. On set, it could transfer the mmu type, on get it could tell userspace the mmu type. What happens if a get is done before the first set, and there are multiple MMU type options for this hardware, with differing entry sizes? Qemu would have to know beforehand how large to make the buffer. We could say that going forward, it's expected that qemu will do a TLB set (either a full one, or a lightweight alternative) after creating a vcpu. For compatibility, if this doesn't happen before the vcpu is run, the TLB is created and initialized as it is today, but no new Qemu-visible features will be enabled that way. If Qemu does a get without ever doing some set operation, it should get an error, since the requirement to do a set is added at the same time as the get API. -Scott -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: RFC: New API for PPC for vcpu mmu access
On Mon, 7 Feb 2011 16:43:02 +0100 Alexander Graf ag...@suse.de wrote: On 04.02.2011, at 23:33, Scott Wood wrote: On Thu, 3 Feb 2011 10:19:06 +0100 Alexander Graf ag...@suse.de wrote: Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe, though keeping it in KVM means we can change it whenever we want without having to sync up Qemu and worry about backward compatibility. Quite the contrary - you have to worry more about backward compatibility. If we implement a new feature that doesn't work on old kernels, we can just tell qemu to not work on those old versions. For the kernel interfaces, we have to keep supporting old userspace. If you're talking about actual interface changes, yes. But a change in how KVM implements things behind the scenes shouldn't break an old Qemu, unless it's buggy and makes assumptions not permitted by the API. How's that different from backing the void pointer up with a different struct depending on the MMU type? We weren't proposing unions. A fixed array does mean you wouldn't have to worry about whether qemu supports the more advanced struct format if fields are added -- you can just unconditionally write it, as long as it's backwards compatible. Unless you hit the limit of the pre-determined array size, that is. And if that gets made higher for future expansion, that's even more data that has to get transferred, before it's really needed. Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in a random __user pointer that we directly write to from kernel space and tell qemu how big and what type a tlb entry is? struct request_ppc_tlb { int tlb_type; int tlb_entries; uint64_t __user *tlb_data }; That's pretty much what the proposed API does -- except it uses a void pointer instead of uint64_t *. Would you really want to loop through 16k entries, doing an ioctl for each? Not really. The API was modeled after something we did on Topaz where it's just a function call. But something array-based would have been awkward without constraining the geometry. Now that we're going to constrain the geometry, providing an array-based get/set would be easy and should definitely be a part of the API. Then performance really would always be an issue. For cases where you really need to do a full get/set, yes. I would really prefer we tackle it with a full-on tlb get/set first and then put the very flexible one on top, because to be the full-on approach feels like the more generic one. I'm very open to adding an individual tlb get/set and maybe even a kvm, please translate EA x to RA y ioctl. But those should come after we cover the big hammer that just copies everything. If we add everything at once, it minimizes the possibilities that Qemu has to deal with -- either the full MMU API is there, or it's not. BTW, I wonder if we should leave PPC out of the name. It seems like any arch with a software-visible TLB could use this, since the hw details are hidden behind the MMU type. How about: struct kvmppc_booke_tlb_entry { union { __u64 mas0_1; struct { __u32 mas0; __u32 mas1; }; }; __u64 mas2; union { __u64 mas7_3 struct { __u32 mas7; __u32 mas3; }; }; __u32 mas8; __u32 pad; }; struct kvmppc_booke_tlb_params { /* * book3e defines 4 TLBs. Individual implementations may have * fewer. TLBs that do not exist on the target must be configured * with a size of zero. KVM will adjust TLBnCFG based on the sizes * configured here, though arrays greater than 2048 entries will * have TLBnCFG[NENTRY] set to zero. */ __u32 tlb_sizes[4]; }; struct kvmppc_booke_tlb_search { struct kvmppc_booke_tlb_entry entry; union { __u64 mas5_6; struct { __u64 mas5; __u64 mas6; }; }; }; For a mmu type of PPC_BOOKE_NOHV, the mas5 field in kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry are present but not supported. For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV: - TLB entries in get/set arrays may be provided in any order, and all TLBs are get/set at once. - An entry with MAS1[V] = 0 terminates the list early (but there will be no terminating entry if the full array is valid). On a call to KVM_GET_TLB, the contents of elemnts after the terminator are undefined. On a call to KVM_SET_TLB, excess elements beyond the terminating entry may not be accessed by KVM. [Note: Once we implement sregs, Qemu can determine which TLBs are
Re: RFC: New API for PPC for vcpu mmu access
On Thu, 3 Feb 2011 10:19:06 +0100 Alexander Graf ag...@suse.de wrote: On 02.02.2011, at 23:08, Scott Wood wrote: On Wed, 2 Feb 2011 22:33:41 +0100 Alexander Graf ag...@suse.de wrote: This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years? I have no idea what things will look like 10 years down the road, but currently e500mc has 576 entries (512 TLB0, 64 TLB1). That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we want to transfer every time qemu feels like resolving an EA. And that's only with the standard hardware TLB size. On Topaz (our standalone hypervisor) we increased the guest's TLB0 to 16384 entries. It speeds up some workloads nicely, but invalidation-heavy loads get hurt. Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around. No, we don't know the size (or necessarily even the structure) of the TLB. KVM may provide a guest TLB that is larger than what hardware has, as a cache to reduce the number of TLB misses that have to go to the guest (we do this now in another hypervisor). Plus sometimes it's just simpler -- why bother halving the size of the guest TLB when running on e500v1? Makes sense. So we basically need an ioctl that tells KVM the MMU type and TLB size. Remember, the userspace tool is the place for policies :). Maybe, though keeping it in KVM means we can change it whenever we want without having to sync up Qemu and worry about backward compatibility. Same-as-hardware TLB geometry with a Qemu-specified number of sets is probably good enough for the forseeable future, though. There were some other schemes we considered a while back for Topaz, but we ended up just going with a larger version of what's in hardware. Maybe this even needs to be potentially runtime switchable, in case you boot off with u-boot in the guest, load a kernel and the kernel activates some PV extensions. U-Boot should be OK with it -- the increased TLB size is architecturally valid, and U-boot doesn't do much with TLB0 anyway. If we later have true PV extensions such as a page table, that'd be another matter. The only reason we need to do this is because there's no proper reset function in qemu for the e500 tlb. I'd prefer to have that there and push the TLB contents down on reset. The other way to look at it is that there's no need for a reset function if all the state is properly settable. :-) You make it sound as if it was hard to implement a reset function in qemu :). Really, that's where it belongs. Sorry, I misread reset function in qemu as reset ioctl in KVM. This is meant to be used by a qemu reset function. If there's a full-tlb set, that could be used instead, though it'd be slower. With the API as proposed it's needed to clear the slate before you set the individual entries you want. Which we want anyway for debugging (and migration, though I wonder if anyone would actually use that with embedded hardware). We certainly should not close the door on migration either way. So all the state has to be 100% user space receivable. Oh, I agree -- or I wouldn't have even mentioned it. :-) I just wouldn't make it the primary optimization concern at this point. Haven't fully made up my mind on the tlb entry structure yet. Maybe something like struct kvm_ppc_booke_tlbe { __u64 data[8]; }; would be enough? The rest is implementation dependent anyways. Exposing those details to user space doesn't buy us anything. By keeping it generic we can at least still build against older kernel headers :). If it's not exposed to userspace, how is userspace going to interpret/fill in the data? It can overlay cast according to the MMU type. How's that different from backing the void pointer up with a different struct depending on the MMU type? We weren't proposing unions. A fixed array does mean you wouldn't have to worry about whether qemu supports the more advanced struct format if fields are added -- you can just unconditionally write it, as long as it's backwards compatible. Unless you hit the limit of the pre-determined array size, that is. And if that gets made higher for future expansion, that's even more data that has to get transferred, before it's really needed. As for kernel headers, I think qemu needs to provide its own copy, like qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests for programs which rely on recent kernel APIs (which Qemu+KVM tends to do already). Yeah, tedious story... I guess it's come up before? Userspace should only really need the TLB entries for 1) Debugging 2) Migration So I don't see the
Re: RFC: New API for PPC for vcpu mmu access
On 02.02.2011, at 23:34, Yoder Stuart-B08248 wrote: -Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Wednesday, February 02, 2011 3:34 PM To: Yoder Stuart-B08248 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; qemu-de...@nongnu.org Subject: Re: RFC: New API for PPC for vcpu mmu access On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote: Below is a proposal for a new API for PPC to allow KVM clients to set MMU state in a vcpu. BookE processors have one or more software managed TLBs and currently there is no mechanism for Qemu to initialize or access them. This is needed for normal initialization as well as debug. There are 4 APIs: -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU with KVM-- the type determines the size and format of the data in the other APIs This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type. -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the vcpu -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies the format of the MMU data passed in This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years? Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around. Yes, we thought about that approach but the idea here, as Scott described, was to provide an API that could work if user space is unaware of the geometry of the TLB. Userspace shouldn't be unaware of the TLB, that's the whole point :). In principle, all state should be fetchable from userspace - so it has to know the geometry. Take a look at Power ISA Version 2.06.1 (on power.org) at the definition of TLBnCFG in Book E. The NENTRY and ASSOC fields now have meaning that allow TLB geometries that cannot be described in the TLBnCFG registers. It's certainly not easy to assemble all the required information in userspace, but we need to do so nevertheless - having the state available simply buys us a lot in terms of flexibility. I think the use case where this API would be used the most would be from a gdb stub that needed to look up an effective address. I agree. As I stated in my previous mail, there's probably good rationale to explicitly tune that path. That doesn't mean that we have to leave out the generic one. It should only be an acceleration. After all, this whole flexibility thing with all the potential possibilities is KVM's strong point. We should not close the doors on those :). Alex -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: RFC: New API for PPC for vcpu mmu access
-Original Message- From: Alexander Graf [mailto:ag...@suse.de] Sent: Wednesday, February 02, 2011 3:34 PM To: Yoder Stuart-B08248 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; qemu-de...@nongnu.org Subject: Re: RFC: New API for PPC for vcpu mmu access On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote: Below is a proposal for a new API for PPC to allow KVM clients to set MMU state in a vcpu. BookE processors have one or more software managed TLBs and currently there is no mechanism for Qemu to initialize or access them. This is needed for normal initialization as well as debug. There are 4 APIs: -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU with KVM-- the type determines the size and format of the data in the other APIs This should be done through the PVR hint in sregs, no? Usually a single CPU type only has a single MMU type. -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the vcpu -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies the format of the MMU data passed in This seems to fine-grained. I'd prefer a list of all TLB entries to be pushed in either direction. What's the foreseeable number of TLB entries within the next 10 years? Having the whole stack available would make the sync with qemu easier and also allows us to only do a single ioctl for all the TLB management. Thanks to the PVR we know the size of the TLB, so we don't have to shove that around. Yes, we thought about that approach but the idea here, as Scott described, was to provide an API that could work if user space is unaware of the geometry of the TLB. Take a look at Power ISA Version 2.06.1 (on power.org) at the definition of TLBnCFG in Book E. The NENTRY and ASSOC fields now have meaning that allow TLB geometries that cannot be described in the TLBnCFG registers. I think the use case where this API would be used the most would be from a gdb stub that needed to look up an effective address. Stuart -- To unsubscribe from this list: send the line unsubscribe kvm-ppc in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html