Re: RFC: New API for PPC for vcpu mmu access

2011-02-14 Thread Scott Wood
On Sun, 13 Feb 2011 23:43:40 +0100
Alexander Graf ag...@suse.de wrote:

  struct kvmppc_book3e_tlb_entry {
  union {
  __u64 mas8_1;
  struct {
  __u32 mas8;
  __u32 mas1;
  };
  };
  __u64 mas2;
  union {
  __u64 mas7_3
  struct {
  __u32 mas7;
  __u32 mas3;
  };
  };
  };
 
 Looks good to me, except for the anonymous unions and structs of course. Avi 
 dislikes those :). 

:-(

 Is there any obvious reason we need to have unions in the first place? The
 compiler should be clever enough to pick the right size accessors when
 writing/reading masked  __u64 entries, no?

Yes, the intent was just to make it easier to access individual mas
registers, and reuse existing bit declarations that are defined relative
to individual registers.

Why clutter up the source code when the compiler can deal with it?  Same
applies to the anonymous unions/structs -- it's done that way to present
the fields in the most straightforward manner (equivalent to how the SPRs
themselves are named and aliased).

If it's a firm NACK on the existing structure, I think I'd rather just drop
the paired versions altogether (but still order them this way so it can be
changed back if there's any desire to in the future, without breaking
compatibility).

 The struct name should also have
 a version indicator - it's the data descriptor only a single specific
 mmu_type, right?

It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

  For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in 
  kvmppc_book3e_tlb_entry is
  present but not supported.
  
  struct kvmppc_book3e_tlb_params {
  /*
   * book3e defines 4 TLBs.  Individual implementations may have
   * fewer.  TLBs that do not already exist on the target must be
   * configured with a size of zero.  A tlb_ways value of zero means
   * the array is fully associative.  Only TLBs that are already
   * set-associative on the target may be configured with a different
   * associativity.  A set-associative TLB may not exceed 255 ways.
   *
   * KVM will adjust TLBnCFG based on the sizes configured here,
   * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
   * set to zero.
   *
   * The size of any TLB that is set-associative must be a multiple of
   * the number of ways, and the number of sets must be a power of two.
   *
   * The page sizes supported by a TLB shall be determined by reading
   * the TLB configuration registers.  This is not adjustable by 
  userspace.
   * [Note: need sregs]
   */
  __u32 tlb_sizes[4];
  __u8 tlb_ways[4];
  __u32 reserved[11];
  };
  
  KVM_CONFIG_TLB
  --
  
  Capability: KVM_CAP_SW_TLB
  Type: vcpu ioctl
  Parameters: struct kvm_config_tlb (in)
  Returns: 0 on success
  -1 on error
  
  struct kvm_config_tlb {
  __u64 params;
  __u64 array;
  __u32 mmu_type;
  __u32 array_len;
 
 Some reserved bits please. IIRC Avi also really likes it when in addition to 
 reserved fields there's also a features int that indicates which reserved 
 fields are actually usable. Shouldn't hurt here either, right?

params itself is a versioned struct, with reserved bits.  Do we really need
it here as well?

  - The hash for determining set number is:
  (MAS2[EPN] / page_size)  (num_sets - 1)
where page_size is the smallest page size supported by the TLB, and
num_sets is the tlb_sizes[] value divided by the tlb_ways[] value.
If a book3e chip is made for which a different hash is needed, a new
MMU type must be used, to ensure that userspace and KVM agree on layout.
 
 Please state the size explicitly then. It's 1k, right?

It's 4K on Freescale chips.  We should probably implement sregs first, in
which case qemu can read the MMU config registers to find out the minimum
supported page size.

If we specify 4K here, we should probably just go ahead and stick FSL in
the MMU type name.  Specifying the hash itself already makes me nervous
about claiming the more generic name.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-14 Thread Alexander Graf

On 14.02.2011, at 18:11, Scott Wood scottw...@freescale.com wrote:

 On Sun, 13 Feb 2011 23:43:40 +0100
 Alexander Graf ag...@suse.de wrote:
 
 struct kvmppc_book3e_tlb_entry {
union {
__u64 mas8_1;
struct {
__u32 mas8;
__u32 mas1;
};
};
__u64 mas2;
union {
__u64 mas7_3
struct {
__u32 mas7;
__u32 mas3;
};
};
 };
 
 Looks good to me, except for the anonymous unions and structs of course. Avi 
 dislikes those :). 
 
 :-(
 
 Is there any obvious reason we need to have unions in the first place? The
 compiler should be clever enough to pick the right size accessors when
 writing/reading masked  __u64 entries, no?
 
 Yes, the intent was just to make it easier to access individual mas
 registers, and reuse existing bit declarations that are defined relative
 to individual registers.
 
 Why clutter up the source code when the compiler can deal with it?  Same
 applies to the anonymous unions/structs -- it's done that way to present
 the fields in the most straightforward manner (equivalent to how the SPRs
 themselves are named and aliased).
 
 If it's a firm NACK on the existing structure, I think I'd rather just drop
 the paired versions altogether (but still order them this way so it can be
 changed back if there's any desire to in the future, without breaking
 compatibility).

There's no nack here :). The only thing that needs to change is the anonymous 
part, as that's a gnu extension. Just name the structs and unions and all is 
well.

The reason I was asking is that I assumed the code would end up being easier, 
not more complex without the u32s. In fact, it probably would. I'll leave the 
final decision if you want to access things by entry-u81.split.mas8 or 
entry-mas8_1  MAS8_1_MAS8_MASK.

 
 The struct name should also have
 a version indicator - it's the data descriptor only a single specific
 mmu_type, right?
 
 It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.

Even fictional future changes to the tlb layout? Either way, we can name that 
differently when it comes.

 
 For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in 
 kvmppc_book3e_tlb_entry is
 present but not supported.
 
 struct kvmppc_book3e_tlb_params {
/*
 * book3e defines 4 TLBs.  Individual implementations may have
 * fewer.  TLBs that do not already exist on the target must be
 * configured with a size of zero.  A tlb_ways value of zero means
 * the array is fully associative.  Only TLBs that are already
 * set-associative on the target may be configured with a different
 * associativity.  A set-associative TLB may not exceed 255 ways.
 *
 * KVM will adjust TLBnCFG based on the sizes configured here,
 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
 * set to zero.
 *
 * The size of any TLB that is set-associative must be a multiple of
 * the number of ways, and the number of sets must be a power of two.
 *
 * The page sizes supported by a TLB shall be determined by reading
 * the TLB configuration registers.  This is not adjustable by userspace.
 * [Note: need sregs]
 */
__u32 tlb_sizes[4];
__u8 tlb_ways[4];
__u32 reserved[11];
 };
 
 KVM_CONFIG_TLB
 --
 
 Capability: KVM_CAP_SW_TLB
 Type: vcpu ioctl
 Parameters: struct kvm_config_tlb (in)
 Returns: 0 on success
-1 on error
 
 struct kvm_config_tlb {
__u64 params;
__u64 array;
__u32 mmu_type;
__u32 array_len;
 
 Some reserved bits please. IIRC Avi also really likes it when in addition to 
 reserved fields there's also a features int that indicates which reserved 
 fields are actually usable. Shouldn't hurt here either, right?
 
 params itself is a versioned struct, with reserved bits.  Do we really need
 it here as well?

Right. Probably not :).

 
 - The hash for determining set number is:
(MAS2[EPN] / page_size)  (num_sets - 1)
  where page_size is the smallest page size supported by the TLB, and
  num_sets is the tlb_sizes[] value divided by the tlb_ways[] value.
  If a book3e chip is made for which a different hash is needed, a new
  MMU type must be used, to ensure that userspace and KVM agree on layout.
 
 Please state the size explicitly then. It's 1k, right?
 
 It's 4K on Freescale chips.  We should probably implement sregs first, in
 which case qemu can read the MMU config registers to find out the minimum
 supported page size.
 
 If we specify 4K here, we should probably just go ahead and stick FSL in
 the MMU type name.  Specifying the hash itself already makes me nervous
 about claiming the more generic name.

Yup, sounds good :).


Alex

 
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-14 Thread Alexander Graf

On 14.02.2011, at 22:16, Scott Wood wrote:

 On Mon, 14 Feb 2011 21:19:19 +0100
 Alexander Graf ag...@suse.de wrote:
 
 There's no nack here :). The only thing that needs to change is the 
 anonymous part, as that's a gnu extension. Just name the structs and unions 
 and all is well.
 
 Ah, I thought it was an aesthetic objection -- didn't realize it was a
 GNUism.  Oh well.

Maybe it was some other random extension, but it's certainly less compatible :).

 
 The reason I was asking is that I assumed the code would end up being 
 easier, not more complex without the u32s. In fact, it probably would. I'll 
 leave the final decision if you want to access things by 
 entry-u81.split.mas8 or entry-mas8_1  MAS8_1_MAS8_MASK.
 
 After sending that, I was thinking that mas7_3 is more naturally used
 as a pair, so I'd stick with the u64 there.
 
 I think mas8_1 benefits less from the pairing, though -- it's only really
 useful if you're going to put the value directly in hardware, which we
 won't.

Sounds good. Make it 2 u32s then :).

 
 The struct name should also have
 a version indicator - it's the data descriptor only a single specific
 mmu_type, right?
 
 It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
 
 Even fictional future changes to the tlb layout?
 
 No, those need a new MMU type ID.

... which means they need a new name, but then booke_tlb_entry is taken.

 
 Please state the size explicitly then. It's 1k, right?
 
 It's 4K on Freescale chips.  We should probably implement sregs first, in
 which case qemu can read the MMU config registers to find out the minimum
 supported page size.
 
 If we specify 4K here, we should probably just go ahead and stick FSL in
 the MMU type name.  Specifying the hash itself already makes me nervous
 about claiming the more generic name.
 
 Yup, sounds good :).
 
 Which one, read the MMU config registers or specify 4K and stick FSL in
 the name?

The specify 4k and stick fsl in the name part. If we simply always define it 
to 4k for all currently supported clients of the interface (e500) we should be 
good, no? No need for evaluations then.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-14 Thread Alexander Graf

On 15.02.2011, at 00:49, Scott Wood wrote:

 On Tue, 15 Feb 2011 00:39:51 +0100
 Alexander Graf ag...@suse.de wrote:
 
 On 14.02.2011, at 22:16, Scott Wood wrote:
 
 On Mon, 14 Feb 2011 21:19:19 +0100
 Alexander Graf ag...@suse.de wrote:
 The struct name should also have
 a version indicator - it's the data descriptor only a single specific
 mmu_type, right?
 
 It handles both KVM_MMU_PPC_BOOK3E_NOHV and KVM_MMU_PPC_BOOK3E_HV.
 
 Even fictional future changes to the tlb layout?
 
 No, those need a new MMU type ID.
 
 ... which means they need a new name, but then booke_tlb_entry is taken.
 
 The MMU ID name and struct name are about equally generic.  I'll add FSL to
 both.  If there's a v2 down the road, then that goes in both.

Very good - thank you!


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-13 Thread Alexander Graf

On 12.02.2011, at 01:57, Scott Wood wrote:

 On Fri, 11 Feb 2011 22:07:11 +0100
 Alexander Graf ag...@suse.de wrote:
 
 
 On 11.02.2011, at 21:53, Scott Wood wrote:
 
 On Fri, 11 Feb 2011 02:41:35 +0100
 Alexander Graf ag...@suse.de wrote:
 
 Maybe we should go with Avi's proposal after all and simply keep the 
 full soft-mmu synced between kernel and user space? That way we only 
 need a setup call at first, no copying in between and simply update the 
 user space version whenever something changes in the guest. We need to 
 store the TLB's contents off somewhere anyways, so all we need is an 
 additional in-kernel array with internal translation data, but that can 
 be separate from the guest visible data, right?
 
 Hmm, the idea is growing on me.
 
 So then everything we need to get all the functionality we need is a hint 
 from kernel to user space that something changed and vice versa.
 
 From kernel to user space is simple. We can just document that after every 
 RUN, all fields can be modified.
 From user space to kernel, we could modify the entries directly and then 
 pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can 
 then decide what to do with it. I guess the easiest implementation for now 
 would be to ignore the bitmap and simply flush the shadow tlb.
 
 That gives us the flush almost for free. All we need to do is set the tlb 
 to all zeros (should be done by env init anyways) and pass in the 
 something changed call. KVM can then decide to simply drop all of its 
 shadow state or loop through every shadow entry and flush it individually. 
 Maybe we should give a hint on the amount of flushes, so KVM can implement 
 some threshold.
 
 OK.  We'll also need a config ioctl to specify MMU type/size and the address
 of the arrays.
 
 Right, a setup call basically :).
 
 OK, here goes v3:
 
 [Note: one final thought after writing this up -- this isn't going to work
 too well in cases where the guest can directly manipulate its TLB, such as
 with the LRAT feature of Power Arch 2.06.  We'll still need a
 copy-in/copy-out mechanism for that.]

In that case qemu sets the mmu mode respectively and is aware of the fact that 
it needs get/set methods. We'll get to that when we have LRAT implemented :).

 struct kvmppc_book3e_tlb_entry {
   union {
   __u64 mas8_1;
   struct {
   __u32 mas8;
   __u32 mas1;
   };
   };
   __u64 mas2;
   union {
   __u64 mas7_3
   struct {
   __u32 mas7;
   __u32 mas3;
   };
   };
 };

Looks good to me, except for the anonymous unions and structs of course. Avi 
dislikes those :). Is there any obvious reason we need to have unions in the 
first place? The compiler should be clever enough to pick the right size 
accessors when writing/reading masked  __u64 entries, no? The struct name 
should also have a version indicator - it's the data descriptor only a single 
specific mmu_type, right?

 
 For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in 
 kvmppc_book3e_tlb_entry is
 present but not supported.
 
 struct kvmppc_book3e_tlb_params {
   /*
* book3e defines 4 TLBs.  Individual implementations may have
* fewer.  TLBs that do not already exist on the target must be
* configured with a size of zero.  A tlb_ways value of zero means
* the array is fully associative.  Only TLBs that are already
* set-associative on the target may be configured with a different
* associativity.  A set-associative TLB may not exceed 255 ways.
*
* KVM will adjust TLBnCFG based on the sizes configured here,
* though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
* set to zero.
*
* The size of any TLB that is set-associative must be a multiple of
* the number of ways, and the number of sets must be a power of two.
*
* The page sizes supported by a TLB shall be determined by reading
* the TLB configuration registers.  This is not adjustable by 
 userspace.
* [Note: need sregs]
*/
   __u32 tlb_sizes[4];
   __u8 tlb_ways[4];
   __u32 reserved[11];
 };
 
 KVM_CONFIG_TLB
 --
 
 Capability: KVM_CAP_SW_TLB
 Type: vcpu ioctl
 Parameters: struct kvm_config_tlb (in)
 Returns: 0 on success
 -1 on error
 
 struct kvm_config_tlb {
   __u64 params;
   __u64 array;
   __u32 mmu_type;
   __u32 array_len;

Some reserved bits please. IIRC Avi also really likes it when in addition to 
reserved fields there's also a features int that indicates which reserved 
fields are actually usable. Shouldn't hurt here either, right?

 };
 
 Configures the virtual CPU's TLB array, establishing a shared memory area
 between userspace and KVM.  The params and array fields are userspace
 addresses of mmu-type-specific data 

Re: RFC: New API for PPC for vcpu mmu access

2011-02-11 Thread Scott Wood
On Fri, 11 Feb 2011 02:41:35 +0100
Alexander Graf ag...@suse.de wrote:

  Maybe we should go with Avi's proposal after all and simply keep the full 
  soft-mmu synced between kernel and user space? That way we only need a 
  setup call at first, no copying in between and simply update the user 
  space version whenever something changes in the guest. We need to store 
  the TLB's contents off somewhere anyways, so all we need is an additional 
  in-kernel array with internal translation data, but that can be separate 
  from the guest visible data, right?

Hmm, the idea is growing on me.

 So then everything we need to get all the functionality we need is a hint 
 from kernel to user space that something changed and vice versa.
 
 From kernel to user space is simple. We can just document that after every 
 RUN, all fields can be modified.
 From user space to kernel, we could modify the entries directly and then pass 
 in an ioctl that passes in a dirty bitmap to kernel space. KVM can then 
 decide what to do with it. I guess the easiest implementation for now would 
 be to ignore the bitmap and simply flush the shadow tlb.
 
 That gives us the flush almost for free. All we need to do is set the tlb to 
 all zeros (should be done by env init anyways) and pass in the something 
 changed call. KVM can then decide to simply drop all of its shadow state or 
 loop through every shadow entry and flush it individually. Maybe we should 
 give a hint on the amount of flushes, so KVM can implement some threshold.

OK.  We'll also need a config ioctl to specify MMU type/size and the address
of the arrays.

 Also, please tell me you didn't implement the previous revisions already.

I didn't. :-)

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-11 Thread Alexander Graf

On 11.02.2011, at 21:53, Scott Wood wrote:

 On Fri, 11 Feb 2011 02:41:35 +0100
 Alexander Graf ag...@suse.de wrote:
 
 Maybe we should go with Avi's proposal after all and simply keep the full 
 soft-mmu synced between kernel and user space? That way we only need a 
 setup call at first, no copying in between and simply update the user 
 space version whenever something changes in the guest. We need to store 
 the TLB's contents off somewhere anyways, so all we need is an additional 
 in-kernel array with internal translation data, but that can be separate 
 from the guest visible data, right?
 
 Hmm, the idea is growing on me.
 
 So then everything we need to get all the functionality we need is a hint 
 from kernel to user space that something changed and vice versa.
 
 From kernel to user space is simple. We can just document that after every 
 RUN, all fields can be modified.
 From user space to kernel, we could modify the entries directly and then 
 pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can then 
 decide what to do with it. I guess the easiest implementation for now would 
 be to ignore the bitmap and simply flush the shadow tlb.
 
 That gives us the flush almost for free. All we need to do is set the tlb to 
 all zeros (should be done by env init anyways) and pass in the something 
 changed call. KVM can then decide to simply drop all of its shadow state or 
 loop through every shadow entry and flush it individually. Maybe we should 
 give a hint on the amount of flushes, so KVM can implement some threshold.
 
 OK.  We'll also need a config ioctl to specify MMU type/size and the address
 of the arrays.

Right, a setup call basically :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-11 Thread Scott Wood
On Fri, 11 Feb 2011 22:07:11 +0100
Alexander Graf ag...@suse.de wrote:

 
 On 11.02.2011, at 21:53, Scott Wood wrote:
 
  On Fri, 11 Feb 2011 02:41:35 +0100
  Alexander Graf ag...@suse.de wrote:
  
  Maybe we should go with Avi's proposal after all and simply keep the 
  full soft-mmu synced between kernel and user space? That way we only 
  need a setup call at first, no copying in between and simply update the 
  user space version whenever something changes in the guest. We need to 
  store the TLB's contents off somewhere anyways, so all we need is an 
  additional in-kernel array with internal translation data, but that can 
  be separate from the guest visible data, right?
  
  Hmm, the idea is growing on me.
  
  So then everything we need to get all the functionality we need is a hint 
  from kernel to user space that something changed and vice versa.
  
  From kernel to user space is simple. We can just document that after every 
  RUN, all fields can be modified.
  From user space to kernel, we could modify the entries directly and then 
  pass in an ioctl that passes in a dirty bitmap to kernel space. KVM can 
  then decide what to do with it. I guess the easiest implementation for now 
  would be to ignore the bitmap and simply flush the shadow tlb.
  
  That gives us the flush almost for free. All we need to do is set the tlb 
  to all zeros (should be done by env init anyways) and pass in the 
  something changed call. KVM can then decide to simply drop all of its 
  shadow state or loop through every shadow entry and flush it individually. 
  Maybe we should give a hint on the amount of flushes, so KVM can implement 
  some threshold.
  
  OK.  We'll also need a config ioctl to specify MMU type/size and the address
  of the arrays.
 
 Right, a setup call basically :).

OK, here goes v3:

[Note: one final thought after writing this up -- this isn't going to work
too well in cases where the guest can directly manipulate its TLB, such as
with the LRAT feature of Power Arch 2.06.  We'll still need a
copy-in/copy-out mechanism for that.]

struct kvmppc_book3e_tlb_entry {
union {
__u64 mas8_1;
struct {
__u32 mas8;
__u32 mas1;
};
};
__u64 mas2;
union {
__u64 mas7_3
struct {
__u32 mas7;
__u32 mas3;
};
};
};

For an MMU type of KVM_MMU_PPC_BOOK3E_NOHV, the mas8 in kvmppc_book3e_tlb_entry 
is
present but not supported.

struct kvmppc_book3e_tlb_params {
/*
 * book3e defines 4 TLBs.  Individual implementations may have
 * fewer.  TLBs that do not already exist on the target must be
 * configured with a size of zero.  A tlb_ways value of zero means
 * the array is fully associative.  Only TLBs that are already
 * set-associative on the target may be configured with a different
 * associativity.  A set-associative TLB may not exceed 255 ways.
 *
 * KVM will adjust TLBnCFG based on the sizes configured here,
 * though arrays greater than 2048 entries will have TLBnCFG[NENTRY]
 * set to zero.
 *
 * The size of any TLB that is set-associative must be a multiple of
 * the number of ways, and the number of sets must be a power of two.
 *
 * The page sizes supported by a TLB shall be determined by reading
 * the TLB configuration registers.  This is not adjustable by 
userspace.
 * [Note: need sregs]
 */
__u32 tlb_sizes[4];
__u8 tlb_ways[4];
__u32 reserved[11];
};

KVM_CONFIG_TLB
--

Capability: KVM_CAP_SW_TLB
Type: vcpu ioctl
Parameters: struct kvm_config_tlb (in)
Returns: 0 on success
 -1 on error

struct kvm_config_tlb {
__u64 params;
__u64 array;
__u32 mmu_type;
__u32 array_len;
};

Configures the virtual CPU's TLB array, establishing a shared memory area
between userspace and KVM.  The params and array fields are userspace
addresses of mmu-type-specific data structures.  The array_len field is an
safety mechanism, and should be set to the size in bytes of the memory that
userspace has reserved for the array.  It must be at least the size dictated by
mmu_type and params.

On return from this call, KVM assumes the state of the TLBs to be empty. 
Prior to calling KVM_RUN, userspace must call KVM_DIRTY_TLB to tell KVM about
any valid entries.

While KVM_RUN is active, the shared region is under control of KVM.  Its
contents are undefined, and any modification by userspace results in boundedly
undefined behavior.

On return from KVM_RUN, the shared region will reflect the current state of
the guest's TLB.  If userspace makes any changes, it must call KVM_DIRTY_TLB
to tell KVM which entries have been changed, prior to calling KVM_RUN again
on this vcpu.


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf
Scott Wood wrote:
 On Wed, 9 Feb 2011 18:21:40 +0100
 Alexander Graf ag...@suse.de wrote:

   
 On 07.02.2011, at 21:15, Scott Wood wrote:

 
 That's pretty much what the proposed API does -- except it uses a void
 pointer instead of uint64_t *.
   
 Oh? Did I miss something there? The proposal looked as if it only transfers 
 a single TLB entry at a time.
 

 Right, I just meant in terms of avoiding a fixed reference to a hw-specific
 type.

   
 How about:

 struct kvmppc_booke_tlb_entry {
 union {
 __u64 mas0_1;
 struct {
 __u32 mas0;
 __u32 mas1;
 };
 };
 __u64 mas2;
 union {
 __u64 mas7_3
 struct {
 __u32 mas7;
 __u32 mas3;
 };
 };
 __u32 mas8;
 __u32 pad;
   
 Would it make sense to add some reserved fields or would we just bump up the 
 mmu id?
 

 I was thinking we'd just bump the ID.  I only stuck pad in there for
 alignment.  And we're making a large array of it, so padding could hurt.
   

Ok, thinking about this a bit more. You're basically proposing a list of
tlb set calls, with each array field identifying one tlb set call. What
I was thinking of was a full TLB sync, so we could keep qemu's internal
TLB representation identical to the ioctl layout and then just call that
one ioctl to completely overwrite all of qemu's internal data (and vice
versa).

 struct kvmppc_booke_tlb_params {
 /*
  * book3e defines 4 TLBs.  Individual implementations may have
  * fewer.  TLBs that do not exist on the target must be configured
  * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
  * configured here, though arrays greater than 2048 entries will
  * have TLBnCFG[NENTRY] set to zero.
  */
 __u32 tlb_sizes[4];
   
 Add some reserved fields?
 

 MMU type ID also controls this, but could add some padding to make
 extensions simpler (esp. since we're not making an array of it).  How much
 would you recommend?
   

How about making it 64 bytes? That should leave us plenty of room.

   
 struct kvmppc_booke_tlb_search {
   
 Search? I thought we agreed on having a search later, after the full get/set 
 is settled?
 

 We agreed on having a full array-like get/set... my preference was to keep
 it all under one capability, which implies adding it at the same time.
 But if we do KVM_TRANSLATE, we can probably drop KVM_SEARCH_TLB.  I'm
 skeptical that array-only will not be a performance issue under any usage
 pattern, but we can implement it and try it out before finalizing any of
 this.
   

Yup. We can even implement it, measure what exactly is slow and then
decide on how to implement it. I'd bet that only the emulation stub is
slow - and for that KVM_TRANSLATE seems like a good fit.

   
 struct kvmppc_booke_tlb_entry entry;
 union {
 __u64 mas5_6;
 struct {
 __u64 mas5;
 __u64 mas6;
 };
 };
 };
   

 The fields inside the struct should be __u32, of course. :-P
   

Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
right? :)

   
 - An entry with MAS1[V] = 0 terminates the list early (but there will
   be no terminating entry if the full array is valid).  On a call to
   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
   On a call to KVM_SET_TLB, excess elements beyond the terminating
   entry may not be accessed by KVM.
   
 Very implementation specific, but ok with me. 
 

 I assumed most MMU types would have some straightforward way of marking an
 entry invalid (if not, it can add a software field in the struct), and that
 it would be MMU-specific code that is processing the list.
   

See above :).

   
 It's constrained to the BOOKE implementation of that GET/SET anyway. Is
 this how the hardware works too?
 

 Hardware doesn't process lists of entries.  But MAS1[V] is the valid
 bit in hardware.

   
 [Note: Once we implement sregs, Qemu can determine which TLBs are
 implemented by reading MMUCFG/TLBnCFG -- but in no case should a TLB be
 unsupported by KVM if its existence is implied by the target CPU]

 KVM_SET_TLB
 ---

 Capability: KVM_CAP_SW_TLB
 Type: vcpu ioctl
 Parameters: struct kvm_set_tlb (in)
 Returns: 0 on success
 -1 on error

 struct kvm_set_tlb {
 __u64 params;
 __u64 array;
 __u32 mmu_type;
 };

 [Note: I used __u64 rather than void * to avoid the need for special
 compat handling with 32-bit userspace on a 64-bit kernel -- if the other
 way is preferred, that's fine with me]
   
 Oh, now I understand what you were proposing :). Sorry. No, this way is sane.
 

 What about the ioctls that take only a pointer?  The actual calling
 mechanism should work without compat, but in order for _IOR and such to not
 assign a different IOCTL number based 

Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf
Scott Wood wrote:
 On Thu, 3 Feb 2011 10:19:06 +0100
 Alexander Graf ag...@suse.de wrote:

   
 Yeah, that one's tricky. Usually the way the memory resolver in qemu works 
 is as follows:

  * kvm goes to qemu
  * qemu fetches all mmu and register data from kvm
  * qemu runs its mmu resolution function as if the target was emulated

 So the normal way would be to fetch _all_ TLB entries from KVM, shove them 
 into env and implement the MMU in qemu (at least enough of it to enable 
 debugging). No other target modifies this code path. But no other target 
 needs to copy  30kb of data only to get the mmu data either :).
 

 I guess you mean that cpu_synchronize_state() is supposed to pull in the
 MMU state, though I don't see where it gets called for 'm'/'M' commands in
 the gdb stub.
   

Well, we could also call it in get_phys_page_debug in target-ppc, but
yes. I guess the reason it works for now is that SDR1 is pretty constant
and was fetched earlier on. For BookE not syncing is obviously even more
broken.

 The MMU code seems to be pretty target-specific.  It's not clear to what
 extent there is a normal way, versus what book3s happens to rely on in
 its get_physical_address() code.  I don't think there are any platforms
 supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
 implementation) that have a pure software-managed TLB.  x86 has page
 tables, and book3s has the hash table (603/e300 doesn't, or more accurately
 Linux doesn't use it, but I guess that's not supported by KVM yet?).
   

As for PPC, only 440, e500 and G3-5 are basically supported. It happens
to work on POWER4 and above too and I've even got reports that it's good
on e600 :).

 We could probably do some sort of lazy state transfer only when MMU code
 that needs it is run.  This could initially include debug translations, for
 testing a non-KVM-dependent get_physical_address() implementation, but
 eventually that would use KVM_TRANSLATE (when KVM is used) and thus not
   

Yup :).

 trigger the state transfer.  I'd also like to add an info tlb command,
 which would require the state transfer.
   

Very nice.

 BTW, how much other than the MMU is missing to be able to run an e500
 target in qemu, without kvm?
   

The last person working on BookE emulation was Edgar. Edgar, how far did
you get?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Edgar E. Iglesias
On Thu, Feb 10, 2011 at 12:55:22PM +0100, Alexander Graf wrote:
 Scott Wood wrote:
  On Thu, 3 Feb 2011 10:19:06 +0100
  Alexander Graf ag...@suse.de wrote:
 

  Yeah, that one's tricky. Usually the way the memory resolver in qemu works 
  is as follows:
 
   * kvm goes to qemu
   * qemu fetches all mmu and register data from kvm
   * qemu runs its mmu resolution function as if the target was emulated
 
  So the normal way would be to fetch _all_ TLB entries from KVM, shove 
  them into env and implement the MMU in qemu (at least enough of it to 
  enable debugging). No other target modifies this code path. But no other 
  target needs to copy  30kb of data only to get the mmu data either :).
  
 
  I guess you mean that cpu_synchronize_state() is supposed to pull in the
  MMU state, though I don't see where it gets called for 'm'/'M' commands in
  the gdb stub.

 
 Well, we could also call it in get_phys_page_debug in target-ppc, but
 yes. I guess the reason it works for now is that SDR1 is pretty constant
 and was fetched earlier on. For BookE not syncing is obviously even more
 broken.
 
  The MMU code seems to be pretty target-specific.  It's not clear to what
  extent there is a normal way, versus what book3s happens to rely on in
  its get_physical_address() code.  I don't think there are any platforms
  supported yet (with both KVM and a non-empty cpu_get_phys_page_debug()
  implementation) that have a pure software-managed TLB.  x86 has page
  tables, and book3s has the hash table (603/e300 doesn't, or more accurately
  Linux doesn't use it, but I guess that's not supported by KVM yet?).

 
 As for PPC, only 440, e500 and G3-5 are basically supported. It happens
 to work on POWER4 and above too and I've even got reports that it's good
 on e600 :).
 
  We could probably do some sort of lazy state transfer only when MMU code
  that needs it is run.  This could initially include debug translations, for
  testing a non-KVM-dependent get_physical_address() implementation, but
  eventually that would use KVM_TRANSLATE (when KVM is used) and thus not

 
 Yup :).
 
  trigger the state transfer.  I'd also like to add an info tlb command,
  which would require the state transfer.

 
 Very nice.
 
  BTW, how much other than the MMU is missing to be able to run an e500
  target in qemu, without kvm?

 
 The last person working on BookE emulation was Edgar. Edgar, how far did
 you get?

Hi,

TBH, I don't really know. My goal was to get linux running on an PPC-440
embedded with the Xilinx FPGA's. I managed to fix enough BookE emulation
to get that far.

After that, we've done a few more hacks to run fsboot and uboot. Also,
we've added support for some of the BookE debug registers to be able
to run gdbserver from within linux guests. Some of these patches haven't
made it upstream yet.

I haven't taken the time to compare the specs to qemu code, so I don't
really know how much is missing. My guess is that If you wan't to run
linux guests, the MMU won't be the limiting factor.

Cheers
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Scott Wood
On Thu, 10 Feb 2011 12:45:38 +0100
Alexander Graf ag...@suse.de wrote:

 Ok, thinking about this a bit more. You're basically proposing a list of
 tlb set calls, with each array field identifying one tlb set call. What
 I was thinking of was a full TLB sync, so we could keep qemu's internal
 TLB representation identical to the ioctl layout and then just call that
 one ioctl to completely overwrite all of qemu's internal data (and vice
 versa).

No, this is a full sync -- the list replaces any existing TLB entries (need
to make that explicit in the doc).  Basically it's an invalidate plus a
list of tlb set operations.

Qemu's internal representation will want to be ordered with no missing
entries.  If we require that of the transfer representation we can't do
early termination.  It would also limit Qemu's flexibility in choosing its
internal representation, and make it more awkward to support multiple MMU
types.

Let's see if the format conversion imposes significant overhead before
imposing a less flexible/larger transfer format. :-)

  MMU type ID also controls this, but could add some padding to make
  extensions simpler (esp. since we're not making an array of it).  How much
  would you recommend?

 
 How about making it 64 bytes? That should leave us plenty of room.

OK.

  The fields inside the struct should be __u32, of course. :-P

 
 Ugh, yes :). But since we're dopping this anyways, it doesn't matter,
 right? :)

Right.

  I assumed most MMU types would have some straightforward way of marking an
  entry invalid (if not, it can add a software field in the struct), and that
  it would be MMU-specific code that is processing the list.

 
 See above :).

Which part?

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 10.02.2011, at 19:51, Scott Wood wrote:

 On Thu, 10 Feb 2011 12:45:38 +0100
 Alexander Graf ag...@suse.de wrote:
 
 Ok, thinking about this a bit more. You're basically proposing a list of
 tlb set calls, with each array field identifying one tlb set call. What
 I was thinking of was a full TLB sync, so we could keep qemu's internal
 TLB representation identical to the ioctl layout and then just call that
 one ioctl to completely overwrite all of qemu's internal data (and vice
 versa).
 
 No, this is a full sync -- the list replaces any existing TLB entries (need
 to make that explicit in the doc).  Basically it's an invalidate plus a
 list of tlb set operations.
 
 Qemu's internal representation will want to be ordered with no missing
 entries.  If we require that of the transfer representation we can't do
 early termination.  It would also limit Qemu's flexibility in choosing its
 internal representation, and make it more awkward to support multiple MMU
 types.

Well, but this way it means we'll have to assemble/disassemble a list of 
entries multiple times:

SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure

GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure

Maybe we should go with Avi's proposal after all and simply keep the full 
soft-mmu synced between kernel and user space? That way we only need a setup 
call at first, no copying in between and simply update the user space version 
whenever something changes in the guest. We need to store the TLB's contents 
off somewhere anyways, so all we need is an additional in-kernel array with 
internal translation data, but that can be separate from the guest visible 
data, right?


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 11.02.2011, at 01:20, Alexander Graf wrote:

 
 On 10.02.2011, at 19:51, Scott Wood wrote:
 
 On Thu, 10 Feb 2011 12:45:38 +0100
 Alexander Graf ag...@suse.de wrote:
 
 Ok, thinking about this a bit more. You're basically proposing a list of
 tlb set calls, with each array field identifying one tlb set call. What
 I was thinking of was a full TLB sync, so we could keep qemu's internal
 TLB representation identical to the ioctl layout and then just call that
 one ioctl to completely overwrite all of qemu's internal data (and vice
 versa).
 
 No, this is a full sync -- the list replaces any existing TLB entries (need
 to make that explicit in the doc).  Basically it's an invalidate plus a
 list of tlb set operations.
 
 Qemu's internal representation will want to be ordered with no missing
 entries.  If we require that of the transfer representation we can't do
 early termination.  It would also limit Qemu's flexibility in choosing its
 internal representation, and make it more awkward to support multiple MMU
 types.
 
 Well, but this way it means we'll have to assemble/disassemble a list of 
 entries multiple times:
 
 SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure
 
 GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure
 
 Maybe we should go with Avi's proposal after all and simply keep the full 
 soft-mmu synced between kernel and user space? That way we only need a setup 
 call at first, no copying in between and simply update the user space version 
 whenever something changes in the guest. We need to store the TLB's contents 
 off somewhere anyways, so all we need is an additional in-kernel array with 
 internal translation data, but that can be separate from the guest visible 
 data, right?

If we could then keep qemu's internal representation == shared data with kvm == 
kvm's internal data for guest visible stuff, we get this done with almost no 
additional overhead. And I don't see any problem with this. Should be easily 
doable.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-10 Thread Alexander Graf

On 11.02.2011, at 01:22, Alexander Graf wrote:

 
 On 11.02.2011, at 01:20, Alexander Graf wrote:
 
 
 On 10.02.2011, at 19:51, Scott Wood wrote:
 
 On Thu, 10 Feb 2011 12:45:38 +0100
 Alexander Graf ag...@suse.de wrote:
 
 Ok, thinking about this a bit more. You're basically proposing a list of
 tlb set calls, with each array field identifying one tlb set call. What
 I was thinking of was a full TLB sync, so we could keep qemu's internal
 TLB representation identical to the ioctl layout and then just call that
 one ioctl to completely overwrite all of qemu's internal data (and vice
 versa).
 
 No, this is a full sync -- the list replaces any existing TLB entries (need
 to make that explicit in the doc).  Basically it's an invalidate plus a
 list of tlb set operations.
 
 Qemu's internal representation will want to be ordered with no missing
 entries.  If we require that of the transfer representation we can't do
 early termination.  It would also limit Qemu's flexibility in choosing its
 internal representation, and make it more awkward to support multiple MMU
 types.
 
 Well, but this way it means we'll have to assemble/disassemble a list of 
 entries multiple times:
 
 SET:
 * qemu assembles the list from its internal representation
 * kvm disassembles the list into its internal structure
 
 GET:
 * kvm assembles the list from its internal representation
 * qemu disassembles the list into its internal structure
 
 Maybe we should go with Avi's proposal after all and simply keep the full 
 soft-mmu synced between kernel and user space? That way we only need a setup 
 call at first, no copying in between and simply update the user space 
 version whenever something changes in the guest. We need to store the TLB's 
 contents off somewhere anyways, so all we need is an additional in-kernel 
 array with internal translation data, but that can be separate from the 
 guest visible data, right?
 
 If we could then keep qemu's internal representation == shared data with kvm 
 == kvm's internal data for guest visible stuff, we get this done with almost 
 no additional overhead. And I don't see any problem with this. Should be 
 easily doable.

So then everything we need to get all the functionality we need is a hint from 
kernel to user space that something changed and vice versa.

From kernel to user space is simple. We can just document that after every 
RUN, all fields can be modified.
From user space to kernel, we could modify the entries directly and then pass 
in an ioctl that passes in a dirty bitmap to kernel space. KVM can then decide 
what to do with it. I guess the easiest implementation for now would be to 
ignore the bitmap and simply flush the shadow tlb.

That gives us the flush almost for free. All we need to do is set the tlb to 
all zeros (should be done by env init anyways) and pass in the something 
changed call. KVM can then decide to simply drop all of its shadow state or 
loop through every shadow entry and flush it individually. Maybe we should give 
a hint on the amount of flushes, so KVM can implement some threshold.

Also, please tell me you didn't implement the previous revisions already. It'd 
be a real bummer to see that work wasted only because we're still iterating 
through the spec O_o.


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-08 Thread Avi Kivity

On 02/07/2011 07:30 PM, Yoder Stuart-B08248 wrote:


  -Original Message-
  From: kvm-ppc-ow...@vger.kernel.org [mailto:kvm-ppc-ow...@vger.kernel.org]
  On Behalf Of Avi Kivity
  Sent: Monday, February 07, 2011 11:14 AM
  To: Alexander Graf
  Cc: Wood Scott-B07421; Yoder Stuart-B08248; kvm-ppc@vger.kernel.org;
  k...@vger.kernel.org; qemu-de...@nongnu.org
  Subject: Re: RFC: New API for PPC for vcpu mmu access

  On 02/03/2011 11:19 AM, Alexander Graf wrote:

   I have no idea what things will look like 10 years down the road,
  but  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
  
That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we
  want to transfer every time qemu feels like resolving an EA.

  You could have an ioctl to translate addresses (x86 had KVM_TRANSLATE or
  similar), or have the TLB stored in user memory, so there is no need to
  transfer it (on the other hand, you have to re-validate it every time you
  peek at it).

The most convenient and flexible thing for  Power Book III-E I think
will be something that operates like a TLB search instruction.  Inputs
are 'address space' and 'process id' and outputs are in which TLB the
entry was found and all the components of a TLB entry:
address space
pid
entry number
ea
rpn
guest state
permissions flags
attributes (WIMGE)

Since all of those fields are architected in MAS registers, in the previous
proposal we just proposed to return several 32-bit fields (one per MAS)
that use the architected layout instead of inventing a brand new
structure defining these fields.



This looks reasonable assuming you can take the hit of a system call per 
translation.


--
error compiling committee.c: too many arguments to function

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Alexander Graf

On 04.02.2011, at 23:33, Scott Wood wrote:

 On Thu, 3 Feb 2011 10:19:06 +0100
 Alexander Graf ag...@suse.de wrote:
 
 On 02.02.2011, at 23:08, Scott Wood wrote:
 On Wed, 2 Feb 2011 22:33:41 +0100
 Alexander Graf ag...@suse.de wrote:
 This seems to fine-grained. I'd prefer a list of all TLB entries to be 
 pushed in either direction. What's the foreseeable number of TLB entries 
 within the next 10 years?
 
 I have no idea what things will look like 10 years down the road, but
 currently e500mc has 576 entries (512 TLB0, 64 TLB1).
 
 That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we 
 want to transfer every time qemu feels like resolving an EA.
 
 And that's only with the standard hardware TLB size.  On Topaz (our
 standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
 It speeds up some workloads nicely, but invalidation-heavy loads get
 hurt.

Yup - I do a similar trick for Book3S. Just that the TLB is implementation 
dependent anyways and mostly hidden from the OS :).

 
 Having the whole stack available would make the sync with qemu easier and 
 also allows us to only do a single ioctl for all the TLB management. 
 Thanks to the PVR we know the size of the TLB, so we don't have to shove 
 that around.
 
 No, we don't know the size (or necessarily even the structure) of the
 TLB.  KVM may provide a guest TLB that is larger than what hardware has,
 as a cache to reduce the number of TLB misses that have to go to the
 guest (we do this now in another hypervisor).
 
 Plus sometimes it's just simpler -- why bother halving the size of the
 guest TLB when running on e500v1?
 
 Makes sense. So we basically need an ioctl that tells KVM the MMU type and 
 TLB size. Remember, the userspace tool is the place for policies :).
 
 Maybe, though keeping it in KVM means we can change it whenever we want
 without having to sync up Qemu and worry about backward compatibility.

Quite the contrary - you have to worry more about backward compatibility. If we 
implement a new feature that doesn't work on old kernels, we can just tell qemu 
to not work on those old versions. For the kernel interfaces, we have to keep 
supporting old userspace.

 Same-as-hardware TLB geometry with a Qemu-specified number of sets is
 probably good enough for the forseeable future, though.  There were
 some other schemes we considered a while back for Topaz, but we ended
 up just going with a larger version of what's in hardware.

Sounds good. Not sure how we'd handle page tables in this whole scheme yet, but 
I guess we really should take one step at a time.

 
 Maybe this even needs to be potentially runtime switchable, in case
 you boot off with u-boot in the guest, load a kernel and the kernel
 activates some PV extensions.
 
 U-Boot should be OK with it -- the increased TLB size is
 architecturally valid, and U-boot doesn't do much with TLB0 anyway.
 
 If we later have true PV extensions such as a page table, that'd be
 another matter.

Yeah. Or imagine we realize that we should make the TLB size dynamic, so that 
non-PV aware guest OSs get the exact same behavior as on real hardware (because 
they're buggy for example) while PV aware guests can bump up the TLB size. But 
really, let's worry about that later.

 
 The only reason we need to do this is because there's no proper reset 
 function in qemu for the e500 tlb. I'd prefer to have that there and push 
 the TLB contents down on reset.
 
 The other way to look at it is that there's no need for a reset
 function if all the state is properly settable. :-)
 
 You make it sound as if it was hard to implement a reset function in qemu 
 :). Really, that's where it belongs.
 
 Sorry, I misread reset function in qemu as reset ioctl in KVM.
 
 This is meant to be used by a qemu reset function.  If there's a 
 full-tlb set, that could be used instead, though it'd be slower.  With
 the API as proposed it's needed to clear the slate before you set the
 individual entries you want.

Yeah, let's just go for a full TLB get/set and optimize specific slow parts 
later. It's easier to fine-tune stuff that's slow than to miss out on 
functionality because we were trying to be too clever from the start.

 
 Which we want anyway for debugging (and migration, though I wonder if
 anyone would actually use that with embedded hardware).
 
 We certainly should not close the door on migration either way. So all the 
 state has to be 100% user space receivable.
 
 Oh, I agree -- or I wouldn't have even mentioned it. :-)
 
 I just wouldn't make it the primary optimization concern at this point.

No, not at all - no optimization for it. But keeping the door open :).

 
 Haven't fully made up my mind on the tlb entry structure yet. Maybe 
 something like
 
 struct kvm_ppc_booke_tlbe {
   __u64 data[8];
 };
 
 would be enough? The rest is implementation dependent anyways. Exposing 
 those details to user space doesn't buy us anything. By keeping it generic 
 we can at least 

Re: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Scott Wood
On Mon, 7 Feb 2011 17:49:51 +0100
Alexander Graf ag...@suse.de wrote:

 
 On 07.02.2011, at 17:40, Yoder Stuart-B08248 wrote:
 
  Suggested change to this would be to have Qemu set tlb_type as 
  an _input_ argument.   If KVM supports it, that type gets used,
  else an error is returned.This would allow Qemu to tell
  the kernel what type of MMU it is prepared to support.   Without
  this Qemu would just have to error out if the type returned is
  unknown.
 
 Yes, we could use the same struct for get and set. On set, it could transfer 
 the mmu type, on get it could tell userspace the mmu type.

What happens if a get is done before the first set, and there are
multiple MMU type options for this hardware, with differing entry sizes?

Qemu would have to know beforehand how large to make the buffer.

We could say that going forward, it's expected that qemu will do a
TLB set (either a full one, or a lightweight alternative) after
creating a vcpu.  For compatibility, if this doesn't happen before the
vcpu is run, the TLB is created and initialized as it is today, but
no new Qemu-visible features will be enabled that way.

If Qemu does a get without ever doing some set operation, it should
get an error, since the requirement to do a set is added at the same
time as the get API.

-Scott

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: RFC: New API for PPC for vcpu mmu access

2011-02-07 Thread Scott Wood
On Mon, 7 Feb 2011 16:43:02 +0100
Alexander Graf ag...@suse.de wrote:

 On 04.02.2011, at 23:33, Scott Wood wrote:
 
  On Thu, 3 Feb 2011 10:19:06 +0100
  Alexander Graf ag...@suse.de wrote:
  
  Makes sense. So we basically need an ioctl that tells KVM the MMU type and 
  TLB size. Remember, the userspace tool is the place for policies :).
  
  Maybe, though keeping it in KVM means we can change it whenever we want
  without having to sync up Qemu and worry about backward compatibility.
 
 Quite the contrary - you have to worry more about backward compatibility. If 
 we implement a new feature that doesn't work on old kernels, we can just tell 
 qemu to not work on those old versions. For the kernel interfaces, we have to 
 keep supporting old userspace.

If you're talking about actual interface changes, yes.  But a change in
how KVM implements things behind the scenes shouldn't break an old
Qemu, unless it's buggy and makes assumptions not permitted by the API.

  How's that different from backing the void pointer up with a different
  struct depending on the MMU type?  We weren't proposing unions.
  
  A fixed array does mean you wouldn't have to worry about whether qemu
  supports the more advanced struct format if fields are added --
  you can just unconditionally write it, as long as it's backwards
  compatible.  Unless you hit the limit of the pre-determined array size,
  that is.  And if that gets made higher for future expansion, that's
  even more data that has to get transferred, before it's really needed.
 
 Yes, it is. And I don't see how we could easily avoid it. Maybe just pass in 
 a random __user pointer that we directly write to from kernel space and tell 
 qemu how big and what type a tlb entry is?
 
 struct request_ppc_tlb {
 int tlb_type;
 int tlb_entries;
 uint64_t __user *tlb_data
 };

That's pretty much what the proposed API does -- except it uses a void
pointer instead of uint64_t *.

 Would you really want to loop through 16k entries, doing an ioctl for each? 

Not really.  The API was modeled after something we did on Topaz where
it's just a function call.  But something array-based would have been
awkward without constraining the geometry.

Now that we're going to constrain the geometry, providing an array-based
get/set would be easy and should definitely be a part of the API.

 Then performance really would always be an issue.

For cases where you really need to do a full get/set, yes.

 I would really prefer we tackle it with a full-on tlb get/set first and then 
 put the very flexible one on top, because to be the full-on approach feels 
 like the more generic one. I'm very open to adding an individual tlb get/set 
 and maybe even a kvm, please translate EA x to RA y ioctl. But those should 
 come after we cover the big hammer that just copies everything.

If we add everything at once, it minimizes the possibilities that Qemu
has to deal with -- either the full MMU API is there, or it's not.

BTW, I wonder if we should leave PPC out of the name.  It seems like
any arch with a software-visible TLB could use this, since the hw
details are hidden behind the MMU type.

How about:

struct kvmppc_booke_tlb_entry {
union {
__u64 mas0_1;
struct {
__u32 mas0;
__u32 mas1;
};
};
__u64 mas2;
union {
__u64 mas7_3
struct {
__u32 mas7;
__u32 mas3;
};
};
__u32 mas8;
__u32 pad;
};

struct kvmppc_booke_tlb_params {
/*
 * book3e defines 4 TLBs.  Individual implementations may have
 * fewer.  TLBs that do not exist on the target must be configured
 * with a size of zero.  KVM will adjust TLBnCFG based on the sizes
 * configured here, though arrays greater than 2048 entries will
 * have TLBnCFG[NENTRY] set to zero.
 */
__u32 tlb_sizes[4];
};

struct kvmppc_booke_tlb_search {
struct kvmppc_booke_tlb_entry entry;
union {
__u64 mas5_6;
struct {
__u64 mas5;
__u64 mas6;
};
};
};

For a mmu type of PPC_BOOKE_NOHV, the mas5 field in
kvmppc_booke_tlb_search and the mas8 field in kvmppc_booke_tlb_entry
are present but not supported.

For an MMU type of PPC_BOOKE_NOHV or PPC_BOOKE_HV:
 - TLB entries in get/set arrays may be provided in any order, and all 
   TLBs are get/set at once.
 - An entry with MAS1[V] = 0 terminates the list early (but there will
   be no terminating entry if the full array is valid).  On a call to
   KVM_GET_TLB, the contents of elemnts after the terminator are undefined.
   On a call to KVM_SET_TLB, excess elements beyond the terminating
   entry may not be accessed by KVM.

[Note: Once we implement sregs, Qemu can determine which TLBs are

Re: RFC: New API for PPC for vcpu mmu access

2011-02-04 Thread Scott Wood
On Thu, 3 Feb 2011 10:19:06 +0100
Alexander Graf ag...@suse.de wrote:

 On 02.02.2011, at 23:08, Scott Wood wrote:
  On Wed, 2 Feb 2011 22:33:41 +0100
  Alexander Graf ag...@suse.de wrote:
  This seems to fine-grained. I'd prefer a list of all TLB entries to be 
  pushed in either direction. What's the foreseeable number of TLB entries 
  within the next 10 years?
  
  I have no idea what things will look like 10 years down the road, but
  currently e500mc has 576 entries (512 TLB0, 64 TLB1).
 
 That sums up to 64 * 576 bytes, which is 36kb. Ouch. Certainly nothing we 
 want to transfer every time qemu feels like resolving an EA.

And that's only with the standard hardware TLB size.  On Topaz (our
standalone hypervisor) we increased the guest's TLB0 to 16384 entries.
It speeds up some workloads nicely, but invalidation-heavy loads get
hurt.

  Having the whole stack available would make the sync with qemu easier and 
  also allows us to only do a single ioctl for all the TLB management. 
  Thanks to the PVR we know the size of the TLB, so we don't have to shove 
  that around.
  
  No, we don't know the size (or necessarily even the structure) of the
  TLB.  KVM may provide a guest TLB that is larger than what hardware has,
  as a cache to reduce the number of TLB misses that have to go to the
  guest (we do this now in another hypervisor).
  
  Plus sometimes it's just simpler -- why bother halving the size of the
  guest TLB when running on e500v1?
 
 Makes sense. So we basically need an ioctl that tells KVM the MMU type and 
 TLB size. Remember, the userspace tool is the place for policies :).

Maybe, though keeping it in KVM means we can change it whenever we want
without having to sync up Qemu and worry about backward compatibility.

Same-as-hardware TLB geometry with a Qemu-specified number of sets is
probably good enough for the forseeable future, though.  There were
some other schemes we considered a while back for Topaz, but we ended
up just going with a larger version of what's in hardware.

 Maybe this even needs to be potentially runtime switchable, in case
 you boot off with u-boot in the guest, load a kernel and the kernel
 activates some PV extensions.

U-Boot should be OK with it -- the increased TLB size is
architecturally valid, and U-boot doesn't do much with TLB0 anyway.

If we later have true PV extensions such as a page table, that'd be
another matter.

  The only reason we need to do this is because there's no proper reset 
  function in qemu for the e500 tlb. I'd prefer to have that there and push 
  the TLB contents down on reset.
  
  The other way to look at it is that there's no need for a reset
  function if all the state is properly settable. :-)
 
 You make it sound as if it was hard to implement a reset function in qemu :). 
 Really, that's where it belongs.

Sorry, I misread reset function in qemu as reset ioctl in KVM.

This is meant to be used by a qemu reset function.  If there's a 
full-tlb set, that could be used instead, though it'd be slower.  With
the API as proposed it's needed to clear the slate before you set the
individual entries you want.

  Which we want anyway for debugging (and migration, though I wonder if
  anyone would actually use that with embedded hardware).
 
 We certainly should not close the door on migration either way. So all the 
 state has to be 100% user space receivable.

Oh, I agree -- or I wouldn't have even mentioned it. :-)

I just wouldn't make it the primary optimization concern at this point.

  Haven't fully made up my mind on the tlb entry structure yet. Maybe 
  something like
  
  struct kvm_ppc_booke_tlbe {
 __u64 data[8];
  };
  
  would be enough? The rest is implementation dependent anyways. Exposing 
  those details to user space doesn't buy us anything. By keeping it generic 
  we can at least still build against older kernel headers :).
  
  If it's not exposed to userspace, how is userspace going to
  interpret/fill in the data?
 
 It can overlay cast according to the MMU type.

How's that different from backing the void pointer up with a different
struct depending on the MMU type?  We weren't proposing unions.

A fixed array does mean you wouldn't have to worry about whether qemu
supports the more advanced struct format if fields are added --
you can just unconditionally write it, as long as it's backwards
compatible.  Unless you hit the limit of the pre-determined array size,
that is.  And if that gets made higher for future expansion, that's
even more data that has to get transferred, before it's really needed.

  As for kernel headers, I think qemu needs to provide its own copy, like
  qemu-kvm does, and like http://kernelnewbies.org/KernelHeaders suggests
  for programs which rely on recent kernel APIs (which Qemu+KVM tends
  to do already).
 
 Yeah, tedious story...

I guess it's come up before?

  Userspace should only really need the TLB entries for
  
   1) Debugging
   2) Migration
  
  So I don't see the 

Re: RFC: New API for PPC for vcpu mmu access

2011-02-03 Thread Alexander Graf

On 02.02.2011, at 23:34, Yoder Stuart-B08248 wrote:

 
 
 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Wednesday, February 02, 2011 3:34 PM
 To: Yoder Stuart-B08248
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; qemu-de...@nongnu.org
 Subject: Re: RFC: New API for PPC for vcpu mmu access
 
 
 On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
 
 Below is a proposal for a new API for PPC to allow KVM clients to set
 MMU state in a vcpu.
 
 BookE processors have one or more software managed TLBs and currently
 there is no mechanism for Qemu to initialize or access them.  This is
 needed for normal initialization as well as debug.
 
 There are 4 APIs:
 
 -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
 with KVM-- the type determines the size and format of the data in the
 other APIs
 
 This should be done through the PVR hint in sregs, no? Usually a single CPU
 type only has a single MMU type.
 
 -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
 vcpu
 
 -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
 the format of the MMU data passed in
 
 This seems to fine-grained. I'd prefer a list of all TLB entries to be
 pushed in either direction. What's the foreseeable number of TLB entries
 within the next 10 years?
 
 Having the whole stack available would make the sync with qemu easier and
 also allows us to only do a single ioctl for all the TLB management. Thanks
 to the PVR we know the size of the TLB, so we don't have to shove that
 around.
 
 Yes, we thought about that approach but the idea here, as Scott 
 described, was to provide an API that could work if user space
 is unaware of the geometry of the TLB.

Userspace shouldn't be unaware of the TLB, that's the whole point :). In 
principle, all state should be fetchable from userspace - so it has to know the 
geometry.

 Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
 of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
 allow TLB geometries that cannot be described in the TLBnCFG
 registers.

It's certainly not easy to assemble all the required information in userspace, 
but we need to do so nevertheless - having the state available simply buys us a 
lot in terms of flexibility.

 I think the use case where this API would be used the most
 would be from a gdb stub that needed to look up an effective
 address.

I agree. As I stated in my previous mail, there's probably good rationale to 
explicitly tune that path. That doesn't mean that we have to leave out the 
generic one. It should only be an acceleration.

After all, this whole flexibility thing with all the potential possibilities is 
KVM's strong point. We should not close the doors on those :).


Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: RFC: New API for PPC for vcpu mmu access

2011-02-02 Thread Yoder Stuart-B08248


 -Original Message-
 From: Alexander Graf [mailto:ag...@suse.de]
 Sent: Wednesday, February 02, 2011 3:34 PM
 To: Yoder Stuart-B08248
 Cc: kvm-ppc@vger.kernel.org; k...@vger.kernel.org; qemu-de...@nongnu.org
 Subject: Re: RFC: New API for PPC for vcpu mmu access
 
 
 On 02.02.2011, at 21:33, Yoder Stuart-B08248 wrote:
 
  Below is a proposal for a new API for PPC to allow KVM clients to set
  MMU state in a vcpu.
 
  BookE processors have one or more software managed TLBs and currently
  there is no mechanism for Qemu to initialize or access them.  This is
  needed for normal initialization as well as debug.
 
  There are 4 APIs:
 
  -KVM_PPC_SET_MMU_TYPE allows the client to negotiate the type of MMU
  with KVM-- the type determines the size and format of the data in the
  other APIs
 
 This should be done through the PVR hint in sregs, no? Usually a single CPU
 type only has a single MMU type.
 
  -KVM_PPC_INVALIDATE_TLB invalidates all TLB entries in all TLBs in the
  vcpu
 
  -KVM_PPC_SET_TLBE sets a TLB entry-- the Power architecture specifies
  the format of the MMU data passed in
 
 This seems to fine-grained. I'd prefer a list of all TLB entries to be
 pushed in either direction. What's the foreseeable number of TLB entries
 within the next 10 years?
 
 Having the whole stack available would make the sync with qemu easier and
 also allows us to only do a single ioctl for all the TLB management. Thanks
 to the PVR we know the size of the TLB, so we don't have to shove that
 around.

Yes, we thought about that approach but the idea here, as Scott 
described, was to provide an API that could work if user space
is unaware of the geometry of the TLB.

Take a look at Power ISA Version 2.06.1 (on power.org) at the definition
of TLBnCFG in Book E.  The NENTRY and ASSOC fields now have meaning that
allow TLB geometries that cannot be described in the TLBnCFG
registers.

I think the use case where this API would be used the most
would be from a gdb stub that needed to look up an effective
address.

Stuart

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html