Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote:
 +This creates a virtual TCE (translation control entry) table, which
 +is an IOMMU for PAPR-style virtual I/O.  It is used to translate
 +logical addresses used in virtual I/O into guest physical addresses,
 +and provides a scatter/gather capability for PAPR virtual I/O.
 +
 +/* for KVM_CAP_SPAPR_TCE_64 */
 +struct kvm_create_spapr_tce_64 {
 +   __u64 liobn;
 +   __u64 window_size;
 +   __u64 bus_offset;
 +   __u32 page_shift;
 +   __u32 flags;
 +};
 +
 +The liobn field gives the logical IO bus number for which to create a
 +TCE table. The window_size field specifies the size of the DMA window
 +which this TCE table will translate - the table will contain one 64
 +bit TCE entry for every IOMMU page. The bus_offset field tells where
 +this window is mapped on the IO bus. 

Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong
HW limits depending on the type of bridge  architecture version...

Do you plan to have that knowledge in qemu ? Or do you have some other
mechanism to query it ? (I might be missing a piece of the puzzle here).

Also one thing I've been pondering ...

We'll end up wasting a ton of memory with those TCE tables. If you have
3 PEs mapped into a guest, it will try to create 3 DDW's mapping the
entire guest memory and so 3 TCE tables large enough for that ... and
which will contain exactly the same entries !

We really want to look into extending PAPR to allow the creation of
table aliases so that the guest can essentially create one table and
associate it with multiple PEs. We might still decide to do multiple
copies for NUMA reasons but no more than one per node for example... at
least we can have the policy in qemu/kvm.

Also, do you currently require allocating a single physically contiguous
table or do you support TCE trees in your implementation ?

Cheers,
Ben.

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Alexey Kardashevskiy
On 06/05/2014 05:38 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2014-06-05 at 17:25 +1000, Alexey Kardashevskiy wrote:
 +This creates a virtual TCE (translation control entry) table, which
 +is an IOMMU for PAPR-style virtual I/O.  It is used to translate
 +logical addresses used in virtual I/O into guest physical addresses,
 +and provides a scatter/gather capability for PAPR virtual I/O.
 +
 +/* for KVM_CAP_SPAPR_TCE_64 */
 +struct kvm_create_spapr_tce_64 {
 +   __u64 liobn;
 +   __u64 window_size;
 +   __u64 bus_offset;
 +   __u32 page_shift;
 +   __u32 flags;
 +};
 +
 +The liobn field gives the logical IO bus number for which to create a
 +TCE table. The window_size field specifies the size of the DMA window
 +which this TCE table will translate - the table will contain one 64
 +bit TCE entry for every IOMMU page. The bus_offset field tells where
 +this window is mapped on the IO bus. 
 
 Hrm, the bus_offset cannot be set arbitrarily, it has some pretty strong
 HW limits depending on the type of bridge  architecture version...
 
 Do you plan to have that knowledge in qemu ? Or do you have some other
 mechanism to query it ? (I might be missing a piece of the puzzle here).


Yes. QEMU will have this knowledge as it has to implement
ibm,create-pe-dma-window and return this address to the guest. There will
be a container API to receive it from powernv code via funky ppc_md callback.

There are 2 steps:
1. query + create window
2. enable in-kernel KVM acceleration for it.

Everything will work without step2 and, frankly speaking, we do not need it
too much for DDW but it does not cost much.

By having bus_offset in ioctl which is only used for step2, I reduce
dependance from powernv.


 Also one thing I've been pondering ...
 
 We'll end up wasting a ton of memory with those TCE tables. If you have
 3 PEs mapped into a guest, it will try to create 3 DDW's mapping the
 entire guest memory and so 3 TCE tables large enough for that ... and
 which will contain exactly the same entries !

This is in the plan too, do not rush :)


 We really want to look into extending PAPR to allow the creation of
 table aliases so that the guest can essentially create one table and
 associate it with multiple PEs. We might still decide to do multiple
 copies for NUMA reasons but no more than one per node for example... at
 least we can have the policy in qemu/kvm.
 
 Also, do you currently require allocating a single physically contiguous
 table or do you support TCE trees in your implementation ?


No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table.
Do we really need trees?


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote:
 
 No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table.
 Do we really need trees?

The above is assuming hugetlbfs backed guests. These are the least of my worry
indeed. But we need to deal with 4k and 64k guests.

Cheers,
Ben


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Alexander Graf


On 05.06.14 12:27, Benjamin Herrenschmidt wrote:

On Thu, 2014-06-05 at 19:26 +1000, Alexey Kardashevskiy wrote:

No trees yet. For 64GB window we need (6430)/(1620)*8 = 32K TCE table.
Do we really need trees?

The above is assuming hugetlbfs backed guests. These are the least of my worry
indeed. But we need to deal with 4k and 64k guests.


What if we ask user space to give us a pointer to user space allocated 
memory along with the TCE registration? We would still ask user space to 
only use the returned fd for TCE modifications, but would have some 
nicely swappable memory we can store the TCE entries in.


In fact, the code as is today can allocate an arbitrary amount of pinned 
kernel memory from within user space without any checks.



Alex

--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Benjamin Herrenschmidt
On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote:
 What if we ask user space to give us a pointer to user space allocated 
 memory along with the TCE registration? We would still ask user space to 
 only use the returned fd for TCE modifications, but would have some 
 nicely swappable memory we can store the TCE entries in.

That isn't going to work terribly well for VFIO :-) But yes, for
emulated devices, we could improve things a bit, including for
the 32-bit TCE tables.

For emulated, the real mode path could walk the page tables and fallback
to virtual mode  get_user if the page isn't present, thus operating
directly on qemu memory TCE tables instead of the current pinned stuff.

However that has a cost in performance, but since that's really only
used for emulated devices and PAPR VIOs, it might not be a huge issue.

But for VFIO we don't have much choice, we need to create something the
HW can access.

 In fact, the code as is today can allocate an arbitrary amount of pinned 
 kernel memory from within user space without any checks.

Right. We should at least account it in the locked limit.

Cheers,
Ben.


--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Alexander Graf


On 05.06.14 14:30, Benjamin Herrenschmidt wrote:

On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote:

What if we ask user space to give us a pointer to user space allocated
memory along with the TCE registration? We would still ask user space to
only use the returned fd for TCE modifications, but would have some
nicely swappable memory we can store the TCE entries in.

That isn't going to work terribly well for VFIO :-) But yes, for
emulated devices, we could improve things a bit, including for
the 32-bit TCE tables.

For emulated, the real mode path could walk the page tables and fallback
to virtual mode  get_user if the page isn't present, thus operating
directly on qemu memory TCE tables instead of the current pinned stuff.

However that has a cost in performance, but since that's really only
used for emulated devices and PAPR VIOs, it might not be a huge issue.

But for VFIO we don't have much choice, we need to create something the
HW can access.


But we need to create separate tables for VFIO anyways, because these 
TCE tables contain virtual addresses, no?



Alex




In fact, the code as is today can allocate an arbitrary amount of pinned
kernel memory from within user space without any checks.

Right. We should at least account it in the locked limit.

Cheers,
Ben.




--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 3/3] PPC: KVM: Add support for 64bit TCE windows

2014-06-05 Thread Alexey Kardashevskiy
On 06/05/2014 10:30 PM, Benjamin Herrenschmidt wrote:
 On Thu, 2014-06-05 at 13:56 +0200, Alexander Graf wrote:
 What if we ask user space to give us a pointer to user space allocated 
 memory along with the TCE registration? We would still ask user space to 
 only use the returned fd for TCE modifications, but would have some 
 nicely swappable memory we can store the TCE entries in.
 
 That isn't going to work terribly well for VFIO :-) But yes, for
 emulated devices, we could improve things a bit, including for
 the 32-bit TCE tables.
 
 For emulated, the real mode path could walk the page tables and fallback
 to virtual mode  get_user if the page isn't present, thus operating
 directly on qemu memory TCE tables instead of the current pinned stuff.
 
 However that has a cost in performance, but since that's really only
 used for emulated devices and PAPR VIOs, it might not be a huge issue.
 
 But for VFIO we don't have much choice, we need to create something the
 HW can access.

You are confusing things here.

There are 2 tables:
1. guest-visible TCE table, this is what is allocated for VIO or emulated PCI;
2. real HW DMA window, one exists already for DMA32 and one I will
allocated for a huge window.

I have just #2 for VFIO now but we will need both in order to implement
H_GET_TCE correctly, and this is the table I will allocate by this new ioctl.


 In fact, the code as is today can allocate an arbitrary amount of pinned 
 kernel memory from within user space without any checks.
 
 Right. We should at least account it in the locked limit.

Yup. And (probably) this thing will keep a counter of how many windows were
created per KVM instance to avoid having multiple copies of the same table.


-- 
Alexey
--
To unsubscribe from this list: send the line unsubscribe kvm-ppc in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html