Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-07 Thread Shamir Rabinovitch
On Thu, Nov 05, 2015 at 04:11:21PM -0500, David Miller wrote:
> 
> And for the record Sowmini fixed a lot of the lock contention:
> 
> commit ff7d37a502022149655c18035b99a53391be0383
> Author: Sowmini Varadhan 
> Date:   Thu Apr 9 15:33:30 2015 -0400
> 
> Break up monolithic iommu table/lock into finer graularity pools and lock
> 

The poor rds-stress results w/o IOMMU bypass I sent in early post were taken 
from
kernel that has the above patch and that has all the needed changes in 
arch/sparc
to use this new feature.

It seems that it worked well for 10G ETH IOMMU lock contention but it still not 
solving
the rds-stress issue.

The difference can be from:

1. Lock contention still left with this enhancement <-- zero in bypass
2. Overhead to setup the IOMMU mapping <-- almost zero in bypass (require 1 HV 
call)
3. Overhead to use the IOMMU mapping <-- not sure how to measure this
4. Overhead to tear the IOMMU mapping <-- zero in bypass 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Tue, Nov 03, 2015 at 07:13:28AM +1100, Benjamin Herrenschmidt wrote:
> 
> Then I would argue for naming this differently. Make it an optional
> hint "DMA_ATTR_HIGH_PERF" or something like that. Whether this is
> achieved via using a bypass or other means in the backend not the
> business of the driver.
> 

Correct. This comment was also from internal review :-) Although
currently there is strong opposition to such new attribute.

> 
> It will partially only but it's just an example of another way the
> bakcend could provide some improved performances without a bypass.

Just curious.. 

In the Infiniband case where user application request the driver to DMA 
map some random pages they allocated - will your suggestion still
function? 

I mean can you use this limited 1:1 mapping window to map any page the
user application choose? How?

At the bypass we just use the physical address of the page (almost as is).
We use the fact that bypass address space cover the whole memory and so we
have mapping for any address.

In IOMMU case we use the IOMMU translation tables and so can map 64 bit 
address to some 32 bit address (or less).

In your case it is not clear to me what can we do with such limited 1:1
mapping.

> 
> Cheers,
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 03:44:27PM +0100, Joerg Roedel wrote:
> 
> How do you envision this per-mapping by-pass to work? For the DMA-API
> mappings you have only one device address space. For by-pass to work,
> you need to map all physical memory of the machine into this space,
> which leaves the question where you want to put the window for
> remapping.
> 
> You surely have to put it after the physical mappings, but any
> protection will be gone, as the device can access all physical memory.

Correct. This issue is one of the concerns here in the previous replies.
I will take different approach which will not require the IOMMU bypass
per mapping. Will try to shift to the x86 'iommu=pt' approach.

> 
> So instead of working around the shortcomings of DMA-API
> implementations, can you present us some numbers and analysis of how bad
> the performance impact with an IOMMU is and what causes it?

We had a bunch of issues around SPARC IOMMU. Not all of them relate to
performance. The first issue was that on SPARC, currently, we only have 
limited address space to IOMMU so we had issue to do large DMA mappings
for Infiniband. Second issue was that we identified high contention on 
the IOMMU locks even in ETH driver. 

> 
> I know that we have lock-contention issues in our IOMMU drivers, which
> can be fixed. Maybe the performance problems you are seeing can be fixed
> too, when you give us more details about them.
> 

I do not want to put too much information here but you can see some results:

rds-stress test from sparc t5-2 -> x86:

with iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3 141278  0 1165565.81   0.00   0.008.93   376.60 -1.00  
(average)

without iommu bypass:
-
sparc->x86 cmdline = -r XXX -s XXX -q 256 -a 8192 -T 10 -d 10 -t 3 -o XXX
tsks   tx/s   rx/s  tx+rx K/smbi K/smbo K/s tx us/c   rtt us cpu %
   3  78558  0  648101.41   0.00   0.00   15.05   876.72 -1.00  
(average)

+ RDMA tests are totally not working (might be due to failure to DMA map all 
the memory).

So IOMMU bypass give ~80% performance boost.

> 
>   Joerg
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-02 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 09:00:34PM +1100, Benjamin Herrenschmidt wrote:
> 
> Chosing on a per-mapping basis *in the back end* might still make some

In my case, choosing mapping based on the hardware that will use this
mappings makes more sense. Most hardware are not that performance sensitive
as the Infiniband hardware.

> amount of sense. What I don't completely grasp is what does it give
> you to expose that choice to the *driver* all the way up the chain. Why
> does the driver knows better whether something should use the bypass or
> not ?

The driver know for what hardware it is mapping the memory so it know if
the memory will be used by performance sensitive hardware or not.

> 
> I can imagine some in-between setups, for example, on POWER (and
> probably x86), I could setup a window that is TCE-mapped (TCEs are our
> iommu PTEs) but used to create a 1:1 mapping. IE. A given TCE always
> map to the same physical page. I could then use map/unmap to adjust the
> protection, the idea being that only "relaxing" the protection requires
> flushing the IO TLB, ie, we could delay most flushes.

In your case, what will give the better performance - 1:1 mapping or IOMMU
mapping? When you say 'relaxing the protection' you refer to 1:1 mapping?

Also, how this 1:1 window address the security concerns that other raised
by other here?

> 
> But that sort of optimization only makes sense in the back-end.
> 
> So what was your original idea where you thought the driver was the
> right one to decide whether to use the bypass or not for a given map
> operation ? That's what I don't grasp... you might have a valid case
> that I just fail to see.

Please see above.

> 
> Cheers,
> Ben.
> 

I think that given that IOMMU bypass on per allocation basis raise some
concerns, the only path for SPARC is this:

1. Support 'iommu=pt' as x86 for total IOMMU as intermediate step. Systems
that use Infiniband will be set to pass through.

2. Add support in DVMA which allow less contention on the IOMMU resources
while doing the map/unmap, bigger address range and full protection.
Still this is not clear what will be the performance cost of using
DVMA.
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Shamir Rabinovitch
On Mon, Nov 02, 2015 at 08:10:49AM +1100, Benjamin Herrenschmidt wrote:
> But but but ...
> 
> What I don't understand is how that brings you any safety.

Limited safety maybe? If some device DMA mappings are via IOMMU 
and this fall to some address range that is far from the bypass / 
pass through range then small drifts in address might be still figured 
if we do not bypass / pass through the IOMMU - right?

Device can sure use the bypass address and just reach the memory w/o 
IOMMU protection. See some comments about that below.

> 
> Basically, either your bridge has a bypass window, or it doesn't. (Or
> it has one and it's enabled or not, same thing).

Agree.

> 
> If you request the bypass on a per-mapping basis, you basically have to
> keep the window always enabled, unless you do some nasty refcounting of
> how many people have a bypass mapping in flight, but that doesn't seem
> useful.
> 
> Thus you have already lost all protection from the device, since your
> entire memory is accessible via the bypass mapping.

Correct, see my above comment. 

> 
> Which means what is the point of then having non-bypass mappings for
> other things ? Just to make things slower ?
> 
> I really don't see what this whole "bypass on a per-mapping basis" buys
> you.
> 
> Note that we implicitly already do that on powerpc, but not for those
> reasons, we do it based on the DMA mask, so that if your coherent mask
> is 32-bit but your dma mask is 64-bit (which is not an uncommon
> combination), we service the "coherent" requests (basically the long
> lifed dma allocs) from the remapped region and the "stream" requests
> (ie, map_page/map_sg) from the bypass.



To summary -

1. The whole point of the IOMMU pass through was to get bigger address space
and faster map/unmap operations for performance critical hardware
2. SPARC IOMMU in particular has the ability to DVMA which adress all the 
protection concerns raised above. Not sure what will be the performance
impact though. This still need a lot of work before we could test this.
3. On x86 we use IOMMU in pass through mode so all the above concerns are valid

The question are -

1. Does partial use of IOMMU while the pass through window is enabled add some
protection?
2. Do we rather the x86 way of doing this which is enable / disable IOMMU 
translations at kernel level?

I think that I can live with option (2) till I have DVMA if there is strong
disagree on the need for per allocation IOMMU bypass.


> 
> Cheers,
> Ben.
> 
--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-11-01 Thread Shamir Rabinovitch
On Thu, Oct 29, 2015 at 10:35:25PM +, David Woodhouse wrote:
> > 
> > Could this be mitigated using pools?  I don't know if the net code
> > would play along easily.
> 
> For the receive side, it shouldn't be beyond the wit of man to
> introduce an API which allocates *and* DMA-maps a skb. Pass it to
> netif_rx() still mapped, with a destructor that just shoves it back in
> a pool for re-use.
> 
> Doing it for transmit might be a little more complex, but perhaps still
> possible.

Not sure this use case is possible for Infiniband where application hold
the data buffers and there is no way to force application to re use the 
buffer as suggested.

This is why I think there will be no easy way to bypass the DMA mapping cost
for all use cases unless we allow applications to request bypass/pass through
DMA mapping (implicitly or explicitly).

> 
> -- 
> dwmw2
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-29 Thread Shamir Rabinovitch
On Thu, Oct 29, 2015 at 09:42:12AM +0900, David Woodhouse wrote:

> Aside from the lack of security, the other disadvantage of that is that
> you have to pin *all* pages of a guest in case DMA happens; you don't
> get to pin *only* those pages which are referenced by that guest's
> IOMMU page tables...

We do bypass the IOMMU but not the DMA API and given that before we call 
the DMA API we pin the page then we do not need to pin all the pages.
Just the ones we use for the DMA. 

For me this flag looks orthogonal to the page pinning issue you brought. It
is just a hint to the DMA API that we want to use simple & fast mapping while
knowing we loose IOMMU protection for this memory.

For the IB case, setting and tearing DMA mappings for the drivers data buffers
is expensive. But we could for example consider to map all the HW control 
objects
that validate the HW access to the drivers data buffers as IOMMU protected and 
so
have IOMMU protection on those critical objects while having fast 
set-up/tear-down
of driver data buffers. The HW control objects have stable mapping for the 
lifetime
of the system. So the case of using both types of DMA mappings is still valid. 
 
> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v1 2/2] dma-mapping-common: add DMA attribute - DMA_ATTR_IOMMU_BYPASS

2015-10-28 Thread Shamir Rabinovitch
On Wed, Oct 28, 2015 at 03:30:01PM +0900, David Woodhouse wrote:
> > > +For systems with IOMMU it is assumed all DMA translations use the IOMMU.
> 
> Not entirely true. We have per-device dma_ops on a most architectures
> already, and we were just talking about the need to add them to
> POWER/SPARC too, because we need to avoid trying to use the IOMMU to
> map virtio devices too.

SPARC has it's implementation under arch/sparc for dma_ops (sun4v_dma_ops).

Some drivers use IOMMU under SPARC for example ixgbe (Intel 10G ETH).
Some, like IB, suffer from IOMMU MAP setup/tear-down & limited address range.
On SPARC IOMMU bypass is not total bypass of the IOMMU but rather much simple
translation that does not require any complex translations tables.

> 
> As we look at that (and make the per-device dma_ops a generic thing
> rather than per-arch), we should probably look at folding your case in
> too.

Whether to use IOMMU or not for DMA is up to the driver. The above example
show real situation where one driver can use IOMMU and the other can't. It
seems that the device cannot know when to use IOMMU bypass and when not.
Even in the driver we can decide to DMA map some buffers using IOMMU
translation and some as IOMMU bypass.

Do you agree that we need this attribute in the generic DMA API? 

> 
> -- 
> dwmw2
> 
> 


--
To unsubscribe from this list: send the line "unsubscribe kvm" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html