Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-12 Thread KyongHo Cho
On Sat, Nov 12, 2011 at 11:04 AM, Stepan Moskovchenko
 wrote:
> Yes, the scatterlist is allowed to have different page sizes. But, they are
> required to have a length that is a multiple of 4K. If a page in the list is

An element in scatterlist may not be aligned by 4K but I think IOMMU driver
must round it up to 4K if IOMMU API support iommu_map_range().

The first element in scatterlist is often not aligned by 4K if the pages are
provided from userspace. Length + offset of  the element must be
aligned by 4K, though.

Thank you,
KyongHo.
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Stepan Moskovchenko

On 11/11/2011 5:24 AM, Joerg Roedel wrote:

On Thu, Nov 10, 2011 at 01:12:00PM -0800, Stepan Moskovchenko wrote:

I have been experimenting with an iommu_map_range call, which maps a
given scatterlist of discontiguous physical pages into a contiguous
virtual region at a given IOVA. This has some performance advantages
over just calling iommu_map iteratively. First, it reduces the
amount of table walking / calculation needed for mapping each page,
given how you know that all the pages will be mapped into a single
virtually-contiguous region (so in most cases, the first-level table
calculation can be reused). Second, it allows one to defer the TLB
(and sometimes cache) maintenance operations until the entire
scatterlist has been mapped, rather than doing a TLB invalidate
after mapping each page, as would have been the case if iommu_map
were just being called from within a loop. Granted, just using
iommu_map many times may be acceptable on the slow path, but I have
seen significant performance gains when using this approach on the
fast path.

Yes, from a performance point-of-view that makes sense, as an addition
to the existing iommu_map interface. Are the pages in the list allowed
to have different page-sizes?


Joerg



Hello

Yes, the scatterlist is allowed to have different page sizes. But, they 
are required to have a length that is a multiple of 4K. If a page in the 
list is bigger than 4K, the code will iteratively map it with 4K pages. 
I suppose based on how my implementation is written, it would not be too 
difficult to add checks for the proper length and VA/PA alignments, and 
insert a 64K / 1M / 16M mapping if the alignment is lucky and the SG 
item is big enough.


In my particular test case, even though the pages in the list might be 
of different sizes, they are not guaranteed to be aligned properly and I 
would most likely have to fall back on mapping them as multiple 
consecutive 4K pages, anyway. But even despite this, having map_range to 
consolidate a lot of the common operations into one call sill gives me a 
nice speed boost.


I hadn't sent the patches out because this was all for my testing, but 
would you be interested in me adding a map_range to the API? The 
iommu_map_range call could even do a check if the ops supports a 
.map_range, and fall back on calling iommu_map repeatedly if the driver 
doesn't support this operation natively. In my code, the function takes 
a domain, iova, scatterlist, length, and prot.


Steve
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Fri, Nov 11, 2011 at 01:27:28PM +, David Woodhouse wrote:
> On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
> > For AMD IOMMU there is a feature called not-present cache. It says that
> > the IOMMU caches non-present entries as well and needs an IOTLB flush
> > when something is mapped (meant for software implementations of the
> > IOMMU).
> > So it can't be really taken out of the fast-path. But the IOMMU driver
> > can optimize the function so that it only flushes the IOTLB when there
> > was an unmap-call before. 
> 
> We have exactly the same situation with the Intel IOMMU (we call it
> 'Caching Mode') for the same reasons.
> 
> I'd be wary about making the IOMMU driver *track* whether there was an
> unmap call before — that seems like hard work and more cache contention,
> especially if the ->commit() call happens on a CPU other than the one
> that just did the unmap.

Well, the IOMMU driver does not need to track this globally or
per-IOMMU. It basically tracks it per-domain. This lowers the
cache-bouncing a bit.

> I'm also not sure exactly when you'd call the ->commit() function when
> the DMA API is being used, and which 'side' of that API the
> deferred-flush optimisations would live.

The commit function is called every time before one of the DMA-API
functions return.

> Would the optimisation be done on the generic side, only calling
> ->commit when it absolutely *has* to happen? (Or periodically after
> unmaps have happened to avoid entries hanging around for ever?)
> 
> Or would the optimisation be done in the IOMMU driver, thus turning the
> ->commit() function into more of a *hint*? You could add a 'simon_says'
> boolean argument to it, I suppose...?

The IOMMU driver can't do that because special knowledge about the
address allocator is needed. So it has to happen in the generic part.
Okay, that optimization does not really work with the ->commit() idea,
as I think about it. With the requirement to call commit() on the
map-side would result in too many flushes to happen.

So it may be better to just call it iommu_flush_tlb(domain) or
something. The user of the IOMMU-API can assume that the flush is only
necessary on the unmap-path. Stricter flushing rules need to be handled
in the IOMMU driver itself.

> > It is also an improvement over the current
> > situation where every iommu_unmap call results in a flush implicitly.
> > This pretty much a no-go for using IOMMU-API in DMA mapping at the
> > moment.
> 
> Right. That definitely needs to be handled. We just need to work out the
> (above and other) details.

Yes, I am convinced now that the commit() idea is not the right solution
:)
Some less generic semantics like a iommu_flush_tlb call-back would do
better with such an optimization.

> Certainly your concerns are valid, but I think we can cope with them
> fairly reasonably. If we *do* have large number of CPUs allocating for a
> given domain, we can move to a per-node rather than per-CPU allocator.
> And we can have dynamically sized allocation regions, so we aren't
> wasting too much space on unused bitmaps if you map just *one* page from
> each of your 4096 CPUs.

It also requires somewhat that the chunks are reasonably small. On a
typical small system the amount of allocated DMA memory is rather small
(GPUs are the exception, of course).

Btw, do you have numbers how much time is spent in spinlocks for
multi-node machines and the VT-d driver for some workloads? I remember
that I did this measurement some time ago on a 24 core AMD machine with
netperf on a 10GBit network card and the time spent spinning came out
at ~5%.
For 1Gbit network and disk-load it was around 0.5% on that machine.

> > How much lock contention will be lowered also depends on the work-load.
> > If dma-handles are frequently freed from another cpu than they were
> > allocated from the same problem re-appears.
> 
> The idea is that dma handles are *infrequently* freed, in batches. So
> we'll bounce the lock's cache line occasionally, but not all the time.

Hmm, btw, how does this batch-freeing works? Is the freeing done when
the address space is filled up or is another approach used?

> > But in the end we have to try it out and see what works best :)
> 
> Indeed. I'm just trying to work out if I should try to do the allocator
> thing purely inside the Intel code first, and then try to move it out
> and make it generic — or if I should start with making the DMA API work
> with a wrapper around the IOMMU API, with your ->commit() and other
> necessary changes. I think I'd prefer the latter, if we can work out how
> it should look.

I think it is the best to start with a generic DMA-mapping
implementation, benchmark it and then find the bottlenecks and try out
possible optimizations. In the beginning it is hard enough to come up
with an implementation that fits X86 and ARM at the same time. When this
is worked out we can continue to optimize :)


Joerg

-- 
AMD Operating System Research Center

Ad

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread David Woodhouse
On Fri, 2011-11-11 at 13:58 +0100, Joerg Roedel wrote:
> For AMD IOMMU there is a feature called not-present cache. It says that
> the IOMMU caches non-present entries as well and needs an IOTLB flush
> when something is mapped (meant for software implementations of the
> IOMMU).
> So it can't be really taken out of the fast-path. But the IOMMU driver
> can optimize the function so that it only flushes the IOTLB when there
> was an unmap-call before. 

We have exactly the same situation with the Intel IOMMU (we call it
'Caching Mode') for the same reasons.

I'd be wary about making the IOMMU driver *track* whether there was an
unmap call before — that seems like hard work and more cache contention,
especially if the ->commit() call happens on a CPU other than the one
that just did the unmap.

I'm also not sure exactly when you'd call the ->commit() function when
the DMA API is being used, and which 'side' of that API the
deferred-flush optimisations would live.

Would the optimisation be done on the generic side, only calling
->commit when it absolutely *has* to happen? (Or periodically after
unmaps have happened to avoid entries hanging around for ever?)

Or would the optimisation be done in the IOMMU driver, thus turning the
->commit() function into more of a *hint*? You could add a 'simon_says'
boolean argument to it, I suppose...?

> It is also an improvement over the current
> situation where every iommu_unmap call results in a flush implicitly.
> This pretty much a no-go for using IOMMU-API in DMA mapping at the
> moment.

Right. That definitely needs to be handled. We just need to work out the
(above and other) details.

> > But also, it's not *so* much of an issue to divide the space up even
> > when it's limited. The idea was not to have it *strictly* per-CPU, but
> > just for a CPU to try allocating from "its own" subrange first…
> 
> Yeah, I get the idea. I fear that the memory consumption will get pretty
> high with that approach. It basically means one round-robin allocator
> per cpu and device. What does that mean on a 4096 CPU machine :)

Well, if your network device is taking interrupts, and mapping/unmapping
buffers across all 4096 CPUs, then your performance is screwed anyway :)

Certainly your concerns are valid, but I think we can cope with them
fairly reasonably. If we *do* have large number of CPUs allocating for a
given domain, we can move to a per-node rather than per-CPU allocator.
And we can have dynamically sized allocation regions, so we aren't
wasting too much space on unused bitmaps if you map just *one* page from
each of your 4096 CPUs.

> How much lock contention will be lowered also depends on the work-load.
> If dma-handles are frequently freed from another cpu than they were
> allocated from the same problem re-appears.

The idea is that dma handles are *infrequently* freed, in batches. So
we'll bounce the lock's cache line occasionally, but not all the time.

In "strict" or "unmap_flush" mode, you get to go slowly unless you do
the unmap on the same CPU that you mapped it from. I can live with that.

> But in the end we have to try it out and see what works best :)

Indeed. I'm just trying to work out if I should try to do the allocator
thing purely inside the Intel code first, and then try to move it out
and make it generic — or if I should start with making the DMA API work
with a wrapper around the IOMMU API, with your ->commit() and other
necessary changes. I think I'd prefer the latter, if we can work out how
it should look.

-- 
dwmw2


smime.p7s
Description: S/MIME cryptographic signature


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 01:12:00PM -0800, Stepan Moskovchenko wrote:
> I have been experimenting with an iommu_map_range call, which maps a
> given scatterlist of discontiguous physical pages into a contiguous
> virtual region at a given IOVA. This has some performance advantages
> over just calling iommu_map iteratively. First, it reduces the
> amount of table walking / calculation needed for mapping each page,
> given how you know that all the pages will be mapped into a single
> virtually-contiguous region (so in most cases, the first-level table
> calculation can be reused). Second, it allows one to defer the TLB
> (and sometimes cache) maintenance operations until the entire
> scatterlist has been mapped, rather than doing a TLB invalidate
> after mapping each page, as would have been the case if iommu_map
> were just being called from within a loop. Granted, just using
> iommu_map many times may be acceptable on the slow path, but I have
> seen significant performance gains when using this approach on the
> fast path.

Yes, from a performance point-of-view that makes sense, as an addition
to the existing iommu_map interface. Are the pages in the list allowed
to have different page-sizes?


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-11 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 07:28:39PM +, David Woodhouse wrote:

> ... which implies that a mapping, once made, might *never* actually get
> torn down until we loop and start reusing address space? That has
> interesting security implications.

Yes, it is a trade-off between security and performance. But if the user
wants more security the unmap_flush parameter can be used.

> Is it true even for devices which have been assigned to a VM and then
> unassigned?

No, this is only used in the DMA-API path. The device-assignment code
uses the IOMMU-API directly. There the IOTLB is always flushed on unmap.

> > There is something similar on the AMD IOMMU side. There it is called
> > unmap_flush.
> 
> OK, so that definitely wants consolidating into a generic option.

Agreed.

> > Some time ago I proposed the iommu_commit() interface which changes
> > these requirements. With this interface the requirement is that after a
> > couple of map/unmap operations the IOMMU-API user has to call
> > iommu_commit() to make these changes visible to the hardware (so mostly
> > sync the IOTLBs). As discussed at that time this would make sense for
> > the Intel and AMD IOMMU drivers.
> 
> I would *really* want to keep those off the fast path (thinking mostly
> about DMA API here, since that's the performance issue). But as long as
> we can achieve that, that's fine.

For AMD IOMMU there is a feature called not-present cache. It says that
the IOMMU caches non-present entries as well and needs an IOTLB flush
when something is mapped (meant for software implementations of the
IOMMU).
So it can't be really taken out of the fast-path. But the IOMMU driver
can optimize the function so that it only flushes the IOTLB when there
was an unmap-call before. It is also an improvement over the current
situation where every iommu_unmap call results in a flush implicitly.
This pretty much a no-go for using IOMMU-API in DMA mapping at the
moment.

> But also, it's not *so* much of an issue to divide the space up even
> when it's limited. The idea was not to have it *strictly* per-CPU, but
> just for a CPU to try allocating from "its own" subrange first, and then
> fall back to allocating a new subrange, and *then* fall back to
> allocating from subranges "belonging" to other CPUs. It's not that the
> allocation from a subrange would be lockless — it's that the lock would
> almost never leave the l1 cache of the CPU that *normally* uses that
> subrange.

Yeah, I get the idea. I fear that the memory consumption will get pretty
high with that approach. It basically means one round-robin allocator
per cpu and device. What does that mean on a 4096 CPU machine :)
How much lock contention will be lowered also depends on the work-load.
If dma-handles are frequently freed from another cpu than they were
allocated from the same problem re-appears.
But in the end we have to try it out and see what works best :)


Regards,

Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Stepan Moskovchenko

On 11/10/2011 9:09 AM, Joerg Roedel wrote:
The plan is to have a single DMA-API implementation for all IOMMU 
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this 
performing reasonalbly well a few changes to the IOMMU-API are 
required. I already have some ideas which we can discuss if you want.


I have been experimenting with an iommu_map_range call, which maps a 
given scatterlist of discontiguous physical pages into a contiguous 
virtual region at a given IOVA. This has some performance advantages 
over just calling iommu_map iteratively. First, it reduces the amount of 
table walking / calculation needed for mapping each page, given how you 
know that all the pages will be mapped into a single 
virtually-contiguous region (so in most cases, the first-level table 
calculation can be reused). Second, it allows one to defer the TLB (and 
sometimes cache) maintenance operations until the entire scatterlist has 
been mapped, rather than doing a TLB invalidate after mapping each page, 
as would have been the case if iommu_map were just being called from 
within a loop. Granted, just using iommu_map many times may be 
acceptable on the slow path, but I have seen significant performance 
gains when using this approach on the fast path.


Steve

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse
On Thu, 2011-11-10 at 18:09 +0100, Joerg Roedel wrote:
> The requirement for the DMA-API is, that the IOTLB must be consistent
> with existing mappings, and only with the parts that are really mapped.
> The unmapped parts are not important.
> 
> This allows nice optimizations like your 'batched unmap' on the Intel
> IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
> allocator for the IO addresses which makes it very easy to flush certain
> IOTLB ranges only before they are reused.

... which implies that a mapping, once made, might *never* actually get
torn down until we loop and start reusing address space? That has
interesting security implications. Is it true even for devices which
have been assigned to a VM and then unassigned?

> >   - ... unless booted with 'intel_iommu=strict', in which case we do the
> > unmap and IOTLB flush immediately before returning to the driver.
> 
> There is something similar on the AMD IOMMU side. There it is called
> unmap_flush.

OK, so that definitely wants consolidating into a generic option.

> >   - But the IOMMU API for virtualisation is different. In fact that
> > doesn't seem to flush the IOTLB at all. Which is probably a bug.
> 
> Well, *current* requirement is, that the IOTLB is in sync with the
> page-table at every time. This is true for the iommu_map and especially
> for the iommu_unmap function. It means basically that the unmapped area
> needs to be flushed out of the IOTLBs before iommu_unmap returns.
> 
> Some time ago I proposed the iommu_commit() interface which changes
> these requirements. With this interface the requirement is that after a
> couple of map/unmap operations the IOMMU-API user has to call
> iommu_commit() to make these changes visible to the hardware (so mostly
> sync the IOTLBs). As discussed at that time this would make sense for
> the Intel and AMD IOMMU drivers.

I would *really* want to keep those off the fast path (thinking mostly
about DMA API here, since that's the performance issue). But as long as
we can achieve that, that's fine.

> > What is acceptable, though? That batched unmap is quite important for
> > performance, because it means that we don't have to bash on the hardware
> > and wait for a flush to complete in the fast path of network driver RX,
> > for example.
> 
> Have you considered a round-robin bitmap-allocator? It allows quite nice
> flushing behavior.

Yeah, I was just looking through the AMD code with a view to doing
something similar. I was thinking of using that technique *within* each
larger range allocated from the whole address space.

> > If we move to a model where we have a separate ->flush_iotlb() call, we
> > need to be careful that we still allow necessary optimisations to
> > happen.
> 
> With iommu_commit() this should be possible, still.
> 
> > I'm looking at fixing performance issues in the Intel IOMMU code, with
> > its virtual address space allocation (the rbtree-based one in iova.c
> > that nobody else uses, which has a single spinlock that *all* CPUs bash
> > on when they need to allocate).
> > 
> > The plan is, vaguely, to allocate large chunks of space to each CPU, and
> > then for each CPU to allocate from its own region first, thus ensuring
> > that the common case doesn't bounce locks between CPUs. It'll be rare
> > for one CPU to have to touch a subregion 'belonging' to another CPU, so
> > lock contention should be drastically reduced.
> 
> Thats an interesting issue. It exists on the AMD IOMMU side too, the
> bitmap-allocator runs in a per-domain spinlock which can get high
> contention. I am not sure how per-cpu chunks of the address space scale
> to large numbers of cpus, though, given that some devices only have a
> small address range that they can address.

I don't care about performance of broken hardware. If we have a single
*global* "subrange" for the <4GiB range of address space, that's
absolutely fine by me.

But also, it's not *so* much of an issue to divide the space up even
when it's limited. The idea was not to have it *strictly* per-CPU, but
just for a CPU to try allocating from "its own" subrange first, and then
fall back to allocating a new subrange, and *then* fall back to
allocating from subranges "belonging" to other CPUs. It's not that the
allocation from a subrange would be lockless — it's that the lock would
almost never leave the l1 cache of the CPU that *normally* uses that
subrange.

With batched unmaps, the CPU doing the unmap may end up taking the lock
occasionally, and bounce cache lines then. But it's infrequent enough
that it shouldn't be a performance problem.

> I have been thinking about some lockless algorithms for the
> bitmap-allocator. But the ideas are not finalized yet, so I still don't
> know if they will work out at all :)

As explained above, I wasn't going for lockless. I was going for
lock-contention-less. Or at least mostly :)

Do you think that approach sounds reasonable?

> The plan is to have a single DMA-API i

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 03:28:50PM +, David Woodhouse wrote:

> Which brings me to another question I have been pondering... do we even
> have a consensus on exactly *when* the IOTLB should be flushed?

Well, sort of, there is still the outstanding idea of the
iommu_commit() interface for the IOMMU-API.

> Even just for the Intel IOMMU, we have three different behaviours:
> 
>   - For DMA API users by default, we do 'batched unmap', so a mapping
> may be active for a period of time after the driver has requested
> that it be unmapped.

The requirement for the DMA-API is, that the IOTLB must be consistent
with existing mappings, and only with the parts that are really mapped.
The unmapped parts are not important.

This allows nice optimizations like your 'batched unmap' on the Intel
IOMMU driver side. The AMD IOMMU driver uses a round-robin bitmap
allocator for the IO addresses which makes it very easy to flush certain
IOTLB ranges only before they are reused.

>   - ... unless booted with 'intel_iommu=strict', in which case we do the
> unmap and IOTLB flush immediately before returning to the driver.

There is something similar on the AMD IOMMU side. There it is called
unmap_flush.

>   - But the IOMMU API for virtualisation is different. In fact that
> doesn't seem to flush the IOTLB at all. Which is probably a bug.

Well, *current* requirement is, that the IOTLB is in sync with the
page-table at every time. This is true for the iommu_map and especially
for the iommu_unmap function. It means basically that the unmapped area
needs to be flushed out of the IOTLBs before iommu_unmap returns.

Some time ago I proposed the iommu_commit() interface which changes
these requirements. With this interface the requirement is that after a
couple of map/unmap operations the IOMMU-API user has to call
iommu_commit() to make these changes visible to the hardware (so mostly
sync the IOTLBs). As discussed at that time this would make sense for
the Intel and AMD IOMMU drivers.

> What is acceptable, though? That batched unmap is quite important for
> performance, because it means that we don't have to bash on the hardware
> and wait for a flush to complete in the fast path of network driver RX,
> for example.

Have you considered a round-robin bitmap-allocator? It allows quite nice
flushing behavior.

> If we move to a model where we have a separate ->flush_iotlb() call, we
> need to be careful that we still allow necessary optimisations to
> happen.

With iommu_commit() this should be possible, still.

> I'm looking at fixing performance issues in the Intel IOMMU code, with
> its virtual address space allocation (the rbtree-based one in iova.c
> that nobody else uses, which has a single spinlock that *all* CPUs bash
> on when they need to allocate).
> 
> The plan is, vaguely, to allocate large chunks of space to each CPU, and
> then for each CPU to allocate from its own region first, thus ensuring
> that the common case doesn't bounce locks between CPUs. It'll be rare
> for one CPU to have to touch a subregion 'belonging' to another CPU, so
> lock contention should be drastically reduced.

Thats an interesting issue. It exists on the AMD IOMMU side too, the
bitmap-allocator runs in a per-domain spinlock which can get high
contention. I am not sure how per-cpu chunks of the address space scale
to large numbers of cpus, though, given that some devices only have a
small address range that they can address.

I have been thinking about some lockless algorithms for the
bitmap-allocator. But the ideas are not finalized yet, so I still don't
know if they will work out at all :)
The basic idea builds around the fact, that most allocations using the
DMA-API fit into one page. So probably we can split the address-space
into a region for one-page allocations which can be accessed without
locks and another region for larger allocations which still need locks.

> Should I be planning to drop the DMA API support from intel-iommu.c
> completely, and have the new allocator just call into the IOMMU API
> functions instead? Other people have been looking at that, haven't they?

Yes, Marek Szyprowski from the ARM side is looking into this already,
but his patches are very ARM specific and not suitable for x86 yet.

> Is there any code? Or special platform-specific requirements for such a
> generic wrapper that I might not have thought of? Details about when to
> flush the IOTLB are one such thing which might need special handling for
> certain hardware...

The plan is to have a single DMA-API implementation for all IOMMU
drivers (X86 and ARM) which just uses the IOMMU-API. But to make this
performing reasonalbly well a few changes to the IOMMU-API are required.
I already have some ideas which we can discuss if you want.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Re

Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread David Woodhouse
On Thu, 2011-11-10 at 14:17 +0800, Kai Huang wrote:
> And another question: have we considered the IOTLB flush operation? I
> think we need to implement similar logic when flush the DVMA range.
> Intel VT-d's manual says software needs to specify the appropriate
> mask value to flush large pages, but it does not say we need to
> exactly match the page size as it is mapped. I guess it's not
> necessary for Intel IOMMU, but other vendor's IOMMU may have such
> limitation (or some other limitations). In my understanding current
> implementation does not provide page size information for particular
> DVMA ranges that has been mapped, and it's not flexible to implement
> IOTLB flush code (ex, we may need to walk through page table to find
> out actual page size). Maybe we can also add iommu_ops->flush_iotlb ? 

Which brings me to another question I have been pondering... do we even
have a consensus on exactly *when* the IOTLB should be flushed?

Even just for the Intel IOMMU, we have three different behaviours:

  - For DMA API users by default, we do 'batched unmap', so a mapping
may be active for a period of time after the driver has requested
that it be unmapped.

  - ... unless booted with 'intel_iommu=strict', in which case we do the
unmap and IOTLB flush immediately before returning to the driver.

  - But the IOMMU API for virtualisation is different. In fact that
doesn't seem to flush the IOTLB at all. Which is probably a bug.

What is acceptable, though? That batched unmap is quite important for
performance, because it means that we don't have to bash on the hardware
and wait for a flush to complete in the fast path of network driver RX,
for example.

If we move to a model where we have a separate ->flush_iotlb() call, we
need to be careful that we still allow necessary optimisations to
happen.

Since I have the right people on Cc and the iommu list is still down,
and it's vaguely tangentially related...

I'm looking at fixing performance issues in the Intel IOMMU code, with
its virtual address space allocation (the rbtree-based one in iova.c
that nobody else uses, which has a single spinlock that *all* CPUs bash
on when they need to allocate).

The plan is, vaguely, to allocate large chunks of space to each CPU, and
then for each CPU to allocate from its own region first, thus ensuring
that the common case doesn't bounce locks between CPUs. It'll be rare
for one CPU to have to touch a subregion 'belonging' to another CPU, so
lock contention should be drastically reduced.

Should I be planning to drop the DMA API support from intel-iommu.c
completely, and have the new allocator just call into the IOMMU API
functions instead? Other people have been looking at that, haven't they?
Is there any code? Or special platform-specific requirements for such a
generic wrapper that I might not have thought of? Details about when to
flush the IOTLB are one such thing which might need special handling for
certain hardware...

-- 
dwmw2

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 10:35:34PM +0800, cody wrote:
> Yes I totally agree page-size is not required for unmap operations
> and should not be added as parameter to map/unmap operations. I am
> not saying the unmap operation, but the IOTLB flush operation. My
> point is we also may also need to add similar logic in IOTLB flush
> code (such as in Intel IOMMU dirver) to grantee that when issuing
> IOTLB flush command for large page, we will still meet the hardware
> limitation of flushing large page. Seems for Intel IOMMU the only
> limitation is the mask value (indicating number of 4k-pages) cannot
> be smaller than the value to cover large page, and seems current
> Intel IOMMU driver code has covered this case well. I am not
> familiar with how AMD IOMMU issues IOTLB flush command, it should be
> able to handle this large page case too. So at this moment, this
> patch should not have any issues :)

The map-operation actually takes a size, as it should. The idea is to
change this function to a map_page interface which takes a page-size
parameter, but thats another story.
The IOTLB flushing is not exposed by the IOMMU-API anyway. To whatever
is necessary to do that, it is the business of the IOMMU driver. So in
the unmap-path the driver finds out the page-size to unmap and can
immediatly flush the IOTLB for that page.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread cody

On 11/10/2011 09:08 PM, Joerg Roedel wrote:

On Thu, Nov 10, 2011 at 08:16:16PM +0800, cody wrote:
   

On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:
 

On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang   wrote:
   

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?
 

The short answer is yes, and furthermore, we also consider to remove
the size param from domain->ops->unmap entirely at some point.

We had a long discussion about it, please see:

https://lkml.org/lkml/2011/10/10/234
   

Yes I've seen your discussion, I followed this thread from beginning:)

How about the IOTLB flush? As I said I think we need to consider
that IOMMU (even does not exist now) may have some limitation on
IOTLB flush, and hiding page size from IOTLB flush code may hurt
performance, or even worse, trigger undefined behaviors.
 

We can only care about IOMMUs that exist today or ones that will exist
and we already know of.
In general for the hardware I know of a page-size is not required for
implementing unmap operations. Requiring this would imply that any user
of the IOMMU-API needs to keeps track of the page-sizes used to map a
given area.
This would be a huge burden which is not really necessary because the
IOMMU driver already has this information and can return it to the user.
So if you want to change that you need a very good reason for it.

   
Yes I totally agree page-size is not required for unmap operations and 
should not be added as parameter to map/unmap operations. I am not 
saying the unmap operation, but the IOTLB flush operation. My point is 
we also may also need to add similar logic in IOTLB flush code (such as 
in Intel IOMMU dirver) to grantee that when issuing IOTLB flush command 
for large page, we will still meet the hardware limitation of flushing 
large page. Seems for Intel IOMMU the only limitation is the mask value 
(indicating number of 4k-pages) cannot be smaller than the value to 
cover large page, and seems current Intel IOMMU driver code has covered 
this case well. I am not familiar with how AMD IOMMU issues IOTLB flush 
command, it should be able to handle this large page case too. So at 
this moment, this patch should not have any issues :)


-cody

Joerg

   


--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread Joerg Roedel
On Thu, Nov 10, 2011 at 08:16:16PM +0800, cody wrote:
> On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:
> >On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang  wrote:
> >>Seems the unmap function don't take phys as parameter, does this mean
> >>domain->ops->unmap will walk through the page table to find out the
> >>actual page size?
> >The short answer is yes, and furthermore, we also consider to remove
> >the size param from domain->ops->unmap entirely at some point.
> >
> >We had a long discussion about it, please see:
> >
> >https://lkml.org/lkml/2011/10/10/234
> Yes I've seen your discussion, I followed this thread from beginning:)
> 
> How about the IOTLB flush? As I said I think we need to consider
> that IOMMU (even does not exist now) may have some limitation on
> IOTLB flush, and hiding page size from IOTLB flush code may hurt
> performance, or even worse, trigger undefined behaviors.

We can only care about IOMMUs that exist today or ones that will exist
and we already know of.
In general for the hardware I know of a page-size is not required for
implementing unmap operations. Requiring this would imply that any user
of the IOMMU-API needs to keeps track of the page-sizes used to map a
given area.
This would be a huge burden which is not really necessary because the
IOMMU driver already has this information and can return it to the user.
So if you want to change that you need a very good reason for it.


Joerg

-- 
AMD Operating System Research Center

Advanced Micro Devices GmbH Einsteinring 24 85609 Dornach
General Managers: Alberto Bozzo, Andrew Bowd
Registration: Dornach, Landkr. Muenchen; Registerger. Muenchen, HRB Nr. 43632

--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-10 Thread cody

On 11/10/2011 03:31 PM, Ohad Ben-Cohen wrote:

On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang  wrote:
   

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?
 

The short answer is yes, and furthermore, we also consider to remove
the size param from domain->ops->unmap entirely at some point.

We had a long discussion about it, please see:

https://lkml.org/lkml/2011/10/10/234
   

Yes I've seen your discussion, I followed this thread from beginning:)

How about the IOTLB flush? As I said I think we need to consider that 
IOMMU (even does not exist now) may have some limitation on IOTLB flush, 
and hiding page size from IOTLB flush code may hurt performance, or even 
worse, trigger undefined behaviors.


-cody
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-09 Thread Ohad Ben-Cohen
On Thu, Nov 10, 2011 at 8:17 AM, Kai Huang  wrote:
> Seems the unmap function don't take phys as parameter, does this mean
> domain->ops->unmap will walk through the page table to find out the
> actual page size?

The short answer is yes, and furthermore, we also consider to remove
the size param from domain->ops->unmap entirely at some point.

We had a long discussion about it, please see:

https://lkml.org/lkml/2011/10/10/234
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 2/7] iommu/core: split mapping to page sizes as supported by the hardware

2011-11-09 Thread Kai Huang
>
> -int iommu_unmap(struct iommu_domain *domain, unsigned long iova, int 
> gfp_order)
> +size_t iommu_unmap(struct iommu_domain *domain, unsigned long iova, size_t 
> size)
>  {
> -       size_t size, unmapped;
> +       size_t unmapped_page, unmapped = 0;
> +       unsigned int min_pagesz;
>
>        if (unlikely(domain->ops->unmap == NULL))
>                return -ENODEV;
>
> -       size         = PAGE_SIZE << gfp_order;
> -
> -       BUG_ON(!IS_ALIGNED(iova, size));
> -
> -       unmapped = domain->ops->unmap(domain, iova, size);
> -
> -       return get_order(unmapped);
> +       /* find out the minimum page size supported */
> +       min_pagesz = 1 << __ffs(domain->ops->pgsize_bitmap);
> +
> +       /*
> +        * The virtual address, as well as the size of the mapping, must be
> +        * aligned (at least) to the size of the smallest page supported
> +        * by the hardware
> +        */
> +       if (!IS_ALIGNED(iova | size, min_pagesz)) {
> +               pr_err("unaligned: iova 0x%lx size 0x%lx min_pagesz 0x%x\n",
> +                                       iova, (unsigned long)size, 
> min_pagesz);
> +               return -EINVAL;
> +       }
> +
> +       pr_debug("unmap this: iova 0x%lx size 0x%lx\n", iova,
> +                                                       (unsigned long)size);
> +
> +       /*
> +        * Keep iterating until we either unmap 'size' bytes (or more)
> +        * or we hit an area that isn't mapped.
> +        */
> +       while (unmapped < size) {
> +               size_t left = size - unmapped;
> +
> +               unmapped_page = domain->ops->unmap(domain, iova, left);
> +               if (!unmapped_page)
> +                       break;
> +
> +               pr_debug("unmapped: iova 0x%lx size %lx\n", iova,
> +                                       (unsigned long)unmapped_page);
> +
> +               iova += unmapped_page;
> +               unmapped += unmapped_page;
> +       }
> +
> +       return unmapped;
>  }
>  EXPORT_SYMBOL_GPL(iommu_unmap);
>

Seems the unmap function don't take phys as parameter, does this mean
domain->ops->unmap will walk through the page table to find out the
actual page size?

And another question: have we considered the IOTLB flush operation? I
think we need to implement similar logic when flush the DVMA range.
Intel VT-d's manual says software needs to specify the appropriate
mask value to flush large pages, but it does not say we need to
exactly match the page size as it is mapped. I guess it's not
necessary for Intel IOMMU, but other vendor's IOMMU may have such
limitation (or some other limitations). In my understanding current
implementation does not provide page size information for particular
DVMA ranges that has been mapped, and it's not flexible to implement
IOTLB flush code (ex, we may need to walk through page table to find
out actual page size). Maybe we can also add iommu_ops->flush_iotlb ?

-cody
--
To unsubscribe from this list: send the line "unsubscribe linux-omap" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html