Re: [PATCH v10 0/3] Generic IOMMU pooled allocator

2015-04-16 Thread David Miller
From: Sowmini Varadhan 
Date: Thu,  9 Apr 2015 15:33:29 -0400

> Investigation of network performance on Sparc shows a high
> degree of locking contention in the IOMMU allocator, and it
> was noticed that the PowerPC code has a better locking model.
> 
> This patch series tries to extract the generic parts of the 
> PowerPC code so that it can be shared across multiple PCI
> devices and architectures.

Series applied, thanks everyone for all the hard work!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v10 0/3] Generic IOMMU pooled allocator

2015-04-09 Thread David Miller
From: Sowmini Varadhan 
Date: Thu,  9 Apr 2015 15:33:29 -0400

> v10: resend patchv9 without RFC tag, and a new mail Message-Id,
> (previous non-RFC attempt did not show up on the patchwork queue?)

Yes, if the patch is identical the patch postings hashes to the same
value as the RFC ones, and therefore patchwork thinks it's a dup.

I asked the patchwork maintainer to adjust this behavior in the
future if he can.

Nevertheless, thanks for posting, I was aware of the update.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v10 0/3] Generic IOMMU pooled allocator

2015-04-09 Thread Sowmini Varadhan

Investigation of network performance on Sparc shows a high
degree of locking contention in the IOMMU allocator, and it
was noticed that the PowerPC code has a better locking model.

This patch series tries to extract the generic parts of the 
PowerPC code so that it can be shared across multiple PCI
devices and architectures.

v10: resend patchv9 without RFC tag, and a new mail Message-Id,
(previous non-RFC attempt did not show up on the patchwork queue?)

Full revision history below:
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.

v3: eliminate iommu_sparc, rearrange the ->demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: removed ->cookie_to_index and ->demap indirection from the
iommu_tbl_ops The caller needs to call these functions as needed,
before invoking the generic arena allocator functions.
Added the "skip_span_boundary" argument to iommu_tbl_pool_init() for
those callers like LDC which do no care about span boundary checks.

v6: removed iommu_tbl_ops, and instead pass the ->flush_all as
an indirection to iommu_tbl_pool_init(); only invoke ->flush_all
when there is no large_pool, based on the assumption that large-pool
usage is infrequently encountered

v7: moved pool_hash initialization to lib/iommu-common.c and cleaned up
code duplication from sun4v/sun4u/ldc. 

v8: Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience.  Discard trylock for simple spin_lock to acquire pool

v9: Addresses latest BenH comments: need_flush checks, add support
for dma mask and align_order. 

v10: resend without RFC tag, and new mail Message-Id.

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 ++--
 arch/sparc/kernel/iommu_common.h  |8 -
 arch/sparc/kernel/ldc.c   |  153 +
 arch/sparc/kernel/pci_sun4v.c |  183 --
 include/linux/iommu-common.h  |   51 +++
 lib/Makefile  |2 +-
 lib/iommu-common.c|  266 +
 8 files changed, 511 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v9 RFC 0/3] Generic IOMMU pooled allocator

2015-04-05 Thread Sowmini Varadhan
Addresses latest BenH comments: need_flush checks, add support
for dma mask and align_order. 

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 ++--
 arch/sparc/kernel/iommu_common.h  |8 -
 arch/sparc/kernel/ldc.c   |  153 +
 arch/sparc/kernel/pci_sun4v.c |  183 --
 include/linux/iommu-common.h  |   51 +++
 lib/Makefile  |2 +-
 lib/iommu-common.c|  266 +
 8 files changed, 511 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-04-02 Thread Benjamin Herrenschmidt
On Fri, 2015-04-03 at 07:22 +1100, Benjamin Herrenschmidt wrote:
> On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> > From: Sowmini Varadhan 
> > Date: Thu, 2 Apr 2015 08:51:52 -0400
> > 
> > > do I need to resubmit this without the RFC tag? Perhaps I should
> > > have dropped that some time ago.
> > 
> > I want to hear from the powerpc folks whether they can positively
> > adopt the new generic code or not.
> 
> Oh yes, I'm fine with it, it's basically our code slightly refactored,
> but I still have a few minor review comments on Swomini's patch I
> started to put together yesterday but didn't finish. I'll send them
> today.

Actually I found some more serious issues that for some reason I didn't
spot the other day. Will send comments in a jiffy.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-04-02 Thread Benjamin Herrenschmidt
On Thu, 2015-04-02 at 12:21 -0400, David Miller wrote:
> From: Sowmini Varadhan 
> Date: Thu, 2 Apr 2015 08:51:52 -0400
> 
> > do I need to resubmit this without the RFC tag? Perhaps I should
> > have dropped that some time ago.
> 
> I want to hear from the powerpc folks whether they can positively
> adopt the new generic code or not.

Oh yes, I'm fine with it, it's basically our code slightly refactored,
but I still have a few minor review comments on Swomini's patch I
started to put together yesterday but didn't finish. I'll send them
today.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-04-02 Thread David Miller
From: Sowmini Varadhan 
Date: Thu, 2 Apr 2015 08:51:52 -0400

> do I need to resubmit this without the RFC tag? Perhaps I should
> have dropped that some time ago.

I want to hear from the powerpc folks whether they can positively
adopt the new generic code or not.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-04-02 Thread Sowmini Varadhan
On (03/31/15 23:12), David Miller wrote:
> 
> It's much more amortized with smart buffering strategies, which are
> common on current generation networking cards.
> 
> There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".
> 
> Maybe the infiniband stuff is doing things very suboptimally, and
> actually with that subsystem and drivers absolutely nothing would
> surprise me.

yeh, we are trying to get more info from them about what their
bottle-necks are. Until then, lets just stick with the spin_lock.

do I need to resubmit this without the RFC tag? Perhaps I should
have dropped that some time ago.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread David Miller
From: Sowmini Varadhan 
Date: Tue, 31 Mar 2015 21:08:18 -0400

> I'm starting to wonder if  some approximation of dma premapped
> buffers may be needed. Doing a map/unmap on each packet is expensive.

It's much more amortized with smart buffering strategies, which are
common on current generation networking cards.

There you only eat one map/unmap per "PAGE_SIZE / rx_pkt_size".

Maybe the infiniband stuff is doing things very suboptimally, and
actually with that subsystem and drivers absolutely nothing would
surprise me.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan

On 03/31/2015 09:01 PM, Benjamin Herrenschmidt wrote:

On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:

Having bravely said that..

the IB team informs me that they see a 10% degradation using
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.


Have they tried using more pools instead ?



we just tried 32 instead of 16, no change to perf.

Looks like their current bottleneck is the find_next_zero_bit (they
can get a 2X perf improvement with the lock fragmentation, but are
then hitting a new ceiling, even with the trylock version)


I'm starting to wonder if  some approximation of dma premapped
buffers may be needed. Doing a map/unmap on each packet is expensive.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Benjamin Herrenschmidt
On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:
> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.

Have they tried using more pools instead ?

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread David Miller
From: Sowmini Varadhan 
Date: Tue, 31 Mar 2015 14:06:42 -0400

> Having bravely said that..
> 
> the IB team informs me that they see a 10% degradation using 
> the spin_lock as opposed to the trylock.
> 
> one path going forward is to continue processing this patch-set 
> as is. I can investigate this further, and later revise the spin_lock
> to the trylock, after we are certain that it is good/necessary.
> 
> thoughts?

Let's address the trylock vs. spin_lock thing later and use plain
spin_lock for now.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan
On (03/31/15 10:40), Sowmini Varadhan wrote:
> 
> I've not heard back from the IB folks, but I'm going to make
> a judgement call here and go with the spin_lock. *If* they
> report some significant benefit from the trylock, probably 
> need to revisit this (and then probably start by re-exmaining
> the hash function to avoid collisions, before resorting to 
> trylock).

Having bravely said that..

the IB team informs me that they see a 10% degradation using 
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set 
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.

thoughts?

--Sowmini


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan
Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience. 

I've not heard back from the IB folks, but I'm going to make
a judgement call here and go with the spin_lock. *If* they
report some significant benefit from the trylock, probably 
need to revisit this (and then probably start by re-exmaining
the hash function to avoid collisions, before resorting to 
trylock).

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  171 +++
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  153 ++--
 arch/sparc/kernel/pci_sun4v.c |  181 +
 include/linux/iommu-common.h  |   48 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  233 +
 8 files changed, 472 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-27 Thread Sowmini Varadhan
On (03/26/15 08:05), Benjamin Herrenschmidt wrote:
> > PowerPC folks, what do you think?
> 
> I'll give it another look today.
> 
> Cheers,
> Ben.

Hi Ben, 

did you have a chance to look at this?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-26 Thread Benjamin Herrenschmidt
On Thu, 2015-03-26 at 16:00 -0700, David Miller wrote:
> From: casca...@linux.vnet.ibm.com
> Date: Wed, 25 Mar 2015 21:43:42 -0300
> 
> > On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> >> From: Benjamin Herrenschmidt 
> >> Date: Tue, 24 Mar 2015 13:08:10 +1100
> >> 
> >> > For the large pool, we don't keep a hint so we don't know it's
> >> > wrapped, in fact we purposefully don't use a hint to limit
> >> > fragmentation on it, but then, it should be used rarely enough that
> >> > flushing always is, I suspect, a good option.
> >> 
> >> I can't think of any use case where the largepool would be hit a lot
> >> at all.
> > 
> > Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> > driver mapped a whole 64KiB page, it would hit the largepool.
> 
> We don't plan to ever use 64KB page size on sparc64, so I think we're
> safe there.
> 
> There are many reasons using 64KB pages is really stupid, what you
> see here with the IOMMU stuff is just one of many symptoms, but I bet
> your powerpc guys are kind of sick of hearing it by now... :-)

Lalalala :-)

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-26 Thread David Miller
From: casca...@linux.vnet.ibm.com
Date: Wed, 25 Mar 2015 21:43:42 -0300

> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
>> From: Benjamin Herrenschmidt 
>> Date: Tue, 24 Mar 2015 13:08:10 +1100
>> 
>> > For the large pool, we don't keep a hint so we don't know it's
>> > wrapped, in fact we purposefully don't use a hint to limit
>> > fragmentation on it, but then, it should be used rarely enough that
>> > flushing always is, I suspect, a good option.
>> 
>> I can't think of any use case where the largepool would be hit a lot
>> at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

We don't plan to ever use 64KB page size on sparc64, so I think we're
safe there.

There are many reasons using 64KB pages is really stupid, what you
see here with the IOMMU stuff is just one of many symptoms, but I bet
your powerpc guys are kind of sick of hearing it by now... :-)

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-26 Thread Sowmini Varadhan
On (03/25/15 21:43), casca...@linux.vnet.ibm.com wrote:
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.
> 
> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.

I guess we also need to consider the "average use-case", i.e., 
something that interleaves small packets and interactive data with
jumbo/bulk data.. in those cases, the largepool would not get many
hits, and might actually be undesirable?

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. 

good point. I think in the early phase of my perf investigation,
it was brought up that Solaris does pre-mapped DMA buffers (they
have to do this carefully, to avoid resource-starvation vulnerabilities-
see http://www.spinics.net/lists/sparclinux/msg13217.html
and threads leading to it..

This is  not something that the common iommu-arena allocator
can/should get involved in, of course. The scope of the arena-allocator
is much more rigorously defined.

I dont know if there is a way to set up a generalized DMA premapped
buffer infra for linux today.

fwiw, when I instrumented this for solaris (there are hooks to disable
the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes,
64 cpus) was not very significant for a single 10G ixgbe port- approx
8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is
only significant when you start trying to scale to multiple ethernet 
ports.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-25 Thread Benjamin Herrenschmidt
On Wed, 2015-03-25 at 21:43 -0300, casca...@linux.vnet.ibm.com wrote:
> On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> > From: Benjamin Herrenschmidt 
> > Date: Tue, 24 Mar 2015 13:08:10 +1100
> > 
> > > For the large pool, we don't keep a hint so we don't know it's
> > > wrapped, in fact we purposefully don't use a hint to limit
> > > fragmentation on it, but then, it should be used rarely enough that
> > > flushing always is, I suspect, a good option.
> > 
> > I can't think of any use case where the largepool would be hit a lot
> > at all.
> 
> Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
> driver mapped a whole 64KiB page, it would hit the largepool.

Yes but I was talking of sparc here ...

> I have been suspicious for some time that after Anton's work on the
> pools, the large mappings optimization would throw away the benefit of
> using the 4 pools, since some drivers would always hit the largepool.

Right, I was thinking we should change the test for large pool from > 15
to > (PAGE_SHIFT * n) where n is TBD by experimentation.

> Of course, drivers that map entire pages, when not buggy, are optimized
> already to avoid calling dma_map all the time. I worked on that for
> mlx4_en, and I would expect that its receive side would always hit the
> largepool.
> 
> So, I decided to experiment and count the number of times that
> largealloc is true versus false.
> 
> On the transmit side, or when using ICMP, I didn't notice many large
> allocations with qlge or cxgb4.
> 
> However, when using large TCP send/recv (I used uperf with 64KB
> writes/reads), I noticed that on the transmit side, largealloc is not
> used, but on the receive side, cxgb4 almost only uses largealloc, while
> qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
> When turning GRO off, that ratio is closer to 1/10, meaning there is
> still some fair use of largealloc in that scenario.

What are the sizes involved ? Always just 64K ? Or more ? Maybe just
changing 15 to 16 in the test would be sufficient ? We should make the
threshole a parameter set at init time so archs/platforms can adjust it.

> I confess my experiments are not complete. I would like to test a couple
> of other drivers as well, including mlx4_en and bnx2x, and test with
> small packet sizes. I suspected that MTU size could make a difference,
> but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
> didn't notice any significant hit of largepool with either qlge or
> cxgb4.
> 
> Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
> latest code, with plans on using 64KiB in some situations, Alexey or Ben
> should have more details.

We still mostly use 4K afaik... We will use 64K in some KVM setups and I
do plan to switch to 64K under some circumstances when we can but we
have some limits imposed by PAPR under hypervisors here.

> But I believe that on the receive side, all drivers should map entire
> pages, using some allocation strategy similar to mlx4_en, in order to
> avoid DMA mapping all the time. Some believe that is bad for latency,
> and prefer to call something like skb_alloc for every package received,
> but I haven't seen any hard numbers, and I don't know why we couldn't
> make such an allocator as good as using something like the SLAB/SLUB
> allocator. Maybe there is a jitter problem, since the allocator has to
> go out and get some new pages and map them, once in a while. But I don't
> see why this would not be a problem with SLAB/SLUB as well. Calling
> dma_map is even worse with the current implementation. It's just that
> some architectures do no work at all when dma_map/unmap is called.
> 
> Hope that helps consider the best strategy for the DMA space allocation
> as of now.

In any case, I don't think Sparc has the same issue. At this point
that's all I care about, once we adapt powerpc to use the new code, we
can revisit that problem on our side.

Cheers,
Ben.

> Regards.
> Cascardo.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-25 Thread cascardo
On Mon, Mar 23, 2015 at 10:15:08PM -0400, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Tue, 24 Mar 2015 13:08:10 +1100
> 
> > For the large pool, we don't keep a hint so we don't know it's
> > wrapped, in fact we purposefully don't use a hint to limit
> > fragmentation on it, but then, it should be used rarely enough that
> > flushing always is, I suspect, a good option.
> 
> I can't think of any use case where the largepool would be hit a lot
> at all.

Well, until recently, IOMMU_PAGE_SIZE was 4KiB on Power, so every time a
driver mapped a whole 64KiB page, it would hit the largepool.

I have been suspicious for some time that after Anton's work on the
pools, the large mappings optimization would throw away the benefit of
using the 4 pools, since some drivers would always hit the largepool.

Of course, drivers that map entire pages, when not buggy, are optimized
already to avoid calling dma_map all the time. I worked on that for
mlx4_en, and I would expect that its receive side would always hit the
largepool.

So, I decided to experiment and count the number of times that
largealloc is true versus false.

On the transmit side, or when using ICMP, I didn't notice many large
allocations with qlge or cxgb4.

However, when using large TCP send/recv (I used uperf with 64KB
writes/reads), I noticed that on the transmit side, largealloc is not
used, but on the receive side, cxgb4 almost only uses largealloc, while
qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
When turning GRO off, that ratio is closer to 1/10, meaning there is
still some fair use of largealloc in that scenario.

I confess my experiments are not complete. I would like to test a couple
of other drivers as well, including mlx4_en and bnx2x, and test with
small packet sizes. I suspected that MTU size could make a difference,
but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
didn't notice any significant hit of largepool with either qlge or
cxgb4.

Also, we need to keep in mind that IOMMU_PAGE_SIZE is now dynamic in the
latest code, with plans on using 64KiB in some situations, Alexey or Ben
should have more details.

But I believe that on the receive side, all drivers should map entire
pages, using some allocation strategy similar to mlx4_en, in order to
avoid DMA mapping all the time. Some believe that is bad for latency,
and prefer to call something like skb_alloc for every package received,
but I haven't seen any hard numbers, and I don't know why we couldn't
make such an allocator as good as using something like the SLAB/SLUB
allocator. Maybe there is a jitter problem, since the allocator has to
go out and get some new pages and map them, once in a while. But I don't
see why this would not be a problem with SLAB/SLUB as well. Calling
dma_map is even worse with the current implementation. It's just that
some architectures do no work at all when dma_map/unmap is called.

Hope that helps consider the best strategy for the DMA space allocation
as of now.

Regards.
Cascardo.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread Benjamin Herrenschmidt
On Wed, 2015-03-25 at 14:12 -0400, David Miller wrote:
> From: Sowmini Varadhan 
> Date: Wed, 25 Mar 2015 13:34:45 -0400
> 
> > Changes from patchv6: moved pool_hash initialization to
> > lib/iommu-common.c and cleaned up code duplication from 
> > sun4v/sun4u/ldc. 
> 
> Looks good to me.
> 
> PowerPC folks, what do you think?

I'll give it another look today.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread David Miller
From: Sowmini Varadhan 
Date: Wed, 25 Mar 2015 13:34:45 -0400

> Changes from patchv6: moved pool_hash initialization to
> lib/iommu-common.c and cleaned up code duplication from 
> sun4v/sun4u/ldc. 

Looks good to me.

PowerPC folks, what do you think?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread Sowmini Varadhan
On (03/24/15 18:16), David Miller wrote:
> Generally this looks fine to me.
> 
> But about patch #2, I see no reason to have multiple iommu_pool_hash
> tables.  Even from a purely sparc perspective, we can always just do
> with just one of them.
> 
> Furthermore, you can even probably move it down into lib/iommu-common.c
> itself.  iommu_tbl_pool_init() can do the one time initialization.

fixed in v7.

Ben, Alexey, do you need more time to review this?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread Sowmini Varadhan

Changes from patchv6: moved pool_hash initialization to
lib/iommu-common.c and cleaned up code duplication from 
sun4v/sun4u/ldc. 

Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  174 +++
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  152 ++--
 arch/sparc/kernel/pci_sun4v.c |  179 +
 include/linux/iommu-common.h  |   48 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  235 +
 8 files changed, 475 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v6 0/3] Generic IOMMU pooled allocator

2015-03-24 Thread David Miller
From: Sowmini Varadhan 
Date: Tue, 24 Mar 2015 13:10:27 -0400

> Deltas from patchv5:
> - removed iommu_tbl_ops, and instead pass the ->flush_all as
>   an indirection to iommu_tbl_pool_init()
> - only invoke ->flush_all when there is no large_pool, based on
>   the assumption that large-pool usage is infrequently encountered.

Generally this looks fine to me.

But about patch #2, I see no reason to have multiple iommu_pool_hash
tables.  Even from a purely sparc perspective, we can always just do
with just one of them.

Furthermore, you can even probably move it down into lib/iommu-common.c
itself.  iommu_tbl_pool_init() can do the one time initialization.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v6 0/3] Generic IOMMU pooled allocator

2015-03-24 Thread Sowmini Varadhan
Deltas from patchv5:
- removed iommu_tbl_ops, and instead pass the ->flush_all as
  an indirection to iommu_tbl_pool_init()
- only invoke ->flush_all when there is no large_pool, based on
  the assumption that large-pool usage is infrequently encountered.

Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  182 ++-
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  169 ++---
 arch/sparc/kernel/pci_sun4v.c |  192 -
 include/linux/iommu-common.h  |   49 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  219 +
 8 files changed, 503 insertions(+), 325 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Tue, 24 Mar 2015 13:08:10 +1100

> For the large pool, we don't keep a hint so we don't know it's
> wrapped, in fact we purposefully don't use a hint to limit
> fragmentation on it, but then, it should be used rarely enough that
> flushing always is, I suspect, a good option.

I can't think of any use case where the largepool would be hit a lot
at all.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 21:44 -0400, David Miller wrote:
> From: Benjamin Herrenschmidt 
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.

I agree, I wasn't sure it was good enough for you, so was just putting
other options on the table.

> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

That would have to be done inside the of the flush callback but I don't
see a big issue there.

For the large pool, we don't keep a hint so we don't know it's wrapped,
in fact we purposefully don't use a hint to limit fragmentation on it,
but then, it should be used rarely enough that flushing always is, I
suspect, a good option.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
benh> It might be sufficient to add a flush counter and compare it between runs
benh> if actual wall-clock benchmarks are too hard to do (especially if you
benh> don't have things like very fast network cards at hand).
benh>
benh> Number of flush / number of packets might be a sufficient metric, it..

I was just going to say: I can add those counters tomorrow,
and get some stats, but seems like it doesn't really matter what 
the outcome is, because: 

On (03/23/15 21:44), David Miller wrote:
> 
> From: Benjamin Herrenschmidt 
> Date: Tue, 24 Mar 2015 09:21:05 +1100
> 
> > Dave, what's your feeling there ? Does anybody around still have
> > some HW that we can test with ?
> 
> I don't see what the actual problem is.
> 
> Even if you use multiple pools, which we should for scalability on
> sun4u too, just do the flush when allocation in _any_ pool wraps
> around.
> 
> That's still better than not doing the optimization at all.
> 
> That is always going to be correct, and you can use a separate
> spinlock to make sure only one thread of control does the full
> IOMMU flush at a time.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Tue, 24 Mar 2015 09:21:05 +1100

> Dave, what's your feeling there ? Does anybody around still have
> some HW that we can test with ?

I don't see what the actual problem is.

Even if you use multiple pools, which we should for scalability on
sun4u too, just do the flush when allocation in _any_ pool wraps
around.

That's still better than not doing the optimization at all.

That is always going to be correct, and you can use a separate
spinlock to make sure only one thread of control does the full
IOMMU flush at a time.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 11:47), Benjamin Herrenschmidt wrote:
> 
> Yes, pass a function pointer argument that can be NULL or just make it a
> member of the iommu_allocator struct (or whatever you call it) passed to
> the init function and that can be NULL. My point is we don't need a
> separate "ops" structure.

Ok, I'll make this a function pointer to the iommu_tbl_pool_init()
function (tomorrow)

fwiw, the ops struct came out as a result of DaveM input, 
  http://www.spinics.net/lists/sparclinux/msg13232.html
albeit the original context is now moot.

> Pass it to init and stash it in the table but don't call it
> "iommu_table", let's use a name that conveys better the fact that this
> is purely a DMA space allocator (to be embedded by the arch specific
> iommu table). Something like iommu_alloc_zone or whatever you want to
> call it. I keep finding new names whenever I think of it :-)

please pick a name that you like as this mooshes all the patches,
and I myself really dont care what it gets called (rose? :-)) as long 
as fat-fingering-risk is minimized

Regarding the relationship with largepool,

> But that might not be necessary. If indeed we very rarely use the large
> pool, then just make it flush always. My feeling is that it will only
> ever be used at driver init/remove time when allocating things like
> descriptor rings, where the flush overhead dont' matter.

ok, will factor this in tomorrow (assuming DaveM has no objections
to anything proposed here, esp around the function pointer for flushall).

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 19:19 -0400, Sowmini Varadhan wrote:

> What I've tried to do is to have a bool large_pool arg passed
> to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
> we never allocate more than 4 pages at a time, so I pass in 
> large_pool == false for all the sparc platforms. 

But that might not be necessary. If indeed we very rarely use the large
pool, then just make it flush always. My feeling is that it will only
ever be used at driver init/remove time when allocating things like
descriptor rings, where the flush overhead dont' matter.

> > Or we can decide that large allocs are rare (typically
> > pci_alloc_consistent, ie, driver init time), and thus always flush on
> > them (or rather on free of a large chunk). David, what's your take
> > there ? I have a feeling that should work fine without a noticeable
> > performance issue...
> > 
> > I would also keep a "dirty" flag set on any free and cleared on any
> > flush to avoid more spurrious flushes, but here too the benefit might be
> > in the noise.

Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 19:08 -0400, Sowmini Varadhan wrote:

> > Sowmini, I see various options for the second choice. We could stick to
> > 1 pool, and basically do as before, ie, if we fail on the first pass of
> > alloc, it means we wrap around and do a flush, I don't think that will
> > cause a significant degradation from today, do you ? We might have an
> > occasional additional flush but I would expect it to be in the noise.
> 
> Isn't this essentially what I have in patch v5 here:
> http://www.spinics.net/lists/sparclinux/msg13534.html

Possibly, I haven't looked at it in details yet.

My point is that while we might flush a bit more often than the original
code, it should remain in the noise performance-wise and thus David
might rescind his objection, but we need to prove this with a
measurement.

> (the ops->reset is the flushall indirection, can be renamed if the
> latter is preferred)
> 
> > Dave, what's your feeling there ? Does anybody around still have some
> > HW that we can test with ?
> 
> I actually tested this on a V440 and a ultra45 (had a heck of a
> time finding these, since the owners keep them turned off because
> they are too noisy and consume too much power :-). Thus while I
> have no opinion, I would not shed any tears if we lost this extra
> perf-tweak in the interest of being earth-friendly  :-))
> 
> so testing it is not a problem, though I dont have any perf
> benchmarks for them either.

That's what we'll need unfortunately to confirm our "gut feeling". It
might be sufficient to add a flush counter and compare it between runs
if actual wall-clock benchmarks are too hard to do (especially if you
don't have things like very fast network cards at hand).

Number of flush / number of packets might be a sufficient metric, it
depends on whether it makes David happy :-)

> > Sowmini, I think we can still kill the ops and have a separate data
> > structure exclusively concerned by allocations by having the alloc
> > functions take the lazy flush function as an argument (which can be
> > NULL), I don't think we should bother with ops.
> 
> I dont quite follow what you have in mind? The caller would somehow
> have to specify a flush_all indirection for the legacy platforms 

Yes, pass a function pointer argument that can be NULL or just make it a
member of the iommu_allocator struct (or whatever you call it) passed to
the init function and that can be NULL. My point is we don't need a
separate "ops" structure.

> Also, you mention
> 
> > You must hold the lock until you do the flush, otherwise somebody
> > else might allocate the not-yet-flushed areas and try to use them...
> > kaboom. However if that's the only callback left, pass it as an
> > argument.
> 
> Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
> would work, or we could pass it in as an arg to iommu_tbl_init,
> and stash it in the struct iommu_table.

Pass it to init and stash it in the table but don't call it
"iommu_table", let's use a name that conveys better the fact that this
is purely a DMA space allocator (to be embedded by the arch specific
iommu table). Something like iommu_alloc_zone or whatever you want to
call it. I keep finding new names whenever I think of it :-)

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread chase rayfield
On Mar 23, 2015 7:13 PM, "Sowmini Varadhan" 
wrote:
>
> On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
> >
> > So we have two choices here that I can see:
> >
> >  - Keep that old platform use the old/simpler allocator
>
> Problem with that approach is that the base "struct iommu" structure
> for sparc gets a split personality: the older one is used with
> the older allocator, and other ugly things ensue.  (alternatively,
> you end up duplicating a version of the code with the flush_all
> inlined).
>
> >  - Try to regain the bulk of that benefit with the new one
> >
> > Sowmini, I see various options for the second choice. We could stick to
> > 1 pool, and basically do as before, ie, if we fail on the first pass of
> > alloc, it means we wrap around and do a flush, I don't think that will
> > cause a significant degradation from today, do you ? We might have an
> > occasional additional flush but I would expect it to be in the noise.
>
> Isn't this essentially what I have in patch v5 here:
> http://www.spinics.net/lists/sparclinux/msg13534.html
>
> (the ops->reset is the flushall indirection, can be renamed if the
> latter is preferred)
>
> > Dave, what's your feeling there ? Does anybody around still have some
> > HW that we can test with ?
>
> I actually tested this on a V440 and a ultra45 (had a heck of a
> time finding these, since the owners keep them turned off because
> they are too noisy and consume too much power :-).

So we need tests then more than hw? I have an ultra1, ultra10 and t2000
I can test on if needed. And I'd appreciate my caches not being flushed
excessively on these boxes :-) too. I'll see if I can't get Gentoo on the
u10 tonight.

Also... It would be more "green" to make the code run faster on these boxes
than otherwise!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 09:36), Benjamin Herrenschmidt wrote:
> 
>  - One pool only
> 
>  - Whenever the allocation is before the previous hint, do a flush, that
> should only happen if a wrap around occurred or in some cases if the
> device DMA mask forced it. I think we always update the hint whenever we
> successfully allocate from the small pools.

right, I believe this is close to what I discussed in the previous
thread, and approx what I have in patchv5, except that the flush 
indirection can be passed as a function pointer, or via the table_ops.

> 
>  - Deal with the largealloc case. That's the contentious issue, see
> below.
  :
> The largealloc issue is a different can of worms. We can try adding an
> option to disable the largealloc business completely (instead of hard
> wiring the "15", make that a variable and define that 0 means no
> largealloc pool).

What I've tried to do is to have a bool large_pool arg passed
to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
we never allocate more than 4 pages at a time, so I pass in 
large_pool == false for all the sparc platforms. 

> Or we can decide that large allocs are rare (typically
> pci_alloc_consistent, ie, driver init time), and thus always flush on
> them (or rather on free of a large chunk). David, what's your take
> there ? I have a feeling that should work fine without a noticeable
> performance issue...
> 
> I would also keep a "dirty" flag set on any free and cleared on any
> flush to avoid more spurrious flushes, but here too the benefit might be
> in the noise.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
> 
> So we have two choices here that I can see:
> 
>  - Keep that old platform use the old/simpler allocator

Problem with that approach is that the base "struct iommu" structure
for sparc gets a split personality: the older one is used with
the older allocator, and other ugly things ensue.  (alternatively,
you end up duplicating a version of the code with the flush_all 
inlined). 

>  - Try to regain the bulk of that benefit with the new one
> 
> Sowmini, I see various options for the second choice. We could stick to
> 1 pool, and basically do as before, ie, if we fail on the first pass of
> alloc, it means we wrap around and do a flush, I don't think that will
> cause a significant degradation from today, do you ? We might have an
> occasional additional flush but I would expect it to be in the noise.

Isn't this essentially what I have in patch v5 here:
http://www.spinics.net/lists/sparclinux/msg13534.html

(the ops->reset is the flushall indirection, can be renamed if the
latter is preferred)

> Dave, what's your feeling there ? Does anybody around still have some
> HW that we can test with ?

I actually tested this on a V440 and a ultra45 (had a heck of a
time finding these, since the owners keep them turned off because
they are too noisy and consume too much power :-). Thus while I
have no opinion, I would not shed any tears if we lost this extra
perf-tweak in the interest of being earth-friendly  :-))

so testing it is not a problem, though I dont have any perf
benchmarks for them either.

> Sowmini, I think we can still kill the ops and have a separate data
> structure exclusively concerned by allocations by having the alloc
> functions take the lazy flush function as an argument (which can be
> NULL), I don't think we should bother with ops.

I dont quite follow what you have in mind? The caller would somehow
have to specify a flush_all indirection for the legacy platforms 

Also, you mention

> You must hold the lock until you do the flush, otherwise somebody
> else might allocate the not-yet-flushed areas and try to use them...
> kaboom. However if that's the only callback left, pass it as an
> argument.

Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
would work, or we could pass it in as an arg to iommu_tbl_init,
and stash it in the struct iommu_table.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan 
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

Our allocator more/less allocates bottom-up from the last successful
allocation and tries again if it fails, so in essence, it is not so
different from what you have done. What we need to do is:

 - One pool only

 - Whenever the allocation is before the previous hint, do a flush, that
should only happen if a wrap around occurred or in some cases if the
device DMA mask forced it. I think we always update the hint whenever we
successfully allocate from the small pools.

 - Deal with the largealloc case. That's the contentious issue, see
below.

So if we ignore the largealloc split we have, we can pretty much recover
David's behaviour using our allocator. There might be the odd flush here
or there that we do and he didn't due to differences in implementation
details, but I doubt those are statistically relevant, as long as we
don't flush on every network packet, we should be good.

The largealloc issue is a different can of worms. We can try adding an
option to disable the largealloc business completely (instead of hard
wiring the "15", make that a variable and define that 0 means no
largealloc pool).

Or we can decide that large allocs are rare (typically
pci_alloc_consistent, ie, driver init time), and thus always flush on
them (or rather on free of a large chunk). David, what's your take
there ? I have a feeling that should work fine without a noticeable
performance issue...

I would also keep a "dirty" flag set on any free and cleared on any
flush to avoid more spurrious flushes, but here too the benefit might be
in the noise.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 12:54 -0400, Sowmini Varadhan wrote:

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:
> 
> - Caller of iommu_tbl_range_alloc() can do the flush_all if they 
>   see start <= end for the one single pool 
> - lose the other ->flush_all invocation (i.e., the one that is
>   done when iommu_area_alloc() fails for pass == 0, and we reset
>   start to 0 to roll-back)
> 
> that would avoid the need for any iommu_tbl_ops in my patch-set.

You must hold the lock until you do the flush, otherwise somebody
else might allocate the not-yet-flushed areas and try to use them...
kaboom. However if that's the only callback left, pass it as an
argument.

> But it would imply that you would still take the perf hit for the roll-back
> if we failed the pass == 0 iteration through iommu_area_alloc().
> Perhaps this is an acceptable compromise in favor of cleaner code
> (again, assuming that current/future archs will all follow the HV
> based design).
> 
> --Sowmini
>  


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 15:05 -0400, David Miller wrote:
> From: Sowmini Varadhan 
> Date: Mon, 23 Mar 2015 12:54:06 -0400
> 
> > If it was only an optimization (i.e., removing it would not break
> > any functionality), and if this was done for older hardware,
> > and *if* we believe that the direction of most architectures is to 
> > follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> > arena pool anyway, one thought that I have for refactoring this
> > is the following:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

So we have two choices here that I can see:

 - Keep that old platform use the old/simpler allocator

 - Try to regain the bulk of that benefit with the new one

Sowmini, I see various options for the second choice. We could stick to
1 pool, and basically do as before, ie, if we fail on the first pass of
alloc, it means we wrap around and do a flush, I don't think that will
cause a significant degradation from today, do you ? We might have an
occasional additional flush but I would expect it to be in the noise.

Another option would be trickier, is to keep an additional bitmap
of "stale" entries. When an entry is freed, instead of freeing it
in the main bitmap, set a bit in the "stale" bit map. If we fail to
allocate, then flush, xor off the main bitmap bits using the stale
bitmap, and try again.

However, the second approach means that as the main bitmap gets full, we
will start allocating from remote pools, so it partially defeats the
pool system, unless we do everything locally per-pool (ie flush when the
pool is full before we fallback to another pool), in which case we go
back to flushing more often than we used to. But here too, the
difference might end up in the noise, we still flush order of magnitude
less than once per translation update.

Dave, what's your feeling there ? Does anybody around still have some
HW that we can test with ?

Sowmini, I think we can still kill the ops and have a separate data
structure exclusively concerned by allocations by having the alloc
functions take the lazy flush function as an argument (which can be
NULL), I don't think we should bother with ops.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/23/15 15:05), David Miller wrote:
> 
> Why add performance regressions to old machines who already are
> suffering too much from all the bloat we are constantly adding to the
> kernel?

I have no personal opinion on this- it's a matter of  choosing
whether we want to have some extra baggage in the library to support
old hardware, or keep the code lean-and-mean for the future,
where the model has changed.

I'll let others on this list make the call.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread David Miller
From: Sowmini Varadhan 
Date: Mon, 23 Mar 2015 12:54:06 -0400

> If it was only an optimization (i.e., removing it would not break
> any functionality), and if this was done for older hardware,
> and *if* we believe that the direction of most architectures is to 
> follow the sun4v/HV model, then, given that the sun4u code only uses 1 
> arena pool anyway, one thought that I have for refactoring this
> is the following:

Why add performance regressions to old machines who already are
suffering too much from all the bloat we are constantly adding to the
kernel?
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Arnd Bergmann
On Monday 23 March 2015, Benjamin Herrenschmidt wrote:
> On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> > 
> > My guess is that the ARM code so far has been concerned mainly with
> > getting things to work in the first place, but scalability problems
> > will only be seen when there are faster CPU cores become available.
> 
> In any case, I think this is mostly a non-issue. The complexity of the
> ARM code is in various areas related to making shit work (handling
> coherent allocations mostly) but only remotely related to the actual
> iommu DMA space allocator (iova in ARM as far as I understand the code)
> which is pretty standard.

Ok, got it. Thanks for explaining tht part.

> The work Sowmini is doing is about specifically the allocator. Making
> our (powerpc) allocator generic since it has some nice scalability
> features.
> 
> In fact, what Aik and I have been pushing and Sowmini is close to
> achieving is to mostly disconnect that allocator from the rest of the
> iommu management (the caller).
> 
> So in the end, the allocator itself should be splitable into something
> separate that resides in lib/ or similar, which ARM can chose to use as
> well.

Yes, this sounds like a good idea. I'm currently at ELC and will bring this
up with the people working on ARM IOMMU support.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/23/15 12:29), David Miller wrote:
> 
> In order to elide the IOMMU flush as much as possible, I implemnented
> a scheme for sun4u wherein we always allocated from low IOMMU
> addresses to high IOMMU addresses.
> 
> In this regime, we only need to flush the IOMMU when we rolled over
> back to low IOMMU addresses during an allocation.
> 
> It made a noticable difference in performance.
> 
> Unfortunately, with sun4v and the hypervisor, I'm not allowed to
> control when the IOMMU flush happens, it has to occur on every
> single IOMMU mapping change.  So this optimization was no longer
> possible there.
> 
> Anyways, that's the history behind it.
> --

I see.

If it was only an optimization (i.e., removing it would not break
any functionality), and if this was done for older hardware,
and *if* we believe that the direction of most architectures is to 
follow the sun4v/HV model, then, given that the sun4u code only uses 1 
arena pool anyway, one thought that I have for refactoring this
is the following:

- Caller of iommu_tbl_range_alloc() can do the flush_all if they 
  see start <= end for the one single pool 
- lose the other ->flush_all invocation (i.e., the one that is
  done when iommu_area_alloc() fails for pass == 0, and we reset
  start to 0 to roll-back)

that would avoid the need for any iommu_tbl_ops in my patch-set.

But it would imply that you would still take the perf hit for the roll-back
if we failed the pass == 0 iteration through iommu_area_alloc().
Perhaps this is an acceptable compromise in favor of cleaner code
(again, assuming that current/future archs will all follow the HV
based design).

--Sowmini
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread David Miller
From: Sowmini Varadhan 
Date: Sun, 22 Mar 2015 15:27:26 -0400

> That leaves only the odd iommu_flushall() hook, I'm trying
> to find the history behind that (needed for sun4u platforms,
> afaik, and not sure if there are other ways to achieve this).

In order to elide the IOMMU flush as much as possible, I implemnented
a scheme for sun4u wherein we always allocated from low IOMMU
addresses to high IOMMU addresses.

In this regime, we only need to flush the IOMMU when we rolled over
back to low IOMMU addresses during an allocation.

It made a noticable difference in performance.

Unfortunately, with sun4v and the hypervisor, I'm not allowed to
control when the IOMMU flush happens, it has to occur on every
single IOMMU mapping change.  So this optimization was no longer
possible there.

Anyways, that's the history behind it.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Benjamin Herrenschmidt
On Mon, 2015-03-23 at 07:04 +0100, Arnd Bergmann wrote:
> 
> My guess is that the ARM code so far has been concerned mainly with
> getting things to work in the first place, but scalability problems
> will only be seen when there are faster CPU cores become available.

In any case, I think this is mostly a non-issue. The complexity of the
ARM code is in various areas related to making shit work (handling
coherent allocations mostly) but only remotely related to the actual
iommu DMA space allocator (iova in ARM as far as I understand the code)
which is pretty standard.

The work Sowmini is doing is about specifically the allocator. Making
our (powerpc) allocator generic since it has some nice scalability
features.

In fact, what Aik and I have been pushing and Sowmini is close to
achieving is to mostly disconnect that allocator from the rest of the
iommu management (the caller).

So in the end, the allocator itself should be splitable into something
separate that resides in lib/ or similar, which ARM can chose to use as
well.

Sowmini: it might be worthwhile creating a separate data structure
iommu_virt_zone or something like that which is what the allocator works
on exclusively, which we would just embed into sparc and powerpc. The
only remaining problem I see is your "reset" op, but I think we should
be able to sort it out once we understand what it's there for :-)

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Arnd Bergmann
On Sunday 22 March 2015, Benjamin Herrenschmidt wrote:
> On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> > On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > > How does this relate to the ARM implementation? There is currently
> > > > an effort going on to make that one shared with ARM64 and possibly
> > > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > > allocation to see if we could pick one of the two to work on
> > > > all architectures?
> > > 
> > > What I see in ARM is horribly complex, I can't quite make sense of it
> > > in a couple of minutes of looking at it, and doesn't seem to address the
> > > basic issue we are addressing here which is the splitting of the iommu
> > > table lock.
> > 
> > Amen to that.. I thought it was just me :-)

Two problems on ARM IOMMU hardware are:

- a very wide range of incompatible approaches, from a simple sw-loaded
  iotlb with just a few hugepage entries to some rather complex
  ones
- a rather overdesigned "SMMU" that will be used by half the SoCs, but
  with some of them implementing slightly incompatible variants.

> > I plan to go through the code to see if/where the armd iommu code
> > does its locking and achieves its parallelism, but the mapping
> > between the sparc/powerpc approach and armd is not immediately obvious
> > to me.

My guess is that the ARM code so far has been concerned mainly with
getting things to work in the first place, but scalability problems
will only be seen when there are faster CPU cores become available.

> We might have more chance with x86_64 though... They would surely have
> similar scalability issues.
> 

I believe the complexity of the modern Intel and AMD IOMMUs is similar
to the ARM SMMU, which appears to have been modeled after the Intel
one to some degree.

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v5 RFC 0/3] Generic IOMMU pooled allocator

2015-03-22 Thread Benjamin Herrenschmidt
On Sun, 2015-03-22 at 15:22 -0400, Sowmini Varadhan wrote:
> Follows up on the feedback in the thread at
>   http://www.spinics.net/lists/sparclinux/msg13493.html
> 
> - removed ->cookie_to_index and ->demap indirection from the iommu_tbl_ops
>   The caller needs to call these functions as needed, before invoking
>   the generic arena allocator functions.
> 
> - added the "skip_span_boundary" argument to iommu_tbl_pool_init() for
>   those callers like LDC which do no care about span boundary checks.

Thanks. I'll try to look in depth some time this week, but poke me if
you don't hear from me, I'm a bit swamped at the moment. In the
meantime, Alexey might have some bandwidth to also check it out.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Benjamin Herrenschmidt
On Sun, 2015-03-22 at 18:07 -0400, Sowmini Varadhan wrote:
> On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > > How does this relate to the ARM implementation? There is currently
> > > an effort going on to make that one shared with ARM64 and possibly
> > > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > > allocation to see if we could pick one of the two to work on
> > > all architectures?
> > 
> > What I see in ARM is horribly complex, I can't quite make sense of it
> > in a couple of minutes of looking at it, and doesn't seem to address the
> > basic issue we are addressing here which is the splitting of the iommu
> > table lock.
> 
> Amen to that.. I thought it was just me :-)
> 
> I plan to go through the code to see if/where the armd iommu code
> does its locking and achieves its parallelism, but the mapping
> between the sparc/powerpc approach and armd is not immediately obvious
> to me.

We might have more chance with x86_64 though... They would surely have
similar scalability issues.

Cheers,
Ben.


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
> > How does this relate to the ARM implementation? There is currently
> > an effort going on to make that one shared with ARM64 and possibly
> > x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> > allocation to see if we could pick one of the two to work on
> > all architectures?
> 
> What I see in ARM is horribly complex, I can't quite make sense of it
> in a couple of minutes of looking at it, and doesn't seem to address the
> basic issue we are addressing here which is the splitting of the iommu
> table lock.

Amen to that.. I thought it was just me :-)

I plan to go through the code to see if/where the armd iommu code
does its locking and achieves its parallelism, but the mapping
between the sparc/powerpc approach and armd is not immediately obvious
to me.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Benjamin Herrenschmidt
On Sun, 2015-03-22 at 20:36 +0100, Arnd Bergmann wrote:

> How does this relate to the ARM implementation? There is currently
> an effort going on to make that one shared with ARM64 and possibly
> x86. Has anyone looked at both the PowerPC and ARM ways of doing the
> allocation to see if we could pick one of the two to work on
> all architectures?

What I see in ARM is horribly complex, I can't quite make sense of it
in a couple of minutes of looking at it, and doesn't seem to address the
basic issue we are addressing here which is the splitting of the iommu
table lock.

I think we are looking at two very different approaches here. One is
to deal with horrendously broken HW which is an ARM SoC vendors
signature :) The other one is about performances and scalability on sane
HW.

Cheers,
Ben.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Arnd Bergmann
On Thursday 19 March 2015, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s)

How does this relate to the ARM implementation? There is currently
an effort going on to make that one shared with ARM64 and possibly
x86. Has anyone looked at both the PowerPC and ARM ways of doing the
allocation to see if we could pick one of the two to work on
all architectures?

Arnd
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
Turned out that I was able to iterate over it, and remove
both the ->cookie_to_index and the ->demap indirection from
iommu_tbl_ops.

That leaves only the odd iommu_flushall() hook, I'm trying
to find the history behind that (needed for sun4u platforms,
afaik, and not sure if there are other ways to achieve this).

Just sent out a new patch-set, which should reach all the 
people/lists on this thread.

If it's possible, it would also help to share any preliminary
code-samples with the iommu_table_ops you folks have in mind.

Please take a look at patchv5, and share your thoughts.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5 RFC 0/3] Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
Follows up on the feedback in the thread at
  http://www.spinics.net/lists/sparclinux/msg13493.html

- removed ->cookie_to_index and ->demap indirection from the iommu_tbl_ops
  The caller needs to call these functions as needed, before invoking
  the generic arena allocator functions.

- added the "skip_span_boundary" argument to iommu_tbl_pool_init() for
  those callers like LDC which do no care about span boundary checks.


Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  187 +++-
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  172 ++---
 arch/sparc/kernel/pci_sun4v.c |  194 -
 include/linux/iommu-common.h  |   55 +
 lib/Makefile  |2 +-
 lib/iommu-common.c|  219 +
 8 files changed, 519 insertions(+), 325 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-19 Thread Sowmini Varadhan

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

Ben> One thing I noticed is the asymetry in your code between the alloc
Ben> and the free path. The alloc path is similar to us in that the lock
Ben> covers the allocation and that's about it, there's no actual mapping to
Ben> the HW done, it's done by the caller level right ?

yes, the only constraint  is that the h/w alloc transaction should be
done after the arena-alloc, whereas for the unmap, the h/w  transaction
should happen first, and arena-unmap should happen after.

Ben> The free path however, in your case, takes the lock and calls back into
Ben> "demap" (which I assume is what removes the translation from the HW)
Ben> with the lock held. There's also some mapping between cookies
Ben> and index which here too isn't exposed to the alloc side but is
Ben> exposed to the free side.

Regarding the ->demap indirection- somewhere between V1 and V2, I 
realized that, at least for sun4v, it's not necessary to hold
the pool lock when doing the unmap, (V1 had originally passed this
a the ->demap). Revisiting the LDC change, I think that even that
has no pool-specific info that needs to be passed in, so possibly the
->demap is not required at all?

I can remove that, and re-verify the LDC code (though I might
not be able to get to this till early next week, as I'm travelling
at the moment).

About the cookie_to_index, that came out of observation of the
LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, 
the workflow is approximately
   base = alloc_npages(..); /* calls iommu_tbl_range_alloc *.
   /* set up cookie_state using base */
   /* populate cookies calling fill_cookies() -> make_cookie() */
The make_cookie() is the inverse operation of cookie_to_index()
(afaict, the code is not very well commented, I'm afraid), but 
I need that indirection to figure out which bitmap to clear.

I dont know if there's a better way to do this, or if
the ->cookie_to_index can get more complex for other IOMMU users

Ben> One thing that Alexey is doing on our side is to move some of the
Ben> hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
Ben> global ppc_md. data structure to a new iommu_table_ops, so your patches
Ben> will definitely collide with our current work so we'll have to figure
Ben> out how things can made to match. We might be able to move more than
Ben> just the allocator to the generic code, and the whole implementation of
Ben> map_sg/unmap_sg if we have the right set of "ops", unless you see a
Ben> reason why that wouldn't work for you ?

I cant think of why that wont work, though it would help to see
the patch itself..


Ben> We also need to add some additional platform specific fields to certain
Ben> iommu table instances to deal with some KVM related tracking of pinned
Ben> "DMAble" memory, here too we might have to be creative and possibly
Ben> enclose the generic iommu_table in a platform specific variant.
Ben> 
Ben> Alexey, any other comment ?
Ben> 
Ben> Cheers,
Ben> Ben.
Ben> 
Ben> 
Ben> 
Ben> --
Ben> To unsubscribe from this list: send the line "unsubscribe sparclinux" in
Ben> the body of a message to majord...@vger.kernel.org
Ben> More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben> 
Ben>Alexey, any other comment ?

On (03/19/15 16:27), Alexey Kardashevskiy wrote:
Alexey> 
Alexey> Agree about missing symmetry. In general, I would call it "zoned
Alexey> pool-locked memory allocator"-ish rather than iommu_table and have
Alexey> no callbacks there.
Alexey> 
Alexey> The iommu_tbl_range_free() caller could call cookie_to_index() and

Problem is that tbl_range_free itself needs the `entry' from
-Alexey>cookie_to_index.. dont know if there's a way to move the code
to avoid that..

Alexey> what the reset() callback does here - I do not understand, some

The -Alexey>reset callback came out of the sun4u use-case. Davem might
have more history here than I do, but my understanding is that the
iommu_flushall() was needed on the older sun4u architectures, where
there was on intermediating HV?

Alexey> documentation would help here, and demap() does not have to be
Alexey> executed under the lock (as map() is not executed under the lock).
Alexey> 
Alexey> btw why demap, not unmap? :)

Maybe neither is needed, as you folks made me realize above.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-18 Thread Alexey Kardashevskiy

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:

PowerPC folks, we're trying to kill the locking contention in our
IOMMU allocators and noticed that you guys have a nice solution to
this in your IOMMU code.


  .../...

Adding Alexei too who is currently doing some changes to our iommu
code to deal with KVM.

One thing I noticed is the asymetry in your code between the alloc
and the free path. The alloc path is similar to us in that the lock
covers the allocation and that's about it, there's no actual mapping to
the HW done, it's done by the caller level right ?

The free path however, in your case, takes the lock and calls back into
"demap" (which I assume is what removes the translation from the HW)
with the lock held. There's also some mapping between cookies
and index which here too isn't exposed to the alloc side but is
exposed to the free side.

I'm sure there's a rationale for that but it would be good if you could
explain it and document the semantics of those 3 callbacks in the iommu
ops.

One thing that Alexey is doing on our side is to move some of the
hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
global ppc_md. data structure to a new iommu_table_ops, so your patches
will definitely collide with our current work so we'll have to figure
out how things can made to match. We might be able to move more than
just the allocator to the generic code, and the whole implementation of
map_sg/unmap_sg if we have the right set of "ops", unless you see a
reason why that wouldn't work for you ?

We also need to add some additional platform specific fields to certain
iommu table instances to deal with some KVM related tracking of pinned
"DMAble" memory, here too we might have to be creative and possibly
enclose the generic iommu_table in a platform specific variant.

Alexey, any other comment ?



Agree about missing symmetry. In general, I would call it "zoned 
pool-locked memory allocator"-ish rather than iommu_table and have no 
callbacks there.


The iommu_tbl_range_free() caller could call cookie_to_index() and what the 
reset() callback does here - I do not understand, some documentation would 
help here, and demap() does not have to be executed under the lock (as 
map() is not executed under the lock).


btw why demap, not unmap? :)


--
Alexey
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-18 Thread Benjamin Herrenschmidt
On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.

 .../...

Adding Alexei too who is currently doing some changes to our iommu
code to deal with KVM.

One thing I noticed is the asymetry in your code between the alloc
and the free path. The alloc path is similar to us in that the lock
covers the allocation and that's about it, there's no actual mapping to
the HW done, it's done by the caller level right ?

The free path however, in your case, takes the lock and calls back into
"demap" (which I assume is what removes the translation from the HW)
with the lock held. There's also some mapping between cookies
and index which here too isn't exposed to the alloc side but is
exposed to the free side.

I'm sure there's a rationale for that but it would be good if you could
explain it and document the semantics of those 3 callbacks in the iommu
ops.

One thing that Alexey is doing on our side is to move some of the
hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
global ppc_md. data structure to a new iommu_table_ops, so your patches
will definitely collide with our current work so we'll have to figure
out how things can made to match. We might be able to move more than
just the allocator to the generic code, and the whole implementation of
map_sg/unmap_sg if we have the right set of "ops", unless you see a
reason why that wouldn't work for you ?

We also need to add some additional platform specific fields to certain
iommu table instances to deal with some KVM related tracking of pinned
"DMAble" memory, here too we might have to be creative and possibly
enclose the generic iommu_table in a platform specific variant.

Alexey, any other comment ?

Cheers,
Ben.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-18 Thread David Miller
From: Benjamin Herrenschmidt 
Date: Thu, 19 Mar 2015 13:46:15 +1100

> Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
> try to find somebody to work on that here & will let you know asap.

Thanks a lot Ben.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-18 Thread Benjamin Herrenschmidt
On Wed, 2015-03-18 at 22:25 -0400, David Miller wrote:
> PowerPC folks, we're trying to kill the locking contention in our
> IOMMU allocators and noticed that you guys have a nice solution to
> this in your IOMMU code.
> 
> Sowmini put together a patch series that tries to extract out the
> generic parts of your code and place it in lib/iommu-common.c so
> that both Sparc and PowerPC can make use of it.
> 
> The real test is if powerpc can be converted to use it.
> 
> So if you guys could please take a look at the current version of
> this series at:
> 
> http://patchwork.ozlabs.org/patch/449712/
> http://patchwork.ozlabs.org/patch/449713/
> http://patchwork.ozlabs.org/patch/449715/
> 
> and give us some feedback that would be great.
> 
> After we sort out making this work properly for both architectures we
> can figure out who merges which parts via what tree(s).

Sounds like a good idea ! CC'ing Anton who wrote the pool stuff. I'll
try to find somebody to work on that here & will let you know asap.

Cheers,
Ben.



___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Generic IOMMU pooled allocator

2015-03-18 Thread David Miller

PowerPC folks, we're trying to kill the locking contention in our
IOMMU allocators and noticed that you guys have a nice solution to
this in your IOMMU code.

Sowmini put together a patch series that tries to extract out the
generic parts of your code and place it in lib/iommu-common.c so
that both Sparc and PowerPC can make use of it.

The real test is if powerpc can be converted to use it.

So if you guys could please take a look at the current version of
this series at:

http://patchwork.ozlabs.org/patch/449712/
http://patchwork.ozlabs.org/patch/449713/
http://patchwork.ozlabs.org/patch/449715/

and give us some feedback that would be great.

After we sort out making this work properly for both architectures we
can figure out who merges which parts via what tree(s).

Thanks!
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev