Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Sat, 2015-04-04 at 07:27 -0400, Sowmini Varadhan wrote: > One last question before I spin out v9.. the dma_mask code > is a bit confusing to me, so I want to make sure... the code is > > > if (limit + tbl->it_offset > mask) { > > limit = mask - tbl->it_offset + 1; > > /* If we're constrained on address range, first try > >* at the masked hint to avoid O(n) search complexity, > >* but on second pass, start at 0 in pool 0. > >*/ > > if ((start & mask) >= limit || pass > 0) { > > spin_unlock(&(pool->lock)); > > pool = &(tbl->pools[0]); > > spin_lock(&(pool->lock)); > > start = pool->start; > > So here I would need to mark need_flush to true, right? Well, no but ... So you aren't changing any pool hint, so you don't have to mark any pool for flushing. The "n < hint" check will do the flush later on if needed (the one I told you to add). You only need to force a flush if you modify the hint. *However*, you do an unlock, which means that if you have modified any *other* pool's hint (ie, need_flush is true), you need to actually perform a flush. I think that business with flushing on unlock is becoming too ugly, I would suggest you simplify things by moving need_flush to be part of the pool structure (maybe a flag). IE, when you modify a pool's hint, your mark it for flushing. When you allocate from a pool, you flush it at the end, just before you return success, if either it was marked for flushing or n < hint (and clear the mark of course). That should cleanly factor all cases in a simpler & more readable code path. The flush is done only when needed (an actual allocation happen on a pool), you just "mark them dirty" when you change their hints, you don't have to worry about the lock dropped case & pool hop'ing anymore. Basically the logic looks like that: .../... n = iommu_area_alloc(...) if (n == -1) { if (pass == 0) { change pool hint pool->need_flush = true; pass++; goto again; } else if (!largealloc && ) { unlock/pool hop/lock // no need to flush anything here pool->hint = pool->start; pool->need_flush = true; etc... } else { n = DMA_ERROR_CODE; goto fail; } } end = n + pages; if (n < pool->hint || pool->need_flush) { pool->need_flush = false; iommu->lazy_flush(...); } pool->hint = end; .../... Now this can be further simplified imho. For example that logic: + if (pass == 0 && handle && *handle && + (*handle >= pool->start) && (*handle < pool->end)) + start = *handle; + else + start = pool->hint; Can move above the again: label. that way, you remove the pass test and you no longer have to adjust pool->hint in the "if (likely(pass == 0))" failure case. Just adjust "start" and try again, which means you also no longer need to set need_flush. Additionally, when pool hop'ing, now you only need to adjust "start" as well. By not resetting it, you also remove the need for marking it "need flush". I would add to that, why not set "start" to the new pool's hint instead of the new pool's start ? That will save some flushes, no ? Anton, do you remember why you reset the hint when changing pool ? I don't see that gaining us anything and it does make lazy flushing more complex. At that point, you have removed all setters of need_flush, that logic can be entirely removed, or am I just too tired (it's past midnight) and missing something entirely here ? We only need the if (n < pool->hint) check to do the lazy flush at the end, that's it. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
One last question before I spin out v9.. the dma_mask code is a bit confusing to me, so I want to make sure... the code is > if (limit + tbl->it_offset > mask) { > limit = mask - tbl->it_offset + 1; > /* If we're constrained on address range, first try >* at the masked hint to avoid O(n) search complexity, >* but on second pass, start at 0 in pool 0. >*/ > if ((start & mask) >= limit || pass > 0) { > spin_unlock(&(pool->lock)); > pool = &(tbl->pools[0]); > spin_lock(&(pool->lock)); > start = pool->start; So here I would need to mark need_flush to true, right? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/04/15 08:06), Benjamin Herrenschmidt wrote: > > No, I meant "n < pool->hint", ie, the start of the newly allocated > block. ah, got it. I'll do my drill with patchset and get back, probably by Monday. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 14:28 -0400, Sowmini Varadhan wrote: > > Just want to confirm: > > > > + again: > > > + if (pass == 0 && handle && *handle && > > > + (*handle >= pool->start) && (*handle < pool->end)) > > > + start = *handle; > > > + else > > > + start = pool->hint; > > > > Now this means "handle" might be < pool->hint, in that case you also > > need a lazy flush. Or rather only if the resulting alloc is. My > : > > > > + } else { > > > + /* give up */ > > > + n = DMA_ERROR_CODE; > > > + goto bail; > > > + } > > > + } > > > > Here, make this something like: > > > > } else if (end < pool->hint) > > need_flush = true; > > you mean > > } else if (start < pool->hint) > > right? (so I'm not missing some corner-case that you are thinking > about here) No, I meant "n < pool->hint", ie, the start of the newly allocated block. "end" hasn't been adjusted yet at that point but we don't want to compare "end" anyway (which will be n + npages), we really want n, ie if the *beginning* of the newly allocated chunk is before the end of the previous one (after all they may even overlap if the previous one has been freed). Ben. > --Sowmini > ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Just want to confirm: > > + again: > > + if (pass == 0 && handle && *handle && > > + (*handle >= pool->start) && (*handle < pool->end)) > > + start = *handle; > > + else > > + start = pool->hint; > > Now this means "handle" might be < pool->hint, in that case you also > need a lazy flush. Or rather only if the resulting alloc is. My : > > + } else { > > + /* give up */ > > + n = DMA_ERROR_CODE; > > + goto bail; > > + } > > + } > > Here, make this something like: > > } else if (end < pool->hint) > need_flush = true; you mean } else if (start < pool->hint) right? (so I'm not missing some corner-case that you are thinking about here) --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
From: Sowmini Varadhan Date: Thu, 2 Apr 2015 17:54:53 -0400 > the other question that comes to my mind is: the whole lazy_flush > optimization probably works best when there is exactly one pool, > and no large pools. In most other cases, we'd end up doing a lazy_flush > when we wrap within our pool itself, losing the benefit of that > optimization. > > Given that the lazy_flush is mostly there to avoid regressions for > the older sun4u architectures (which have other hardware bottlenecks > anyway), and this code is rapidly getting messy, does it make sense > to constrain the lazy_flush check to only apply for the 1-pool, > no-large-pool case? I think it's better to have the multiple pools with more often global flushing. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 18:15 -0400, Sowmini Varadhan wrote: > On (04/03/15 08:57), Benjamin Herrenschmidt wrote: > > > > > > I only just noticed too, you completely dropped the code to honor > > > > the dma mask. Why that ? Some devices rely on this. > > /* Sowmini's comment about this coming from sparc origins.. */ > > > Probably, not that many devices have limits on DMA mask but they do > > exist. It becomes more important if we decide to create a very large > > IOMMU window that spans beyond 4G in order to support devices with > > 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit > > masks. > > > > In any case, for a generic piece of code, this should be supported. > > Basically, assume that if we have something in the powerpc code, we need > > it, if you remove it, we won't be able to use your code generically. > > I see. > > is the mask something that can be stored in the iommu_map_table as > part of the init? > No, the mask is per device and has to be retrieved from the device. Additionally the mask used for dma_map_* can be different from the consistent mask used for alloc_coherent (we have a bug there on powerpc which I'm trying to fix btw). So it should be passed as an argument by the caller. > I can see that the align_order has to be an additional arg to > iommu_tbl_range_alloc, not sure if mask falls in that category > as well. It does. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 08:57), Benjamin Herrenschmidt wrote: > > > > I only just noticed too, you completely dropped the code to honor > > > the dma mask. Why that ? Some devices rely on this. /* Sowmini's comment about this coming from sparc origins.. */ > Probably, not that many devices have limits on DMA mask but they do > exist. It becomes more important if we decide to create a very large > IOMMU window that spans beyond 4G in order to support devices with > 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit > masks. > > In any case, for a generic piece of code, this should be supported. > Basically, assume that if we have something in the powerpc code, we need > it, if you remove it, we won't be able to use your code generically. I see. is the mask something that can be stored in the iommu_map_table as part of the init? I can see that the align_order has to be an additional arg to iommu_tbl_range_alloc, not sure if mask falls in that category as well. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 17:54 -0400, Sowmini Varadhan wrote: > the other question that comes to my mind is: the whole lazy_flush > optimization probably works best when there is exactly one pool, > and no large pools. In most other cases, we'd end up doing a lazy_flush > when we wrap within our pool itself, losing the benefit of that > optimization. Not a big deal. The point of the optimization is to avoid flushing on every map/unmap. As long as the pool wrapping is an order of magnitude rarer than actual map/unmap operations, it's a win. > Given that the lazy_flush is mostly there to avoid regressions for > the older sun4u architectures (which have other hardware bottlenecks > anyway), and this code is rapidly getting messy, does it make sense > to constrain the lazy_flush check to only apply for the 1-pool, > no-large-pool case? I was planning to use it to improve things on older G5s as well and I definitely want pools on these. Leave it in, we just need to get it right. I think it's not that messy, if you refactor the way I suggested, it basically boils down to conditionally flushing in two places, at the point where the pool lock is dropped, if the hint for that pool was moved backward. Do the test once in the else case of the n == -1 for all the "normal" cases (which includes the hint > limit, multi-pass wrap and *handle below hint, they are all covered) and in the pool hop'ing case. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 09:01 +1100, Benjamin Herrenschmidt wrote: > On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote: > > > No, the last argument to iommu_area_alloc() which is passed from the > > callers when doing consistent allocs. Basically, the DMA api mandates > > that consistent allocs are naturally aligned (to their own size), we > > implement that on powerpc by passing that alignment argument down. > > Talking of this ... I notice this is not documented in DMA-API.txt... > however many drivers make that assumption, and the DMA pool allocator > in mm/dmapool.c as well. > > Maybe somebody should update DMA-API.txt... Ah it *is* documented in DMA-API-HOWTO.txt :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote: > No, the last argument to iommu_area_alloc() which is passed from the > callers when doing consistent allocs. Basically, the DMA api mandates > that consistent allocs are naturally aligned (to their own size), we > implement that on powerpc by passing that alignment argument down. Talking of this ... I notice this is not documented in DMA-API.txt... however many drivers make that assumption, and the DMA pool allocator in mm/dmapool.c as well. Maybe somebody should update DMA-API.txt... Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 17:43 -0400, Sowmini Varadhan wrote: > On (04/03/15 07:54), Benjamin Herrenschmidt wrote: > > > + limit = pool->end; > > > + > > > + /* The case below can happen if we have a small segment appended > > > + * to a large, or when the previous alloc was at the very end of > > > + * the available space. If so, go back to the beginning and flush. > > > + */ > > > + if (start >= limit) { > > > + start = pool->start; > > > + if (!large_pool && iommu->lazy_flush != NULL) > > > + iommu->lazy_flush(iommu); > > > > Add need_flush = false; > > A few clarifications, while I parse the rest of your comments: > > Not sure I follow- need_flush is initialized to true at the start of the > function? No but you can loop back there via "goto again". However if you follow my other comment and move the flush back to the end, then you don't need that at all. > > I only just noticed too, you completely dropped the code to honor > > the dma mask. Why that ? Some devices rely on this. > > so that's an interesting question: the existing iommu_range_alloc() in > arch/sparc/kernel/iommu.c does not use the mask at all. I based most of > the code on this (except for the lock fragmentation part). > I dont know if this is arch specific. Probably, not that many devices have limits on DMA mask but they do exist. It becomes more important if we decide to create a very large IOMMU window that spans beyond 4G in order to support devices with 32-bit DMA masks. Otherwise it's older devices mostly with <32-bit masks. In any case, for a generic piece of code, this should be supported. Basically, assume that if we have something in the powerpc code, we need it, if you remove it, we won't be able to use your code generically. There are a few cases we can debate like our block allocation, but things like the mask or the alignment constraints aren't in that group. > > > > > + if (dev) > > > + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, > > > + 1 << iommu->table_shift); > > > + else > > > + boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift); > > > + > > > + shift = iommu->table_map_base >> iommu->table_shift; > > > + boundary_size = boundary_size >> iommu->table_shift; > > > + /* > > > + * if the skip_span_boundary_check had been set during init, we set > > > + * things up so that iommu_is_span_boundary() merely checks if the > > > + * (index + npages) < num_tsb_entries > > > + */ > > > + if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) { > > > + shift = 0; > > > + boundary_size = iommu->poolsize * iommu->nr_pools; > > > + } > > > + n = iommu_area_alloc(iommu->map, limit, start, npages, shift, > > > + boundary_size, 0); > > > > You have completely dropped the alignment support. This will break > > drivers. There are cases (especially with consistent allocations) where > > Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case? No, the last argument to iommu_area_alloc() which is passed from the callers when doing consistent allocs. Basically, the DMA api mandates that consistent allocs are naturally aligned (to their own size), we implement that on powerpc by passing that alignment argument down. > That's very specific to LDC (sparc ldoms virtualization infra). The default > is to not have IOMMU_NO_SPAN_BOUND set. > For the rest of the drivers, the code that sets up boundary_size aligns things > in the same way as the ppc code. > > > > the driver have alignment constraints on the address, those must be > > preserved. > > > > --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Tue, 2015-03-31 at 10:40 -0400, Sowmini Varadhan wrote: > + if (largealloc) { > + pool = &(iommu->large_pool); > + spin_lock_irqsave(&pool->lock, flags); > + pool_nr = 0; /* to keep compiler happy */ > + } else { > + /* pick out pool_nr */ > + pool_nr = pool_hash & (npools - 1); > + pool = &(iommu->pools[pool_nr]); > + spin_lock_irqsave(&(pool->lock), flags); > + } Move spin_lock_irqsave outside of if/else > + again: > + if (pass == 0 && handle && *handle && > + (*handle >= pool->start) && (*handle < pool->end)) > + start = *handle; > + else > + start = pool->hint; Now this means "handle" might be < pool->hint, in that case you also need a lazy flush. Or rather only if the resulting alloc is. My suggestion originally was to test that after the area alloc. See further down. > + limit = pool->end; > + > + /* The case below can happen if we have a small segment appended > + * to a large, or when the previous alloc was at the very end of > + * the available space. If so, go back to the beginning and flush. > + */ > + if (start >= limit) { > + start = pool->start; > + if (!large_pool && iommu->lazy_flush != NULL) > + iommu->lazy_flush(iommu); Add need_flush = false; In fact, looking more closely, I see a another problems: You never flush the large pool. You should either remove all your large_pool checks since you treat it as a normal pool you your code, and let it flush normally, or you should have an unconditional flush somewhere in there for the large pool (or in free()), either way. The current implementation you show me, afaik, never calls lazy_flush on a large allocation. Also, if you fix the problem I mentioned earlier about *handle < hint, you also don't need the above flush, it will be handled further down in the else case for the if (n == -1), see below > + } I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. > + if (dev) > + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, > + 1 << iommu->table_shift); > + else > + boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift); > + > + shift = iommu->table_map_base >> iommu->table_shift; > + boundary_size = boundary_size >> iommu->table_shift; > + /* > + * if the skip_span_boundary_check had been set during init, we set > + * things up so that iommu_is_span_boundary() merely checks if the > + * (index + npages) < num_tsb_entries > + */ > + if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) { > + shift = 0; > + boundary_size = iommu->poolsize * iommu->nr_pools; > + } > + n = iommu_area_alloc(iommu->map, limit, start, npages, shift, > + boundary_size, 0); You have completely dropped the alignment support. This will break drivers. There are cases (especially with consistent allocations) where the driver have alignment constraints on the address, those must be preserved. > + if (n == -1) { > + if (likely(pass == 0)) { > + /* First failure, rescan from the beginning. */ > + pool->hint = pool->start; > + if (!large_pool && iommu->lazy_flush != NULL) > + need_flush = true; Same question about the large pool check... I wouldn't bother with the function pointer NULL check here either, only test it at the call site. > + pass++; > + goto again; > + } else if (!largealloc && pass <= iommu->nr_pools) { > + Here, you might have moved the hint of the pool (case above), and then hit this code path, no ? In that case, need_flush might be set, which means you must flush before you unlock. > spin_unlock(&(pool->lock)); > + > pool_nr = (pool_nr + 1) & (iommu->nr_pools - 1); > + pool = &(iommu->pools[pool_nr]); > + spin_lock(&(pool->lock)); > + pool->hint = pool->start; > + if (!large_pool && iommu->lazy_flush != NULL) > + need_flush = true; I wouldn't bother with the test. This can't be large_alloc afaik and the NULL check can be done at the call site. > + pass++; > + goto again; > + } else { > + /* give up */ > + n = DMA_ERROR_CODE; > + goto bail; > + } > + } Here, make this something like: } else if (end < pool->hint) need_flush = true; To handle the case where you started your alloc below the hint without having updated the hint yet, ie, a *handle < hint or a
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
the other question that comes to my mind is: the whole lazy_flush optimization probably works best when there is exactly one pool, and no large pools. In most other cases, we'd end up doing a lazy_flush when we wrap within our pool itself, losing the benefit of that optimization. Given that the lazy_flush is mostly there to avoid regressions for the older sun4u architectures (which have other hardware bottlenecks anyway), and this code is rapidly getting messy, does it make sense to constrain the lazy_flush check to only apply for the 1-pool, no-large-pool case? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 07:54), Benjamin Herrenschmidt wrote: > > + limit = pool->end; > > + > > + /* The case below can happen if we have a small segment appended > > +* to a large, or when the previous alloc was at the very end of > > +* the available space. If so, go back to the beginning and flush. > > +*/ > > + if (start >= limit) { > > + start = pool->start; > > + if (!large_pool && iommu->lazy_flush != NULL) > > + iommu->lazy_flush(iommu); > > Add need_flush = false; A few clarifications, while I parse the rest of your comments: Not sure I follow- need_flush is initialized to true at the start of the function? > I only just noticed too, you completely dropped the code to honor > the dma mask. Why that ? Some devices rely on this. so that's an interesting question: the existing iommu_range_alloc() in arch/sparc/kernel/iommu.c does not use the mask at all. I based most of the code on this (except for the lock fragmentation part). I dont know if this is arch specific. > > > + if (dev) > > + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, > > + 1 << iommu->table_shift); > > + else > > + boundary_size = ALIGN(1UL << 32, 1 << iommu->table_shift); > > + > > + shift = iommu->table_map_base >> iommu->table_shift; > > + boundary_size = boundary_size >> iommu->table_shift; > > + /* > > +* if the skip_span_boundary_check had been set during init, we set > > +* things up so that iommu_is_span_boundary() merely checks if the > > +* (index + npages) < num_tsb_entries > > +*/ > > + if ((iommu->flags & IOMMU_NO_SPAN_BOUND) != 0) { > > + shift = 0; > > + boundary_size = iommu->poolsize * iommu->nr_pools; > > + } > > + n = iommu_area_alloc(iommu->map, limit, start, npages, shift, > > +boundary_size, 0); > > You have completely dropped the alignment support. This will break > drivers. There are cases (especially with consistent allocations) where Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case? That's very specific to LDC (sparc ldoms virtualization infra). The default is to not have IOMMU_NO_SPAN_BOUND set. For the rest of the drivers, the code that sets up boundary_size aligns things in the same way as the ppc code. > the driver have alignment constraints on the address, those must be > preserved. > --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
From: Sowmini Varadhan > Investigation of multithreaded iperf experiments on an ethernet > interface show the iommu->lock as the hottest lock identified by > lockstat, with something of the order of 21M contentions out of > 27M acquisitions, and an average wait time of 26 us for the lock. > This is not efficient. A more scalable design is to follow the ppc > model, where the iommu_map_table has multiple pools, each stretching > over a segment of the map, and with a separate lock for each pool. > This model allows for better parallelization of the iommu map search. I've wondered whether the iommu setup for ethernet receive (in particular) could be made much more efficient if there were a function that would unmap one buffer and map a second buffer? My thought is that iommu pte entry used by the old buffer could just be modified to reference the new one. In effect each ring entry would end up using a fixed iommu pte. The other question is how much data can be copied in 26us ? On iommu systems 'copybreak' limits on receive and transmit may need to be quite high. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/31/15 15:15), David Laight wrote: > > I've wondered whether the iommu setup for ethernet receive (in particular) > could be made much more efficient if there were a function that > would unmap one buffer and map a second buffer? > My thought is that iommu pte entry used by the old buffer could just > be modified to reference the new one. > In effect each ring entry would end up using a fixed iommu pte. There are a number of interesting things to investigate in this space, and the above is just one of them, ways to avoid the overhead of a full-blown map/unmap on each call. See http://www.spinics.net/lists/sparclinux/msg13613.html But the scope of this patchset is actually very rigidly defined: to refactor the iommu pool/arena allocator into a common library, and avoid code duplication (today even the single sparc arch duplicates it for sun4[u,v] and ldc, and that's not even counting the duplication across other archs/pci-drivers). Investigating ways to provide a generalized infra that can avoid a dma map/unmp for every packet would be a good follow-on. > The other question is how much data can be copied in 26us ? > On iommu systems 'copybreak' limits on receive and transmit > may need to be quite high. where does the "26us" number come from? I may be missing that context? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu->lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the ->demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed ->cookie_to_index and ->demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. v7: one-time initialization of pool_hash from iommu_tbl_pool_init() v8: Benh code review comments. include/linux/iommu-common.h | 48 + lib/Makefile |2 +- lib/iommu-common.c | 233 ++ 3 files changed, 282 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..7b1b761 --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,48 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include +#include +#include + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 << IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_map_table { + unsigned long table_map_base; + unsigned long table_shift; + unsigned long nr_pools; + void(*lazy_flush)(struct iommu_map_table *); + unsigned long poolsize; + struct iommu_pool pools[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_map_table *iommu, + unsigned long num_entries, + u32 table_shift, + void (*lazy_flush)(struct iommu_map_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_map_table *iommu, + unsigned long npages, + unsigned long *handle); + +extern void iommu_tbl_range_free(struct iommu_map_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 58f74d2..60c22e6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..35dacbe --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,233 @@ +/* + * IOMMU mmap management and range allocation functions. + * Based almost entirely upon the powerpc iommu allocator. + */ + +#include +#include +#include +#include +#include +#include +#include + +unsigned long iommu_large_alloc = 15; + +static DEFINE_PER_CPU(unsigned int, iommu_pool_hash); + +static void setup_iommu_pool_hash(void) +{ + unsigned int i; + static bool do_once; + + if (do_o