Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
One last question before I spin out v9.. the dma_mask code is a bit confusing to me, so I want to make sure... the code is if (limit + tbl-it_offset mask) { limit = mask - tbl-it_offset + 1; /* If we're constrained on address range, first try * at the masked hint to avoid O(n) search complexity, * but on second pass, start at 0 in pool 0. */ if ((start mask) = limit || pass 0) { spin_unlock((pool-lock)); pool = (tbl-pools[0]); spin_lock((pool-lock)); start = pool-start; So here I would need to mark need_flush to true, right? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Sat, 2015-04-04 at 07:27 -0400, Sowmini Varadhan wrote: One last question before I spin out v9.. the dma_mask code is a bit confusing to me, so I want to make sure... the code is if (limit + tbl-it_offset mask) { limit = mask - tbl-it_offset + 1; /* If we're constrained on address range, first try * at the masked hint to avoid O(n) search complexity, * but on second pass, start at 0 in pool 0. */ if ((start mask) = limit || pass 0) { spin_unlock((pool-lock)); pool = (tbl-pools[0]); spin_lock((pool-lock)); start = pool-start; So here I would need to mark need_flush to true, right? Well, no but ... So you aren't changing any pool hint, so you don't have to mark any pool for flushing. The n hint check will do the flush later on if needed (the one I told you to add). You only need to force a flush if you modify the hint. *However*, you do an unlock, which means that if you have modified any *other* pool's hint (ie, need_flush is true), you need to actually perform a flush. I think that business with flushing on unlock is becoming too ugly, I would suggest you simplify things by moving need_flush to be part of the pool structure (maybe a flag). IE, when you modify a pool's hint, your mark it for flushing. When you allocate from a pool, you flush it at the end, just before you return success, if either it was marked for flushing or n hint (and clear the mark of course). That should cleanly factor all cases in a simpler more readable code path. The flush is done only when needed (an actual allocation happen on a pool), you just mark them dirty when you change their hints, you don't have to worry about the lock dropped case pool hop'ing anymore. Basically the logic looks like that: .../... n = iommu_area_alloc(...) if (n == -1) { if (pass == 0) { change pool hint pool-need_flush = true; pass++; goto again; } else if (!largealloc ) { unlock/pool hop/lock // no need to flush anything here pool-hint = pool-start; pool-need_flush = true; etc... } else { n = DMA_ERROR_CODE; goto fail; } } end = n + pages; if (n pool-hint || pool-need_flush) { pool-need_flush = false; iommu-lazy_flush(...); } pool-hint = end; .../... Now this can be further simplified imho. For example that logic: + if (pass == 0 handle *handle + (*handle = pool-start) (*handle pool-end)) + start = *handle; + else + start = pool-hint; Can move above the again: label. that way, you remove the pass test and you no longer have to adjust pool-hint in the if (likely(pass == 0)) failure case. Just adjust start and try again, which means you also no longer need to set need_flush. Additionally, when pool hop'ing, now you only need to adjust start as well. By not resetting it, you also remove the need for marking it need flush. I would add to that, why not set start to the new pool's hint instead of the new pool's start ? That will save some flushes, no ? Anton, do you remember why you reset the hint when changing pool ? I don't see that gaining us anything and it does make lazy flushing more complex. At that point, you have removed all setters of need_flush, that logic can be entirely removed, or am I just too tired (it's past midnight) and missing something entirely here ? We only need the if (n pool-hint) check to do the lazy flush at the end, that's it. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Just want to confirm: + again: + if (pass == 0 handle *handle + (*handle = pool-start) (*handle pool-end)) + start = *handle; + else + start = pool-hint; Now this means handle might be pool-hint, in that case you also need a lazy flush. Or rather only if the resulting alloc is. My : + } else { + /* give up */ + n = DMA_ERROR_CODE; + goto bail; + } + } Here, make this something like: } else if (end pool-hint) need_flush = true; you mean } else if (start pool-hint) right? (so I'm not missing some corner-case that you are thinking about here) --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/04/15 08:06), Benjamin Herrenschmidt wrote: No, I meant n pool-hint, ie, the start of the newly allocated block. ah, got it. I'll do my drill with patchset and get back, probably by Monday. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 14:28 -0400, Sowmini Varadhan wrote: Just want to confirm: + again: + if (pass == 0 handle *handle + (*handle = pool-start) (*handle pool-end)) + start = *handle; + else + start = pool-hint; Now this means handle might be pool-hint, in that case you also need a lazy flush. Or rather only if the resulting alloc is. My : + } else { + /* give up */ + n = DMA_ERROR_CODE; + goto bail; + } + } Here, make this something like: } else if (end pool-hint) need_flush = true; you mean } else if (start pool-hint) right? (so I'm not missing some corner-case that you are thinking about here) No, I meant n pool-hint, ie, the start of the newly allocated block. end hasn't been adjusted yet at that point but we don't want to compare end anyway (which will be n + npages), we really want n, ie if the *beginning* of the newly allocated chunk is before the end of the previous one (after all they may even overlap if the previous one has been freed). Ben. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 17:54 -0400, Sowmini Varadhan wrote: the other question that comes to my mind is: the whole lazy_flush optimization probably works best when there is exactly one pool, and no large pools. In most other cases, we'd end up doing a lazy_flush when we wrap within our pool itself, losing the benefit of that optimization. Not a big deal. The point of the optimization is to avoid flushing on every map/unmap. As long as the pool wrapping is an order of magnitude rarer than actual map/unmap operations, it's a win. Given that the lazy_flush is mostly there to avoid regressions for the older sun4u architectures (which have other hardware bottlenecks anyway), and this code is rapidly getting messy, does it make sense to constrain the lazy_flush check to only apply for the 1-pool, no-large-pool case? I was planning to use it to improve things on older G5s as well and I definitely want pools on these. Leave it in, we just need to get it right. I think it's not that messy, if you refactor the way I suggested, it basically boils down to conditionally flushing in two places, at the point where the pool lock is dropped, if the hint for that pool was moved backward. Do the test once in the else case of the n == -1 for all the normal cases (which includes the hint limit, multi-pass wrap and *handle below hint, they are all covered) and in the pool hop'ing case. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 08:57), Benjamin Herrenschmidt wrote: I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. /* Sowmini's comment about this coming from sparc origins.. */ Probably, not that many devices have limits on DMA mask but they do exist. It becomes more important if we decide to create a very large IOMMU window that spans beyond 4G in order to support devices with 32-bit DMA masks. Otherwise it's older devices mostly with 32-bit masks. In any case, for a generic piece of code, this should be supported. Basically, assume that if we have something in the powerpc code, we need it, if you remove it, we won't be able to use your code generically. I see. is the mask something that can be stored in the iommu_map_table as part of the init? I can see that the align_order has to be an additional arg to iommu_tbl_range_alloc, not sure if mask falls in that category as well. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Tue, 2015-03-31 at 10:40 -0400, Sowmini Varadhan wrote: + if (largealloc) { + pool = (iommu-large_pool); + spin_lock_irqsave(pool-lock, flags); + pool_nr = 0; /* to keep compiler happy */ + } else { + /* pick out pool_nr */ + pool_nr = pool_hash (npools - 1); + pool = (iommu-pools[pool_nr]); + spin_lock_irqsave((pool-lock), flags); + } Move spin_lock_irqsave outside of if/else + again: + if (pass == 0 handle *handle + (*handle = pool-start) (*handle pool-end)) + start = *handle; + else + start = pool-hint; Now this means handle might be pool-hint, in that case you also need a lazy flush. Or rather only if the resulting alloc is. My suggestion originally was to test that after the area alloc. See further down. + limit = pool-end; + + /* The case below can happen if we have a small segment appended + * to a large, or when the previous alloc was at the very end of + * the available space. If so, go back to the beginning and flush. + */ + if (start = limit) { + start = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + iommu-lazy_flush(iommu); Add need_flush = false; In fact, looking more closely, I see a another problems: You never flush the large pool. You should either remove all your large_pool checks since you treat it as a normal pool you your code, and let it flush normally, or you should have an unconditional flush somewhere in there for the large pool (or in free()), either way. The current implementation you show me, afaik, never calls lazy_flush on a large allocation. Also, if you fix the problem I mentioned earlier about *handle hint, you also don't need the above flush, it will be handled further down in the else case for the if (n == -1), see below + } I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. + if (dev) + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, + 1 iommu-table_shift); + else + boundary_size = ALIGN(1UL 32, 1 iommu-table_shift); + + shift = iommu-table_map_base iommu-table_shift; + boundary_size = boundary_size iommu-table_shift; + /* + * if the skip_span_boundary_check had been set during init, we set + * things up so that iommu_is_span_boundary() merely checks if the + * (index + npages) num_tsb_entries + */ + if ((iommu-flags IOMMU_NO_SPAN_BOUND) != 0) { + shift = 0; + boundary_size = iommu-poolsize * iommu-nr_pools; + } + n = iommu_area_alloc(iommu-map, limit, start, npages, shift, + boundary_size, 0); You have completely dropped the alignment support. This will break drivers. There are cases (especially with consistent allocations) where the driver have alignment constraints on the address, those must be preserved. + if (n == -1) { + if (likely(pass == 0)) { + /* First failure, rescan from the beginning. */ + pool-hint = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + need_flush = true; Same question about the large pool check... I wouldn't bother with the function pointer NULL check here either, only test it at the call site. + pass++; + goto again; + } else if (!largealloc pass = iommu-nr_pools) { + Here, you might have moved the hint of the pool (case above), and then hit this code path, no ? In that case, need_flush might be set, which means you must flush before you unlock. spin_unlock((pool-lock)); + pool_nr = (pool_nr + 1) (iommu-nr_pools - 1); + pool = (iommu-pools[pool_nr]); + spin_lock((pool-lock)); + pool-hint = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + need_flush = true; I wouldn't bother with the test. This can't be large_alloc afaik and the NULL check can be done at the call site. + pass++; + goto again; + } else { + /* give up */ + n = DMA_ERROR_CODE; + goto bail; + } + } Here, make this something like: } else if (end pool-hint) need_flush = true; To handle the case where you started your alloc below the hint without having updated the hint yet, ie, a *handle hint or a wrap around, this will also handle the start = limit case. Basically you need to flush iff you are going to provide an alloc below the current hint,
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 18:15 -0400, Sowmini Varadhan wrote: On (04/03/15 08:57), Benjamin Herrenschmidt wrote: I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. /* Sowmini's comment about this coming from sparc origins.. */ Probably, not that many devices have limits on DMA mask but they do exist. It becomes more important if we decide to create a very large IOMMU window that spans beyond 4G in order to support devices with 32-bit DMA masks. Otherwise it's older devices mostly with 32-bit masks. In any case, for a generic piece of code, this should be supported. Basically, assume that if we have something in the powerpc code, we need it, if you remove it, we won't be able to use your code generically. I see. is the mask something that can be stored in the iommu_map_table as part of the init? No, the mask is per device and has to be retrieved from the device. Additionally the mask used for dma_map_* can be different from the consistent mask used for alloc_coherent (we have a bug there on powerpc which I'm trying to fix btw). So it should be passed as an argument by the caller. I can see that the align_order has to be an additional arg to iommu_tbl_range_alloc, not sure if mask falls in that category as well. It does. Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 09:01 +1100, Benjamin Herrenschmidt wrote: On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote: No, the last argument to iommu_area_alloc() which is passed from the callers when doing consistent allocs. Basically, the DMA api mandates that consistent allocs are naturally aligned (to their own size), we implement that on powerpc by passing that alignment argument down. Talking of this ... I notice this is not documented in DMA-API.txt... however many drivers make that assumption, and the DMA pool allocator in mm/dmapool.c as well. Maybe somebody should update DMA-API.txt... Ah it *is* documented in DMA-API-HOWTO.txt :-) Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Fri, 2015-04-03 at 08:57 +1100, Benjamin Herrenschmidt wrote: No, the last argument to iommu_area_alloc() which is passed from the callers when doing consistent allocs. Basically, the DMA api mandates that consistent allocs are naturally aligned (to their own size), we implement that on powerpc by passing that alignment argument down. Talking of this ... I notice this is not documented in DMA-API.txt... however many drivers make that assumption, and the DMA pool allocator in mm/dmapool.c as well. Maybe somebody should update DMA-API.txt... Cheers, Ben. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 07:54), Benjamin Herrenschmidt wrote: + limit = pool-end; + + /* The case below can happen if we have a small segment appended +* to a large, or when the previous alloc was at the very end of +* the available space. If so, go back to the beginning and flush. +*/ + if (start = limit) { + start = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + iommu-lazy_flush(iommu); Add need_flush = false; A few clarifications, while I parse the rest of your comments: Not sure I follow- need_flush is initialized to true at the start of the function? I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. so that's an interesting question: the existing iommu_range_alloc() in arch/sparc/kernel/iommu.c does not use the mask at all. I based most of the code on this (except for the lock fragmentation part). I dont know if this is arch specific. + if (dev) + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, + 1 iommu-table_shift); + else + boundary_size = ALIGN(1UL 32, 1 iommu-table_shift); + + shift = iommu-table_map_base iommu-table_shift; + boundary_size = boundary_size iommu-table_shift; + /* +* if the skip_span_boundary_check had been set during init, we set +* things up so that iommu_is_span_boundary() merely checks if the +* (index + npages) num_tsb_entries +*/ + if ((iommu-flags IOMMU_NO_SPAN_BOUND) != 0) { + shift = 0; + boundary_size = iommu-poolsize * iommu-nr_pools; + } + n = iommu_area_alloc(iommu-map, limit, start, npages, shift, +boundary_size, 0); You have completely dropped the alignment support. This will break drivers. There are cases (especially with consistent allocations) where Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case? That's very specific to LDC (sparc ldoms virtualization infra). The default is to not have IOMMU_NO_SPAN_BOUND set. For the rest of the drivers, the code that sets up boundary_size aligns things in the same way as the ppc code. the driver have alignment constraints on the address, those must be preserved. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On Thu, 2015-04-02 at 17:43 -0400, Sowmini Varadhan wrote: On (04/03/15 07:54), Benjamin Herrenschmidt wrote: + limit = pool-end; + + /* The case below can happen if we have a small segment appended + * to a large, or when the previous alloc was at the very end of + * the available space. If so, go back to the beginning and flush. + */ + if (start = limit) { + start = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + iommu-lazy_flush(iommu); Add need_flush = false; A few clarifications, while I parse the rest of your comments: Not sure I follow- need_flush is initialized to true at the start of the function? No but you can loop back there via goto again. However if you follow my other comment and move the flush back to the end, then you don't need that at all. I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. so that's an interesting question: the existing iommu_range_alloc() in arch/sparc/kernel/iommu.c does not use the mask at all. I based most of the code on this (except for the lock fragmentation part). I dont know if this is arch specific. Probably, not that many devices have limits on DMA mask but they do exist. It becomes more important if we decide to create a very large IOMMU window that spans beyond 4G in order to support devices with 32-bit DMA masks. Otherwise it's older devices mostly with 32-bit masks. In any case, for a generic piece of code, this should be supported. Basically, assume that if we have something in the powerpc code, we need it, if you remove it, we won't be able to use your code generically. There are a few cases we can debate like our block allocation, but things like the mask or the alignment constraints aren't in that group. + if (dev) + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, + 1 iommu-table_shift); + else + boundary_size = ALIGN(1UL 32, 1 iommu-table_shift); + + shift = iommu-table_map_base iommu-table_shift; + boundary_size = boundary_size iommu-table_shift; + /* + * if the skip_span_boundary_check had been set during init, we set + * things up so that iommu_is_span_boundary() merely checks if the + * (index + npages) num_tsb_entries + */ + if ((iommu-flags IOMMU_NO_SPAN_BOUND) != 0) { + shift = 0; + boundary_size = iommu-poolsize * iommu-nr_pools; + } + n = iommu_area_alloc(iommu-map, limit, start, npages, shift, + boundary_size, 0); You have completely dropped the alignment support. This will break drivers. There are cases (especially with consistent allocations) where Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case? No, the last argument to iommu_area_alloc() which is passed from the callers when doing consistent allocs. Basically, the DMA api mandates that consistent allocs are naturally aligned (to their own size), we implement that on powerpc by passing that alignment argument down. That's very specific to LDC (sparc ldoms virtualization infra). The default is to not have IOMMU_NO_SPAN_BOUND set. For the rest of the drivers, the code that sets up boundary_size aligns things in the same way as the ppc code. the driver have alignment constraints on the address, those must be preserved. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
the other question that comes to my mind is: the whole lazy_flush optimization probably works best when there is exactly one pool, and no large pools. In most other cases, we'd end up doing a lazy_flush when we wrap within our pool itself, losing the benefit of that optimization. Given that the lazy_flush is mostly there to avoid regressions for the older sun4u architectures (which have other hardware bottlenecks anyway), and this code is rapidly getting messy, does it make sense to constrain the lazy_flush check to only apply for the 1-pool, no-large-pool case? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
RE: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
From: Sowmini Varadhan Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. I've wondered whether the iommu setup for ethernet receive (in particular) could be made much more efficient if there were a function that would unmap one buffer and map a second buffer? My thought is that iommu pte entry used by the old buffer could just be modified to reference the new one. In effect each ring entry would end up using a fixed iommu pte. The other question is how much data can be copied in 26us ? On iommu systems 'copybreak' limits on receive and transmit may need to be quite high. David ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/31/15 15:15), David Laight wrote: I've wondered whether the iommu setup for ethernet receive (in particular) could be made much more efficient if there were a function that would unmap one buffer and map a second buffer? My thought is that iommu pte entry used by the old buffer could just be modified to reference the new one. In effect each ring entry would end up using a fixed iommu pte. There are a number of interesting things to investigate in this space, and the above is just one of them, ways to avoid the overhead of a full-blown map/unmap on each call. See http://www.spinics.net/lists/sparclinux/msg13613.html But the scope of this patchset is actually very rigidly defined: to refactor the iommu pool/arena allocator into a common library, and avoid code duplication (today even the single sparc arch duplicates it for sun4[u,v] and ldc, and that's not even counting the duplication across other archs/pci-drivers). Investigating ways to provide a generalized infra that can avoid a dma map/unmp for every packet would be a good follow-on. The other question is how much data can be copied in 26us ? On iommu systems 'copybreak' limits on receive and transmit may need to be quite high. where does the 26us number come from? I may be missing that context? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev