Fwd: Re: sparc64: Build failure due to commit f1600e549b94 (sparc: Make sparc64 use scalable lib/iommu-common.c functions)
On (04/19/15 14:09), David Miller wrote: On (04/18/15 21:23), Guenter Roeck wrote: lib/built-in.o:(.discard+0x1): multiple definition of `__pcpu_unique_iommu_pool_hash' arch/powerpc/kernel/built-in.o:(.discard+0x18): first defined here .. I get a similar failure in the powerpc:allmodconfig build : Maybe ping the powerpc folks becuase if they can do a quick conversion, this change isn't necessary. linuxppc-dev, The disussion above [http://www.spinics.net/lists/sparclinux/msg13835.html] is in reference to the issue that Guenter Roeck identified. The problem is that we have a static DEFINE_PER_CPU(unsigned int, iommu_pool_hash); in both lib/iommu-common.c as well as arch/powerpc/kernel/iommu.c, and as Guenter correctly pointed out, DEFINE_PER_CPU translates to DEFINE_PER_CPU_SECTION, which in turn is defined as #define DEFINE_PER_CPU_SECTION(type, name, sec) \ __PCPU_DUMMY_ATTRS char __pcpu_scope_##name;\ extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;\ -- __PCPU_DUMMY_ATTRS char __pcpu_unique_##name; \ extern __PCPU_ATTRS(sec) __typeof__(type) name; \ __PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __weak \ __typeof__(type) name if CONFIG_DEBUG_FORCE_WEAK_PER_CPU is configured, which is the case here. The marked line above shows that __pcpu_unique_iommu_pool_hash is declared as global variable This results in a build error for allmodconfig on powerpc, due to a multiple definition of __pcpu_unique_iommu_pool_hash. Given that the goal is to use the functions in lib/iommu-common.c as the Generic IOMMU pooled allocator that will be a superset of functionality in arch/powerpc/kernel/iommu.c, DaveM suggested that I ping you to see if you can do a quick conversion to handle this duplicate defintion, rather than rename iommu_pool_hash in iommu-common.c. Would that be possible? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v10 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. v7: one-time initialization of pool_hash from iommu_tbl_pool_init() v8: Benh code review comments. v9: More Benh code review comments, added dma_mask, align_order logic to iommu_tbl_range_alloc. v10: resend without RFC tag, and new mail message-id include/linux/iommu-common.h | 51 lib/Makefile |2 +- lib/iommu-common.c | 266 ++ 3 files changed, 318 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..bbced83 --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,51 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_map_table { + unsigned long table_map_base; + unsigned long table_shift; + unsigned long nr_pools; + void(*lazy_flush)(struct iommu_map_table *); + unsigned long poolsize; + struct iommu_pool pools[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 +#defineIOMMU_NEED_FLUSH0x0004 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_map_table *iommu, + unsigned long num_entries, + u32 table_shift, + void (*lazy_flush)(struct iommu_map_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_map_table *iommu, + unsigned long npages, + unsigned long *handle, + unsigned long mask, + unsigned int align_order); + +extern void iommu_tbl_range_free(struct iommu_map_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 58f74d2..60c22e6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..b99f1d7 --- /dev/null
[PATCH v10 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() v6: removed iommu_tbl_ops. Pass flush_all as function pointer to iommu_tbl_pool_init v7: move pool_hash initialization to iommu_tbl_pool_init() v8: BenH comments v9: more BenH comments v10: resend without RFC tag, and new mail message-id arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 172 +-- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 183 + 4 files changed, 128 insertions(+), 242 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..cd0d69f 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_map_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..5320689 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,6 +13,7 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h @@ -45,8 +46,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_map_table *iommu_map_table) { + struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) -{ - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush
[PATCH v10 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common v6: remove iommu_tbl_ops v7: move pool_hash initialization to iommu_tbl_pool_init v8: BenH comments v9: More BenH comments v10: resend without RFC tag, and new mail message-id arch/sparc/kernel/ldc.c | 153 --- 1 files changed, 65 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..d2ae0f7 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,7 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +28,10 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_map_table iommu_map_table; }; struct ldc_channel { @@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_map_table *iommu = ldc_iommu-iommu_map_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + NULL, false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp) out_free_table: free_pages((unsigned long) table, order); - iommu-page_table = NULL; + ldc_iommu-page_table = NULL; out_free_map: - kfree(iommu
[PATCH v10 0/3] Generic IOMMU pooled allocator
Investigation of network performance on Sparc shows a high degree of locking contention in the IOMMU allocator, and it was noticed that the PowerPC code has a better locking model. This patch series tries to extract the generic parts of the PowerPC code so that it can be shared across multiple PCI devices and architectures. v10: resend patchv9 without RFC tag, and a new mail Message-Id, (previous non-RFC attempt did not show up on the patchwork queue?) Full revision history below: v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: removed -cookie_to_index and -demap indirection from the iommu_tbl_ops The caller needs to call these functions as needed, before invoking the generic arena allocator functions. Added the skip_span_boundary argument to iommu_tbl_pool_init() for those callers like LDC which do no care about span boundary checks. v6: removed iommu_tbl_ops, and instead pass the -flush_all as an indirection to iommu_tbl_pool_init(); only invoke -flush_all when there is no large_pool, based on the assumption that large-pool usage is infrequently encountered v7: moved pool_hash initialization to lib/iommu-common.c and cleaned up code duplication from sun4v/sun4u/ldc. v8: Addresses BenH comments with one exception: I've left the IOMMU_POOL_HASH as is, so that powerpc can tailor it to their convenience. Discard trylock for simple spin_lock to acquire pool v9: Addresses latest BenH comments: need_flush checks, add support for dma mask and align_order. v10: resend without RFC tag, and new mail Message-Id. Sowmini Varadhan (3): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 172 ++-- arch/sparc/kernel/iommu_common.h |8 - arch/sparc/kernel/ldc.c | 153 + arch/sparc/kernel/pci_sun4v.c | 183 -- include/linux/iommu-common.h | 51 +++ lib/Makefile |2 +- lib/iommu-common.c| 266 + 8 files changed, 511 insertions(+), 331 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/08/15 18:30), Benjamin Herrenschmidt wrote: I'm happy with your last version, feel free to add my Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org sounds good, I'll do this and rese a non-RFC version today. Thanks for all the feedback - it was very useful to me, and I'm much happier with the end-result. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v9 RFC 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common v6: remove iommu_tbl_ops v7: move pool_hash initialization to iommu_tbl_pool_init v8: BenH comments v9: More BenH comments arch/sparc/kernel/ldc.c | 153 --- 1 files changed, 65 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..d2ae0f7 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,7 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +28,10 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_map_table iommu_map_table; }; struct ldc_channel { @@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_map_table *iommu = ldc_iommu-iommu_map_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + NULL, false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp) out_free_table: free_pages((unsigned long) table, order); - iommu-page_table = NULL; + ldc_iommu-page_table = NULL; out_free_map: - kfree(iommu-arena.map); - iommu-arena.map = NULL; + kfree(iommu-map); + iommu-map = NULL
[PATCH v9 RFC 0/3] Generic IOMMU pooled allocator
Addresses latest BenH comments: need_flush checks, add support for dma mask and align_order. Sowmini Varadhan (3): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 172 ++-- arch/sparc/kernel/iommu_common.h |8 - arch/sparc/kernel/ldc.c | 153 + arch/sparc/kernel/pci_sun4v.c | 183 -- include/linux/iommu-common.h | 51 +++ lib/Makefile |2 +- lib/iommu-common.c| 266 + 8 files changed, 511 insertions(+), 331 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. v7: one-time initialization of pool_hash from iommu_tbl_pool_init() v8: Benh code review comments. v9: More Benh code review comments, added dma_mask, align_order logic to iommu_tbl_range_alloc. include/linux/iommu-common.h | 51 lib/Makefile |2 +- lib/iommu-common.c | 266 ++ 3 files changed, 318 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..bbced83 --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,51 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_map_table { + unsigned long table_map_base; + unsigned long table_shift; + unsigned long nr_pools; + void(*lazy_flush)(struct iommu_map_table *); + unsigned long poolsize; + struct iommu_pool pools[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 +#defineIOMMU_NEED_FLUSH0x0004 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_map_table *iommu, + unsigned long num_entries, + u32 table_shift, + void (*lazy_flush)(struct iommu_map_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_map_table *iommu, + unsigned long npages, + unsigned long *handle, + unsigned long mask, + unsigned int align_order); + +extern void iommu_tbl_range_free(struct iommu_map_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 58f74d2..60c22e6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..b99f1d7 --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,266 @@ +/* + * IOMMU mmap management and range allocation functions
[PATCH v9 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() v6: removed iommu_tbl_ops. Pass flush_all as function pointer to iommu_tbl_pool_init v7: move pool_hash initialization to iommu_tbl_pool_init() v8: BenH comments v9: more BenH comments arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 172 +-- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 183 + 4 files changed, 128 insertions(+), 242 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..cd0d69f 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_map_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..5320689 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,6 +13,7 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h @@ -45,8 +46,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_map_table *iommu_map_table) { + struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) -{ - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush. -*/ - if (start = limit) { - start = 0; - if (iommu-flush_all) - iommu
Re: [PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/05/15 22:26), Benjamin Herrenschmidt wrote: So you decided to keep the logic here that updates the hint instead of just getting rid of need_flush alltogether ? Out of curiosity, what's the rationale ? Did you find a reason why resetting the hint in those two cases (rather than just setting start appropriately) is actually useful ? To be honest, I actually did not want to poke at this too much, given that both ppc and sparc seem to have essentially the same logic. But my gut feeling is that if the pool-hint hasn't resulted in a succesful area_alloc, we might as well reset it for the next iommu*range_alloc-er, because it's likely that the next alloc will also need to wrap and search anyway.. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
One last question before I spin out v9.. the dma_mask code is a bit confusing to me, so I want to make sure... the code is if (limit + tbl-it_offset mask) { limit = mask - tbl-it_offset + 1; /* If we're constrained on address range, first try * at the masked hint to avoid O(n) search complexity, * but on second pass, start at 0 in pool 0. */ if ((start mask) = limit || pass 0) { spin_unlock((pool-lock)); pool = (tbl-pools[0]); spin_lock((pool-lock)); start = pool-start; So here I would need to mark need_flush to true, right? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Just want to confirm: + again: + if (pass == 0 handle *handle + (*handle = pool-start) (*handle pool-end)) + start = *handle; + else + start = pool-hint; Now this means handle might be pool-hint, in that case you also need a lazy flush. Or rather only if the resulting alloc is. My : + } else { + /* give up */ + n = DMA_ERROR_CODE; + goto bail; + } + } Here, make this something like: } else if (end pool-hint) need_flush = true; you mean } else if (start pool-hint) right? (so I'm not missing some corner-case that you are thinking about here) --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/04/15 08:06), Benjamin Herrenschmidt wrote: No, I meant n pool-hint, ie, the start of the newly allocated block. ah, got it. I'll do my drill with patchset and get back, probably by Monday. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 08:57), Benjamin Herrenschmidt wrote: I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. /* Sowmini's comment about this coming from sparc origins.. */ Probably, not that many devices have limits on DMA mask but they do exist. It becomes more important if we decide to create a very large IOMMU window that spans beyond 4G in order to support devices with 32-bit DMA masks. Otherwise it's older devices mostly with 32-bit masks. In any case, for a generic piece of code, this should be supported. Basically, assume that if we have something in the powerpc code, we need it, if you remove it, we won't be able to use your code generically. I see. is the mask something that can be stored in the iommu_map_table as part of the init? I can see that the align_order has to be an additional arg to iommu_tbl_range_alloc, not sure if mask falls in that category as well. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (04/03/15 07:54), Benjamin Herrenschmidt wrote: + limit = pool-end; + + /* The case below can happen if we have a small segment appended +* to a large, or when the previous alloc was at the very end of +* the available space. If so, go back to the beginning and flush. +*/ + if (start = limit) { + start = pool-start; + if (!large_pool iommu-lazy_flush != NULL) + iommu-lazy_flush(iommu); Add need_flush = false; A few clarifications, while I parse the rest of your comments: Not sure I follow- need_flush is initialized to true at the start of the function? I only just noticed too, you completely dropped the code to honor the dma mask. Why that ? Some devices rely on this. so that's an interesting question: the existing iommu_range_alloc() in arch/sparc/kernel/iommu.c does not use the mask at all. I based most of the code on this (except for the lock fragmentation part). I dont know if this is arch specific. + if (dev) + boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1, + 1 iommu-table_shift); + else + boundary_size = ALIGN(1UL 32, 1 iommu-table_shift); + + shift = iommu-table_map_base iommu-table_shift; + boundary_size = boundary_size iommu-table_shift; + /* +* if the skip_span_boundary_check had been set during init, we set +* things up so that iommu_is_span_boundary() merely checks if the +* (index + npages) num_tsb_entries +*/ + if ((iommu-flags IOMMU_NO_SPAN_BOUND) != 0) { + shift = 0; + boundary_size = iommu-poolsize * iommu-nr_pools; + } + n = iommu_area_alloc(iommu-map, limit, start, npages, shift, +boundary_size, 0); You have completely dropped the alignment support. This will break drivers. There are cases (especially with consistent allocations) where Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case? That's very specific to LDC (sparc ldoms virtualization infra). The default is to not have IOMMU_NO_SPAN_BOUND set. For the rest of the drivers, the code that sets up boundary_size aligns things in the same way as the ppc code. the driver have alignment constraints on the address, those must be preserved. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
On (03/31/15 23:12), David Miller wrote: It's much more amortized with smart buffering strategies, which are common on current generation networking cards. There you only eat one map/unmap per PAGE_SIZE / rx_pkt_size. Maybe the infiniband stuff is doing things very suboptimally, and actually with that subsystem and drivers absolutely nothing would surprise me. yeh, we are trying to get more info from them about what their bottle-necks are. Until then, lets just stick with the spin_lock. do I need to resubmit this without the RFC tag? Perhaps I should have dropped that some time ago. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
the other question that comes to my mind is: the whole lazy_flush optimization probably works best when there is exactly one pool, and no large pools. In most other cases, we'd end up doing a lazy_flush when we wrap within our pool itself, losing the benefit of that optimization. Given that the lazy_flush is mostly there to avoid regressions for the older sun4u architectures (which have other hardware bottlenecks anyway), and this code is rapidly getting messy, does it make sense to constrain the lazy_flush check to only apply for the 1-pool, no-large-pool case? ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
On 03/31/2015 09:01 PM, Benjamin Herrenschmidt wrote: On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote: Having bravely said that.. the IB team informs me that they see a 10% degradation using the spin_lock as opposed to the trylock. one path going forward is to continue processing this patch-set as is. I can investigate this further, and later revise the spin_lock to the trylock, after we are certain that it is good/necessary. Have they tried using more pools instead ? we just tried 32 instead of 16, no change to perf. Looks like their current bottleneck is the find_next_zero_bit (they can get a 2X perf improvement with the lock fragmentation, but are then hitting a new ceiling, even with the trylock version) I'm starting to wonder if some approximation of dma premapped buffers may be needed. Doing a map/unmap on each packet is expensive. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_map_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. v7: one-time initialization of pool_hash from iommu_tbl_pool_init() v8: Benh code review comments. include/linux/iommu-common.h | 48 + lib/Makefile |2 +- lib/iommu-common.c | 233 ++ 3 files changed, 282 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..7b1b761 --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,48 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_map_table { + unsigned long table_map_base; + unsigned long table_shift; + unsigned long nr_pools; + void(*lazy_flush)(struct iommu_map_table *); + unsigned long poolsize; + struct iommu_pool pools[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_map_table *iommu, + unsigned long num_entries, + u32 table_shift, + void (*lazy_flush)(struct iommu_map_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_map_table *iommu, + unsigned long npages, + unsigned long *handle); + +extern void iommu_tbl_range_free(struct iommu_map_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 58f74d2..60c22e6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..35dacbe --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,233 @@ +/* + * IOMMU mmap management and range allocation functions. + * Based almost entirely upon the powerpc iommu allocator. + */ + +#include linux/export.h +#include linux/bitmap.h +#include linux/bug.h +#include linux/iommu-helper.h +#include linux/iommu-common.h +#include linux/dma-mapping.h +#include linux/hash.h + +unsigned long iommu_large_alloc
[PATCH v8 RFC 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common v6: remove iommu_tbl_ops v7: move pool_hash initialization to iommu_tbl_pool_init v8: BenH comments arch/sparc/kernel/ldc.c | 153 --- 1 files changed, 65 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..f842512 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,7 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +28,10 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_map_table iommu_map_table; }; struct ldc_channel { @@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_map_table *iommu = ldc_iommu-iommu_map_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + NULL, false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp) out_free_table: free_pages((unsigned long) table, order); - iommu-page_table = NULL; + ldc_iommu-page_table = NULL; out_free_map: - kfree(iommu-arena.map); - iommu-arena.map = NULL; + kfree(iommu-map); + iommu-map = NULL; return err; } static
[PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
Addresses BenH comments with one exception: I've left the IOMMU_POOL_HASH as is, so that powerpc can tailor it to their convenience. I've not heard back from the IB folks, but I'm going to make a judgement call here and go with the spin_lock. *If* they report some significant benefit from the trylock, probably need to revisit this (and then probably start by re-exmaining the hash function to avoid collisions, before resorting to trylock). Sowmini Varadhan (3): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 171 +++ arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/ldc.c | 153 ++-- arch/sparc/kernel/pci_sun4v.c | 181 + include/linux/iommu-common.h | 48 lib/Makefile |2 +- lib/iommu-common.c| 233 + 8 files changed, 472 insertions(+), 331 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/31/15 15:15), David Laight wrote: I've wondered whether the iommu setup for ethernet receive (in particular) could be made much more efficient if there were a function that would unmap one buffer and map a second buffer? My thought is that iommu pte entry used by the old buffer could just be modified to reference the new one. In effect each ring entry would end up using a fixed iommu pte. There are a number of interesting things to investigate in this space, and the above is just one of them, ways to avoid the overhead of a full-blown map/unmap on each call. See http://www.spinics.net/lists/sparclinux/msg13613.html But the scope of this patchset is actually very rigidly defined: to refactor the iommu pool/arena allocator into a common library, and avoid code duplication (today even the single sparc arch duplicates it for sun4[u,v] and ldc, and that's not even counting the duplication across other archs/pci-drivers). Investigating ways to provide a generalized infra that can avoid a dma map/unmp for every packet would be a good follow-on. The other question is how much data can be copied in 26us ? On iommu systems 'copybreak' limits on receive and transmit may need to be quite high. where does the 26us number come from? I may be missing that context? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator
On (03/31/15 10:40), Sowmini Varadhan wrote: I've not heard back from the IB folks, but I'm going to make a judgement call here and go with the spin_lock. *If* they report some significant benefit from the trylock, probably need to revisit this (and then probably start by re-exmaining the hash function to avoid collisions, before resorting to trylock). Having bravely said that.. the IB team informs me that they see a 10% degradation using the spin_lock as opposed to the trylock. one path going forward is to continue processing this patch-set as is. I can investigate this further, and later revise the spin_lock to the trylock, after we are certain that it is good/necessary. thoughts? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v8 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() v6: removed iommu_tbl_ops. Pass flush_all as function pointer to iommu_tbl_pool_init v7: move pool_hash initialization to iommu_tbl_pool_init() v8: BenH comments arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 171 +-- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 181 - 4 files changed, 125 insertions(+), 242 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..cd0d69f 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_map_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..8b76e21 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,6 +13,7 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h @@ -45,8 +46,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_map_table *iommu_map_table) { + struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) -{ - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush. -*/ - if (start = limit) { - start = 0; - if (iommu-flush_all) - iommu-flush_all(iommu
Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/30/15 21:55), Benjamin Herrenschmidt wrote: No that's not my point. The lock is only taken for a short time but might still collide, the bouncing in that case will probably (at least that's my feeling) hurt more than help. However, I have another concern with your construct. Essentially you spin looking for an unlocked pool without a cpu_relax. Now it's unlikely but you *can* end up eating cycles, which on a high SMT like POWER8 might mean slowing down the actual owner of the pool lock. So I tried looking at the code, and perhaps there is some arch-specific subtlety here that I am missing, but where does spin_lock itself do the cpu_relax? afaict, LOCK_CONTENDED() itself does not have this. What is iperf ? What does that mean there was lock contention ? IE, was the overall performance improved or not ? Removing contention by trading it for cache line bouncing will not necessarily help. I'm not saying this is a bad optimisation but it makes the code messy and I think deserves some solid numbers demonstrating its worth. iperf is a network perf benchmark. I'll try to regenerate some numbers today and get some instrumentation of cache-line bounces vs trylock misses. There was also an actual bug in the case where you hop'ed to a new pool and forgot the flush. yes, thanks for catching that, I'll fix that too, of course. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/30/15 14:24), Benjamin Herrenschmidt wrote: + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) I don't like those macros. You changed the value from what we had on powerpc. It could be that the new values are as good for us but I'd like to do a bit differently. Can you make the bits a variable ? Or at least an arch-provided macro which we can change later if needed ? Actuallly, this are just the upper bound (16) on the number of pools. The actual number is selected by the value passed to the iommu_tbl_range_init(), and is not hard-coded (as was the case with the power-pc code). Thus powerpc can continue to use 4 pools without any loss of generality. : Let's make it clear that this is for allocation of DMA space only, it would thus make my life easier when adapting powerpc to use a different names, something like struct iommu_area works for me, or iommu alloc_region .. whatever you fancy the most. : Why adding the 'arena' prefix ? What was wrong with pools in the powerpc imlementation ? for the same reason you want to re-baptize iommu_table above- at the time, I was doing it to minimize conflicts with existing usage. But I can rename everything if you like. +#define IOMMU_LARGE_ALLOC 15 Make that a variable, here too, the arch might want to tweak it. I think 15 is actually not a good value for powerpc with 4k iommu pages and 64k PAGE_SIZE, we should make the above some kind of factor of PAGE_SHIFT - iommu_page_shift. Ok. + /* Sanity check */ + if (unlikely(npages == 0)) { + printk_ratelimited(npages == 0\n); You removed the WARN_ON here which had the nice property of giving you a backtrace which points to the offender. The above message alone is useless. yes, the original code was generating checkpatch warnings and errors. That's why I removed it. I am not certain about the first unlocked pool... We take the lock for a fairly short amount of time, I have the gut feeling that the cache line bouncing introduced by looking at a different pool may well cost more than waiting a bit longer. Did do some measurements of that optimisation ? if you are really only taking it for a short amount of time, then the trylock should just succeed, so there is no cache line bouncing. But yes, I did instrument it with iperf, and there was lock contention on the spinlock, which was eliminted by the trylock. I'll fix the rest of the variable naming etc nits and send out a new version later today. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/31/15 08:28), Benjamin Herrenschmidt wrote: Provided that the IB test doesn't come up with a significant difference, I definitely vote for the simpler version of doing a normal spin_lock. sounds good. let me wait for the confirmation from IB, and I'll send out patchv8 soon after. FWIW, I'm ok with all the other comments- thanks for the feedback. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
On (03/30/15 09:01), Sowmini Varadhan wrote: So I tried looking at the code, and perhaps there is some arch-specific subtlety here that I am missing, but where does spin_lock itself do the cpu_relax? afaict, LOCK_CONTENDED() itself does not have this. To answer my question: I'd missed the CONFIG_LOCK_STAT (which David Ahern pointed out to me). the above is only true for the LOCK_STAT case. In any case, I ran some experiments today: I was running iperf [http://en.wikipedia.org/wiki/Iperf] over ixgbe, which is where I'd noticed the original perf issues for sparc. I was running iperf2 (which is more aggressively threaded than iperf3) with 8, 10, 16, 20 threads, and with TSO turned off. In each case, I was making sure that I was able to reach 9.X Gbps (this is a 10Gbps link) I dont see any significant difference in the perf profile between the spin_trylock and the spin_lock version (other than, of course, the change to the lock-contention for the trylock version). I looked at the perf profiled cache-misses (works out to about 1400M for 10 threads, with or without the trylock). I'm still waiting for some of the IB folks to try out the spin_lock version (they had also seen some significant perf improvements from breaking down the monolithic lock into multiple pools, so their workload is also sensitive to this) But as such, it looks like it doesnt matter much, whether you use the trylock to find the first available pool, or block on the spin_lock. I'll let folks on this list vote on this one (assuming the IB tests also come out without a significant variation between the 2 locking choices). --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: [PATCH v7 0/3] Generic IOMMU pooled allocator
On (03/26/15 08:05), Benjamin Herrenschmidt wrote: PowerPC folks, what do you think? I'll give it another look today. Cheers, Ben. Hi Ben, did you have a chance to look at this? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/25/15 21:43), casca...@linux.vnet.ibm.com wrote: However, when using large TCP send/recv (I used uperf with 64KB writes/reads), I noticed that on the transmit side, largealloc is not used, but on the receive side, cxgb4 almost only uses largealloc, while qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings. When turning GRO off, that ratio is closer to 1/10, meaning there is still some fair use of largealloc in that scenario. I confess my experiments are not complete. I would like to test a couple of other drivers as well, including mlx4_en and bnx2x, and test with small packet sizes. I suspected that MTU size could make a difference, but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I didn't notice any significant hit of largepool with either qlge or cxgb4. I guess we also need to consider the average use-case, i.e., something that interleaves small packets and interactive data with jumbo/bulk data.. in those cases, the largepool would not get many hits, and might actually be undesirable? But I believe that on the receive side, all drivers should map entire pages, using some allocation strategy similar to mlx4_en, in order to avoid DMA mapping all the time. good point. I think in the early phase of my perf investigation, it was brought up that Solaris does pre-mapped DMA buffers (they have to do this carefully, to avoid resource-starvation vulnerabilities- see http://www.spinics.net/lists/sparclinux/msg13217.html and threads leading to it.. This is not something that the common iommu-arena allocator can/should get involved in, of course. The scope of the arena-allocator is much more rigorously defined. I dont know if there is a way to set up a generalized DMA premapped buffer infra for linux today. fwiw, when I instrumented this for solaris (there are hooks to disable the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes, 64 cpus) was not very significant for a single 10G ixgbe port- approx 8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is only significant when you start trying to scale to multiple ethernet ports. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v7 RFC 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common v6: remove iommu_tbl_ops v7: move pool_hash initialization to iommu_tbl_pool_init arch/sparc/kernel/ldc.c | 152 --- 1 files changed, 64 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..e858968 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,7 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +28,10 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_table iommu_table; }; struct ldc_channel { @@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_table *iommu = ldc_iommu-iommu_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + NULL, false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp) out_free_table: free_pages((unsigned long) table, order); - iommu-page_table = NULL; + ldc_iommu-page_table = NULL; out_free_map: - kfree(iommu-arena.map); - iommu-arena.map = NULL; + kfree(iommu-map); + iommu-map = NULL; return err; } static void ldc_iommu_release(struct
[PATCH v7 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() v6: removed iommu_tbl_ops. Pass flush_all as function pointer to iommu_tbl_pool_init v7: move pool_hash initialization to iommu_tbl_pool_init() arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 174 +--- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 179 4 files changed, 127 insertions(+), 241 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..e3cd449 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..f7fdff2 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,6 +13,7 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h @@ -45,8 +46,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_table *iommu_table) { + struct iommu *iommu = container_of(iommu_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) -{ - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush. -*/ - if (start = limit) { - start = 0; - if (iommu-flush_all) - iommu-flush_all(iommu); - } - - again
[PATCH v7 0/3] Generic IOMMU pooled allocator
Changes from patchv6: moved pool_hash initialization to lib/iommu-common.c and cleaned up code duplication from sun4v/sun4u/ldc. Sowmini (2): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Sowmini Varadhan (1): Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 174 +++ arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/ldc.c | 152 ++-- arch/sparc/kernel/pci_sun4v.c | 179 + include/linux/iommu-common.h | 48 lib/Makefile |2 +- lib/iommu-common.c| 235 + 8 files changed, 475 insertions(+), 330 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. v7: one-time initialization of pool_hash from iommu_tbl_pool_init() include/linux/iommu-common.h | 48 + lib/Makefile |2 +- lib/iommu-common.c | 235 ++ 3 files changed, 284 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..197111b --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,48 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_table { + unsigned long page_table_map_base; + unsigned long page_table_shift; + unsigned long nr_pools; + void(*flush_all)(struct iommu_table *); + unsigned long poolsize; + struct iommu_pool arena_pool[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_table *iommu, + unsigned long num_entries, + u32 page_table_shift, + void (*flush_all)(struct iommu_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_table *iommu, + unsigned long npages, + unsigned long *handle); + +extern void iommu_tbl_range_free(struct iommu_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 3c3b30b..0ea2ac6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..bb7e706 --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,235 @@ +/* + * IOMMU mmap management and range allocation functions. + * Based almost entirely upon the powerpc iommu allocator. + */ + +#include linux/export.h +#include linux/bitmap.h +#include linux/bug.h +#include linux/iommu-helper.h +#include linux/iommu-common.h +#include linux/dma-mapping.h +#include linux/hash.h + +#define IOMMU_LARGE_ALLOC 15 + +static DEFINE_PER_CPU(unsigned int
Re: [PATCH v6 0/3] Generic IOMMU pooled allocator
On (03/24/15 18:16), David Miller wrote: Generally this looks fine to me. But about patch #2, I see no reason to have multiple iommu_pool_hash tables. Even from a purely sparc perspective, we can always just do with just one of them. Furthermore, you can even probably move it down into lib/iommu-common.c itself. iommu_tbl_pool_init() can do the one time initialization. fixed in v7. Ben, Alexey, do you need more time to review this? --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v6 RFC 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common v6: remove iommu_tbl_ops arch/sparc/kernel/ldc.c | 169 ++ 1 files changed, 81 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..c6b1bb1 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,8 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/hash.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +29,11 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + +static DEFINE_PER_CPU(unsigned int, ldc_pool_hash); + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +105,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_table iommu_table; }; struct ldc_channel { @@ -998,31 +1005,73 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +static void setup_ldc_pool_hash(void) +{ + unsigned int i; + static bool do_once; + + if (do_once) + return; + do_once = true; + for_each_possible_cpu(i) + per_cpu(ldc_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS); +} + + +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_table *iommu = ldc_iommu-iommu_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + setup_ldc_pool_hash(); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + NULL, false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1086,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1098,32 @@ static int ldc_iommu_init(struct ldc_channel *lp
[PATCH v6 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all optimization. include/linux/iommu-common.h | 49 ++ lib/Makefile |2 +- lib/iommu-common.c | 219 ++ 3 files changed, 269 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..266a169 --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,49 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_table { + unsigned long page_table_map_base; + unsigned long page_table_shift; + unsigned long nr_pools; + void(*flush_all)(struct iommu_table *); + unsigned long poolsize; + struct iommu_pool arena_pool[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_table *iommu, + unsigned long num_entries, + u32 page_table_shift, + void (*flush_all)(struct iommu_table *), + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_table *iommu, + unsigned long npages, + unsigned long *handle, + unsigned int pool_hash); + +extern void iommu_tbl_range_free(struct iommu_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 3c3b30b..0ea2ac6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..86a5bca --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,219 @@ +/* + * IOMMU mmap management and range allocation functions. + * Based almost entirely upon the powerpc iommu allocator. + */ + +#include linux/export.h +#include linux/bitmap.h +#include linux/bug.h +#include linux/iommu-helper.h +#include linux/iommu-common.h +#include linux/dma-mapping.h + +#define IOMMU_LARGE_ALLOC 15 + +/* + * Initialize iommu_pool entries for the iommu_table
[PATCH v6 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() v6: removed iommu_tbl_ops. Pass flush_all as function pointer to iommu_tbl_pool_init arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 182 +++ arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 192 ++--- 4 files changed, 153 insertions(+), 236 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..e3cd449 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..9886186 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,11 +13,15 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/hash.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h #endif +static DEFINE_PER_CPU(unsigned int, iommu_pool_hash); + #include asm/iommu.h #include iommu_common.h @@ -45,8 +49,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_table *iommu_table) { + struct iommu *iommu = container_of(iommu_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,93 +92,18 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) +static void setup_iommu_pool_hash(void) { - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush. -*/ - if (start = limit) { - start
[PATCH v6 0/3] Generic IOMMU pooled allocator
Deltas from patchv5: - removed iommu_tbl_ops, and instead pass the -flush_all as an indirection to iommu_tbl_pool_init() - only invoke -flush_all when there is no large_pool, based on the assumption that large-pool usage is infrequently encountered. Sowmini (2): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Sowmini Varadhan (1): Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 182 ++- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/ldc.c | 169 ++--- arch/sparc/kernel/pci_sun4v.c | 192 - include/linux/iommu-common.h | 49 lib/Makefile |2 +- lib/iommu-common.c| 219 + 8 files changed, 503 insertions(+), 325 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/23/15 12:29), David Miller wrote: In order to elide the IOMMU flush as much as possible, I implemnented a scheme for sun4u wherein we always allocated from low IOMMU addresses to high IOMMU addresses. In this regime, we only need to flush the IOMMU when we rolled over back to low IOMMU addresses during an allocation. It made a noticable difference in performance. Unfortunately, with sun4v and the hypervisor, I'm not allowed to control when the IOMMU flush happens, it has to occur on every single IOMMU mapping change. So this optimization was no longer possible there. Anyways, that's the history behind it. -- I see. If it was only an optimization (i.e., removing it would not break any functionality), and if this was done for older hardware, and *if* we believe that the direction of most architectures is to follow the sun4v/HV model, then, given that the sun4u code only uses 1 arena pool anyway, one thought that I have for refactoring this is the following: - Caller of iommu_tbl_range_alloc() can do the flush_all if they see start = end for the one single pool - lose the other -flush_all invocation (i.e., the one that is done when iommu_area_alloc() fails for pass == 0, and we reset start to 0 to roll-back) that would avoid the need for any iommu_tbl_ops in my patch-set. But it would imply that you would still take the perf hit for the roll-back if we failed the pass == 0 iteration through iommu_area_alloc(). Perhaps this is an acceptable compromise in favor of cleaner code (again, assuming that current/future archs will all follow the HV based design). --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/23/15 15:05), David Miller wrote: Why add performance regressions to old machines who already are suffering too much from all the bloat we are constantly adding to the kernel? I have no personal opinion on this- it's a matter of choosing whether we want to have some extra baggage in the library to support old hardware, or keep the code lean-and-mean for the future, where the model has changed. I'll let others on this list make the call. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/24/15 11:47), Benjamin Herrenschmidt wrote: Yes, pass a function pointer argument that can be NULL or just make it a member of the iommu_allocator struct (or whatever you call it) passed to the init function and that can be NULL. My point is we don't need a separate ops structure. Ok, I'll make this a function pointer to the iommu_tbl_pool_init() function (tomorrow) fwiw, the ops struct came out as a result of DaveM input, http://www.spinics.net/lists/sparclinux/msg13232.html albeit the original context is now moot. Pass it to init and stash it in the table but don't call it iommu_table, let's use a name that conveys better the fact that this is purely a DMA space allocator (to be embedded by the arch specific iommu table). Something like iommu_alloc_zone or whatever you want to call it. I keep finding new names whenever I think of it :-) please pick a name that you like as this mooshes all the patches, and I myself really dont care what it gets called (rose? :-)) as long as fat-fingering-risk is minimized Regarding the relationship with largepool, But that might not be necessary. If indeed we very rarely use the large pool, then just make it flush always. My feeling is that it will only ever be used at driver init/remove time when allocating things like descriptor rings, where the flush overhead dont' matter. ok, will factor this in tomorrow (assuming DaveM has no objections to anything proposed here, esp around the function pointer for flushall). --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/24/15 09:21), Benjamin Herrenschmidt wrote: So we have two choices here that I can see: - Keep that old platform use the old/simpler allocator Problem with that approach is that the base struct iommu structure for sparc gets a split personality: the older one is used with the older allocator, and other ugly things ensue. (alternatively, you end up duplicating a version of the code with the flush_all inlined). - Try to regain the bulk of that benefit with the new one Sowmini, I see various options for the second choice. We could stick to 1 pool, and basically do as before, ie, if we fail on the first pass of alloc, it means we wrap around and do a flush, I don't think that will cause a significant degradation from today, do you ? We might have an occasional additional flush but I would expect it to be in the noise. Isn't this essentially what I have in patch v5 here: http://www.spinics.net/lists/sparclinux/msg13534.html (the ops-reset is the flushall indirection, can be renamed if the latter is preferred) Dave, what's your feeling there ? Does anybody around still have some HW that we can test with ? I actually tested this on a V440 and a ultra45 (had a heck of a time finding these, since the owners keep them turned off because they are too noisy and consume too much power :-). Thus while I have no opinion, I would not shed any tears if we lost this extra perf-tweak in the interest of being earth-friendly :-)) so testing it is not a problem, though I dont have any perf benchmarks for them either. Sowmini, I think we can still kill the ops and have a separate data structure exclusively concerned by allocations by having the alloc functions take the lazy flush function as an argument (which can be NULL), I don't think we should bother with ops. I dont quite follow what you have in mind? The caller would somehow have to specify a flush_all indirection for the legacy platforms Also, you mention You must hold the lock until you do the flush, otherwise somebody else might allocate the not-yet-flushed areas and try to use them... kaboom. However if that's the only callback left, pass it as an argument. Passing in a function pointer to the flushall to iommu_tbl_range_alloc would work, or we could pass it in as an arg to iommu_tbl_init, and stash it in the struct iommu_table. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/24/15 09:36), Benjamin Herrenschmidt wrote: - One pool only - Whenever the allocation is before the previous hint, do a flush, that should only happen if a wrap around occurred or in some cases if the device DMA mask forced it. I think we always update the hint whenever we successfully allocate from the small pools. right, I believe this is close to what I discussed in the previous thread, and approx what I have in patchv5, except that the flush indirection can be passed as a function pointer, or via the table_ops. - Deal with the largealloc case. That's the contentious issue, see below. : The largealloc issue is a different can of worms. We can try adding an option to disable the largealloc business completely (instead of hard wiring the 15, make that a variable and define that 0 means no largealloc pool). What I've tried to do is to have a bool large_pool arg passed to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), we never allocate more than 4 pages at a time, so I pass in large_pool == false for all the sparc platforms. Or we can decide that large allocs are rare (typically pci_alloc_consistent, ie, driver init time), and thus always flush on them (or rather on free of a large chunk). David, what's your take there ? I have a feeling that should work fine without a noticeable performance issue... I would also keep a dirty flag set on any free and cleared on any flush to avoid more spurrious flushes, but here too the benefit might be in the noise. ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
benh It might be sufficient to add a flush counter and compare it between runs benh if actual wall-clock benchmarks are too hard to do (especially if you benh don't have things like very fast network cards at hand). benh benh Number of flush / number of packets might be a sufficient metric, it.. I was just going to say: I can add those counters tomorrow, and get some stats, but seems like it doesn't really matter what the outcome is, because: On (03/23/15 21:44), David Miller wrote: From: Benjamin Herrenschmidt b...@kernel.crashing.org Date: Tue, 24 Mar 2015 09:21:05 +1100 Dave, what's your feeling there ? Does anybody around still have some HW that we can test with ? I don't see what the actual problem is. Even if you use multiple pools, which we should for scalability on sun4u too, just do the flush when allocation in _any_ pool wraps around. That's still better than not doing the optimization at all. That is always going to be correct, and you can use a separate spinlock to make sure only one thread of control does the full IOMMU flush at a time. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
On (03/23/15 09:02), Benjamin Herrenschmidt wrote: How does this relate to the ARM implementation? There is currently an effort going on to make that one shared with ARM64 and possibly x86. Has anyone looked at both the PowerPC and ARM ways of doing the allocation to see if we could pick one of the two to work on all architectures? What I see in ARM is horribly complex, I can't quite make sense of it in a couple of minutes of looking at it, and doesn't seem to address the basic issue we are addressing here which is the splitting of the iommu table lock. Amen to that.. I thought it was just me :-) I plan to go through the code to see if/where the armd iommu code does its locking and achieves its parallelism, but the mapping between the sparc/powerpc approach and armd is not immediately obvious to me. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
Re: Generic IOMMU pooled allocator
Turned out that I was able to iterate over it, and remove both the -cookie_to_index and the -demap indirection from iommu_tbl_ops. That leaves only the odd iommu_flushall() hook, I'm trying to find the history behind that (needed for sun4u platforms, afaik, and not sure if there are other ways to achieve this). Just sent out a new patch-set, which should reach all the people/lists on this thread. If it's possible, it would also help to share any preliminary code-samples with the iommu_table_ops you folks have in mind. Please take a look at patchv5, and share your thoughts. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock
Investigation of multithreaded iperf experiments on an ethernet interface show the iommu-lock as the hottest lock identified by lockstat, with something of the order of 21M contentions out of 27M acquisitions, and an average wait time of 26 us for the lock. This is not efficient. A more scalable design is to follow the ppc model, where the iommu_table has multiple pools, each stretching over a segment of the map, and with a separate lock for each pool. This model allows for better parallelization of the iommu map search. This patch adds the iommu range alloc/free function infrastructure. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2 changes: - incorporate David Miller editorial comments: sparc specific fields moved from iommu-common into sparc's iommu_64.h - make the npools value an input parameter, for the case when the iommu map size is not very large - cookie_to_index mapping, and optimizations for span-boundary check, for use case such as LDC. v3: eliminate iommu_sparc, rearrange the -demap indirection to be invoked under the pool lock. v4: David Miller review changes: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - page_table_map_base and page_table_shift are unsigned long, not u32. v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru - removed -cookie_to_index and -demap indirection: caller should invoke these as needed before calling into the generic allocator include/linux/iommu-common.h | 55 +++ lib/Makefile |2 +- lib/iommu-common.c | 219 ++ 3 files changed, 275 insertions(+), 1 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h new file mode 100644 index 000..50753cf --- /dev/null +++ b/include/linux/iommu-common.h @@ -0,0 +1,55 @@ +#ifndef _LINUX_IOMMU_COMMON_H +#define _LINUX_IOMMU_COMMON_H + +#include linux/spinlock_types.h +#include linux/device.h +#include asm/page.h + +#define IOMMU_POOL_HASHBITS 4 +#define IOMMU_NR_POOLS (1 IOMMU_POOL_HASHBITS) + +struct iommu_pool { + unsigned long start; + unsigned long end; + unsigned long hint; + spinlock_t lock; +}; + +struct iommu_table; + +struct iommu_tbl_ops { + void(*reset)(struct iommu_table *); +}; + +struct iommu_table { + unsigned long page_table_map_base; + unsigned long page_table_shift; + unsigned long nr_pools; + const struct iommu_tbl_ops *iommu_tbl_ops; + unsigned long poolsize; + struct iommu_pool arena_pool[IOMMU_NR_POOLS]; + u32 flags; +#defineIOMMU_HAS_LARGE_POOL0x0001 +#defineIOMMU_NO_SPAN_BOUND 0x0002 + struct iommu_pool large_pool; + unsigned long *map; +}; + +extern void iommu_tbl_pool_init(struct iommu_table *iommu, + unsigned long num_entries, + u32 page_table_shift, + const struct iommu_tbl_ops *iommu_tbl_ops, + bool large_pool, u32 npools, + bool skip_span_boundary_check); + +extern unsigned long iommu_tbl_range_alloc(struct device *dev, + struct iommu_table *iommu, + unsigned long npages, + unsigned long *handle, + unsigned int pool_hash); + +extern void iommu_tbl_range_free(struct iommu_table *iommu, +u64 dma_addr, unsigned long npages, +unsigned long entry); + +#endif diff --git a/lib/Makefile b/lib/Makefile index 3c3b30b..0ea2ac6 100644 --- a/lib/Makefile +++ b/lib/Makefile @@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o obj-$(CONFIG_SWIOTLB) += swiotlb.o -obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o +obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o diff --git a/lib/iommu-common.c b/lib/iommu-common.c new file mode 100644 index 000..f416191 --- /dev/null +++ b/lib/iommu-common.c @@ -0,0 +1,219 @@ +/* + * IOMMU mmap management and range allocation functions. + * Based almost entirely upon the powerpc iommu allocator. + */ + +#include linux/export.h +#include linux/bitmap.h +#include linux/bug.h +#include linux/iommu-helper.h +#include linux/iommu-common.h +#include linux/dma-mapping.h + +#define IOMMU_LARGE_ALLOC 15 + +/* + * Initialize iommu_pool entries for the iommu_table
[PATCH v5 RFC 3/3] sparc: Make LDC use common iommu poll management functions
Note that this conversion is only being done to consolidate the code and ensure that the common code provides the sufficient abstraction. It is not expected to result in any noticeable performance improvement, as there is typically one ldc_iommu per vnet_port, and each one has 8k entries, with a typical request for 1-4 pages. Thus LDC uses npools == 1. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v3: added this file to be a consumer of the common iommu library v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead inline these calls into ldc before calling into iommu-common arch/sparc/kernel/ldc.c | 172 +++ 1 files changed, 84 insertions(+), 88 deletions(-) diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c index 274a9f5..3210f81 100644 --- a/arch/sparc/kernel/ldc.c +++ b/arch/sparc/kernel/ldc.c @@ -15,6 +15,8 @@ #include linux/list.h #include linux/init.h #include linux/bitmap.h +#include linux/hash.h +#include linux/iommu-common.h #include asm/hypervisor.h #include asm/iommu.h @@ -27,6 +29,11 @@ #define DRV_MODULE_VERSION 1.1 #define DRV_MODULE_RELDATE July 22, 2008 +#define COOKIE_PGSZ_CODE 0xf000ULL +#define COOKIE_PGSZ_CODE_SHIFT 60ULL + +static DEFINE_PER_CPU(unsigned int, ldc_pool_hash); + static char version[] = DRV_MODULE_NAME .c:v DRV_MODULE_VERSION ( DRV_MODULE_RELDATE )\n; #define LDC_PACKET_SIZE64 @@ -98,10 +105,10 @@ static const struct ldc_mode_ops stream_ops; int ldom_domaining_enabled; struct ldc_iommu { - /* Protects arena alloc/free. */ + /* Protects ldc_unmap. */ spinlock_t lock; - struct iommu_arena arena; struct ldc_mtable_entry *page_table; + struct iommu_table iommu_table; }; struct ldc_channel { @@ -998,31 +1005,76 @@ static void free_queue(unsigned long num_entries, struct ldc_packet *q) free_pages((unsigned long)q, order); } +static unsigned long ldc_cookie_to_index(u64 cookie, void *arg) +{ + u64 szcode = cookie COOKIE_PGSZ_CODE_SHIFT; + /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */ + + cookie = ~COOKIE_PGSZ_CODE; + + return (cookie (13ULL + (szcode * 3ULL))); +} + +static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie, + unsigned long entry, unsigned long npages) +{ + struct ldc_mtable_entry *base; + unsigned long i, shift; + + shift = (cookie COOKIE_PGSZ_CODE_SHIFT) * 3; + base = iommu-page_table + entry; + for (i = 0; i npages; i++) { + if (base-cookie) + sun4v_ldc_revoke(id, cookie + (i shift), +base-cookie); + base-mte = 0; + } +} + /* XXX Make this configurable... XXX */ #define LDC_IOTABLE_SIZE (8 * 1024) -static int ldc_iommu_init(struct ldc_channel *lp) +struct iommu_tbl_ops ldc_iommu_ops; + +static void setup_ldc_pool_hash(void) +{ + unsigned int i; + static bool do_once; + + if (do_once) + return; + do_once = true; + for_each_possible_cpu(i) + per_cpu(ldc_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS); +} + + +static int ldc_iommu_init(const char *name, struct ldc_channel *lp) { unsigned long sz, num_tsb_entries, tsbsize, order; - struct ldc_iommu *iommu = lp-iommu; + struct ldc_iommu *ldc_iommu = lp-iommu; + struct iommu_table *iommu = ldc_iommu-iommu_table; struct ldc_mtable_entry *table; unsigned long hv_err; int err; num_tsb_entries = LDC_IOTABLE_SIZE; tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry); - - spin_lock_init(iommu-lock); + setup_ldc_pool_hash(); + spin_lock_init(ldc_iommu-lock); sz = num_tsb_entries / 8; sz = (sz + 7UL) ~7UL; - iommu-arena.map = kzalloc(sz, GFP_KERNEL); - if (!iommu-arena.map) { + iommu-map = kzalloc(sz, GFP_KERNEL); + if (!iommu-map) { printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz); return -ENOMEM; } - - iommu-arena.limit = num_tsb_entries; + iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT, + ldc_iommu_ops, + false /* no large pool */, + 1 /* npools */, + true /* skip span boundary check */); order = get_order(tsbsize); @@ -1037,7 +1089,7 @@ static int ldc_iommu_init(struct ldc_channel *lp) memset(table, 0, PAGE_SIZE order); - iommu-page_table = table; + ldc_iommu-page_table = table; hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table), num_tsb_entries); @@ -1049,31 +1101,32
[PATCH v5 RFC 0/3] Generic IOMMU pooled allocator
Follows up on the feedback in the thread at http://www.spinics.net/lists/sparclinux/msg13493.html - removed -cookie_to_index and -demap indirection from the iommu_tbl_ops The caller needs to call these functions as needed, before invoking the generic arena allocator functions. - added the skip_span_boundary argument to iommu_tbl_pool_init() for those callers like LDC which do no care about span boundary checks. Sowmini (2): Break up monolithic iommu table/lock into finer graularity pools and lock Make sparc64 use scalable lib/iommu-common.c functions Sowmini Varadhan (1): Make LDC use common iommu poll management functions arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 187 +++- arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/ldc.c | 172 ++--- arch/sparc/kernel/pci_sun4v.c | 194 - include/linux/iommu-common.h | 55 + lib/Makefile |2 +- lib/iommu-common.c| 219 + 8 files changed, 519 insertions(+), 325 deletions(-) create mode 100644 include/linux/iommu-common.h create mode 100644 lib/iommu-common.c ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev
[PATCH v5 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions
In iperf experiments running linux as the Tx side (TCP client) with 10 threads results in a severe performance drop when TSO is disabled, indicating a weakness in the software that can be avoided by using the scalable IOMMU arena DMA allocation. Baseline numbers before this patch: with default settings (TSO enabled) :9-9.5 Gbps Disable TSO using ethtool- drops badly: 2-3 Gbps. After this patch, iperf client with 10 threads, can give a throughput of at least 8.5 Gbps, even when TSO is disabled. Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com --- v2: moved sparc specific fileds into iommu_sparc v3: converted all sparc users of iommu, so lot of cleanup and streamlining v4: David Miller review change: - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE - reverts pci_impl.h (now that all iommu usage has been converted) v5: benh/aik feedback modifies the function signatures: pass in modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free() arch/sparc/include/asm/iommu_64.h |7 +- arch/sparc/kernel/iommu.c | 187 arch/sparc/kernel/iommu_common.h |8 -- arch/sparc/kernel/pci_sun4v.c | 194 ++--- 4 files changed, 160 insertions(+), 236 deletions(-) diff --git a/arch/sparc/include/asm/iommu_64.h b/arch/sparc/include/asm/iommu_64.h index 2b9321a..e3cd449 100644 --- a/arch/sparc/include/asm/iommu_64.h +++ b/arch/sparc/include/asm/iommu_64.h @@ -16,6 +16,7 @@ #define IOPTE_WRITE 0x0002UL #define IOMMU_NUM_CTXS 4096 +#include linux/iommu-common.h struct iommu_arena { unsigned long *map; @@ -24,11 +25,10 @@ struct iommu_arena { }; struct iommu { + struct iommu_table tbl; spinlock_t lock; - struct iommu_arena arena; - void(*flush_all)(struct iommu *); + u32 dma_addr_mask; iopte_t *page_table; - u32 page_table_map_base; unsigned long iommu_control; unsigned long iommu_tsbbase; unsigned long iommu_flush; @@ -40,7 +40,6 @@ struct iommu { unsigned long dummy_page_pa; unsigned long ctx_lowest_free; DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS); - u32 dma_addr_mask; }; struct strbuf { diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c index bfa4d0c..71a76d3 100644 --- a/arch/sparc/kernel/iommu.c +++ b/arch/sparc/kernel/iommu.c @@ -13,11 +13,15 @@ #include linux/errno.h #include linux/iommu-helper.h #include linux/bitmap.h +#include linux/hash.h +#include linux/iommu-common.h #ifdef CONFIG_PCI #include linux/pci.h #endif +static DEFINE_PER_CPU(unsigned int, iommu_pool_hash); + #include asm/iommu.h #include iommu_common.h @@ -45,8 +49,9 @@ i (ASI_PHYS_BYPASS_EC_E)) /* Must be invoked under the IOMMU lock. */ -static void iommu_flushall(struct iommu *iommu) +static void iommu_flushall(struct iommu_table *iommu_table) { + struct iommu *iommu = container_of(iommu_table, struct iommu, tbl); if (iommu-iommu_flushinv) { iommu_write(iommu-iommu_flushinv, ~(u64)0); } else { @@ -87,94 +92,23 @@ static inline void iopte_make_dummy(struct iommu *iommu, iopte_t *iopte) iopte_val(*iopte) = val; } -/* Based almost entirely upon the ppc64 iommu allocator. If you use the 'handle' - * facility it must all be done in one pass while under the iommu lock. - * - * On sun4u platforms, we only flush the IOMMU once every time we've passed - * over the entire page table doing allocations. Therefore we only ever advance - * the hint and cannot backtrack it. - */ -unsigned long iommu_range_alloc(struct device *dev, - struct iommu *iommu, - unsigned long npages, - unsigned long *handle) -{ - unsigned long n, end, start, limit, boundary_size; - struct iommu_arena *arena = iommu-arena; - int pass = 0; - - /* This allocator was derived from x86_64's bit string search */ - - /* Sanity check */ - if (unlikely(npages == 0)) { - if (printk_ratelimit()) - WARN_ON(1); - return DMA_ERROR_CODE; - } - - if (handle *handle) - start = *handle; - else - start = arena-hint; - - limit = arena-limit; - - /* The case below can happen if we have a small segment appended -* to a large, or when the previous alloc was at the very end of -* the available space. If so, go back to the beginning and flush. -*/ - if (start = limit) { - start = 0; - if (iommu-flush_all) - iommu-flush_all(iommu); - } - - again: - - if (dev
Re: Generic IOMMU pooled allocator
On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote: Ben One thing I noticed is the asymetry in your code between the alloc Ben and the free path. The alloc path is similar to us in that the lock Ben covers the allocation and that's about it, there's no actual mapping to Ben the HW done, it's done by the caller level right ? yes, the only constraint is that the h/w alloc transaction should be done after the arena-alloc, whereas for the unmap, the h/w transaction should happen first, and arena-unmap should happen after. Ben The free path however, in your case, takes the lock and calls back into Ben demap (which I assume is what removes the translation from the HW) Ben with the lock held. There's also some mapping between cookies Ben and index which here too isn't exposed to the alloc side but is Ben exposed to the free side. Regarding the -demap indirection- somewhere between V1 and V2, I realized that, at least for sun4v, it's not necessary to hold the pool lock when doing the unmap, (V1 had originally passed this a the -demap). Revisiting the LDC change, I think that even that has no pool-specific info that needs to be passed in, so possibly the -demap is not required at all? I can remove that, and re-verify the LDC code (though I might not be able to get to this till early next week, as I'm travelling at the moment). About the cookie_to_index, that came out of observation of the LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, the workflow is approximately base = alloc_npages(..); /* calls iommu_tbl_range_alloc *. /* set up cookie_state using base */ /* populate cookies calling fill_cookies() - make_cookie() */ The make_cookie() is the inverse operation of cookie_to_index() (afaict, the code is not very well commented, I'm afraid), but I need that indirection to figure out which bitmap to clear. I dont know if there's a better way to do this, or if the -cookie_to_index can get more complex for other IOMMU users Ben One thing that Alexey is doing on our side is to move some of the Ben hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our Ben global ppc_md. data structure to a new iommu_table_ops, so your patches Ben will definitely collide with our current work so we'll have to figure Ben out how things can made to match. We might be able to move more than Ben just the allocator to the generic code, and the whole implementation of Ben map_sg/unmap_sg if we have the right set of ops, unless you see a Ben reason why that wouldn't work for you ? I cant think of why that wont work, though it would help to see the patch itself.. Ben We also need to add some additional platform specific fields to certain Ben iommu table instances to deal with some KVM related tracking of pinned Ben DMAble memory, here too we might have to be creative and possibly Ben enclose the generic iommu_table in a platform specific variant. Ben Ben Alexey, any other comment ? Ben Ben Cheers, Ben Ben. Ben Ben Ben Ben -- Ben To unsubscribe from this list: send the line unsubscribe sparclinux in Ben the body of a message to majord...@vger.kernel.org Ben More majordomo info at http://vger.kernel.org/majordomo-info.html Ben BenAlexey, any other comment ? On (03/19/15 16:27), Alexey Kardashevskiy wrote: Alexey Alexey Agree about missing symmetry. In general, I would call it zoned Alexey pool-locked memory allocator-ish rather than iommu_table and have Alexey no callbacks there. Alexey Alexey The iommu_tbl_range_free() caller could call cookie_to_index() and Problem is that tbl_range_free itself needs the `entry' from -Alexeycookie_to_index.. dont know if there's a way to move the code to avoid that.. Alexey what the reset() callback does here - I do not understand, some The -Alexeyreset callback came out of the sun4u use-case. Davem might have more history here than I do, but my understanding is that the iommu_flushall() was needed on the older sun4u architectures, where there was on intermediating HV? Alexey documentation would help here, and demap() does not have to be Alexey executed under the lock (as map() is not executed under the lock). Alexey Alexey btw why demap, not unmap? :) Maybe neither is needed, as you folks made me realize above. --Sowmini ___ Linuxppc-dev mailing list Linuxppc-dev@lists.ozlabs.org https://lists.ozlabs.org/listinfo/linuxppc-dev