Fwd: Re: sparc64: Build failure due to commit f1600e549b94 (sparc: Make sparc64 use scalable lib/iommu-common.c functions)

2015-04-19 Thread Sowmini Varadhan

 On (04/19/15 14:09), David Miller wrote:

  On (04/18/15 21:23), Guenter Roeck wrote:
  lib/built-in.o:(.discard+0x1): multiple definition of
  `__pcpu_unique_iommu_pool_hash'
  arch/powerpc/kernel/built-in.o:(.discard+0x18): first defined here
  .. I get a similar failure in the
  powerpc:allmodconfig build
:
 
 Maybe ping the powerpc folks becuase if they can do a quick
 conversion, this change isn't necessary.

linuxppc-dev,

The disussion above [http://www.spinics.net/lists/sparclinux/msg13835.html]
is in reference to the issue that Guenter Roeck
identified. The problem is that we have a 
  static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
in both lib/iommu-common.c as well as arch/powerpc/kernel/iommu.c,
and as Guenter correctly pointed out,

 DEFINE_PER_CPU translates to DEFINE_PER_CPU_SECTION, which in turn is
  defined as

  #define DEFINE_PER_CPU_SECTION(type, name, sec) \
__PCPU_DUMMY_ATTRS char __pcpu_scope_##name;\
extern __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;\
  --   __PCPU_DUMMY_ATTRS char __pcpu_unique_##name;   \
extern __PCPU_ATTRS(sec) __typeof__(type) name; \
__PCPU_ATTRS(sec) PER_CPU_DEF_ATTRIBUTES __weak \
__typeof__(type) name

  if CONFIG_DEBUG_FORCE_WEAK_PER_CPU is configured, which is the case here.
  The marked line above shows that __pcpu_unique_iommu_pool_hash is declared as
  global variable

This results in a build error for allmodconfig on powerpc, due to 
a multiple definition of __pcpu_unique_iommu_pool_hash.

Given that the goal is to use the functions in lib/iommu-common.c
as the Generic IOMMU pooled allocator that will be a superset of
functionality in arch/powerpc/kernel/iommu.c, DaveM suggested that I 
ping you to see if you can do a quick conversion to handle this
duplicate defintion, rather than rename iommu_pool_hash in iommu-common.c.

Would that be possible?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v10 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-09 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

v8: Benh code review comments.

v9: More Benh code review comments, added dma_mask, align_order logic to
iommu_tbl_range_alloc.

v10: resend without RFC tag, and new mail message-id

 include/linux/iommu-common.h |   51 
 lib/Makefile |2 +-
 lib/iommu-common.c   |  266 ++
 3 files changed, 318 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..bbced83
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,51 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_map_table {
+   unsigned long   table_map_base;
+   unsigned long   table_shift;
+   unsigned long   nr_pools;
+   void(*lazy_flush)(struct iommu_map_table *);
+   unsigned long   poolsize;
+   struct iommu_pool   pools[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+#defineIOMMU_NEED_FLUSH0x0004
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+   unsigned long num_entries,
+   u32 table_shift,
+   void (*lazy_flush)(struct iommu_map_table *),
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_map_table *iommu,
+  unsigned long npages,
+  unsigned long *handle,
+  unsigned long mask,
+  unsigned int align_order);
+
+extern void iommu_tbl_range_free(struct iommu_map_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2..60c22e6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..b99f1d7
--- /dev/null

[PATCH v10 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-04-09 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()
v8: BenH comments
v9: more BenH comments
v10: resend without RFC tag, and new mail message-id

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 +--
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  183 +
 4 files changed, 128 insertions(+), 242 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..cd0d69f 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_map_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..5320689 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
@@ -45,8 +46,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_map_table *iommu_map_table)
 {
+   struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
-{
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush

[PATCH v10 3/3] sparc: Make LDC use common iommu poll management functions

2015-04-09 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init
v8: BenH comments
v9: More BenH comments
v10: resend without RFC tag, and new mail message-id

 arch/sparc/kernel/ldc.c |  153 ---
 1 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..d2ae0f7 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_map_table  iommu_map_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_map_table *iommu = ldc_iommu-iommu_map_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   NULL, false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
free_pages((unsigned long) table, order);
-   iommu-page_table = NULL;
+   ldc_iommu-page_table = NULL;
 
 out_free_map:
-   kfree(iommu

[PATCH v10 0/3] Generic IOMMU pooled allocator

2015-04-09 Thread Sowmini Varadhan

Investigation of network performance on Sparc shows a high
degree of locking contention in the IOMMU allocator, and it
was noticed that the PowerPC code has a better locking model.

This patch series tries to extract the generic parts of the 
PowerPC code so that it can be shared across multiple PCI
devices and architectures.

v10: resend patchv9 without RFC tag, and a new mail Message-Id,
(previous non-RFC attempt did not show up on the patchwork queue?)

Full revision history below:
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.

v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: removed -cookie_to_index and -demap indirection from the
iommu_tbl_ops The caller needs to call these functions as needed,
before invoking the generic arena allocator functions.
Added the skip_span_boundary argument to iommu_tbl_pool_init() for
those callers like LDC which do no care about span boundary checks.

v6: removed iommu_tbl_ops, and instead pass the -flush_all as
an indirection to iommu_tbl_pool_init(); only invoke -flush_all
when there is no large_pool, based on the assumption that large-pool
usage is infrequently encountered

v7: moved pool_hash initialization to lib/iommu-common.c and cleaned up
code duplication from sun4v/sun4u/ldc. 

v8: Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience.  Discard trylock for simple spin_lock to acquire pool

v9: Addresses latest BenH comments: need_flush checks, add support
for dma mask and align_order. 

v10: resend without RFC tag, and new mail Message-Id.

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 ++--
 arch/sparc/kernel/iommu_common.h  |8 -
 arch/sparc/kernel/ldc.c   |  153 +
 arch/sparc/kernel/pci_sun4v.c |  183 --
 include/linux/iommu-common.h  |   51 +++
 lib/Makefile  |2 +-
 lib/iommu-common.c|  266 +
 8 files changed, 511 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-08 Thread Sowmini Varadhan
On (04/08/15 18:30), Benjamin Herrenschmidt wrote:
 
 I'm happy with your last version, feel free to add my
 
 Acked-by: Benjamin Herrenschmidt b...@kernel.crashing.org

sounds good, I'll do this and rese a non-RFC version today.

Thanks for all the feedback - it was very useful to me, and
I'm much happier with the end-result.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v9 RFC 3/3] sparc: Make LDC use common iommu poll management functions

2015-04-05 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init
v8: BenH comments
v9: More BenH comments
 arch/sparc/kernel/ldc.c |  153 ---
 1 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..d2ae0f7 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_map_table  iommu_map_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_map_table *iommu = ldc_iommu-iommu_map_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   NULL, false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
free_pages((unsigned long) table, order);
-   iommu-page_table = NULL;
+   ldc_iommu-page_table = NULL;
 
 out_free_map:
-   kfree(iommu-arena.map);
-   iommu-arena.map = NULL;
+   kfree(iommu-map);
+   iommu-map = NULL

[PATCH v9 RFC 0/3] Generic IOMMU pooled allocator

2015-04-05 Thread Sowmini Varadhan
Addresses latest BenH comments: need_flush checks, add support
for dma mask and align_order. 

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 ++--
 arch/sparc/kernel/iommu_common.h  |8 -
 arch/sparc/kernel/ldc.c   |  153 +
 arch/sparc/kernel/pci_sun4v.c |  183 --
 include/linux/iommu-common.h  |   51 +++
 lib/Makefile  |2 +-
 lib/iommu-common.c|  266 +
 8 files changed, 511 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-05 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

v8: Benh code review comments.

v9: More Benh code review comments, added dma_mask, align_order logic to
iommu_tbl_range_alloc.

 include/linux/iommu-common.h |   51 
 lib/Makefile |2 +-
 lib/iommu-common.c   |  266 ++
 3 files changed, 318 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..bbced83
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,51 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_map_table {
+   unsigned long   table_map_base;
+   unsigned long   table_shift;
+   unsigned long   nr_pools;
+   void(*lazy_flush)(struct iommu_map_table *);
+   unsigned long   poolsize;
+   struct iommu_pool   pools[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+#defineIOMMU_NEED_FLUSH0x0004
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+   unsigned long num_entries,
+   u32 table_shift,
+   void (*lazy_flush)(struct iommu_map_table *),
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_map_table *iommu,
+  unsigned long npages,
+  unsigned long *handle,
+  unsigned long mask,
+  unsigned int align_order);
+
+extern void iommu_tbl_range_free(struct iommu_map_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2..60c22e6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..b99f1d7
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,266 @@
+/*
+ * IOMMU mmap management and range allocation functions

[PATCH v9 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-04-05 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()
v8: BenH comments
v9: more BenH comments
 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  172 +--
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  183 +
 4 files changed, 128 insertions(+), 242 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..cd0d69f 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_map_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..5320689 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
@@ -45,8 +46,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_map_table *iommu_map_table)
 {
+   struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
-{
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush.
-*/
-   if (start = limit) {
-   start = 0;
-   if (iommu-flush_all)
-   iommu

Re: [PATCHv9 RFC 1/3] Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-05 Thread Sowmini Varadhan
On (04/05/15 22:26), Benjamin Herrenschmidt wrote:
 
 So you decided to keep the logic here that updates the hint instead of
 just getting rid of need_flush alltogether ?
 
 Out of curiosity, what's the rationale ? Did you find a reason why
 resetting the hint in those two cases (rather than just setting start
 appropriately) is actually useful ?

To be honest, I actually did not want to poke at this too much, given
that both ppc and sparc seem to have essentially the same
logic.

But my gut feeling is that if the pool-hint hasn't resulted
in a succesful area_alloc, we might as well reset it for the next
iommu*range_alloc-er, because it's likely that the next alloc
will also need to  wrap and search anyway..

--Sowmini


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-04 Thread Sowmini Varadhan
One last question before I spin out v9.. the dma_mask code
is a bit confusing to me, so I want to make sure... the code is 

 if (limit + tbl-it_offset  mask) {
   limit = mask - tbl-it_offset + 1;
   /* If we're constrained on address range, first try
* at the masked hint to avoid O(n) search complexity,
* but on second pass, start at 0 in pool 0.
*/
   if ((start  mask) = limit || pass  0) {
   spin_unlock((pool-lock));
   pool = (tbl-pools[0]);
   spin_lock((pool-lock));
   start = pool-start;

So here I would need to mark need_flush to true, right?


--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-03 Thread Sowmini Varadhan


Just want to confirm:

  + again:
  +   if (pass == 0  handle  *handle 
  +   (*handle = pool-start)  (*handle  pool-end))
  +   start = *handle;
  +   else
  +   start = pool-hint;
 
 Now this means handle might be  pool-hint, in that case you also
 need a lazy flush. Or rather only if the resulting alloc is. My
 :

  +   } else {
  +   /* give up */
  +   n = DMA_ERROR_CODE;
  +   goto bail;
  +   }
  +   }
 
 Here, make this something like:
 
   } else if (end  pool-hint)
   need_flush = true;

you mean 

} else if (start  pool-hint)

right?  (so I'm not missing some corner-case that you are thinking
about here)

--Sowmini
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-03 Thread Sowmini Varadhan
On (04/04/15 08:06), Benjamin Herrenschmidt wrote:
 
 No, I meant n  pool-hint, ie, the start of the newly allocated
 block.

ah, got it. I'll do my drill with patchset and get back, probably by
Monday.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-02 Thread Sowmini Varadhan
On (04/03/15 08:57), Benjamin Herrenschmidt wrote:
 
   I only just noticed too, you completely dropped the code to honor
   the dma mask. Why that ? Some devices rely on this.

/* Sowmini's comment about this coming from sparc origins.. */

 Probably, not that many devices have limits on DMA mask but they do
 exist. It becomes more important if we decide to create a very large
 IOMMU window that spans beyond 4G in order to support devices with
 32-bit DMA masks. Otherwise it's older devices mostly with 32-bit
 masks.
 
 In any case, for a generic piece of code, this should be supported.
 Basically, assume that if we have something in the powerpc code, we need
 it, if you remove it, we won't be able to use your code generically.

I see.

is the mask something that can be stored in the iommu_map_table as
part of the init? 

I can see that the align_order has to be an additional arg to 
iommu_tbl_range_alloc, not sure if mask falls in that category
as well.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-02 Thread Sowmini Varadhan
On (04/03/15 07:54), Benjamin Herrenschmidt wrote:
  +   limit = pool-end;
  +
  +   /* The case below can happen if we have a small segment appended
  +* to a large, or when the previous alloc was at the very end of
  +* the available space. If so, go back to the beginning and flush.
  +*/
  +   if (start = limit) {
  +   start = pool-start;
  +   if (!large_pool  iommu-lazy_flush != NULL)
  +   iommu-lazy_flush(iommu);
 
 Add need_flush = false;

A few clarifications, while I parse the rest of your comments:

Not sure I follow- need_flush is initialized to true at the start of the 
function?

 I only just noticed too, you completely dropped the code to honor
 the dma mask. Why that ? Some devices rely on this.

so that's an interesting question: the existing iommu_range_alloc() in
arch/sparc/kernel/iommu.c does not use the mask at all. I based most of
the code  on this (except for the lock fragmentation part). 
I dont know if this is arch specific.

 
  +   if (dev)
  +   boundary_size = ALIGN(dma_get_seg_boundary(dev) + 1,
  + 1  iommu-table_shift);
  +   else
  +   boundary_size = ALIGN(1UL  32, 1  iommu-table_shift);
  +
  +   shift = iommu-table_map_base  iommu-table_shift;
  +   boundary_size = boundary_size  iommu-table_shift;
  +   /*
  +* if the skip_span_boundary_check had been set during init, we set
  +* things up so that iommu_is_span_boundary() merely checks if the
  +* (index + npages)  num_tsb_entries
  +*/
  +   if ((iommu-flags  IOMMU_NO_SPAN_BOUND) != 0) {
  +   shift = 0;
  +   boundary_size = iommu-poolsize * iommu-nr_pools;
  +   }
  +   n = iommu_area_alloc(iommu-map, limit, start, npages, shift,
  +boundary_size, 0);
 
 You have completely dropped the alignment support. This will break
 drivers. There are cases (especially with consistent allocations) where

Again, not sure I follow? are you referring to the IOMMU_NO_SPAN_BOUND case?
That's very specific to LDC (sparc ldoms virtualization infra). The default
is to not have IOMMU_NO_SPAN_BOUND set. 
For the rest of the drivers, the code that sets up boundary_size aligns things
in the same way as the ppc code.


 the driver have alignment constraints on the address, those must be
 preserved.
 

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-04-02 Thread Sowmini Varadhan
On (03/31/15 23:12), David Miller wrote:
 
 It's much more amortized with smart buffering strategies, which are
 common on current generation networking cards.
 
 There you only eat one map/unmap per PAGE_SIZE / rx_pkt_size.
 
 Maybe the infiniband stuff is doing things very suboptimally, and
 actually with that subsystem and drivers absolutely nothing would
 surprise me.

yeh, we are trying to get more info from them about what their
bottle-necks are. Until then, lets just stick with the spin_lock.

do I need to resubmit this without the RFC tag? Perhaps I should
have dropped that some time ago.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-04-02 Thread Sowmini Varadhan
the other question that comes to my mind is: the whole lazy_flush
optimization probably works best when there is exactly one pool,
and no large pools. In most other cases, we'd end up doing a lazy_flush
when we wrap within our pool itself, losing the benefit of that 
optimization. 

Given that the lazy_flush is mostly there to avoid regressions for 
the older sun4u architectures (which have other hardware bottlenecks 
anyway), and this code is rapidly getting messy, does it make sense
to constrain the lazy_flush check to only apply for the 1-pool, 
no-large-pool case?


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan

On 03/31/2015 09:01 PM, Benjamin Herrenschmidt wrote:

On Tue, 2015-03-31 at 14:06 -0400, Sowmini Varadhan wrote:

Having bravely said that..

the IB team informs me that they see a 10% degradation using
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.


Have they tried using more pools instead ?



we just tried 32 instead of 16, no change to perf.

Looks like their current bottleneck is the find_next_zero_bit (they
can get a 2X perf improvement with the lock fragmentation, but are
then hitting a new ceiling, even with the trylock version)


I'm starting to wonder if  some approximation of dma premapped
buffers may be needed. Doing a map/unmap on each packet is expensive.

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-31 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_map_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

v8: Benh code review comments.

 include/linux/iommu-common.h |   48 +
 lib/Makefile |2 +-
 lib/iommu-common.c   |  233 ++
 3 files changed, 282 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..7b1b761
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,48 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_map_table {
+   unsigned long   table_map_base;
+   unsigned long   table_shift;
+   unsigned long   nr_pools;
+   void(*lazy_flush)(struct iommu_map_table *);
+   unsigned long   poolsize;
+   struct iommu_pool   pools[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_map_table *iommu,
+   unsigned long num_entries,
+   u32 table_shift,
+   void (*lazy_flush)(struct iommu_map_table *),
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_map_table *iommu,
+  unsigned long npages,
+  unsigned long *handle);
+
+extern void iommu_tbl_range_free(struct iommu_map_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 58f74d2..60c22e6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -106,7 +106,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..35dacbe
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,233 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include linux/export.h
+#include linux/bitmap.h
+#include linux/bug.h
+#include linux/iommu-helper.h
+#include linux/iommu-common.h
+#include linux/dma-mapping.h
+#include linux/hash.h
+
+unsigned long iommu_large_alloc

[PATCH v8 RFC 3/3] sparc: Make LDC use common iommu poll management functions

2015-03-31 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init
v8: BenH comments

 arch/sparc/kernel/ldc.c |  153 ---
 1 files changed, 65 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..f842512 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_map_table  iommu_map_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_map_table *iommu = ldc_iommu-iommu_map_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   NULL, false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
free_pages((unsigned long) table, order);
-   iommu-page_table = NULL;
+   ldc_iommu-page_table = NULL;
 
 out_free_map:
-   kfree(iommu-arena.map);
-   iommu-arena.map = NULL;
+   kfree(iommu-map);
+   iommu-map = NULL;
 
return err;
 }
 
 static

[PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan
Addresses BenH comments with one exception: I've left the
IOMMU_POOL_HASH as is, so that powerpc can tailor it to their
convenience. 

I've not heard back from the IB folks, but I'm going to make
a judgement call here and go with the spin_lock. *If* they
report some significant benefit from the trylock, probably 
need to revisit this (and then probably start by re-exmaining
the hash function to avoid collisions, before resorting to 
trylock).

Sowmini Varadhan (3):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  171 +++
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  153 ++--
 arch/sparc/kernel/pci_sun4v.c |  181 +
 include/linux/iommu-common.h  |   48 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  233 +
 8 files changed, 472 insertions(+), 331 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-31 Thread Sowmini Varadhan
On (03/31/15 15:15), David Laight wrote:
 
 I've wondered whether the iommu setup for ethernet receive (in particular)
 could be made much more efficient if there were a function that
 would unmap one buffer and map a second buffer?
 My thought is that iommu pte entry used by the old buffer could just
 be modified to reference the new one.
 In effect each ring entry would end up using a fixed iommu pte.

There are a number of interesting things to investigate in
this space, and the above is just one of them, ways to avoid
the overhead of a full-blown map/unmap on each call. See 
  http://www.spinics.net/lists/sparclinux/msg13613.html

But the scope of this patchset is actually very rigidly defined:
to refactor the iommu pool/arena allocator into a common library,
and avoid code duplication (today even the single sparc arch
duplicates it for sun4[u,v] and ldc, and that's not even counting
the duplication across other archs/pci-drivers).

Investigating ways to provide a generalized infra that can
avoid a dma map/unmp for every packet would be a good follow-on.

 The other question is how much data can be copied in 26us ?
 On iommu systems 'copybreak' limits on receive and transmit
 may need to be quite high.

where does the 26us number come from? I may be missing that context?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v8 RFC 0/3] Generic IOMMU pooled allocator

2015-03-31 Thread Sowmini Varadhan
On (03/31/15 10:40), Sowmini Varadhan wrote:
 
 I've not heard back from the IB folks, but I'm going to make
 a judgement call here and go with the spin_lock. *If* they
 report some significant benefit from the trylock, probably 
 need to revisit this (and then probably start by re-exmaining
 the hash function to avoid collisions, before resorting to 
 trylock).

Having bravely said that..

the IB team informs me that they see a 10% degradation using 
the spin_lock as opposed to the trylock.

one path going forward is to continue processing this patch-set 
as is. I can investigate this further, and later revise the spin_lock
to the trylock, after we are certain that it is good/necessary.

thoughts?

--Sowmini


___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v8 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-03-31 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()
v8: BenH comments

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  171 +--
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  181 -
 4 files changed, 125 insertions(+), 242 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..cd0d69f 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_map_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..8b76e21 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
@@ -45,8 +46,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_map_table *iommu_map_table)
 {
+   struct iommu *iommu = container_of(iommu_map_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
-{
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush.
-*/
-   if (start = limit) {
-   start = 0;
-   if (iommu-flush_all)
-   iommu-flush_all(iommu

Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-30 Thread Sowmini Varadhan
On (03/30/15 21:55), Benjamin Herrenschmidt wrote:
 
 No that's not my point. The lock is only taken for a short time but
 might still collide, the bouncing in that case will probably (at least
 that's my feeling) hurt more than help.
 
 However, I have another concern with your construct. Essentially you
 spin looking for an unlocked pool without a cpu_relax. Now it's unlikely
 but you *can* end up eating cycles, which on a high SMT like POWER8
 might mean slowing down the actual owner of the pool lock. 

So I tried looking at the code, and perhaps there is some arch-specific
subtlety here that I am missing, but where does spin_lock itself
do the cpu_relax? afaict, LOCK_CONTENDED() itself does not have this.


 What is iperf ? What does that mean there was lock contention ? IE,
 was the overall performance improved or not ? Removing contention by
 trading it for cache line bouncing will not necessarily help. I'm not
 saying this is a bad optimisation but it makes the code messy and I
 think deserves some solid numbers demonstrating its worth.

iperf is a network perf benchmark. I'll try to regenerate some 
numbers today and get some instrumentation of cache-line bounces
vs trylock misses.
   
 There was also an actual bug in the case where you hop'ed to a new pool
 and forgot the flush.

yes, thanks for catching that, I'll fix that too, of course.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-30 Thread Sowmini Varadhan
On (03/30/15 14:24), Benjamin Herrenschmidt wrote:
  +
  +#define IOMMU_POOL_HASHBITS 4
  +#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
 
 I don't like those macros. You changed the value from what we had on
 powerpc. It could be that the new values are as good for us but I'd
 like to do a bit differently. Can you make the bits a variable ? Or at
 least an arch-provided macro which we can change later if needed ?

Actuallly, this are just the upper bound (16) on the number of pools.

The actual number is selected by the value passed to the 
iommu_tbl_range_init(), and is not hard-coded (as was the case with
the power-pc code).

Thus powerpc can continue to use 4 pools without any loss of 
generality.

   :
 Let's make it clear that this is for allocation of DMA space only, it
 would thus make my life easier when adapting powerpc to use a different
 names, something like struct iommu_area works for me, or iommu
 alloc_region .. whatever you fancy the most.
  :
 Why adding the 'arena' prefix ? What was wrong with pools in the
 powerpc imlementation ?

for the same reason you want to re-baptize iommu_table above- at
the time, I was doing it to minimize conflicts with existing usage.
But I can rename everything if you like. 

  +#define IOMMU_LARGE_ALLOC  15
 
 Make that a variable, here too, the arch might want to tweak it.
 
 I think 15 is actually not a good value for powerpc with 4k iommu pages
 and 64k PAGE_SIZE, we should make the above some kind of factor of
 PAGE_SHIFT - iommu_page_shift.

Ok.

  +   /* Sanity check */
  +   if (unlikely(npages == 0)) {
  +   printk_ratelimited(npages == 0\n);
 
 You removed the WARN_ON here which had the nice property of giving you a
 backtrace which points to the offender. The above message alone is
 useless.

yes, the original code was generating checkpatch warnings and errors.
That's why I removed it. 

 I am not certain about the first unlocked pool... We take the lock for
 a fairly short amount of time, I have the gut feeling that the cache
 line bouncing introduced by looking at a different pool may well cost
 more than waiting a bit longer. Did do some measurements of that
 optimisation ?

if you are really only taking it for a short amount of time, then
the trylock should just succeed, so there is no cache line bouncing.

But yes, I did instrument it with iperf, and there was lock contention
on the spinlock, which was eliminted by the trylock. 

I'll fix the rest of the variable naming etc nits and send out
a new version later today.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-30 Thread Sowmini Varadhan
On (03/31/15 08:28), Benjamin Herrenschmidt wrote:
 
 Provided that the IB test doesn't come up with a significant difference,
 I definitely vote for the simpler version of doing a normal spin_lock.

sounds good. let me wait for the confirmation from IB,
and I'll send out patchv8 soon after. 

FWIW, I'm ok with all the other comments- thanks for the feedback.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-30 Thread Sowmini Varadhan
On (03/30/15 09:01), Sowmini Varadhan wrote:
 
 So I tried looking at the code, and perhaps there is some arch-specific
 subtlety here that I am missing, but where does spin_lock itself
 do the cpu_relax? afaict, LOCK_CONTENDED() itself does not have this.

To answer my question:
I'd missed the CONFIG_LOCK_STAT (which David Ahern pointed out to me).
the above is only true for the LOCK_STAT case.

In any case, I ran some experiments today: I was running 
iperf [http://en.wikipedia.org/wiki/Iperf] over ixgbe, which
is where I'd noticed the original perf issues for sparc. I was
running iperf2 (which is more aggressively threaded than iperf3) with
8, 10, 16, 20 threads, and with TSO turned off. In each case, I was
making sure that I was able to reach 9.X Gbps (this is a 10Gbps link)

I dont see any significant difference in the perf profile between the
spin_trylock and the spin_lock version (other than, of course, the change
to the lock-contention for the trylock version). I looked at the
perf profiled cache-misses (works out to about 1400M for 10 threads,
with or without the trylock).

I'm still waiting for some of the IB folks to try out the spin_lock
version (they had also seen some significant perf improvements from
breaking down the monolithic lock into multiple pools, so their workload
is also sensitive to this)

But as such, it looks like it doesnt matter much, whether you use
the trylock to find the first available pool, or block on the spin_lock.
I'll let folks on this list vote on this one (assuming the IB tests also 
come out without a significant variation between the 2 locking choices).

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: [PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-27 Thread Sowmini Varadhan
On (03/26/15 08:05), Benjamin Herrenschmidt wrote:
  PowerPC folks, what do you think?
 
 I'll give it another look today.
 
 Cheers,
 Ben.

Hi Ben, 

did you have a chance to look at this?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-26 Thread Sowmini Varadhan
On (03/25/15 21:43), casca...@linux.vnet.ibm.com wrote:
 However, when using large TCP send/recv (I used uperf with 64KB
 writes/reads), I noticed that on the transmit side, largealloc is not
 used, but on the receive side, cxgb4 almost only uses largealloc, while
 qlge seems to have a 1/1 usage or largealloc/non-largealloc mappings.
 When turning GRO off, that ratio is closer to 1/10, meaning there is
 still some fair use of largealloc in that scenario.
 
 I confess my experiments are not complete. I would like to test a couple
 of other drivers as well, including mlx4_en and bnx2x, and test with
 small packet sizes. I suspected that MTU size could make a difference,
 but in the case of ICMP, with MTU 9000 and payload of 8000 bytes, I
 didn't notice any significant hit of largepool with either qlge or
 cxgb4.

I guess we also need to consider the average use-case, i.e., 
something that interleaves small packets and interactive data with
jumbo/bulk data.. in those cases, the largepool would not get many
hits, and might actually be undesirable?

 But I believe that on the receive side, all drivers should map entire
 pages, using some allocation strategy similar to mlx4_en, in order to
 avoid DMA mapping all the time. 

good point. I think in the early phase of my perf investigation,
it was brought up that Solaris does pre-mapped DMA buffers (they
have to do this carefully, to avoid resource-starvation vulnerabilities-
see http://www.spinics.net/lists/sparclinux/msg13217.html
and threads leading to it..

This is  not something that the common iommu-arena allocator
can/should get involved in, of course. The scope of the arena-allocator
is much more rigorously defined.

I dont know if there is a way to set up a generalized DMA premapped
buffer infra for linux today.

fwiw, when I instrumented this for solaris (there are hooks to disable
the pre-mapped bufferes) the impact on a T5-2 (8 sockets, 2 numa nodes,
64 cpus) was not very significant for a single 10G ixgbe port- approx
8 Gbps instead of 9.X Gbps. I think the DMA buffer pre-mapping is
only significant when you start trying to scale to multiple ethernet 
ports.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v7 RFC 3/3] sparc: Make LDC use common iommu poll management functions

2015-03-25 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops
v7: move pool_hash initialization to iommu_tbl_pool_init

 arch/sparc/kernel/ldc.c |  152 ---
 1 files changed, 64 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..e858968 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,7 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +28,10 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +103,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_table  iommu_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1003,59 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_table *iommu = ldc_iommu-iommu_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   NULL, false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1070,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1082,32 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
 out_free_table:
free_pages((unsigned long) table, order);
-   iommu-page_table = NULL;
+   ldc_iommu-page_table = NULL;
 
 out_free_map:
-   kfree(iommu-arena.map);
-   iommu-arena.map = NULL;
+   kfree(iommu-map);
+   iommu-map = NULL;
 
return err;
 }
 
 static void ldc_iommu_release(struct

[PATCH v7 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-03-25 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
iommu_tbl_pool_init
v7: move pool_hash initialization to iommu_tbl_pool_init()

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  174 +---
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  179 
 4 files changed, 127 insertions(+), 241 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..e3cd449 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..f7fdff2 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,6 +13,7 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
@@ -45,8 +46,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_table *iommu_table)
 {
+   struct iommu *iommu = container_of(iommu_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,94 +89,6 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
-{
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush.
-*/
-   if (start = limit) {
-   start = 0;
-   if (iommu-flush_all)
-   iommu-flush_all(iommu);
-   }
-
- again

[PATCH v7 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread Sowmini Varadhan

Changes from patchv6: moved pool_hash initialization to
lib/iommu-common.c and cleaned up code duplication from 
sun4v/sun4u/ldc. 

Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  174 +++
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  152 ++--
 arch/sparc/kernel/pci_sun4v.c |  179 +
 include/linux/iommu-common.h  |   48 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  235 +
 8 files changed, 475 insertions(+), 330 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v7 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-25 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
optimization.

v7: one-time initialization of pool_hash from iommu_tbl_pool_init()

 include/linux/iommu-common.h |   48 +
 lib/Makefile |2 +-
 lib/iommu-common.c   |  235 ++
 3 files changed, 284 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..197111b
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,48 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_table {
+   unsigned long   page_table_map_base;
+   unsigned long   page_table_shift;
+   unsigned long   nr_pools;
+   void(*flush_all)(struct iommu_table *);
+   unsigned long   poolsize;
+   struct iommu_pool   arena_pool[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_table *iommu,
+   unsigned long num_entries,
+   u32 page_table_shift,
+   void (*flush_all)(struct iommu_table *),
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_table *iommu,
+  unsigned long npages,
+  unsigned long *handle);
+
+extern void iommu_tbl_range_free(struct iommu_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3c3b30b..0ea2ac6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..bb7e706
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,235 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include linux/export.h
+#include linux/bitmap.h
+#include linux/bug.h
+#include linux/iommu-helper.h
+#include linux/iommu-common.h
+#include linux/dma-mapping.h
+#include linux/hash.h
+
+#define IOMMU_LARGE_ALLOC  15
+
+static DEFINE_PER_CPU(unsigned int

Re: [PATCH v6 0/3] Generic IOMMU pooled allocator

2015-03-25 Thread Sowmini Varadhan
On (03/24/15 18:16), David Miller wrote:
 Generally this looks fine to me.
 
 But about patch #2, I see no reason to have multiple iommu_pool_hash
 tables.  Even from a purely sparc perspective, we can always just do
 with just one of them.
 
 Furthermore, you can even probably move it down into lib/iommu-common.c
 itself.  iommu_tbl_pool_init() can do the one time initialization.

fixed in v7.

Ben, Alexey, do you need more time to review this?

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v6 RFC 3/3] sparc: Make LDC use common iommu poll management functions

2015-03-24 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common
v6: remove iommu_tbl_ops

 arch/sparc/kernel/ldc.c |  169 ++
 1 files changed, 81 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..c6b1bb1 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,8 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/hash.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +29,11 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+static DEFINE_PER_CPU(unsigned int, ldc_pool_hash);
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +105,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_table  iommu_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1005,73 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+static void setup_ldc_pool_hash(void)
+{
+   unsigned int i;
+   static bool do_once;
+
+   if (do_once)
+   return;
+   do_once = true;
+   for_each_possible_cpu(i)
+   per_cpu(ldc_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
+}
+
+
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_table *iommu = ldc_iommu-iommu_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   setup_ldc_pool_hash();
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   NULL, false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1086,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1098,32 @@ static int ldc_iommu_init(struct ldc_channel *lp

[PATCH v6 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-24 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

v6: Benh/DaveM discussion eliminationg iommu_tbl_ops, but retaining flush_all
optimization.

 include/linux/iommu-common.h |   49 ++
 lib/Makefile |2 +-
 lib/iommu-common.c   |  219 ++
 3 files changed, 269 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..266a169
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,49 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_table {
+   unsigned long   page_table_map_base;
+   unsigned long   page_table_shift;
+   unsigned long   nr_pools;
+   void(*flush_all)(struct iommu_table *);
+   unsigned long   poolsize;
+   struct iommu_pool   arena_pool[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_table *iommu,
+   unsigned long num_entries,
+   u32 page_table_shift,
+   void (*flush_all)(struct iommu_table *),
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_table *iommu,
+  unsigned long npages,
+  unsigned long *handle,
+  unsigned int pool_hash);
+
+extern void iommu_tbl_range_free(struct iommu_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3c3b30b..0ea2ac6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..86a5bca
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,219 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include linux/export.h
+#include linux/bitmap.h
+#include linux/bug.h
+#include linux/iommu-helper.h
+#include linux/iommu-common.h
+#include linux/dma-mapping.h
+
+#define IOMMU_LARGE_ALLOC  15
+
+/*
+ * Initialize iommu_pool entries for the iommu_table

[PATCH v6 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-03-24 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()
v6: removed iommu_tbl_ops. Pass flush_all as function pointer to 
iommu_tbl_pool_init

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  182 +++
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  192 ++---
 4 files changed, 153 insertions(+), 236 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..e3cd449 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..9886186 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,11 +13,15 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/hash.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
 #endif
 
+static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
+
 #include asm/iommu.h
 
 #include iommu_common.h
@@ -45,8 +49,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_table *iommu_table)
 {
+   struct iommu *iommu = container_of(iommu_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,93 +92,18 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
+static void setup_iommu_pool_hash(void)
 {
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush.
-*/
-   if (start = limit) {
-   start

[PATCH v6 0/3] Generic IOMMU pooled allocator

2015-03-24 Thread Sowmini Varadhan
Deltas from patchv5:
- removed iommu_tbl_ops, and instead pass the -flush_all as
  an indirection to iommu_tbl_pool_init()
- only invoke -flush_all when there is no large_pool, based on
  the assumption that large-pool usage is infrequently encountered.

Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  182 ++-
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  169 ++---
 arch/sparc/kernel/pci_sun4v.c |  192 -
 include/linux/iommu-common.h  |   49 
 lib/Makefile  |2 +-
 lib/iommu-common.c|  219 +
 8 files changed, 503 insertions(+), 325 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/23/15 12:29), David Miller wrote:
 
 In order to elide the IOMMU flush as much as possible, I implemnented
 a scheme for sun4u wherein we always allocated from low IOMMU
 addresses to high IOMMU addresses.
 
 In this regime, we only need to flush the IOMMU when we rolled over
 back to low IOMMU addresses during an allocation.
 
 It made a noticable difference in performance.
 
 Unfortunately, with sun4v and the hypervisor, I'm not allowed to
 control when the IOMMU flush happens, it has to occur on every
 single IOMMU mapping change.  So this optimization was no longer
 possible there.
 
 Anyways, that's the history behind it.
 --

I see.

If it was only an optimization (i.e., removing it would not break
any functionality), and if this was done for older hardware,
and *if* we believe that the direction of most architectures is to 
follow the sun4v/HV model, then, given that the sun4u code only uses 1 
arena pool anyway, one thought that I have for refactoring this
is the following:

- Caller of iommu_tbl_range_alloc() can do the flush_all if they 
  see start = end for the one single pool 
- lose the other -flush_all invocation (i.e., the one that is
  done when iommu_area_alloc() fails for pass == 0, and we reset
  start to 0 to roll-back)

that would avoid the need for any iommu_tbl_ops in my patch-set.

But it would imply that you would still take the perf hit for the roll-back
if we failed the pass == 0 iteration through iommu_area_alloc().
Perhaps this is an acceptable compromise in favor of cleaner code
(again, assuming that current/future archs will all follow the HV
based design).

--Sowmini
 
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/23/15 15:05), David Miller wrote:
 
 Why add performance regressions to old machines who already are
 suffering too much from all the bloat we are constantly adding to the
 kernel?

I have no personal opinion on this- it's a matter of  choosing
whether we want to have some extra baggage in the library to support
old hardware, or keep the code lean-and-mean for the future,
where the model has changed.

I'll let others on this list make the call.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 11:47), Benjamin Herrenschmidt wrote:
 
 Yes, pass a function pointer argument that can be NULL or just make it a
 member of the iommu_allocator struct (or whatever you call it) passed to
 the init function and that can be NULL. My point is we don't need a
 separate ops structure.

Ok, I'll make this a function pointer to the iommu_tbl_pool_init()
function (tomorrow)

fwiw, the ops struct came out as a result of DaveM input, 
  http://www.spinics.net/lists/sparclinux/msg13232.html
albeit the original context is now moot.

 Pass it to init and stash it in the table but don't call it
 iommu_table, let's use a name that conveys better the fact that this
 is purely a DMA space allocator (to be embedded by the arch specific
 iommu table). Something like iommu_alloc_zone or whatever you want to
 call it. I keep finding new names whenever I think of it :-)

please pick a name that you like as this mooshes all the patches,
and I myself really dont care what it gets called (rose? :-)) as long 
as fat-fingering-risk is minimized

Regarding the relationship with largepool,

 But that might not be necessary. If indeed we very rarely use the large
 pool, then just make it flush always. My feeling is that it will only
 ever be used at driver init/remove time when allocating things like
 descriptor rings, where the flush overhead dont' matter.

ok, will factor this in tomorrow (assuming DaveM has no objections
to anything proposed here, esp around the function pointer for flushall).

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 09:21), Benjamin Herrenschmidt wrote:
 
 So we have two choices here that I can see:
 
  - Keep that old platform use the old/simpler allocator

Problem with that approach is that the base struct iommu structure
for sparc gets a split personality: the older one is used with
the older allocator, and other ugly things ensue.  (alternatively,
you end up duplicating a version of the code with the flush_all 
inlined). 

  - Try to regain the bulk of that benefit with the new one
 
 Sowmini, I see various options for the second choice. We could stick to
 1 pool, and basically do as before, ie, if we fail on the first pass of
 alloc, it means we wrap around and do a flush, I don't think that will
 cause a significant degradation from today, do you ? We might have an
 occasional additional flush but I would expect it to be in the noise.

Isn't this essentially what I have in patch v5 here:
http://www.spinics.net/lists/sparclinux/msg13534.html

(the ops-reset is the flushall indirection, can be renamed if the
latter is preferred)

 Dave, what's your feeling there ? Does anybody around still have some
 HW that we can test with ?

I actually tested this on a V440 and a ultra45 (had a heck of a
time finding these, since the owners keep them turned off because
they are too noisy and consume too much power :-). Thus while I
have no opinion, I would not shed any tears if we lost this extra
perf-tweak in the interest of being earth-friendly  :-))

so testing it is not a problem, though I dont have any perf
benchmarks for them either.

 Sowmini, I think we can still kill the ops and have a separate data
 structure exclusively concerned by allocations by having the alloc
 functions take the lazy flush function as an argument (which can be
 NULL), I don't think we should bother with ops.

I dont quite follow what you have in mind? The caller would somehow
have to specify a flush_all indirection for the legacy platforms 

Also, you mention

 You must hold the lock until you do the flush, otherwise somebody
 else might allocate the not-yet-flushed areas and try to use them...
 kaboom. However if that's the only callback left, pass it as an
 argument.

Passing  in a function pointer to the flushall to iommu_tbl_range_alloc
would work, or we could pass it in as an arg to iommu_tbl_init,
and stash it in the struct iommu_table.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
On (03/24/15 09:36), Benjamin Herrenschmidt wrote:
 
  - One pool only
 
  - Whenever the allocation is before the previous hint, do a flush, that
 should only happen if a wrap around occurred or in some cases if the
 device DMA mask forced it. I think we always update the hint whenever we
 successfully allocate from the small pools.

right, I believe this is close to what I discussed in the previous
thread, and approx what I have in patchv5, except that the flush 
indirection can be passed as a function pointer, or via the table_ops.

 
  - Deal with the largealloc case. That's the contentious issue, see
 below.
  :
 The largealloc issue is a different can of worms. We can try adding an
 option to disable the largealloc business completely (instead of hard
 wiring the 15, make that a variable and define that 0 means no
 largealloc pool).

What I've tried to do is to have a bool large_pool arg passed
to iommu_tbl_pool_init. In my observation (instrumented for scsi, ixgbe), 
we never allocate more than 4 pages at a time, so I pass in 
large_pool == false for all the sparc platforms. 

 Or we can decide that large allocs are rare (typically
 pci_alloc_consistent, ie, driver init time), and thus always flush on
 them (or rather on free of a large chunk). David, what's your take
 there ? I have a feeling that should work fine without a noticeable
 performance issue...
 
 I would also keep a dirty flag set on any free and cleared on any
 flush to avoid more spurrious flushes, but here too the benefit might be
 in the noise.
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-23 Thread Sowmini Varadhan
benh It might be sufficient to add a flush counter and compare it between runs
benh if actual wall-clock benchmarks are too hard to do (especially if you
benh don't have things like very fast network cards at hand).
benh
benh Number of flush / number of packets might be a sufficient metric, it..

I was just going to say: I can add those counters tomorrow,
and get some stats, but seems like it doesn't really matter what 
the outcome is, because: 

On (03/23/15 21:44), David Miller wrote:
 
 From: Benjamin Herrenschmidt b...@kernel.crashing.org
 Date: Tue, 24 Mar 2015 09:21:05 +1100
 
  Dave, what's your feeling there ? Does anybody around still have
  some HW that we can test with ?
 
 I don't see what the actual problem is.
 
 Even if you use multiple pools, which we should for scalability on
 sun4u too, just do the flush when allocation in _any_ pool wraps
 around.
 
 That's still better than not doing the optimization at all.
 
 That is always going to be correct, and you can use a separate
 spinlock to make sure only one thread of control does the full
 IOMMU flush at a time.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
On (03/23/15 09:02), Benjamin Herrenschmidt wrote:
  How does this relate to the ARM implementation? There is currently
  an effort going on to make that one shared with ARM64 and possibly
  x86. Has anyone looked at both the PowerPC and ARM ways of doing the
  allocation to see if we could pick one of the two to work on
  all architectures?
 
 What I see in ARM is horribly complex, I can't quite make sense of it
 in a couple of minutes of looking at it, and doesn't seem to address the
 basic issue we are addressing here which is the splitting of the iommu
 table lock.

Amen to that.. I thought it was just me :-)

I plan to go through the code to see if/where the armd iommu code
does its locking and achieves its parallelism, but the mapping
between the sparc/powerpc approach and armd is not immediately obvious
to me.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

Re: Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
Turned out that I was able to iterate over it, and remove
both the -cookie_to_index and the -demap indirection from
iommu_tbl_ops.

That leaves only the odd iommu_flushall() hook, I'm trying
to find the history behind that (needed for sun4u platforms,
afaik, and not sure if there are other ways to achieve this).

Just sent out a new patch-set, which should reach all the 
people/lists on this thread.

If it's possible, it would also help to share any preliminary
code-samples with the iommu_table_ops you folks have in mind.

Please take a look at patchv5, and share your thoughts.

--Sowmini
___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5 RFC 1/3] sparc: Break up monolithic iommu table/lock into finer graularity pools and lock

2015-03-22 Thread Sowmini Varadhan
Investigation of multithreaded iperf experiments on an ethernet
interface show the iommu-lock as the hottest lock identified by
lockstat, with something of the order of  21M contentions out of
27M acquisitions, and an average wait time of 26 us for the lock.
This is not efficient. A more scalable design is to follow the ppc
model, where the iommu_table has multiple pools, each stretching
over a segment of the map, and with a separate lock for each pool.
This model allows for better parallelization of the iommu map search.

This patch adds the iommu range alloc/free function infrastructure.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2 changes:
  - incorporate David Miller editorial comments: sparc specific
fields moved from iommu-common into sparc's iommu_64.h
  - make the npools value an input parameter, for the case when
the iommu map size is not very large
  - cookie_to_index mapping, and optimizations for span-boundary
check, for use case such as LDC.
v3: eliminate iommu_sparc, rearrange the -demap indirection to
be invoked under the pool lock.

v4: David Miller review changes:
  - s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
  - page_table_map_base and page_table_shift are unsigned long, not u32.

v5: Feedback from b...@kernel.crashing.org and a...@ozlabs.ru
  - removed -cookie_to_index and -demap indirection: caller should
invoke these as needed before calling into the generic allocator

 include/linux/iommu-common.h |   55 +++
 lib/Makefile |2 +-
 lib/iommu-common.c   |  219 ++
 3 files changed, 275 insertions(+), 1 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

diff --git a/include/linux/iommu-common.h b/include/linux/iommu-common.h
new file mode 100644
index 000..50753cf
--- /dev/null
+++ b/include/linux/iommu-common.h
@@ -0,0 +1,55 @@
+#ifndef _LINUX_IOMMU_COMMON_H
+#define _LINUX_IOMMU_COMMON_H
+
+#include linux/spinlock_types.h
+#include linux/device.h
+#include asm/page.h
+
+#define IOMMU_POOL_HASHBITS 4
+#define IOMMU_NR_POOLS  (1  IOMMU_POOL_HASHBITS)
+
+struct iommu_pool {
+   unsigned long   start;
+   unsigned long   end;
+   unsigned long   hint;
+   spinlock_t  lock;
+};
+
+struct iommu_table;
+
+struct iommu_tbl_ops {
+   void(*reset)(struct iommu_table *);
+};
+
+struct iommu_table {
+   unsigned long   page_table_map_base;
+   unsigned long   page_table_shift;
+   unsigned long   nr_pools;
+   const struct iommu_tbl_ops  *iommu_tbl_ops;
+   unsigned long   poolsize;
+   struct iommu_pool   arena_pool[IOMMU_NR_POOLS];
+   u32 flags;
+#defineIOMMU_HAS_LARGE_POOL0x0001
+#defineIOMMU_NO_SPAN_BOUND 0x0002
+   struct iommu_pool   large_pool;
+   unsigned long   *map;
+};
+
+extern void iommu_tbl_pool_init(struct iommu_table *iommu,
+   unsigned long num_entries,
+   u32 page_table_shift,
+   const struct iommu_tbl_ops *iommu_tbl_ops,
+   bool large_pool, u32 npools,
+   bool skip_span_boundary_check);
+
+extern unsigned long iommu_tbl_range_alloc(struct device *dev,
+  struct iommu_table *iommu,
+  unsigned long npages,
+  unsigned long *handle,
+  unsigned int pool_hash);
+
+extern void iommu_tbl_range_free(struct iommu_table *iommu,
+u64 dma_addr, unsigned long npages,
+unsigned long entry);
+
+#endif
diff --git a/lib/Makefile b/lib/Makefile
index 3c3b30b..0ea2ac6 100644
--- a/lib/Makefile
+++ b/lib/Makefile
@@ -102,7 +102,7 @@ obj-$(CONFIG_AUDIT_GENERIC) += audit.o
 obj-$(CONFIG_AUDIT_COMPAT_GENERIC) += compat_audit.o
 
 obj-$(CONFIG_SWIOTLB) += swiotlb.o
-obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o
+obj-$(CONFIG_IOMMU_HELPER) += iommu-helper.o iommu-common.o
 obj-$(CONFIG_FAULT_INJECTION) += fault-inject.o
 obj-$(CONFIG_NOTIFIER_ERROR_INJECTION) += notifier-error-inject.o
 obj-$(CONFIG_CPU_NOTIFIER_ERROR_INJECT) += cpu-notifier-error-inject.o
diff --git a/lib/iommu-common.c b/lib/iommu-common.c
new file mode 100644
index 000..f416191
--- /dev/null
+++ b/lib/iommu-common.c
@@ -0,0 +1,219 @@
+/*
+ * IOMMU mmap management and range allocation functions.
+ * Based almost entirely upon the powerpc iommu allocator.
+ */
+
+#include linux/export.h
+#include linux/bitmap.h
+#include linux/bug.h
+#include linux/iommu-helper.h
+#include linux/iommu-common.h
+#include linux/dma-mapping.h
+
+#define IOMMU_LARGE_ALLOC  15
+
+/*
+ * Initialize iommu_pool entries for the iommu_table

[PATCH v5 RFC 3/3] sparc: Make LDC use common iommu poll management functions

2015-03-22 Thread Sowmini Varadhan
Note that this conversion is only being done to consolidate the
code and ensure that the common code provides the sufficient
abstraction. It is not expected to result in any noticeable
performance improvement, as there is typically one ldc_iommu
per vnet_port, and each one has 8k entries, with a typical
request for 1-4 pages.  Thus LDC uses npools == 1.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v3: added this file to be a consumer of the common iommu library
v4: removed -cookie_to_index and -demap from iommu_tbl_ops and instead
inline these calls into ldc before calling into iommu-common

 arch/sparc/kernel/ldc.c |  172 +++
 1 files changed, 84 insertions(+), 88 deletions(-)

diff --git a/arch/sparc/kernel/ldc.c b/arch/sparc/kernel/ldc.c
index 274a9f5..3210f81 100644
--- a/arch/sparc/kernel/ldc.c
+++ b/arch/sparc/kernel/ldc.c
@@ -15,6 +15,8 @@
 #include linux/list.h
 #include linux/init.h
 #include linux/bitmap.h
+#include linux/hash.h
+#include linux/iommu-common.h
 
 #include asm/hypervisor.h
 #include asm/iommu.h
@@ -27,6 +29,11 @@
 #define DRV_MODULE_VERSION 1.1
 #define DRV_MODULE_RELDATE July 22, 2008
 
+#define COOKIE_PGSZ_CODE   0xf000ULL
+#define COOKIE_PGSZ_CODE_SHIFT 60ULL
+
+static DEFINE_PER_CPU(unsigned int, ldc_pool_hash);
+
 static char version[] =
DRV_MODULE_NAME .c:v DRV_MODULE_VERSION  ( DRV_MODULE_RELDATE )\n;
 #define LDC_PACKET_SIZE64
@@ -98,10 +105,10 @@ static const struct ldc_mode_ops stream_ops;
 int ldom_domaining_enabled;
 
 struct ldc_iommu {
-   /* Protects arena alloc/free.  */
+   /* Protects ldc_unmap.  */
spinlock_t  lock;
-   struct iommu_arena  arena;
struct ldc_mtable_entry *page_table;
+   struct iommu_table  iommu_table;
 };
 
 struct ldc_channel {
@@ -998,31 +1005,76 @@ static void free_queue(unsigned long num_entries, struct 
ldc_packet *q)
free_pages((unsigned long)q, order);
 }
 
+static unsigned long ldc_cookie_to_index(u64 cookie, void *arg)
+{
+   u64 szcode = cookie  COOKIE_PGSZ_CODE_SHIFT;
+   /* struct ldc_iommu *ldc_iommu = (struct ldc_iommu *)arg; */
+
+   cookie = ~COOKIE_PGSZ_CODE;
+
+   return (cookie  (13ULL + (szcode * 3ULL)));
+}
+
+static void ldc_demap(struct ldc_iommu *iommu, unsigned long id, u64 cookie,
+ unsigned long entry, unsigned long npages)
+{
+   struct ldc_mtable_entry *base;
+   unsigned long i, shift;
+
+   shift = (cookie  COOKIE_PGSZ_CODE_SHIFT) * 3;
+   base = iommu-page_table + entry;
+   for (i = 0; i  npages; i++) {
+   if (base-cookie)
+   sun4v_ldc_revoke(id, cookie + (i  shift),
+base-cookie);
+   base-mte = 0;
+   }
+}
+
 /* XXX Make this configurable... XXX */
 #define LDC_IOTABLE_SIZE   (8 * 1024)
 
-static int ldc_iommu_init(struct ldc_channel *lp)
+struct iommu_tbl_ops ldc_iommu_ops;
+
+static void setup_ldc_pool_hash(void)
+{
+   unsigned int i;
+   static bool do_once;
+
+   if (do_once)
+   return;
+   do_once = true;
+   for_each_possible_cpu(i)
+   per_cpu(ldc_pool_hash, i) = hash_32(i, IOMMU_POOL_HASHBITS);
+}
+
+
+static int ldc_iommu_init(const char *name, struct ldc_channel *lp)
 {
unsigned long sz, num_tsb_entries, tsbsize, order;
-   struct ldc_iommu *iommu = lp-iommu;
+   struct ldc_iommu *ldc_iommu = lp-iommu;
+   struct iommu_table *iommu = ldc_iommu-iommu_table;
struct ldc_mtable_entry *table;
unsigned long hv_err;
int err;
 
num_tsb_entries = LDC_IOTABLE_SIZE;
tsbsize = num_tsb_entries * sizeof(struct ldc_mtable_entry);
-
-   spin_lock_init(iommu-lock);
+   setup_ldc_pool_hash();
+   spin_lock_init(ldc_iommu-lock);
 
sz = num_tsb_entries / 8;
sz = (sz + 7UL)  ~7UL;
-   iommu-arena.map = kzalloc(sz, GFP_KERNEL);
-   if (!iommu-arena.map) {
+   iommu-map = kzalloc(sz, GFP_KERNEL);
+   if (!iommu-map) {
printk(KERN_ERR PFX Alloc of arena map failed, sz=%lu\n, sz);
return -ENOMEM;
}
-
-   iommu-arena.limit = num_tsb_entries;
+   iommu_tbl_pool_init(iommu, num_tsb_entries, PAGE_SHIFT,
+   ldc_iommu_ops,
+   false /* no large pool */,
+   1 /* npools */,
+   true /* skip span boundary check */);
 
order = get_order(tsbsize);
 
@@ -1037,7 +1089,7 @@ static int ldc_iommu_init(struct ldc_channel *lp)
 
memset(table, 0, PAGE_SIZE  order);
 
-   iommu-page_table = table;
+   ldc_iommu-page_table = table;
 
hv_err = sun4v_ldc_set_map_table(lp-id, __pa(table),
 num_tsb_entries);
@@ -1049,31 +1101,32

[PATCH v5 RFC 0/3] Generic IOMMU pooled allocator

2015-03-22 Thread Sowmini Varadhan
Follows up on the feedback in the thread at
  http://www.spinics.net/lists/sparclinux/msg13493.html

- removed -cookie_to_index and -demap indirection from the iommu_tbl_ops
  The caller needs to call these functions as needed, before invoking
  the generic arena allocator functions.

- added the skip_span_boundary argument to iommu_tbl_pool_init() for
  those callers like LDC which do no care about span boundary checks.


Sowmini (2):
  Break up monolithic iommu table/lock into finer graularity pools and
lock
  Make sparc64 use scalable lib/iommu-common.c functions

Sowmini Varadhan (1):
  Make LDC use common iommu poll management functions

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  187 +++-
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/ldc.c   |  172 ++---
 arch/sparc/kernel/pci_sun4v.c |  194 -
 include/linux/iommu-common.h  |   55 +
 lib/Makefile  |2 +-
 lib/iommu-common.c|  219 +
 8 files changed, 519 insertions(+), 325 deletions(-)
 create mode 100644 include/linux/iommu-common.h
 create mode 100644 lib/iommu-common.c

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev

[PATCH v5 RFC 2/3] sparc: Make sparc64 use scalable lib/iommu-common.c functions

2015-03-22 Thread Sowmini Varadhan
In iperf experiments running linux as the Tx side (TCP client) with
10 threads results in a severe performance drop when TSO is disabled,
indicating a weakness in the software that can be avoided by using
the scalable IOMMU arena DMA allocation.

Baseline numbers before this patch:
   with default settings (TSO enabled) :9-9.5 Gbps
   Disable TSO using ethtool- drops badly:  2-3 Gbps.

After this patch, iperf client with 10 threads, can give a
throughput of at least 8.5 Gbps, even when TSO is disabled.

Signed-off-by: Sowmini Varadhan sowmini.varad...@oracle.com
---
v2: moved sparc specific fileds into iommu_sparc
v3: converted all sparc users of iommu, so lot of cleanup and streamlining
v4: David Miller review change:
- s/IOMMU_ERROR_CODE/DMA_ERROR_CODE
- reverts pci_impl.h (now that all iommu usage has been converted)
v5: benh/aik feedback modifies the function signatures: pass in 
modified args to iommmu_tbl_pool_init() and iommu_tbl_range_free()

 arch/sparc/include/asm/iommu_64.h |7 +-
 arch/sparc/kernel/iommu.c |  187 
 arch/sparc/kernel/iommu_common.h  |8 --
 arch/sparc/kernel/pci_sun4v.c |  194 ++---
 4 files changed, 160 insertions(+), 236 deletions(-)

diff --git a/arch/sparc/include/asm/iommu_64.h 
b/arch/sparc/include/asm/iommu_64.h
index 2b9321a..e3cd449 100644
--- a/arch/sparc/include/asm/iommu_64.h
+++ b/arch/sparc/include/asm/iommu_64.h
@@ -16,6 +16,7 @@
 #define IOPTE_WRITE   0x0002UL
 
 #define IOMMU_NUM_CTXS 4096
+#include linux/iommu-common.h
 
 struct iommu_arena {
unsigned long   *map;
@@ -24,11 +25,10 @@ struct iommu_arena {
 };
 
 struct iommu {
+   struct iommu_table  tbl;
spinlock_t  lock;
-   struct iommu_arena  arena;
-   void(*flush_all)(struct iommu *);
+   u32 dma_addr_mask;
iopte_t *page_table;
-   u32 page_table_map_base;
unsigned long   iommu_control;
unsigned long   iommu_tsbbase;
unsigned long   iommu_flush;
@@ -40,7 +40,6 @@ struct iommu {
unsigned long   dummy_page_pa;
unsigned long   ctx_lowest_free;
DECLARE_BITMAP(ctx_bitmap, IOMMU_NUM_CTXS);
-   u32 dma_addr_mask;
 };
 
 struct strbuf {
diff --git a/arch/sparc/kernel/iommu.c b/arch/sparc/kernel/iommu.c
index bfa4d0c..71a76d3 100644
--- a/arch/sparc/kernel/iommu.c
+++ b/arch/sparc/kernel/iommu.c
@@ -13,11 +13,15 @@
 #include linux/errno.h
 #include linux/iommu-helper.h
 #include linux/bitmap.h
+#include linux/hash.h
+#include linux/iommu-common.h
 
 #ifdef CONFIG_PCI
 #include linux/pci.h
 #endif
 
+static DEFINE_PER_CPU(unsigned int, iommu_pool_hash);
+
 #include asm/iommu.h
 
 #include iommu_common.h
@@ -45,8 +49,9 @@
   i (ASI_PHYS_BYPASS_EC_E))
 
 /* Must be invoked under the IOMMU lock. */
-static void iommu_flushall(struct iommu *iommu)
+static void iommu_flushall(struct iommu_table *iommu_table)
 {
+   struct iommu *iommu = container_of(iommu_table, struct iommu, tbl);
if (iommu-iommu_flushinv) {
iommu_write(iommu-iommu_flushinv, ~(u64)0);
} else {
@@ -87,94 +92,23 @@ static inline void iopte_make_dummy(struct iommu *iommu, 
iopte_t *iopte)
iopte_val(*iopte) = val;
 }
 
-/* Based almost entirely upon the ppc64 iommu allocator.  If you use the 
'handle'
- * facility it must all be done in one pass while under the iommu lock.
- *
- * On sun4u platforms, we only flush the IOMMU once every time we've passed
- * over the entire page table doing allocations.  Therefore we only ever 
advance
- * the hint and cannot backtrack it.
- */
-unsigned long iommu_range_alloc(struct device *dev,
-   struct iommu *iommu,
-   unsigned long npages,
-   unsigned long *handle)
-{
-   unsigned long n, end, start, limit, boundary_size;
-   struct iommu_arena *arena = iommu-arena;
-   int pass = 0;
-
-   /* This allocator was derived from x86_64's bit string search */
-
-   /* Sanity check */
-   if (unlikely(npages == 0)) {
-   if (printk_ratelimit())
-   WARN_ON(1);
-   return DMA_ERROR_CODE;
-   }
-
-   if (handle  *handle)
-   start = *handle;
-   else
-   start = arena-hint;
-
-   limit = arena-limit;
-
-   /* The case below can happen if we have a small segment appended
-* to a large, or when the previous alloc was at the very end of
-* the available space. If so, go back to the beginning and flush.
-*/
-   if (start = limit) {
-   start = 0;
-   if (iommu-flush_all)
-   iommu-flush_all(iommu);
-   }
-
- again:
-
-   if (dev

Re: Generic IOMMU pooled allocator

2015-03-19 Thread Sowmini Varadhan

On 03/19/2015 02:01 PM, Benjamin Herrenschmidt wrote:

Ben One thing I noticed is the asymetry in your code between the alloc
Ben and the free path. The alloc path is similar to us in that the lock
Ben covers the allocation and that's about it, there's no actual mapping to
Ben the HW done, it's done by the caller level right ?

yes, the only constraint  is that the h/w alloc transaction should be
done after the arena-alloc, whereas for the unmap, the h/w  transaction
should happen first, and arena-unmap should happen after.

Ben The free path however, in your case, takes the lock and calls back into
Ben demap (which I assume is what removes the translation from the HW)
Ben with the lock held. There's also some mapping between cookies
Ben and index which here too isn't exposed to the alloc side but is
Ben exposed to the free side.

Regarding the -demap indirection- somewhere between V1 and V2, I 
realized that, at least for sun4v, it's not necessary to hold
the pool lock when doing the unmap, (V1 had originally passed this
a the -demap). Revisiting the LDC change, I think that even that
has no pool-specific info that needs to be passed in, so possibly the
-demap is not required at all?

I can remove that, and re-verify the LDC code (though I might
not be able to get to this till early next week, as I'm travelling
at the moment).

About the cookie_to_index, that came out of observation of the
LDC code (ldc_cookie_to_index in patchset 3). In the LDC case, 
the workflow is approximately
   base = alloc_npages(..); /* calls iommu_tbl_range_alloc *.
   /* set up cookie_state using base */
   /* populate cookies calling fill_cookies() - make_cookie() */
The make_cookie() is the inverse operation of cookie_to_index()
(afaict, the code is not very well commented, I'm afraid), but 
I need that indirection to figure out which bitmap to clear.

I dont know if there's a better way to do this, or if
the -cookie_to_index can get more complex for other IOMMU users

Ben One thing that Alexey is doing on our side is to move some of the
Ben hooks to manipulate the underlying TCEs (ie. iommu PTEs) from our
Ben global ppc_md. data structure to a new iommu_table_ops, so your patches
Ben will definitely collide with our current work so we'll have to figure
Ben out how things can made to match. We might be able to move more than
Ben just the allocator to the generic code, and the whole implementation of
Ben map_sg/unmap_sg if we have the right set of ops, unless you see a
Ben reason why that wouldn't work for you ?

I cant think of why that wont work, though it would help to see
the patch itself..


Ben We also need to add some additional platform specific fields to certain
Ben iommu table instances to deal with some KVM related tracking of pinned
Ben DMAble memory, here too we might have to be creative and possibly
Ben enclose the generic iommu_table in a platform specific variant.
Ben 
Ben Alexey, any other comment ?
Ben 
Ben Cheers,
Ben Ben.
Ben 
Ben 
Ben 
Ben --
Ben To unsubscribe from this list: send the line unsubscribe sparclinux in
Ben the body of a message to majord...@vger.kernel.org
Ben More majordomo info at  http://vger.kernel.org/majordomo-info.html
Ben 
BenAlexey, any other comment ?

On (03/19/15 16:27), Alexey Kardashevskiy wrote:
Alexey 
Alexey Agree about missing symmetry. In general, I would call it zoned
Alexey pool-locked memory allocator-ish rather than iommu_table and have
Alexey no callbacks there.
Alexey 
Alexey The iommu_tbl_range_free() caller could call cookie_to_index() and

Problem is that tbl_range_free itself needs the `entry' from
-Alexeycookie_to_index.. dont know if there's a way to move the code
to avoid that..

Alexey what the reset() callback does here - I do not understand, some

The -Alexeyreset callback came out of the sun4u use-case. Davem might
have more history here than I do, but my understanding is that the
iommu_flushall() was needed on the older sun4u architectures, where
there was on intermediating HV?

Alexey documentation would help here, and demap() does not have to be
Alexey executed under the lock (as map() is not executed under the lock).
Alexey 
Alexey btw why demap, not unmap? :)

Maybe neither is needed, as you folks made me realize above.

--Sowmini

___
Linuxppc-dev mailing list
Linuxppc-dev@lists.ozlabs.org
https://lists.ozlabs.org/listinfo/linuxppc-dev