Re: [PATCH 22/22] IB/iser: Chain all iser transaction send work requests

2015-07-30 Thread Sagi Grimberg

On 7/30/2015 1:27 PM, Or Gerlitz wrote:

On Thu, Jul 30, 2015 at 11:06 AM, Sagi Grimberg sa...@mellanox.com wrote:

Concatination of send work requests benefits performance
by reducing the send queue lock contention (acquired in
ib_post_send) and saves us HW doorbells which is posted
only once.


s/Concatination/Concatenation/

AFAIK,  do we today! isn't that the case? if partially, please specify
in the change-logs
what flows were not fully optimized in that respect and are such after
the patch.


I'll add which current work requests are not chained.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: COMMERCIAL: Re: [PATCH 06/22] IB/iser: Fix possible bogus DMA unmapping

2015-07-30 Thread Or Gerlitz

On 7/30/2015 3:09 PM, Sagi Grimberg wrote:
I'll add the Fixes tag. 


don't forget to use --abbrev=12 for the Fixes: tag
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/22] IB/iser: Support up to 8MB data transfer in a single command

2015-07-30 Thread Sagi Grimberg

On 7/30/2015 1:22 PM, Or Gerlitz wrote:

On Thu, Jul 30, 2015 at 11:06 AM, Sagi Grimberg sa...@mellanox.com wrote:

iser support up to 512KB data transfer in a single scsi
command. In order to support up to 8MB, iser needs to pre-allocate
larger memory regions and larger page vectors.



We should be doing things for a reason, and we are following that
practice, it's missing.

I believe we have nice motivation to put here for why we want to do
that. Please add it.


Sure.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/22] IB/iser: Remove dead code in fmr_pool alloc/free

2015-07-30 Thread Sagi Grimberg

On 7/30/2015 1:31 PM, Or Gerlitz wrote:

On Thu, Jul 30, 2015 at 11:06 AM, Sagi Grimberg sa...@mellanox.com wrote:

In the past the we always tried to allocate an fmr_pool
and if it failed on ENOSYS (not supported) then we continued
with dma mr. This is not the case anymore and if we tried to
allocate an fmr_pool then it is supported and we expect to succeed.


AFAIK, the ENOSYS flow was something that came into play when working
e.g over VF drivers such as mlx4 that don't support fmr-ing but we still wanted
an optimal performance. What does this is not the case anymore means? these
VF drivers are still out there.


Today, iser is not usable with no FRWR and no FMR support. (it once was
when we bounced to higher-order allocations but we don't do that
anymore). Memory registration is a requirement support for iser today.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3 00/15] Modify MR allocation API

2015-07-30 Thread Christoph Hellwig
On Thu, Jul 30, 2015 at 10:32:33AM +0300, Sagi Grimberg wrote:
 This patch set is detached from my WIP for modifying our
 fast registration kernel API. I incorporated some comments
 from Jason and Christoph. The current set is a drop-in replacement
 of ib_alloc_fast_reg_mr to ib_alloc_mr which receives a memory
 region type (whcih can be IB_MR_TYPE_MEM_REG for normal memory
 registration, IB_MR_TYPE_SIGNATURE for a data-integrity capable
 memory region and future arbitrary SG support capable memory
 region).

While this series doesn't get us much yet it looks reasonable.
So let's start small and get this one in.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 11/22] IB/iser: Remove dead code in fmr_pool alloc/free

2015-07-30 Thread Or Gerlitz
On Thu, Jul 30, 2015 at 3:23 PM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:
 Today, iser is not usable with no FRWR and no FMR support. (it once was
 when we bounced to higher-order allocations but we don't do that
 anymore). Memory registration is a requirement support for iser today.

OK, sure, we now have FRWR support, back then we didn't
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 12/14] IB/cma: Use found net_dev for passive connections

2015-07-30 Thread Haggai Eran
When receiving a new connection in cma_req_handler, we actually already
know the net_dev that is used for the connection's creation. Instead of
calling cma_translate_addr to resolve the new connection id's source
address, just use the net_dev that was found.

Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cma.c | 76 ---
 1 file changed, 49 insertions(+), 27 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f272b3d1799d..c1cd47eab149 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1277,33 +1277,31 @@ static struct rdma_id_private *cma_find_listener(
 }
 
 static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id,
-struct ib_cm_event *ib_event)
+struct ib_cm_event *ib_event,
+struct net_device **net_dev)
 {
struct cma_req_info req;
struct rdma_bind_list *bind_list;
struct rdma_id_private *id_priv;
-   struct net_device *net_dev;
int err;
 
err = cma_save_req_info(ib_event, req);
if (err)
return ERR_PTR(err);
 
-   net_dev = cma_get_net_dev(ib_event, req);
-   if (IS_ERR(net_dev)) {
-   if (PTR_ERR(net_dev) == -EAFNOSUPPORT) {
+   *net_dev = cma_get_net_dev(ib_event, req);
+   if (IS_ERR(*net_dev)) {
+   if (PTR_ERR(*net_dev) == -EAFNOSUPPORT) {
/* Assuming the protocol is AF_IB */
-   net_dev = NULL;
+   *net_dev = NULL;
} else {
-   return ERR_PTR(PTR_ERR(net_dev));
+   return ERR_PTR(PTR_ERR(*net_dev));
}
}
 
bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id),
cma_port_from_service_id(req.service_id));
-   id_priv = cma_find_listener(bind_list, cm_id, ib_event, req, net_dev);
-
-   dev_put(net_dev);
+   id_priv = cma_find_listener(bind_list, cm_id, ib_event, req, *net_dev);
 
return id_priv;
 }
@@ -1553,7 +1551,8 @@ out:
 }
 
 static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id,
-  struct ib_cm_event *ib_event)
+  struct ib_cm_event *ib_event,
+  struct net_device *net_dev)
 {
struct rdma_id_private *id_priv;
struct rdma_cm_id *id;
@@ -1585,14 +1584,16 @@ static struct rdma_id_private *cma_new_conn_id(struct 
rdma_cm_id *listen_id,
if (rt-num_paths == 2)
rt-path_rec[1] = *ib_event-param.req_rcvd.alternate_path;
 
-   if (cma_any_addr(cma_src_addr(id_priv))) {
-   rt-addr.dev_addr.dev_type = ARPHRD_INFINIBAND;
-   rdma_addr_set_sgid(rt-addr.dev_addr, rt-path_rec[0].sgid);
-   ib_addr_set_pkey(rt-addr.dev_addr, 
be16_to_cpu(rt-path_rec[0].pkey));
-   } else {
-   ret = cma_translate_addr(cma_src_addr(id_priv), 
rt-addr.dev_addr);
+   if (net_dev) {
+   ret = rdma_copy_addr(rt-addr.dev_addr, net_dev, NULL);
if (ret)
goto err;
+   } else {
+   /* An AF_IB connection */
+   WARN_ON_ONCE(ss_family != AF_IB);
+
+   cma_translate_ib((struct sockaddr_ib *)cma_src_addr(id_priv),
+rt-addr.dev_addr);
}
rdma_addr_set_dgid(rt-addr.dev_addr, rt-path_rec[0].dgid);
 
@@ -1605,7 +1606,8 @@ err:
 }
 
 static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id,
- struct ib_cm_event *ib_event)
+ struct ib_cm_event *ib_event,
+ struct net_device *net_dev)
 {
struct rdma_id_private *id_priv;
struct rdma_cm_id *id;
@@ -1624,10 +1626,18 @@ static struct rdma_id_private *cma_new_udp_id(struct 
rdma_cm_id *listen_id,
  ib_event-param.sidr_req_rcvd.service_id))
goto err;
 
-   if (!cma_any_addr((struct sockaddr *) id-route.addr.src_addr)) {
-   ret = cma_translate_addr(cma_src_addr(id_priv), 
id-route.addr.dev_addr);
+   if (net_dev) {
+   ret = rdma_copy_addr(id-route.addr.dev_addr, net_dev, NULL);
if (ret)
goto err;
+   } else {
+   /* An AF_IB connection */
+   WARN_ON_ONCE(ss_family != AF_IB);
+
+   if (!cma_any_addr(cma_src_addr(id_priv)))
+   cma_translate_ib((struct sockaddr_ib *)
+   cma_src_addr(id_priv),
+ 

[PATCH v4 03/14] IB/core: Find the network device matching connection parameters

2015-07-30 Thread Haggai Eran
From: Yotam Kenneth yota...@mellanox.com

In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.

The IB device and port, together with the P_Key and the GID should
be enough to uniquely identify the ULP net device. However, in current
kernels there can be multiple IPoIB interfaces created with the same GID.
Furthermore, such configuration may be desireable to support ipvlan-like
configurations for RDMA CM with IPoIB.  To resolve the device in these
cases the code will also take the IP address as an additional input.

Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com
---
 drivers/infiniband/core/device.c | 46 
 include/rdma/ib_verbs.h  | 27 +++
 2 files changed, 73 insertions(+)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 623d8e191ced..124597732fe7 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -38,6 +38,7 @@
 #include linux/slab.h
 #include linux/init.h
 #include linux/mutex.h
+#include linux/netdevice.h
 #include rdma/rdma_netlink.h
 
 #include core_priv.h
@@ -781,6 +782,51 @@ int ib_find_pkey(struct ib_device *device,
 }
 EXPORT_SYMBOL(ib_find_pkey);
 
+/**
+ * ib_get_net_dev_by_params() - Return the appropriate net_dev
+ * for a received CM request
+ * @dev:   An RDMA device on which the request has been received.
+ * @port:  Port number on the RDMA device.
+ * @pkey:  The Pkey the request came on.
+ * @gid:   A GID that the net_dev uses to communicate.
+ * @addr:  Contains the IP address that the request specified as its
+ * destination.
+ */
+struct net_device *ib_get_net_dev_by_params(struct ib_device *dev,
+   u8 port,
+   u16 pkey,
+   const union ib_gid *gid,
+   const struct sockaddr *addr)
+{
+   struct net_device *net_dev = NULL;
+   struct ib_client_data *context;
+
+   if (!rdma_protocol_ib(dev, port))
+   return NULL;
+
+   down_read(lists_rwsem);
+
+   list_for_each_entry(context, dev-client_data_list, list) {
+   struct ib_client *client = context-client;
+
+   if (context-going_down)
+   continue;
+
+   if (client-get_net_dev_by_params) {
+   net_dev = client-get_net_dev_by_params(dev, port, pkey,
+   gid, addr,
+   context-data);
+   if (net_dev)
+   break;
+   }
+   }
+
+   up_read(lists_rwsem);
+
+   return net_dev;
+}
+EXPORT_SYMBOL(ib_get_net_dev_by_params);
+
 static int __init ib_core_init(void)
 {
int ret;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index aaa5d2217ab5..5c68f8c1c31a 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -48,6 +48,7 @@
 #include linux/rwsem.h
 #include linux/scatterlist.h
 #include linux/workqueue.h
+#include linux/socket.h
 #include uapi/linux/if_ether.h
 
 #include linux/atomic.h
@@ -1766,6 +1767,28 @@ struct ib_client {
void (*add)   (struct ib_device *);
void (*remove)(struct ib_device *, void *client_data);
 
+   /* Returns the net_dev belonging to this ib_client and matching the
+* given parameters.
+* @dev: An RDMA device that the net_dev use for communication.
+* @port:A physical port number on the RDMA device.
+* @pkey:P_Key that the net_dev uses if applicable.
+* @gid: A GID that the net_dev uses to communicate.
+* @addr:An IP address the net_dev is configured with.
+* @client_data: The device's client data set by ib_set_client_data().
+*
+* An ib_client that implements a net_dev on top of RDMA devices
+* (such as IP over IB) should implement this callback, allowing the
+* rdma_cm module to find the right net_dev for a given request.
+*
+* The caller is responsible for calling dev_put on the returned
+* netdev. */
+   struct net_device *(*get_net_dev_by_params)(
+   struct ib_device *dev,
+   u8 port,
+   u16 pkey,
+   const union ib_gid *gid,
+   const struct sockaddr 

[PATCH v4 13/14] IB/cma: Share ib_cm_ids between rdma_cm_ids

2015-07-30 Thread Haggai Eran
Use ib_cm_insert_listen to create listening IB CM IDs or share existing
ones if needed. When given a request on a specific CM ID, the code now
matches the request to the RDMA CM ID based on the request parameters, so
it no longer needs to rely on the ib_cm's private data matching
capabilities.

Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cma.c | 59 +++
 1 file changed, 4 insertions(+), 55 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c1cd47eab149..1f26bff5f780 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -1771,42 +1771,6 @@ __be64 rdma_get_service_id(struct rdma_cm_id *id, struct 
sockaddr *addr)
 }
 EXPORT_SYMBOL(rdma_get_service_id);
 
-static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr 
*addr,
-struct ib_cm_compare_data *compare)
-{
-   struct cma_hdr *cma_data, *cma_mask;
-   __be32 ip4_addr;
-   struct in6_addr ip6_addr;
-
-   memset(compare, 0, sizeof *compare);
-   cma_data = (void *) compare-data;
-   cma_mask = (void *) compare-mask;
-
-   switch (addr-sa_family) {
-   case AF_INET:
-   ip4_addr = ((struct sockaddr_in *) addr)-sin_addr.s_addr;
-   cma_set_ip_ver(cma_data, 4);
-   cma_set_ip_ver(cma_mask, 0xF);
-   if (!cma_any_addr(addr)) {
-   cma_data-dst_addr.ip4.addr = ip4_addr;
-   cma_mask-dst_addr.ip4.addr = htonl(~0);
-   }
-   break;
-   case AF_INET6:
-   ip6_addr = ((struct sockaddr_in6 *) addr)-sin6_addr;
-   cma_set_ip_ver(cma_data, 6);
-   cma_set_ip_ver(cma_mask, 0xF);
-   if (!cma_any_addr(addr)) {
-   cma_data-dst_addr.ip6 = ip6_addr;
-   memset(cma_mask-dst_addr.ip6, 0xFF,
-  sizeof cma_mask-dst_addr.ip6);
-   }
-   break;
-   default:
-   break;
-   }
-}
-
 static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event)
 {
struct rdma_id_private *id_priv = iw_id-context;
@@ -1960,33 +1924,18 @@ out:
 
 static int cma_ib_listen(struct rdma_id_private *id_priv)
 {
-   struct ib_cm_compare_data compare_data;
struct sockaddr *addr;
struct ib_cm_id *id;
__be64 svc_id;
-   int ret;
 
-   id = ib_create_cm_id(id_priv-id.device, cma_req_handler, id_priv);
+   addr = cma_src_addr(id_priv);
+   svc_id = rdma_get_service_id(id_priv-id, addr);
+   id = ib_cm_insert_listen(id_priv-id.device, cma_req_handler, svc_id);
if (IS_ERR(id))
return PTR_ERR(id);
-
id_priv-cm_id.ib = id;
 
-   addr = cma_src_addr(id_priv);
-   svc_id = rdma_get_service_id(id_priv-id, addr);
-   if (cma_any_addr(addr)  !id_priv-afonly)
-   ret = ib_cm_listen(id_priv-cm_id.ib, svc_id, 0, NULL);
-   else {
-   cma_set_compare_data(id_priv-id.ps, addr, compare_data);
-   ret = ib_cm_listen(id_priv-cm_id.ib, svc_id, 0, compare_data);
-   }
-
-   if (ret) {
-   ib_destroy_cm_id(id_priv-cm_id.ib);
-   id_priv-cm_id.ib = NULL;
-   }
-
-   return ret;
+   return 0;
 }
 
 static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog)
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 01/14] IB/core: Add rwsem to allow reading device list or client list

2015-07-30 Thread Haggai Eran
Currently the RDMA subsystem's device list and client list are protected by
a single mutex. This prevents adding user-facing APIs that iterate these
lists, since using them may cause a deadlock. The patch attempts to solve
this problem by adding a read-write semaphore to protect the lists. Readers
now don't need the mutex, and are safe just by read-locking the semaphore.

The ib_register_device, ib_register_client, ib_unregister_device, and
ib_unregister_client functions are modified to lock the semaphore for write
during their respective list modification. Also, in order to make sure
client callbacks are called only between add() and remove() calls, the code
is changed to only add items to the lists after the add() calls and remove
from the lists before the remove() calls.

This patch attempts to solve a similar need [1] that was seen in the RoCE
v2 patch series.

[1] http://www.spinics.net/lists/linux-rdma/msg24733.html

Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Cc: Matan Barak mat...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/device.c | 39 ---
 1 file changed, 28 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 9567756ca4f9..f08d438205ed 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -55,17 +55,24 @@ struct ib_client_data {
 struct workqueue_struct *ib_wq;
 EXPORT_SYMBOL_GPL(ib_wq);
 
+/* The device_list and client_list contain devices and clients after their
+ * registration has completed, and the devices and clients are removed
+ * during unregistration. */
 static LIST_HEAD(device_list);
 static LIST_HEAD(client_list);
 
 /*
- * device_mutex protects access to both device_list and client_list.
- * There's no real point to using multiple locks or something fancier
- * like an rwsem: we always access both lists, and we're always
- * modifying one list or the other list.  In any case this is not a
- * hot path so there's no point in trying to optimize.
+ * device_mutex and lists_rwsem protect access to both device_list and
+ * client_list.  device_mutex protects writer access by device and client
+ * registration / de-registration.  lists_rwsem protects reader access to
+ * these lists.  Iterators of these lists must lock it for read, while updates
+ * to the lists must be done with a write lock. A special case is when the
+ * device_mutex is locked. In this case locking the lists for read access is
+ * not necessary as the device_mutex implies it.
  */
 static DEFINE_MUTEX(device_mutex);
+static DECLARE_RWSEM(lists_rwsem);
+
 
 static int ib_device_check_mandatory(struct ib_device *device)
 {
@@ -305,8 +312,6 @@ int ib_register_device(struct ib_device *device,
goto out;
}
 
-   list_add_tail(device-core_list, device_list);
-
device-reg_state = IB_DEV_REGISTERED;
 
{
@@ -317,6 +322,10 @@ int ib_register_device(struct ib_device *device,
client-add(device);
}
 
+   down_write(lists_rwsem);
+   list_add_tail(device-core_list, device_list);
+   up_write(lists_rwsem);
+
  out:
mutex_unlock(device_mutex);
return ret;
@@ -337,12 +346,14 @@ void ib_unregister_device(struct ib_device *device)
 
mutex_lock(device_mutex);
 
+   down_write(lists_rwsem);
+   list_del(device-core_list);
+   up_write(lists_rwsem);
+
list_for_each_entry_reverse(client, client_list, list)
if (client-remove)
client-remove(device);
 
-   list_del(device-core_list);
-
mutex_unlock(device_mutex);
 
ib_device_unregister_sysfs(device);
@@ -375,11 +386,14 @@ int ib_register_client(struct ib_client *client)
 
mutex_lock(device_mutex);
 
-   list_add_tail(client-list, client_list);
list_for_each_entry(device, device_list, core_list)
if (client-add  !add_client_context(device, client))
client-add(device);
 
+   down_write(lists_rwsem);
+   list_add_tail(client-list, client_list);
+   up_write(lists_rwsem);
+
mutex_unlock(device_mutex);
 
return 0;
@@ -402,6 +416,10 @@ void ib_unregister_client(struct ib_client *client)
 
mutex_lock(device_mutex);
 
+   down_write(lists_rwsem);
+   list_del(client-list);
+   up_write(lists_rwsem);
+
list_for_each_entry(device, device_list, core_list) {
if (client-remove)
client-remove(device);
@@ -414,7 +432,6 @@ void ib_unregister_client(struct ib_client *client)
}
spin_unlock_irqrestore(device-client_data_lock, flags);
}
-   list_del(client-list);
 
mutex_unlock(device_mutex);
 }
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to 

Re: Potential lost receive WCs (was [PATCH WIP 38/43])

2015-07-30 Thread Chuck Lever

On Jul 30, 2015, at 3:00 AM, Sagi Grimberg sa...@dev.mellanox.co.il wrote:

 
 The drivers we have that don't dequeue all the CQEs are doing
 something like NAPI polling and have other mechanisms to guarentee
 progress. Don't copy something like budget without copying the other
 mechanisms :)
 
 OK, that makes total sense. Thanks for clarifying.
 
 IIRC NAPI is soft-IRQ which chuck is trying to avoid.
 
 Chuck, I think I was the one that commented on this. I observed a
 situation in iser where the polling loop kept going continuously
 without ever leaving the soft-IRQ context (high workload obviously).
 In addition to the polling loop hogging the CPU, other CQs with the
 same IRQ assignment were starved. So I suggested you should take care
 of it in xprtrdma as well.
 
 The correct approach is NAPI. There is an equivalent for storage which
 is called blk_iopoll (block/blk-iopool.c) which sort of has nothing
 specific to block devices (also soft-IRQ context). I have attempted to
 convert iser to use it, but I got some unpredictable latency jitters so
 I stopped and didn't get a chance to pick it up ever since.
 
 I still think that draining the CQ without respecting a quota is
 wrong, even if driverX has a glitch there.

The iWARP and IBTA specs disagree: they both recommend clearing
existing CQEs when handling a completion upcall. Thus the API is
designed with the expectation that consumers do not impose a poll
budget.

Any solution to the starvation problem, including quota + NAPI,
involves deferring receive work. xprtrdma already defers work.

Our completion handlers are lightweight. The bulk of receive
handling is done in softIRQ in a tasklet that handles each RPC
reply in a loop. It's more likely the tasklet loop, rather than
completion handling, is going to result in starvation.

The only issue we've seen so far is the reply tasklet can hog
one CPU because it is single-threaded across all transport
connections. Thus it is more effective for us to replace the
tasklet with a work queue where each RPC reply can be globally
scheduled and does not interfere with other work being done
by softIRQ.

In other words, the starvation issue seen in xprtrdma is not
in the receive handler, so fixing it there is likely to be
ineffective.


--
Chuck Lever



--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 02/14] IB/core: lock client data with lists_rwsem

2015-07-30 Thread Haggai Eran
An ib_client callback that is called with the lists_rwsem locked only for
read is protected from changes to the IB client lists, but not from
ib_unregister_device() freeing its client data. This is because
ib_unregister_device() will remove the device from the device list with
lists_rwsem locked for write, but perform the rest of the cleanup,
including the call to remove() without that lock.

Mark client data that is undergoing de-registration with a new going_down
flag in the client data context. Lock the client data list with lists_rwsem
for write in addition to using the spinlock, so that functions calling the
callback would be able to lock only lists_rwsem for read and let callbacks
sleep.

Since ib_unregister_client() now marks the client data context, no need for
remove() to search the context again, so pass the client data directly to
remove() callbacks.

Reviewed-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cache.c   |  2 +-
 drivers/infiniband/core/cm.c  |  7 ++--
 drivers/infiniband/core/cma.c |  7 ++--
 drivers/infiniband/core/device.c  | 53 +--
 drivers/infiniband/core/mad.c |  2 +-
 drivers/infiniband/core/multicast.c   |  7 ++--
 drivers/infiniband/core/sa_query.c|  6 ++--
 drivers/infiniband/core/ucm.c |  6 ++--
 drivers/infiniband/core/user_mad.c|  6 ++--
 drivers/infiniband/core/uverbs_main.c |  6 ++--
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  7 ++--
 drivers/infiniband/ulp/srp/ib_srp.c   |  6 ++--
 drivers/infiniband/ulp/srpt/ib_srpt.c |  5 ++-
 include/rdma/ib_verbs.h   |  4 ++-
 net/rds/ib.c  |  5 ++-
 net/rds/iw.c  |  5 ++-
 16 files changed, 82 insertions(+), 52 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 871da832d016..c93af66cc091 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -394,7 +394,7 @@ err:
kfree(device-cache.lmc_cache);
 }
 
-static void ib_cache_cleanup_one(struct ib_device *device)
+static void ib_cache_cleanup_one(struct ib_device *device, void *client_data)
 {
int p;
 
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 3a972ebf3c0d..82d5c4362aa8 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -58,7 +58,7 @@ MODULE_DESCRIPTION(InfiniBand CM);
 MODULE_LICENSE(Dual BSD/GPL);
 
 static void cm_add_one(struct ib_device *device);
-static void cm_remove_one(struct ib_device *device);
+static void cm_remove_one(struct ib_device *device, void *client_data);
 
 static struct ib_client cm_client = {
.name   = cm,
@@ -3886,9 +3886,9 @@ free:
kfree(cm_dev);
 }
 
-static void cm_remove_one(struct ib_device *ib_device)
+static void cm_remove_one(struct ib_device *ib_device, void *client_data)
 {
-   struct cm_device *cm_dev;
+   struct cm_device *cm_dev = client_data;
struct cm_port *port;
struct ib_port_modify port_modify = {
.clr_port_cap_mask = IB_PORT_CM_SUP
@@ -3896,7 +3896,6 @@ static void cm_remove_one(struct ib_device *ib_device)
unsigned long flags;
int i;
 
-   cm_dev = ib_get_client_data(ib_device, cm_client);
if (!cm_dev)
return;
 
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 143ded2bbe7c..6b6cdfa5d231 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -94,7 +94,7 @@ const char *rdma_event_msg(enum rdma_cm_event_type event)
 EXPORT_SYMBOL(rdma_event_msg);
 
 static void cma_add_one(struct ib_device *device);
-static void cma_remove_one(struct ib_device *device);
+static void cma_remove_one(struct ib_device *device, void *client_data);
 
 static struct ib_client cma_client = {
.name   = cma,
@@ -3551,11 +3551,10 @@ static void cma_process_remove(struct cma_device 
*cma_dev)
wait_for_completion(cma_dev-comp);
 }
 
-static void cma_remove_one(struct ib_device *device)
+static void cma_remove_one(struct ib_device *device, void *client_data)
 {
-   struct cma_device *cma_dev;
+   struct cma_device *cma_dev = client_data;
 
-   cma_dev = ib_get_client_data(device, cma_client);
if (!cma_dev)
return;
 
diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index f08d438205ed..623d8e191ced 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -50,6 +50,9 @@ struct ib_client_data {
struct list_head  list;
struct ib_client *client;
void *data;
+   /* The device or client is going down. Do not call client or device
+* callbacks other than remove(). */
+   bool  going_down;
 };
 
 struct workqueue_struct 

[PATCH v4 07/14] IB/cma: Refactor RDMA IP CM private-data parsing code

2015-07-30 Thread Haggai Eran
When receiving a connection request, rdma_cm needs to associate the request
with a network device, in order to disambiguate requests. To do this, it
needs to know the request's destination IP. For this the module needs to
allow getting this information from the private data in the request packet,
instead of relying on the information already being in the listening RDMA
CM ID.

When creating a new incoming connection ID, the code in
cma_save_ip{4,6}_info can no longer rely on the listener's private data to
find the port number, so it reads it from the requested service ID.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
---
 drivers/infiniband/core/cma.c | 170 ++
 1 file changed, 105 insertions(+), 65 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 6b6cdfa5d231..cf5c48b0b7d5 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -870,107 +870,138 @@ static inline int cma_any_port(struct sockaddr *addr)
return !cma_port(addr);
 }
 
-static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id 
*listen_id,
+static void cma_save_ib_info(struct sockaddr *src_addr,
+struct sockaddr *dst_addr,
+struct rdma_cm_id *listen_id,
 struct ib_sa_path_rec *path)
 {
struct sockaddr_ib *listen_ib, *ib;
 
listen_ib = (struct sockaddr_ib *) listen_id-route.addr.src_addr;
-   ib = (struct sockaddr_ib *) id-route.addr.src_addr;
-   ib-sib_family = listen_ib-sib_family;
-   if (path) {
-   ib-sib_pkey = path-pkey;
-   ib-sib_flowinfo = path-flow_label;
-   memcpy(ib-sib_addr, path-sgid, 16);
-   } else {
-   ib-sib_pkey = listen_ib-sib_pkey;
-   ib-sib_flowinfo = listen_ib-sib_flowinfo;
-   ib-sib_addr = listen_ib-sib_addr;
-   }
-   ib-sib_sid = listen_ib-sib_sid;
-   ib-sib_sid_mask = cpu_to_be64(0xULL);
-   ib-sib_scope_id = listen_ib-sib_scope_id;
-
-   if (path) {
-   ib = (struct sockaddr_ib *) id-route.addr.dst_addr;
-   ib-sib_family = listen_ib-sib_family;
-   ib-sib_pkey = path-pkey;
-   ib-sib_flowinfo = path-flow_label;
-   memcpy(ib-sib_addr, path-dgid, 16);
+   if (src_addr) {
+   ib = (struct sockaddr_ib *)src_addr;
+   ib-sib_family = AF_IB;
+   if (path) {
+   ib-sib_pkey = path-pkey;
+   ib-sib_flowinfo = path-flow_label;
+   memcpy(ib-sib_addr, path-sgid, 16);
+   ib-sib_sid = path-service_id;
+   ib-sib_scope_id = 0;
+   } else {
+   ib-sib_pkey = listen_ib-sib_pkey;
+   ib-sib_flowinfo = listen_ib-sib_flowinfo;
+   ib-sib_addr = listen_ib-sib_addr;
+   ib-sib_sid = listen_ib-sib_sid;
+   ib-sib_scope_id = listen_ib-sib_scope_id;
+   }
+   ib-sib_sid_mask = cpu_to_be64(0xULL);
+   }
+   if (dst_addr) {
+   ib = (struct sockaddr_ib *)dst_addr;
+   ib-sib_family = AF_IB;
+   if (path) {
+   ib-sib_pkey = path-pkey;
+   ib-sib_flowinfo = path-flow_label;
+   memcpy(ib-sib_addr, path-dgid, 16);
+   }
}
 }
 
-static __be16 ss_get_port(const struct sockaddr_storage *ss)
-{
-   if (ss-ss_family == AF_INET)
-   return ((struct sockaddr_in *)ss)-sin_port;
-   else if (ss-ss_family == AF_INET6)
-   return ((struct sockaddr_in6 *)ss)-sin6_port;
-   BUG();
-}
-
-static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id 
*listen_id,
- struct cma_hdr *hdr)
+static void cma_save_ip4_info(struct sockaddr *src_addr,
+ struct sockaddr *dst_addr,
+ struct cma_hdr *hdr,
+ __be16 local_port)
 {
struct sockaddr_in *ip4;
 
-   ip4 = (struct sockaddr_in *) id-route.addr.src_addr;
-   ip4-sin_family = AF_INET;
-   ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr;
-   ip4-sin_port = ss_get_port(listen_id-route.addr.src_addr);
+   if (src_addr) {
+   ip4 = (struct sockaddr_in *)src_addr;
+   ip4-sin_family = AF_INET;
+   ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr;
+   ip4-sin_port = local_port;
+   }
 
-   ip4 = (struct sockaddr_in *) id-route.addr.dst_addr;
-   ip4-sin_family = AF_INET;
-   ip4-sin_addr.s_addr = 

[PATCH v4 10/14] IB/cma: Add net_dev and private data checks to RDMA CM

2015-07-30 Thread Haggai Eran
Instead of relying on a the ib_cm module to check an incoming CM request's
private data header, add these checks to the RDMA CM module. This allows a
following patch to to clean up the ib_cm interface and remove the code that
looks into the private headers. It will also allow supporting namespaces in
RDMA CM by making these checks namespace aware later on.

Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cma.c | 188 +-
 1 file changed, 185 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f2d799209412..011aa7310dd3 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -263,6 +263,15 @@ struct cma_hdr {
 
 #define CMA_VERSION 0x00
 
+struct cma_req_info {
+   struct ib_device *device;
+   int port;
+   union ib_gid local_gid;
+   __be64 service_id;
+   u16 pkey;
+   bool has_gid:1;
+};
+
 static int cma_comp(struct rdma_id_private *id_priv, enum rdma_cm_state comp)
 {
unsigned long flags;
@@ -300,7 +309,7 @@ static enum rdma_cm_state cma_exch(struct rdma_id_private 
*id_priv,
return old;
 }
 
-static inline u8 cma_get_ip_ver(struct cma_hdr *hdr)
+static inline u8 cma_get_ip_ver(const struct cma_hdr *hdr)
 {
return hdr-ip_version  4;
 }
@@ -1016,7 +1025,7 @@ static int cma_save_ip_info(struct sockaddr *src_addr,
cma_save_ip6_info(src_addr, dst_addr, hdr, port);
break;
default:
-   return -EINVAL;
+   return -EAFNOSUPPORT;
}
 
return 0;
@@ -1040,6 +1049,176 @@ static int cma_save_net_info(struct sockaddr *src_addr,
return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id);
 }
 
+static int cma_save_req_info(const struct ib_cm_event *ib_event,
+struct cma_req_info *req)
+{
+   const struct ib_cm_req_event_param *req_param =
+   ib_event-param.req_rcvd;
+   const struct ib_cm_sidr_req_event_param *sidr_param =
+   ib_event-param.sidr_req_rcvd;
+
+   switch (ib_event-event) {
+   case IB_CM_REQ_RECEIVED:
+   req-device = req_param-listen_id-device;
+   req-port   = req_param-port;
+   memcpy(req-local_gid, req_param-primary_path-sgid,
+  sizeof(req-local_gid));
+   req-has_gid= true;
+   req-service_id = req_param-primary_path-service_id;
+   req-pkey   = req_param-bth_pkey;
+   break;
+   case IB_CM_SIDR_REQ_RECEIVED:
+   req-device = sidr_param-listen_id-device;
+   req-port   = sidr_param-port;
+   req-has_gid= false;
+   req-service_id = sidr_param-service_id;
+   req-pkey   = sidr_param-bth_pkey;
+   break;
+   default:
+   return -EINVAL;
+   }
+
+   return 0;
+}
+
+static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event,
+ const struct cma_req_info *req)
+{
+   struct sockaddr_storage listen_addr_storage;
+   struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage;
+   struct net_device *net_dev;
+   const union ib_gid *gid = req-has_gid ? req-local_gid : NULL;
+   int err;
+
+   err = cma_save_ip_info(listen_addr, NULL, ib_event, req-service_id);
+   if (err)
+   return ERR_PTR(err);
+
+   net_dev = ib_get_net_dev_by_params(req-device, req-port, req-pkey,
+  gid, listen_addr);
+   if (!net_dev)
+   return ERR_PTR(-ENODEV);
+
+   return net_dev;
+}
+
+static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id)
+{
+   return (be64_to_cpu(service_id)  16)  0x;
+}
+
+static bool cma_match_private_data(struct rdma_id_private *id_priv,
+  const struct cma_hdr *hdr)
+{
+   struct sockaddr *addr = cma_src_addr(id_priv);
+   __be32 ip4_addr;
+   struct in6_addr ip6_addr;
+
+   if (cma_any_addr(addr)  !id_priv-afonly)
+   return true;
+
+   switch (addr-sa_family) {
+   case AF_INET:
+   ip4_addr = ((struct sockaddr_in *)addr)-sin_addr.s_addr;
+   if (cma_get_ip_ver(hdr) != 4)
+   return false;
+   if (!cma_any_addr(addr) 
+   hdr-dst_addr.ip4.addr != ip4_addr)
+   return false;
+   break;
+   case AF_INET6:
+   ip6_addr = ((struct sockaddr_in6 *)addr)-sin6_addr;
+   if (cma_get_ip_ver(hdr) != 6)
+   return false;
+   if (!cma_any_addr(addr) 
+   memcmp(hdr-dst_addr.ip6, ip6_addr, sizeof(ip6_addr)))
+   return false;
+   break;
+   case AF_IB:
+   

[PATCH v4 00/14] Demux IB CM requests in the rdma_cm module

2015-07-30 Thread Haggai Eran
I'm sending the patchset again with the rwsem patch and rebased over Doug's
to-be-rebased/for-4.3 tree.

Regards,
Haggai

Changes from v3:
- rebase over github.com/dledford/linux to-be-rebased/for-4.3
- add rwsem patch

Changes from v2:
- added missing reviewed-bys
- Patch 5: remove service_mask as a parameter from ib_cm_insert_listen()
- Patch 9:
  * move cma_req_info struct near other structs
  * put GID by value in the struct

Changes from v1:
- Patch 1: mark ib_client_data as going down instead of removing all client
  contexts during de-registration.
- Patch 2:
  * move kdoc to the function definition
  * do not call get_net_dev_by_params() on devices/clients that are going
down
  * pass client data directly to the callback
- Patch 3:
  * pass client data directly to callback
  * fix a lockdep warning in ipoib_match_gid_pkey_addr()
  * remove a debugging print left over
  * set a rate limit to the duplicated IP address warning
- Patch 5:
  * change atomic_dec(id-refcount) to cm_deref_id()
  * always update listen_sharecount under the cm.lock spinlock
- Patch 6: handle AF_IB requests by getting parameters from the listener
- Patch 8: new patch to expose BTH P_Key from ib_cm to rdma_cm
- Patch 9:
  * get P_Key used for de-mux from the BTH
  * use -EAFNOSUPPORT in cma_save_ip_info to designate a possible AF_IB
connection request
  * pass a NULL netdev for AF_IB requests
- Patch 11: handle AF_IB connections by filling connection information from
  the listener id instead of from the net_dev
- Patch 12: fix mention of the old ib_cm_id_create_and_listen function in
  the changelog entry.

Changes from v0:
- Added a patch to prevent a race between ib_unregister_device() and
  ib_get_net_dev_by_params().
- Removed the patch that exported a UD GMP packet's GID from the GRH, and
  related code.
- Patch 3:
  * Add _rcu suffix to ipoib_is_dev_match_addr().
  * Add helper function to get the master netdev for bonding support.
  * Scan for matching net devices in two phases: first without looking at
  * the IP address, and then looking at the IP address only when the first
phase did not find a unique net device.
- Patch 5:
  * Do not init listen_sharecount = 1 for non-listening ib_cm_ids.
  * Remove code that sets a CM ID's state to IB_CM_IDLE right before
destruction.
  * Rename ib_cm_id_create_and_listen() to ib_cm_insert_listen().
  * Do not increase reference counts when failing to add a shared CM ID due
to having a different handler callback.
- Patch 9: Clean IPv4 net_dev validation function.
- Added patch 10: new patch to use the found net_dev in IB/cma for
  eliminating unneeded calls to cma_translate_addr.
- Patch 12: Remove the lock argument to __ib_cm_listen().

The rdma_cm module relies today on the ib_cm module to demux incoming
requests based on their service ID and IP address. The ib_cm module is the
wrong place to perform this task, as it can also be used with services that
do not adhere to the RDMA IP CM service as defined in the IBA
specifications. It is forced to use an opaque private data struct and mask
to compare incoming requests against.

This series moves that demux task responsibility to the rdma_cm module. The
rdma_cm module can look into the private data attached to a CM request,
containing the IP addresses related to the request. It uses the details of
the request to find the net device associated with the request, and use
that net device to find the correct listening rdma_cm_id.

The series applies against Doug's for-v4.2 tree with the patch adding a
rwsem to IB core [2] applied.

The series is structured as follows:
Patch 1 prevents a possible race between ib_client.remove() callbacks from
ib_unregister_device(), and ib_client callbacks that rely on the
lists_rwsem locked for read, such as ib_get_net_dev_by_params(). Both
callbacks may call ib_get_client_data(), and the patch makes sure that the
remove callback doesn't free the client data while it is being used by the
other callback.

Patches 2-3 add the ability to lookup a network device according to the IB
device, port, P_Key, GID and IP address. They find the matching IPoIB
interfaces, and return a matching net_device if one exists.

Patches 4-5 make necessary changes in ib_cm to allow RDMA CM get the
information it needs out of CM and SIDR requests, and share a single
ib_cm_id with multiple RDMA CM listeners.

Patches 6-7 do some preliminary refactoring to the rdma_cm module. They
allow extracting information out of incoming requests instead of retrieving
them from a listening CM ID, and add helper functions to access the port
space IDRs.

Finally, patches 8-12 change rdma_cm to demultiplex requests on its own, and
patch 13 cleans up the now unneeded code in ib_cm to compare against the
private data.

This series contains a subset of the RDMA CM namespaces patches [1]. The
changes from v4 of the relevant patches are:
- Patch 1
  * in addition to the IB device, port, P_Key and IP address, pass
also the GID, 

[PATCH v4 08/14] IB/cma: Helper functions to access port space IDRs

2015-07-30 Thread Haggai Eran
Add helper functions to access the IDRs by port-space and port number.

Pass around the port-space enum in cma.c instead of using pointers to
port-space IDRs.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com
---
 drivers/infiniband/core/cma.c | 81 ---
 1 file changed, 60 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index cf5c48b0b7d5..f2d799209412 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -113,6 +113,22 @@ static DEFINE_IDR(udp_ps);
 static DEFINE_IDR(ipoib_ps);
 static DEFINE_IDR(ib_ps);
 
+static struct idr *cma_idr(enum rdma_port_space ps)
+{
+   switch (ps) {
+   case RDMA_PS_TCP:
+   return tcp_ps;
+   case RDMA_PS_UDP:
+   return udp_ps;
+   case RDMA_PS_IPOIB:
+   return ipoib_ps;
+   case RDMA_PS_IB:
+   return ib_ps;
+   default:
+   return NULL;
+   }
+}
+
 struct cma_device {
struct list_headlist;
struct ib_device*device;
@@ -122,11 +138,33 @@ struct cma_device {
 };
 
 struct rdma_bind_list {
-   struct idr  *ps;
+   enum rdma_port_spaceps;
struct hlist_head   owners;
unsigned short  port;
 };
 
+static int cma_ps_alloc(enum rdma_port_space ps,
+   struct rdma_bind_list *bind_list, int snum)
+{
+   struct idr *idr = cma_idr(ps);
+
+   return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL);
+}
+
+static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum)
+{
+   struct idr *idr = cma_idr(ps);
+
+   return idr_find(idr, snum);
+}
+
+static void cma_ps_remove(enum rdma_port_space ps, int snum)
+{
+   struct idr *idr = cma_idr(ps);
+
+   idr_remove(idr, snum);
+}
+
 enum {
CMA_OPTION_AFONLY,
 };
@@ -1069,7 +1107,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(lock);
hlist_del(id_priv-node);
if (hlist_empty(bind_list-owners)) {
-   idr_remove(bind_list-ps, bind_list-port);
+   cma_ps_remove(bind_list-ps, bind_list-port);
kfree(bind_list);
}
mutex_unlock(lock);
@@ -2365,8 +2403,8 @@ static void cma_bind_port(struct rdma_bind_list 
*bind_list,
hlist_add_head(id_priv-node, bind_list-owners);
 }
 
-static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,
- unsigned short snum)
+static int cma_alloc_port(enum rdma_port_space ps,
+ struct rdma_id_private *id_priv, unsigned short snum)
 {
struct rdma_bind_list *bind_list;
int ret;
@@ -2375,7 +2413,7 @@ static int cma_alloc_port(struct idr *ps, struct 
rdma_id_private *id_priv,
if (!bind_list)
return -ENOMEM;
 
-   ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL);
+   ret = cma_ps_alloc(ps, bind_list, snum);
if (ret  0)
goto err;
 
@@ -2388,7 +2426,8 @@ err:
return ret == -ENOSPC ? -EADDRNOTAVAIL : ret;
 }
 
-static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv)
+static int cma_alloc_any_port(enum rdma_port_space ps,
+ struct rdma_id_private *id_priv)
 {
static unsigned int last_used_port;
int low, high, remaining;
@@ -2399,7 +2438,7 @@ static int cma_alloc_any_port(struct idr *ps, struct 
rdma_id_private *id_priv)
rover = prandom_u32() % remaining + low;
 retry:
if (last_used_port != rover 
-   !idr_find(ps, (unsigned short) rover)) {
+   !cma_ps_find(ps, (unsigned short)rover)) {
int ret = cma_alloc_port(ps, id_priv, rover);
/*
 * Remember previously used port number in order to avoid
@@ -2454,7 +2493,8 @@ static int cma_check_port(struct rdma_bind_list 
*bind_list,
return 0;
 }
 
-static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv)
+static int cma_use_port(enum rdma_port_space ps,
+   struct rdma_id_private *id_priv)
 {
struct rdma_bind_list *bind_list;
unsigned short snum;
@@ -2464,7 +2504,7 @@ static int cma_use_port(struct idr *ps, struct 
rdma_id_private *id_priv)
if (snum  PROT_SOCK  !capable(CAP_NET_BIND_SERVICE))
return -EACCES;
 
-   bind_list = idr_find(ps, snum);
+   bind_list = cma_ps_find(ps, snum);
if (!bind_list) {
ret = cma_alloc_port(ps, id_priv, snum);
} else {
@@ -2487,25 +2527,24 @@ static int cma_bind_listen(struct rdma_id_private 
*id_priv)
return ret;
 }
 
-static struct idr *cma_select_inet_ps(struct rdma_id_private *id_priv)
+static enum rdma_port_space 

Re: [PATCH for-4.3 02/15] IB: Modify ib_create_mr API

2015-07-30 Thread Steve Wise

On 7/30/2015 2:32 AM, Sagi Grimberg wrote:

Use ib_alloc_mr with specific parameters.
Change the existing callers.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
  drivers/infiniband/core/verbs.c  | 31 --
  drivers/infiniband/hw/mlx5/main.c|  2 +-
  drivers/infiniband/hw/mlx5/mlx5_ib.h |  5 +++--
  drivers/infiniband/hw/mlx5/mr.c  | 17 ++-
  drivers/infiniband/ulp/iser/iser_verbs.c |  6 ++
  drivers/infiniband/ulp/isert/ib_isert.c  |  6 +-
  include/rdma/ib_verbs.h  | 37 +---
  7 files changed, 58 insertions(+), 46 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 003bb62..2ac599b 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1272,15 +1272,32 @@ int ib_dereg_mr(struct ib_mr *mr)
  }
  EXPORT_SYMBOL(ib_dereg_mr);
  
-struct ib_mr *ib_create_mr(struct ib_pd *pd,

-  struct ib_mr_init_attr *mr_init_attr)
+/**
+ * ib_alloc_mr() - Allocates a memory region
+ * @pd:protection domain associated with the region
+ * @mr_type:   memory region type
+ * @max_num_sg:maximum sg entries available for registration.
+ *
+ * Notes:
+ * Memory registeration page/sg lists must not exceed max_num_sg.
+ * For mr_type IB_MR_TYPE_MEM_REG, the total length cannot exceed
+ * max_num_sg * used_page_size.
+ *


Nit:  the above sounds like used_page_size is a variable.  Something 
like this might work?


max_num_sg * the page size used for this sg list.



+ */
+struct ib_mr *ib_alloc_mr(struct ib_pd *pd,
+ enum ib_mr_type mr_type,
+ u32 max_num_sg)
  {
struct ib_mr *mr;
  
-	if (!pd-device-create_mr)

-   return ERR_PTR(-ENOSYS);
-
-   mr = pd-device-create_mr(pd, mr_init_attr);
+   if (pd-device-alloc_mr) {
+   mr = pd-device-alloc_mr(pd, mr_type, max_num_sg);
+   } else {
+   if (mr_type != IB_MR_TYPE_MEM_REG ||
+   !pd-device-alloc_fast_reg_mr)
+   return ERR_PTR(-ENOSYS);
+   mr = pd-device-alloc_fast_reg_mr(pd, max_num_sg);
+   }
  
  	if (!IS_ERR(mr)) {

mr-device  = pd-device;
@@ -1292,7 +1309,7 @@ struct ib_mr *ib_create_mr(struct ib_pd *pd,
  
  	return mr;

  }
-EXPORT_SYMBOL(ib_create_mr);
+EXPORT_SYMBOL(ib_alloc_mr);
  
  struct ib_mr *ib_alloc_fast_reg_mr(struct ib_pd *pd, int max_page_list_len)

  {
diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 46d1383..2c2a461 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -1489,7 +1489,7 @@ static void *mlx5_ib_add(struct mlx5_core_dev *mdev)
dev-ib_dev.attach_mcast = mlx5_ib_mcg_attach;
dev-ib_dev.detach_mcast = mlx5_ib_mcg_detach;
dev-ib_dev.process_mad  = mlx5_ib_process_mad;
-   dev-ib_dev.create_mr= mlx5_ib_create_mr;
+   dev-ib_dev.alloc_mr = mlx5_ib_alloc_mr;
dev-ib_dev.alloc_fast_reg_mr= mlx5_ib_alloc_fast_reg_mr;
dev-ib_dev.alloc_fast_reg_page_list = mlx5_ib_alloc_fast_reg_page_list;
dev-ib_dev.free_fast_reg_page_list  = mlx5_ib_free_fast_reg_page_list;
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index 537f42e..3030abe 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -572,8 +572,9 @@ struct ib_mr *mlx5_ib_reg_user_mr(struct ib_pd *pd, u64 
start, u64 length,
  int mlx5_ib_update_mtt(struct mlx5_ib_mr *mr, u64 start_page_index,
   int npages, int zap);
  int mlx5_ib_dereg_mr(struct ib_mr *ibmr);
-struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,
-   struct ib_mr_init_attr *mr_init_attr);
+struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
+  enum ib_mr_type mr_type,
+  u32 max_num_sg);
  struct ib_mr *mlx5_ib_alloc_fast_reg_mr(struct ib_pd *pd,
int max_page_list_len);
  struct ib_fast_reg_page_list *mlx5_ib_alloc_fast_reg_page_list(struct 
ib_device *ibdev,
diff --git a/drivers/infiniband/hw/mlx5/mr.c b/drivers/infiniband/hw/mlx5/mr.c
index 03cf74e..b0b68bb 100644
--- a/drivers/infiniband/hw/mlx5/mr.c
+++ b/drivers/infiniband/hw/mlx5/mr.c
@@ -1246,14 +1246,15 @@ int mlx5_ib_dereg_mr(struct ib_mr *ibmr)
return 0;
  }
  
-struct ib_mr *mlx5_ib_create_mr(struct ib_pd *pd,

-   struct ib_mr_init_attr *mr_init_attr)
+struct ib_mr *mlx5_ib_alloc_mr(struct ib_pd *pd,
+  enum ib_mr_type mr_type,
+  u32 max_num_sg)
  {
struct mlx5_ib_dev *dev = to_mdev(pd-device);
struct mlx5_create_mkey_mbox_in *in;
struct mlx5_ib_mr *mr;
  

[PATCH v4 04/14] IB/ipoib: Return IPoIB devices matching connection parameters

2015-07-30 Thread Haggai Eran
From: Guy Shapiro gu...@mellanox.com

Implement the get_net_device_by_port_pkey_ip callback that returns network
device to ib_core according to connection parameters. Check the ipoib
device and iterate over all child devices to look for a match.

For each IPoIB device we iterate through all upper devices when searching
for a matching IP, in order to support bonding.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 229 +-
 1 file changed, 228 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index cca1a0c91ec4..36536ce5a3e2 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -48,6 +48,9 @@
 
 #include linux/jhash.h
 #include net/arp.h
+#include net/addrconf.h
+#include linux/inetdevice.h
+#include rdma/ib_cache.h
 
 #define DRV_VERSION 1.0.0
 
@@ -91,11 +94,16 @@ struct ib_sa_client ipoib_sa_client;
 static void ipoib_add_one(struct ib_device *device);
 static void ipoib_remove_one(struct ib_device *device, void *client_data);
 static void ipoib_neigh_reclaim(struct rcu_head *rp);
+static struct net_device *ipoib_get_net_dev_by_params(
+   struct ib_device *dev, u8 port, u16 pkey,
+   const union ib_gid *gid, const struct sockaddr *addr,
+   void *client_data);
 
 static struct ib_client ipoib_client = {
.name   = ipoib,
.add= ipoib_add_one,
-   .remove = ipoib_remove_one
+   .remove = ipoib_remove_one,
+   .get_net_dev_by_params = ipoib_get_net_dev_by_params,
 };
 
 int ipoib_open(struct net_device *dev)
@@ -222,6 +230,225 @@ static int ipoib_change_mtu(struct net_device *dev, int 
new_mtu)
return 0;
 }
 
+/* Called with an RCU read lock taken */
+static bool ipoib_is_dev_match_addr_rcu(const struct sockaddr *addr,
+   struct net_device *dev)
+{
+   struct net *net = dev_net(dev);
+   struct in_device *in_dev;
+   struct sockaddr_in *addr_in = (struct sockaddr_in *)addr;
+   struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
+   __be32 ret_addr;
+
+   switch (addr-sa_family) {
+   case AF_INET:
+   in_dev = in_dev_get(dev);
+   if (!in_dev)
+   return false;
+
+   ret_addr = inet_confirm_addr(net, in_dev, 0,
+addr_in-sin_addr.s_addr,
+RT_SCOPE_HOST);
+   in_dev_put(in_dev);
+   if (ret_addr)
+   return true;
+
+   break;
+   case AF_INET6:
+   if (IS_ENABLED(CONFIG_IPV6) 
+   ipv6_chk_addr(net, addr_in6-sin6_addr, dev, 1))
+   return true;
+
+   break;
+   }
+   return false;
+}
+
+/**
+ * Find the master net_device on top of the given net_device.
+ * @dev: base IPoIB net_device
+ *
+ * Returns the master net_device with a reference held, or the same net_device
+ * if no master exists.
+ */
+static struct net_device *ipoib_get_master_net_dev(struct net_device *dev)
+{
+   struct net_device *master;
+
+   rcu_read_lock();
+   master = netdev_master_upper_dev_get_rcu(dev);
+   if (master)
+   dev_hold(master);
+   rcu_read_unlock();
+
+   if (master)
+   return master;
+
+   dev_hold(dev);
+   return dev;
+}
+
+/**
+ * Find a net_device matching the given address, which is an upper device of
+ * the given net_device.
+ * @addr: IP address to look for.
+ * @dev: base IPoIB net_device
+ *
+ * If found, returns the net_device with a reference held. Otherwise return
+ * NULL.
+ */
+static struct net_device *ipoib_get_net_dev_match_addr(
+   const struct sockaddr *addr, struct net_device *dev)
+{
+   struct net_device *upper,
+ *result = NULL;
+   struct list_head *iter;
+
+   rcu_read_lock();
+   if (ipoib_is_dev_match_addr_rcu(addr, dev)) {
+   dev_hold(dev);
+   result = dev;
+   goto out;
+   }
+
+   netdev_for_each_all_upper_dev_rcu(dev, upper, iter) {
+   if (ipoib_is_dev_match_addr_rcu(addr, upper)) {
+   dev_hold(upper);
+   result = upper;
+   break;
+   }
+   }
+out:
+   rcu_read_unlock();
+   return result;
+}
+
+/* returns the number of IPoIB netdevs on top a given ipoib device matching a
+ * pkey_index and address, if one exists.
+ *
+ * @found_net_dev: contains a matching net_device if the return value = 1,
+ * with a reference held. */
+static int ipoib_match_gid_pkey_addr(struct 

[PATCH v4 05/14] IB/cm: Expose service ID in request events

2015-07-30 Thread Haggai Eran
Expose the service ID on an incoming CM or SIDR request to the event
handler. This will allow the RDMA CM module to de-multiplex connection
requests based on the information encoded in the service ID.

Acked-by: Sean Hefty sean.he...@intel.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cm.c | 3 +++
 include/rdma/ib_cm.h | 1 +
 2 files changed, 4 insertions(+)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 82d5c4362aa8..93e9e2f34fc6 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -1268,6 +1268,7 @@ static void cm_format_paths_from_req(struct cm_req_msg 
*req_msg,
primary_path-packet_life_time =
cm_req_get_primary_local_ack_timeout(req_msg);
primary_path-packet_life_time -= (primary_path-packet_life_time  0);
+   primary_path-service_id = req_msg-service_id;
 
if (req_msg-alt_local_lid) {
memset(alt_path, 0, sizeof *alt_path);
@@ -1289,6 +1290,7 @@ static void cm_format_paths_from_req(struct cm_req_msg 
*req_msg,
alt_path-packet_life_time =
cm_req_get_alt_local_ack_timeout(req_msg);
alt_path-packet_life_time -= (alt_path-packet_life_time  0);
+   alt_path-service_id = req_msg-service_id;
}
 }
 
@@ -2992,6 +2994,7 @@ static void cm_format_sidr_req_event(struct cm_work *work,
param = work-cm_event.param.sidr_req_rcvd;
param-pkey = __be16_to_cpu(sidr_req_msg-pkey);
param-listen_id = listen_id;
+   param-service_id = sidr_req_msg-service_id;
param-port = work-port-port_num;
work-cm_event.private_data = sidr_req_msg-private_data;
 }
diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h
index 39ed2d2fbd51..1b567bbc3ad4 100644
--- a/include/rdma/ib_cm.h
+++ b/include/rdma/ib_cm.h
@@ -223,6 +223,7 @@ struct ib_cm_apr_event_param {
 
 struct ib_cm_sidr_req_event_param {
struct ib_cm_id *listen_id;
+   __be64  service_id;
u8  port;
u16 pkey;
 };
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 11/14] IB/cma: Validate routing of incoming requests

2015-07-30 Thread Haggai Eran
Pass incoming request parameters through the relevant IPv4/IPv6 routing
tables and make sure the network stack is configured to handle such
requests.

Signed-off-by: Haggai Eran hagg...@mellanox.com
---
 drivers/infiniband/core/cma.c | 95 +--
 1 file changed, 92 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 011aa7310dd3..f272b3d1799d 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -46,6 +46,8 @@
 
 #include net/tcp.h
 #include net/ipv6.h
+#include net/ip_fib.h
+#include net/ip6_route.h
 
 #include rdma/rdma_cm.h
 #include rdma/rdma_cm_ib.h
@@ -1081,16 +1083,98 @@ static int cma_save_req_info(const struct ib_cm_event 
*ib_event,
return 0;
 }
 
+static bool validate_ipv4_net_dev(struct net_device *net_dev,
+ const struct sockaddr_in *dst_addr,
+ const struct sockaddr_in *src_addr)
+{
+   __be32 daddr = dst_addr-sin_addr.s_addr,
+  saddr = src_addr-sin_addr.s_addr;
+   struct fib_result res;
+   struct flowi4 fl4;
+   int err;
+   bool ret;
+
+   if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr) ||
+   ipv4_is_lbcast(daddr) || ipv4_is_zeronet(saddr) ||
+   ipv4_is_zeronet(daddr) || ipv4_is_loopback(daddr) ||
+   ipv4_is_loopback(saddr))
+   return false;
+
+   memset(fl4, 0, sizeof(fl4));
+   fl4.flowi4_iif = net_dev-ifindex;
+   fl4.daddr = daddr;
+   fl4.saddr = saddr;
+
+   rcu_read_lock();
+   err = fib_lookup(dev_net(net_dev), fl4, res, 0);
+   if (err)
+   return false;
+
+   ret = FIB_RES_DEV(res) == net_dev;
+   rcu_read_unlock();
+
+   return ret;
+}
+
+static bool validate_ipv6_net_dev(struct net_device *net_dev,
+ const struct sockaddr_in6 *dst_addr,
+ const struct sockaddr_in6 *src_addr)
+{
+#if IS_ENABLED(CONFIG_IPV6)
+   const int strict = ipv6_addr_type(dst_addr-sin6_addr) 
+  IPV6_ADDR_LINKLOCAL;
+   struct rt6_info *rt = rt6_lookup(dev_net(net_dev), dst_addr-sin6_addr,
+src_addr-sin6_addr, net_dev-ifindex,
+strict);
+   bool ret;
+
+   if (!rt)
+   return false;
+
+   ret = rt-rt6i_idev-dev == net_dev;
+   ip6_rt_put(rt);
+
+   return ret;
+#else
+   return false;
+#endif
+}
+
+static bool validate_net_dev(struct net_device *net_dev,
+const struct sockaddr *daddr,
+const struct sockaddr *saddr)
+{
+   const struct sockaddr_in *daddr4 = (const struct sockaddr_in *)daddr;
+   const struct sockaddr_in *saddr4 = (const struct sockaddr_in *)saddr;
+   const struct sockaddr_in6 *daddr6 = (const struct sockaddr_in6 *)daddr;
+   const struct sockaddr_in6 *saddr6 = (const struct sockaddr_in6 *)saddr;
+
+   switch (daddr-sa_family) {
+   case AF_INET:
+   return saddr-sa_family == AF_INET 
+  validate_ipv4_net_dev(net_dev, daddr4, saddr4);
+
+   case AF_INET6:
+   return saddr-sa_family == AF_INET6 
+  validate_ipv6_net_dev(net_dev, daddr6, saddr6);
+
+   default:
+   return false;
+   }
+}
+
 static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event,
  const struct cma_req_info *req)
 {
-   struct sockaddr_storage listen_addr_storage;
-   struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage;
+   struct sockaddr_storage listen_addr_storage, src_addr_storage;
+   struct sockaddr *listen_addr = (struct sockaddr *)listen_addr_storage,
+   *src_addr = (struct sockaddr *)src_addr_storage;
struct net_device *net_dev;
const union ib_gid *gid = req-has_gid ? req-local_gid : NULL;
int err;
 
-   err = cma_save_ip_info(listen_addr, NULL, ib_event, req-service_id);
+   err = cma_save_ip_info(listen_addr, src_addr, ib_event,
+  req-service_id);
if (err)
return ERR_PTR(err);
 
@@ -1099,6 +1183,11 @@ static struct net_device *cma_get_net_dev(struct 
ib_cm_event *ib_event,
if (!net_dev)
return ERR_PTR(-ENODEV);
 
+   if (!validate_net_dev(net_dev, listen_addr, src_addr)) {
+   dev_put(net_dev);
+   return ERR_PTR(-EHOSTUNREACH);
+   }
+
return net_dev;
 }
 
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next 0/4] few mlx4 IB driver fixes for 4.3

2015-07-30 Thread Doug Ledford
On 07/30/2015 10:34 AM, Or Gerlitz wrote:
 Hi Doug,
 
 Some fixes included, none of them accounts for regression introduced in 
 4.2-rc1,
 so all can go to 4.3 -- genetated them again 4.2-rc4
 
 Or.
 
 Jack Morgenstein (3):
   IB/mlx4: Fix potential deadlock when sending mad to wire
   IB/mlx4: Deprecate mcast group warning message to debug because of flooding
   IB/mlx4: In sysfs under RoCE, do not allow changing the paravirtualization 
 mapping for pkeys
 
 Noa Osherovich (1):
   IB/mlx4: Use correct SL on AH query under RoCE
 
  drivers/infiniband/hw/mlx4/ah.c|  6 +-
  drivers/infiniband/hw/mlx4/mcg.c   | 15 ++-
  drivers/infiniband/hw/mlx4/sysfs.c |  5 -
  3 files changed, 19 insertions(+), 7 deletions(-)
 

These all looked fine to me.  Picked up for 4.3.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


[PATCH for-next 2/4] IB/mlx4: Deprecate mcast group warning message to debug because of flooding

2015-07-30 Thread Or Gerlitz
From: Jack Morgenstein ja...@dev.mellanox.co.il

The mcg too many pending requests warning message fills the log
when OpenSM is downed. Deprecate the warning to be debug output.


Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/mcg.c | 8 ++--
 1 file changed, 6 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/mcg.c b/drivers/infiniband/hw/mlx4/mcg.c
index a0559a8..2d5bccd 100644
--- a/drivers/infiniband/hw/mlx4/mcg.c
+++ b/drivers/infiniband/hw/mlx4/mcg.c
@@ -51,6 +51,10 @@
pr_warn(%s-%d: %16s (port %d): WARNING:  format, __func__, __LINE__,\
(group)-name, group-demux-port, ## arg)
 
+#define mcg_debug_group(group, format, arg...) \
+   pr_debug(%s-%d: %16s (port %d): WARNING:  format, __func__, __LINE__,\
+(group)-name, (group)-demux-port, ## arg)
+
 #define mcg_error_group(group, format, arg...) \
pr_err(  %16s:  format, (group)-name, ## arg)
 
@@ -962,8 +966,8 @@ int mlx4_ib_mcg_multiplex_handler(struct ib_device *ibdev, 
int port,
mutex_lock(group-lock);
if (group-func[slave].num_pend_reqs  MAX_PEND_REQS_PER_FUNC) {
mutex_unlock(group-lock);
-   mcg_warn_group(group, Port %d, Func %d has too many 
pending requests (%d), dropping\n,
-  port, slave, MAX_PEND_REQS_PER_FUNC);
+   mcg_debug_group(group, Port %d, Func %d has too many 
pending requests (%d), dropping\n,
+   port, slave, MAX_PEND_REQS_PER_FUNC);
release_group(group, 0);
kfree(req);
return -ENOMEM;
-- 
2.3.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 3/4] IB/mlx4: In sysfs under RoCE, do not allow changing the paravirtualization mapping for pkeys

2015-07-30 Thread Or Gerlitz
From: Jack Morgenstein ja...@dev.mellanox.co.il

The pkey mapping for RoCE must remain the default mapping:
VFs:
  virtual index 0 = mapped to real index 0 (0x)
  All others indices: mapped to a real pkey index containing an
  invalid pkey.
PF:
  virtual index i = real index i.

Fixes: c1e7e466120b ('IB/mlx4: Add iov directory in sysfs under the ib device')
Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/sysfs.c | 5 -
 1 file changed, 4 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/sysfs.c 
b/drivers/infiniband/hw/mlx4/sysfs.c
index 6797108..69fb5ba 100644
--- a/drivers/infiniband/hw/mlx4/sysfs.c
+++ b/drivers/infiniband/hw/mlx4/sysfs.c
@@ -640,6 +640,8 @@ static int add_port(struct mlx4_ib_dev *dev, int port_num, 
int slave)
struct mlx4_port *p;
int i;
int ret;
+   int is_eth = rdma_port_get_link_layer(dev-ib_dev, port_num) ==
+   IB_LINK_LAYER_ETHERNET;
 
p = kzalloc(sizeof *p, GFP_KERNEL);
if (!p)
@@ -657,7 +659,8 @@ static int add_port(struct mlx4_ib_dev *dev, int port_num, 
int slave)
 
p-pkey_group.name  = pkey_idx;
p-pkey_group.attrs =
-   alloc_group_attrs(show_port_pkey, store_port_pkey,
+   alloc_group_attrs(show_port_pkey,
+ is_eth ? NULL : store_port_pkey,
  dev-dev-caps.pkey_table_len[port_num]);
if (!p-pkey_group.attrs) {
ret = -ENOMEM;
-- 
2.3.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next 4/4] IB/mlx4: Use correct SL on AH query under RoCE

2015-07-30 Thread Or Gerlitz
From: Noa Osherovich no...@mellanox.com

The mlx4 IB driver implementation for ib_query_ah used a wrong offset
(28 instead of 29) when link type is Ethernet. Fixed to use the correct one.

Fixes: fa417f7b520e ('IB/mlx4: Add support for IBoE')
Signed-off-by: Shani Michaeli sha...@mellanox.com
Signed-off-by: Noa Osherovich no...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/ah.c | 6 +-
 1 file changed, 5 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index f50a546..33fdd50 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -148,9 +148,13 @@ int mlx4_ib_query_ah(struct ib_ah *ibah, struct ib_ah_attr 
*ah_attr)
enum rdma_link_layer ll;
 
memset(ah_attr, 0, sizeof *ah_attr);
-   ah_attr-sl = be32_to_cpu(ah-av.ib.sl_tclass_flowlabel)  28;
ah_attr-port_num = be32_to_cpu(ah-av.ib.port_pd)  24;
ll = rdma_port_get_link_layer(ibah-device, ah_attr-port_num);
+   if (ll == IB_LINK_LAYER_ETHERNET)
+   ah_attr-sl = be32_to_cpu(ah-av.eth.sl_tclass_flowlabel)  29;
+   else
+   ah_attr-sl = be32_to_cpu(ah-av.ib.sl_tclass_flowlabel)  28;
+
ah_attr-dlid = ll == IB_LINK_LAYER_INFINIBAND ? 
be16_to_cpu(ah-av.ib.dlid) : 0;
if (ah-av.ib.stat_rate)
ah_attr-static_rate = ah-av.ib.stat_rate - 
MLX4_STAT_RATE_OFFSET;
-- 
2.3.7

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 20/22] IB/iser: Support up to 8MB data transfer in a single command

2015-07-30 Thread Steve Wise

On 7/30/2015 3:06 AM, Sagi Grimberg wrote:

iser support up to 512KB data transfer in a single scsi
command. In order to support up to 8MB, iser needs to pre-allocate
larger memory regions and larger page vectors.

Given that a few target implementations don't support data transfers
of more than 512KB by default and the fact that larger IO sizes require
more resources, we introduce a module parameter to determine the
maximum number of 512B sectors in a single scsi command.
Users that are interested in larger transfers can change this value given
that the target supports larger transfers.

IO operations that consists of N pages will need a page vector
of size N+1 in case the first SG element contains an offset. Given
that some devices allocates memory regions in powers of 2, this
means that allocating a region with N+1 pages, will result in
region resources allocation of the next power of 2. Since we don't
want that to happen, in case we are in the limit of IO size supported
and the first SG element has an offset, we align the SG list using a
bounce buffer (which is OK given that this is not likely to happen a lot).

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
  drivers/infiniband/ulp/iser/iscsi_iser.c | 19 ---
  drivers/infiniband/ulp/iser/iscsi_iser.h | 14 --
  drivers/infiniband/ulp/iser/iser_initiator.c |  2 +-
  drivers/infiniband/ulp/iser/iser_memory.c| 14 --
  drivers/infiniband/ulp/iser/iser_verbs.c | 27 +++
  5 files changed, 60 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index e3cea61..9eeefc8 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -93,6 +93,10 @@ static unsigned int iscsi_max_lun = 512;
  module_param_named(max_lun, iscsi_max_lun, uint, S_IRUGO);
  MODULE_PARM_DESC(max_lun, Max LUNs to allow per session (default:512);
  
+unsigned int iser_max_sectors = ISER_DEF_MAX_SECTORS;

+module_param_named(max_sectors, iser_max_sectors, uint, S_IRUGO | S_IWUSR);
+MODULE_PARM_DESC(max_sectors, Max number of sectors in a single scsi command 
(default:1024);
+
  bool iser_pi_enable = false;
  module_param_named(pi_enable, iser_pi_enable, bool, S_IRUGO);
  MODULE_PARM_DESC(pi_enable, Enable T10-PI offload support 
(default:disabled));
@@ -625,6 +629,8 @@ iscsi_iser_session_create(struct iscsi_endpoint *ep,
if (ep) {
iser_conn = ep-dd_data;
max_cmds = iser_conn-max_cmds;
+   shost-sg_tablesize = iser_conn-scsi_sg_tablesize;
+   shost-max_sectors = iser_conn-scsi_max_sectors;
  
  		mutex_lock(iser_conn-state_mutex);

if (iser_conn-state != ISER_CONN_UP) {
@@ -643,15 +649,6 @@ iscsi_iser_session_create(struct iscsi_endpoint *ep,
   SHOST_DIX_GUARD_CRC);
}
  
-		/*

-* Limit the sg_tablesize and max_sectors based on the device
-* max fastreg page list length.
-*/
-   shost-sg_tablesize = min_t(unsigned short, shost-sg_tablesize,
-   ib_conn-device-dev_attr.max_fast_reg_page_list_len);
-   shost-max_sectors = min_t(unsigned int,
-   1024, (shost-sg_tablesize * PAGE_SIZE)  9);
-
if (iscsi_host_add(shost,
   ib_conn-device-ib_device-dma_device)) {
mutex_unlock(iser_conn-state_mutex);
@@ -966,8 +963,8 @@ static struct scsi_host_template iscsi_iser_sht = {
.name   = iSCSI Initiator over iSER,
.queuecommand   = iscsi_queuecommand,
.change_queue_depth = scsi_change_queue_depth,
-   .sg_tablesize   = ISCSI_ISER_SG_TABLESIZE,
-   .max_sectors= 1024,
+   .sg_tablesize   = ISCSI_ISER_DEF_SG_TABLESIZE,
+   .max_sectors= ISER_DEF_MAX_SECTORS,
.cmd_per_lun= ISER_DEF_CMD_PER_LUN,
.eh_abort_handler   = iscsi_eh_abort,
.eh_device_reset_handler= iscsi_eh_device_reset,
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index e9ebe0b..8a32e20 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -98,8 +98,13 @@
  #define SHIFT_4K  12
  #define SIZE_4K   (1ULL  SHIFT_4K)
  #define MASK_4K   (~(SIZE_4K-1))
-   /* support up to 512KB in one RDMA */
-#define ISCSI_ISER_SG_TABLESIZE (0x8  SHIFT_4K)
+
+/* Default support is 512KB I/O size */
+#define ISER_DEF_MAX_SECTORS   1024
+#define ISCSI_ISER_DEF_SG_TABLESIZE((ISER_DEF_MAX_SECTORS * 512)  
SHIFT_4K)
+/* Maximum support is 8MB I/O size */
+#define ISCSI_ISER_MAX_SG_TABLESIZE(16384 * 512  SHIFT_4K)
+
  #define 

Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Yuval Shaia
On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
 Hi,
 
 Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
  This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
  CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
  corruption free delivery of the packet.
  
  InfiniBand uses 32b CRC which provides stronger data integrity 
  protection compare to 16b IP Checksum.
 
 InfiniBand 32b CRC = Ethernet 32b CRC, it's link layer, layer 2.
 
 IPv4 checksum is at another level, it's internet layer, layer 3.
 
   So, there is no added value that IP/TCP Checksum provides in the IB 
  world.
  
 
 Sure, IPv4 checksum is a thing of the past: checksum was dropped from
 IP header in IPv6: it assumes the lower layer, such as Ethernet,
 provides the required integrety check.
 
 I think not checking the IPv4 checksum should be a choice, carefully
 thought, for inside a fabric, as I understand your proposal, packet
 with invalid checksum will be allowed to go in/out of the fabric.
Yes, this is why it is controlled by module parameter.
Maybe a better choice would be to default it to 0.
 
 It sound like it's a departure from the behavior one can expect from an
 IPv4 network stack.
It should be considered as network-fine-tuning parameter so if admin knows his 
fabric he can use it.
 
  The proposal is to tell network stack that IPoIB-CM supports IP 
  Checksum offload. This enables the kernel to save the time of 
  checksum calculation of IPoIB CM packets. Network sends the IP packet 
  without adding the IP Checksum to the header. On the receive side, 
  IPoIB driver again tells the network stack that IP Checksum is good 
  for the incoming packets and network stack avoids the IP Checksum 
  calculations.
  
  During connection establishment the driver determine if peer supports
  IB CRC as checksum. This is done so driver will be able to calculate
  checksum before transmiting the packet in case the peer does not 
  support this feature.
  
 
 Two questions:
Three :)
 
 - What will see tool such as wireshark/tcpdump when sniffing checksum
Zero or what ever the networking layer puts in csum when H/W supports 
CSUM-offloading.
Please note that with this patch driver still supports backward computability 
(per connection).
This means that for connections with peer which does not support this 
functionality you expect to see this value filled with checksum.
 -less IPv4 packets sent/received on IPoIB interface ?
No
 
 - What might happen if such checksum-less IPv4 packet is later routed to a 
 different IPv4 network ?
As noted above, for network that is opened to outside world this feature should 
be blocked.
In general i would say that if a layer 2 terminator device (e.x router) exist 
in the fabric - this feature can't be used and must be blocked.
With this limitation it still worth use it because of the reason of increasing 
throughput
 
  With this enhancement throughput is increased by 60%.
  
 
  Signed-off-by: Yuval Shaia yuval.sh...@oracle.com
 
 Regards.
 
 -- 
 Yann Droneaud
 OPTEYA
 
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V7 02/10] net: Add info for NETDEV_CHANGEUPPER event

2015-07-30 Thread Matan Barak
Some consumers of NETDEV_CHANGEUPPER event would like to know which
upper device was linked/unlinked and what operation was carried.

Add information in the notifier info block for that purpose.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 include/linux/netdevice.h | 14 ++
 net/core/dev.c| 12 ++--
 2 files changed, 24 insertions(+), 2 deletions(-)

diff --git a/include/linux/netdevice.h b/include/linux/netdevice.h
index e20979d..2b7fe4e 100644
--- a/include/linux/netdevice.h
+++ b/include/linux/netdevice.h
@@ -3556,6 +3556,20 @@ struct sk_buff *__skb_gso_segment(struct sk_buff *skb,
 struct sk_buff *skb_mac_gso_segment(struct sk_buff *skb,
netdev_features_t features);
 
+enum netdev_changeupper_event {
+   NETDEV_CHANGEUPPER_LINK,
+   NETDEV_CHANGEUPPER_UNLINK,
+};
+
+struct netdev_changeupper_info {
+   struct netdev_notifier_info info; /* must be first */
+   enum netdev_changeupper_event   event;
+   struct net_device   *upper;
+};
+
+void netdev_changeupper_info_change(struct net_device *dev,
+   struct netdev_changeupper_info *info);
+
 struct netdev_bonding_info {
ifslave slave;
ifbond  master;
diff --git a/net/core/dev.c b/net/core/dev.c
index a8e4dd4..6e6f14e 100644
--- a/net/core/dev.c
+++ b/net/core/dev.c
@@ -5302,6 +5302,7 @@ static int __netdev_upper_dev_link(struct net_device *dev,
   void *private)
 {
struct netdev_adjacent *i, *j, *to_i, *to_j;
+   struct netdev_changeupper_info changeupper_info;
int ret = 0;
 
ASSERT_RTNL();
@@ -5357,7 +5358,10 @@ static int __netdev_upper_dev_link(struct net_device 
*dev,
goto rollback_lower_mesh;
}
 
-   call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+   changeupper_info.event = NETDEV_CHANGEUPPER_LINK;
+   changeupper_info.upper = upper_dev;
+   call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+ changeupper_info.info);
return 0;
 
 rollback_lower_mesh:
@@ -5453,6 +5457,7 @@ void netdev_upper_dev_unlink(struct net_device *dev,
 struct net_device *upper_dev)
 {
struct netdev_adjacent *i, *j;
+   struct netdev_changeupper_info changeupper_info;
ASSERT_RTNL();
 
__netdev_adjacent_dev_unlink_neighbour(dev, upper_dev);
@@ -5474,7 +5479,10 @@ void netdev_upper_dev_unlink(struct net_device *dev,
list_for_each_entry(i, upper_dev-all_adj_list.upper, list)
__netdev_adjacent_dev_unlink(dev, i-dev);
 
-   call_netdevice_notifiers(NETDEV_CHANGEUPPER, dev);
+   changeupper_info.event = NETDEV_CHANGEUPPER_UNLINK;
+   changeupper_info.upper = upper_dev;
+   call_netdevice_notifiers_info(NETDEV_CHANGEUPPER, dev,
+ changeupper_info.info);
 }
 EXPORT_SYMBOL(netdev_upper_dev_unlink);
 
-- 
2.1.0

Cc: net...@vger.kernel.org
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V7 07/10] net/mlx4: Postpone the registration of net_device

2015-07-30 Thread Matan Barak
From: Moni Shoua mo...@mellanox.com

The mlx4 network driver was registered in the context of the 'add'
function of the core driver (called when HW should be registered).
This makes the netdev event NETDEV_REGISTER to be sent in a context
where the answer to get_protocol_dev() callback returns NULL. This may
be confusing to listeners of netdev events.
This patch is a preparation to the patch that implements the
get_netdev() callback in the IB/mlx4 driver.

Signed-off-by: Moni Shoua mo...@mellanox.com
---
 drivers/net/ethernet/mellanox/mlx4/en_main.c | 36 
 drivers/net/ethernet/mellanox/mlx4/intf.c|  3 +++
 include/linux/mlx4/driver.h  |  1 +
 3 files changed, 25 insertions(+), 15 deletions(-)

diff --git a/drivers/net/ethernet/mellanox/mlx4/en_main.c 
b/drivers/net/ethernet/mellanox/mlx4/en_main.c
index 913b716..a946e4b 100644
--- a/drivers/net/ethernet/mellanox/mlx4/en_main.c
+++ b/drivers/net/ethernet/mellanox/mlx4/en_main.c
@@ -224,6 +224,26 @@ static void mlx4_en_remove(struct mlx4_dev *dev, void 
*endev_ptr)
kfree(mdev);
 }
 
+static void mlx4_en_activate(struct mlx4_dev *dev, void *ctx)
+{
+   int i;
+   struct mlx4_en_dev *mdev = ctx;
+
+   /* Create a netdev for each port */
+   mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
+   mlx4_info(mdev, Activating port:%d\n, i);
+   if (mlx4_en_init_netdev(mdev, i, mdev-profile.prof[i]))
+   mdev-pndev[i] = NULL;
+   }
+
+   /* register notifier */
+   mdev-nb.notifier_call = mlx4_en_netdev_event;
+   if (register_netdevice_notifier(mdev-nb)) {
+   mdev-nb.notifier_call = NULL;
+   mlx4_err(mdev, Failed to create notifier\n);
+   }
+}
+
 static void *mlx4_en_add(struct mlx4_dev *dev)
 {
struct mlx4_en_dev *mdev;
@@ -297,21 +317,6 @@ static void *mlx4_en_add(struct mlx4_dev *dev)
mutex_init(mdev-state_lock);
mdev-device_up = true;
 
-   /* Setup ports */
-
-   /* Create a netdev for each port */
-   mlx4_foreach_port(i, dev, MLX4_PORT_TYPE_ETH) {
-   mlx4_info(mdev, Activating port:%d\n, i);
-   if (mlx4_en_init_netdev(mdev, i, mdev-profile.prof[i]))
-   mdev-pndev[i] = NULL;
-   }
-   /* register notifier */
-   mdev-nb.notifier_call = mlx4_en_netdev_event;
-   if (register_netdevice_notifier(mdev-nb)) {
-   mdev-nb.notifier_call = NULL;
-   mlx4_err(mdev, Failed to create notifier\n);
-   }
-
return mdev;
 
 err_mr:
@@ -335,6 +340,7 @@ static struct mlx4_interface mlx4_en_interface = {
.event  = mlx4_en_event,
.get_dev= mlx4_en_get_netdev,
.protocol   = MLX4_PROT_ETH,
+   .activate   = mlx4_en_activate,
 };
 
 static void mlx4_en_verify_params(void)
diff --git a/drivers/net/ethernet/mellanox/mlx4/intf.c 
b/drivers/net/ethernet/mellanox/mlx4/intf.c
index 0d80aed..0472941 100644
--- a/drivers/net/ethernet/mellanox/mlx4/intf.c
+++ b/drivers/net/ethernet/mellanox/mlx4/intf.c
@@ -63,8 +63,11 @@ static void mlx4_add_device(struct mlx4_interface *intf, 
struct mlx4_priv *priv)
spin_lock_irq(priv-ctx_lock);
list_add_tail(dev_ctx-list, priv-ctx_list);
spin_unlock_irq(priv-ctx_lock);
+   if (intf-activate)
+   intf-activate(priv-dev, dev_ctx-context);
} else
kfree(dev_ctx);
+
 }
 
 static void mlx4_remove_device(struct mlx4_interface *intf, struct mlx4_priv 
*priv)
diff --git a/include/linux/mlx4/driver.h b/include/linux/mlx4/driver.h
index 9553a73..5a06d96 100644
--- a/include/linux/mlx4/driver.h
+++ b/include/linux/mlx4/driver.h
@@ -59,6 +59,7 @@ struct mlx4_interface {
void(*event) (struct mlx4_dev *dev, void *context,
  enum mlx4_dev_event event, unsigned 
long param);
void *  (*get_dev)(struct mlx4_dev *dev, void *context, 
u8 port);
+   void(*activate)(struct mlx4_dev *dev, void 
*context);
struct list_headlist;
enum mlx4_protocol  protocol;
int flags;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-4.3 00/15] Modify MR allocation API

2015-07-30 Thread Steve Wise

On 7/30/2015 2:32 AM, Sagi Grimberg wrote:

This patch set is detached from my WIP for modifying our
fast registration kernel API. I incorporated some comments
from Jason and Christoph. The current set is a drop-in replacement
of ib_alloc_fast_reg_mr to ib_alloc_mr which receives a memory
region type (whcih can be IB_MR_TYPE_MEM_REG for normal memory
registration, IB_MR_TYPE_SIGNATURE for a data-integrity capable
memory region and future arbitrary SG support capable memory
region).




Series looks good.

Reviewed-by: Steve Wise sw...@opengridcomputing.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v3 00/13] Demux IB CM requests in the rdma_cm module

2015-07-30 Thread Doug Ledford
On 07/30/2015 05:03 AM, Haggai Eran wrote:
 On 29/07/2015 17:49, Doug Ledford wrote:
 This doesn't apply on to a clean 4.2-rc4 kernel tree.  Can you please
 rebase against either that or my to-be-rebase/for-4.3 branch of my
 github repo?
 
 Sure. Keep in mind that this patchset also depends on the patch from
 Matan's series to protected the device and client list with a rwsem [2].
 Do you want me to add it to this series?

You added it to the series, which is OK and I've picked that series up.
 In general though, if there are two different patch series that both
need the same patch, I would prefer it if that one patch is sent by
itself so I can grab it and not included in either of the other two
series.  I think some of the patches that didn't apply were when I
forgot to remove a duplicate patch found in two different series.

 Haggai
 
 [2] [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device 
 list or client list
 http://www.spinics.net/lists/linux-rdma/msg25931.html
 
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
 


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


[PATCH for-next V7 01/10] net/ipv6: Export addrconf_ifid_eui48

2015-07-30 Thread Matan Barak
For loopback purposes, RoCE devices should have a default GID in the
port GID table, even when the interface is down. In order to do so,
we use the IPv6 link local address which would have been genenrated
for the related Ethernet netdevice when it goes up as a default GID.

addrconf_ifid_eui48 is used to gernerate this address, export it.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 include/net/addrconf.h | 31 +++
 net/ipv6/addrconf.c| 31 ---
 2 files changed, 31 insertions(+), 31 deletions(-)

diff --git a/include/net/addrconf.h b/include/net/addrconf.h
index def59d3..431fdfa 100644
--- a/include/net/addrconf.h
+++ b/include/net/addrconf.h
@@ -91,6 +91,37 @@ int ipv6_rcv_saddr_equal(const struct sock *sk, const struct 
sock *sk2);
 void addrconf_join_solict(struct net_device *dev, const struct in6_addr *addr);
 void addrconf_leave_solict(struct inet6_dev *idev, const struct in6_addr 
*addr);
 
+static inline int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
+{
+   if (dev-addr_len != ETH_ALEN)
+   return -1;
+   memcpy(eui, dev-dev_addr, 3);
+   memcpy(eui + 5, dev-dev_addr + 3, 3);
+
+   /*
+* The zSeries OSA network cards can be shared among various
+* OS instances, but the OSA cards have only one MAC address.
+* This leads to duplicate address conflicts in conjunction
+* with IPv6 if more than one instance uses the same card.
+*
+* The driver for these cards can deliver a unique 16-bit
+* identifier for each instance sharing the same card.  It is
+* placed instead of 0xFFFE in the interface identifier.  The
+* u bit of the interface identifier is not inverted in this
+* case.  Hence the resulting interface identifier has local
+* scope according to RFC2373.
+*/
+   if (dev-dev_id) {
+   eui[3] = (dev-dev_id  8)  0xFF;
+   eui[4] = dev-dev_id  0xFF;
+   } else {
+   eui[3] = 0xFF;
+   eui[4] = 0xFE;
+   eui[0] ^= 2;
+   }
+   return 0;
+}
+
 static inline unsigned long addrconf_timeout_fixup(u32 timeout,
   unsigned int unit)
 {
diff --git a/net/ipv6/addrconf.c b/net/ipv6/addrconf.c
index 21c2c81..5b0c041 100644
--- a/net/ipv6/addrconf.c
+++ b/net/ipv6/addrconf.c
@@ -1845,37 +1845,6 @@ static void addrconf_leave_anycast(struct inet6_ifaddr 
*ifp)
__ipv6_dev_ac_dec(ifp-idev, addr);
 }
 
-static int addrconf_ifid_eui48(u8 *eui, struct net_device *dev)
-{
-   if (dev-addr_len != ETH_ALEN)
-   return -1;
-   memcpy(eui, dev-dev_addr, 3);
-   memcpy(eui + 5, dev-dev_addr + 3, 3);
-
-   /*
-* The zSeries OSA network cards can be shared among various
-* OS instances, but the OSA cards have only one MAC address.
-* This leads to duplicate address conflicts in conjunction
-* with IPv6 if more than one instance uses the same card.
-*
-* The driver for these cards can deliver a unique 16-bit
-* identifier for each instance sharing the same card.  It is
-* placed instead of 0xFFFE in the interface identifier.  The
-* u bit of the interface identifier is not inverted in this
-* case.  Hence the resulting interface identifier has local
-* scope according to RFC2373.
-*/
-   if (dev-dev_id) {
-   eui[3] = (dev-dev_id  8)  0xFF;
-   eui[4] = dev-dev_id  0xFF;
-   } else {
-   eui[3] = 0xFF;
-   eui[4] = 0xFE;
-   eui[0] ^= 2;
-   }
-   return 0;
-}
-
 static int addrconf_ifid_eui64(u8 *eui, struct net_device *dev)
 {
if (dev-addr_len != IEEE802154_ADDR_LEN)
-- 
2.1.0

Cc: net...@vger.kernel.org
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V7 03/10] net/bonding: Export bond_option_active_slave_get_rcu

2015-07-30 Thread Matan Barak
Some consumers of the netdev events API would like to know who is the
active slave when a NETDEV_CHANGEUPPER or NETDEV_BONDING_FAILOVER
events occur. For example, when managing RoCE GIDs, GIDs based on the
bond's ips should only be set on the port which corresponds to active
slave netdevice.

Signed-off-by: Matan Barak mat...@mellanox.com
---
 drivers/net/bonding/bond_options.c | 13 -
 include/net/bonding.h  |  7 +++
 2 files changed, 7 insertions(+), 13 deletions(-)

diff --git a/drivers/net/bonding/bond_options.c 
b/drivers/net/bonding/bond_options.c
index e9c624d..28bd005 100644
--- a/drivers/net/bonding/bond_options.c
+++ b/drivers/net/bonding/bond_options.c
@@ -730,19 +730,6 @@ static int bond_option_mode_set(struct bonding *bond,
return 0;
 }
 
-static struct net_device *__bond_option_active_slave_get(struct bonding *bond,
-struct slave *slave)
-{
-   return bond_uses_primary(bond)  slave ? slave-dev : NULL;
-}
-
-struct net_device *bond_option_active_slave_get_rcu(struct bonding *bond)
-{
-   struct slave *slave = rcu_dereference(bond-curr_active_slave);
-
-   return __bond_option_active_slave_get(bond, slave);
-}
-
 static int bond_option_active_slave_set(struct bonding *bond,
const struct bond_opt_value *newval)
 {
diff --git a/include/net/bonding.h b/include/net/bonding.h
index 20defc0..c1740a2 100644
--- a/include/net/bonding.h
+++ b/include/net/bonding.h
@@ -310,6 +310,13 @@ static inline bool bond_uses_primary(struct bonding *bond)
return bond_mode_uses_primary(BOND_MODE(bond));
 }
 
+static inline struct net_device *bond_option_active_slave_get_rcu(struct 
bonding *bond)
+{
+   struct slave *slave = rcu_dereference(bond-curr_active_slave);
+
+   return bond_uses_primary(bond)  slave ? slave-dev : NULL;
+}
+
 static inline bool bond_slave_is_up(struct slave *slave)
 {
return netif_running(slave-dev)  netif_carrier_ok(slave-dev);
-- 
2.1.0

Cc: net...@vger.kernel.org
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V7 00/10] Move RoCE GID management to IB/Core

2015-07-30 Thread Matan Barak
This series has been running in linux-rdma for a while. We added here
CC to netdev for the three pre-patches which come first. They allow
the IB core to access some helpers (e.g generating default Eth IPv6
link local address), gain more info on bonding changes, etc.

Previously, every vendor implemented its net device notifiers in its own
driver. This introduces a huge code duplication as figuring
whether an event is related to the vendor's net device in the
various cases (bonding, vlan or any other upper device) is
similar for all vendors. In the future, when multiple GID types will
be supported, this code duplication would have gotten even worse.

Therefore, we decided moving this into a common core core.
roce_gid_table and roce_gid_mgmt were created in order to store and
manage the new GID table, by filling it when getting the related events.
Vendors now only have to implement modify_gid and get_netdev IB
device calls, which are truly unique for each vendor.
roce_gid_table is implemented as IB client that manages the GID
table of the IB device. Each GID is associated with a GID type and a
network device (which is mandatory for management of the GID table).
The GID table is populated by using roce_gid_mgmt. roce_gid_mgmt
registers to net device/inet/inet events and calls roce_gid_table
in order to populate the GID table accordingly.

Patch 0005 is the core patch in this series. It creates a new infrastructure
for storing GIDs and their attributes in IB/core. This infrastructure support
reading and writing GIDs alongside with their meta-data. The new infrastructure
is used for both manageing RoCE ports and IB ports. The core difference is that
in IB ports, this infrastructure is used souly as a cache, while in RoCE we
actually manage the vendor's GID table by calling add_gid and del_gid callbacks.
In RoCE, we always enable default gids for an active device (an active device
is defined here as a device that doesn't have a bonding master or is the current
active slave). This is done in order to allow loopback traffic.

Patch 0004 replaces the locking schema for IB devices. Previously, device_mutex
was used in order to lock the devices/clients list against every modification.
However, downstream patches add new functions which iterate over the device
list. Those functions could be executed for a workqueue contexts on behalf
of IB clients. Thus, when a client is removed, we need to wait for all works
to be finished. Since a client removal was done in device_mutex lock, we'll
be in fact waiting for a work which requires to lock the device_mutex itself
(=DEADLOCK). In order to mitigate this problem, we use rw semaphore to allow
multiple readers. We use a mutex in order to solve races between adding
(or removing) a client and a device simultaneously, which could have resulted
in calling client-add (or client-remove) twice for the same device and client.
This patch was sent as part of Add network namespace support in the RDMA-CM
series.

Patch 0006 adds population of this table for the bonding case based on net
device events. Only the active slaves retain their master's IP based gids and
default gids.

Patch 0001 exports addrconf_ifid_eui48 in order to generate the default GID.
Patch 0002 adds information for NETDEV_CHANGEUPPER which is used in order to
understand the nature of change - link/unlink and which master net-device is
related to this change.
Patch 0003 exports bond_option_active_slave_get_rcu which is necassary in
order to assign the GIDs only to the active slave.

The rest of the patches add support for ocrdma and mlx4 devices.

This series is rebased over Doug's to-be-rebase/for-4.3.

Thanks,
Devesh, Somnath, Moni and Matan

Changes from V6:
(1) Addressed Jason's comments:
(a) Cache is no longer a client but part of IB infrastructure
(b) No need for READ_ONCE and flush_workqueue when tearing down
the cache

Changes from V5:
(1) Incoporate the changes to cache.c so we use the same infrastructure
to manage both IB and RoCE (per Doug's request)
(2) Replace the locking mechanism in the IB core GID cache from seqcount +
rcu to rwlock (addressing comments from Jason)
(3) get_netdev returns a helded (dev_hold) device
(4) Squashed the RocE GID table, RoCE GID management and default GID handling
code into one patch (per Doug's request).
(5) Change modify_gid to add_gid and del_gid.
(6) set the netdev related changes into three dedicated patches and make
them be 1st in the series.

Changes from V4:
(1) Remove any API changes.
(2) Fixed a bug regarding bonding upper devices.
(3) Rebased ontop of Doug's k.o/for-4.2.

Changes from V3:
(1) Remove RoCE V2 functionality (it will be sent at later patchset).
(2) Instead of removing qp_attr_mask flags, reserve them.
(3) Remove the kref from IB devices in favor of rwsem.
(4) Change the name of roce_gid_cache to roce_gid_table.
(5) Fix a race when roce_gid_table is free'd while getting events.
(6) Remove the roce_gid_cache 

[PATCH for-next V7 08/10] IB/mlx4: Implement ib_device callbacks

2015-07-30 Thread Matan Barak
From: Moni Shoua mo...@mellanox.com

get_netdev: get the net_device on the physical port of the IB transport port. In
port aggregation mode it is required to return the netdev of the active port.

modify_gid: note for a change in the RoCE gid cache. Handle this by writing to
the harsware GID table. It is possible that indexes in cahce and hardware tables
won't match so a translation is required when modifying a QP or creating an
address handle.

Signed-off-by: Moni Shoua mo...@mellanox.com
---
 drivers/infiniband/core/cache.c  |   3 +-
 drivers/infiniband/hw/mlx4/main.c| 236 ++-
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  17 +++
 include/linux/mlx4/device.h  |   3 +-
 include/rdma/ib_verbs.h  |   2 +
 5 files changed, 257 insertions(+), 4 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 9b4f16b..a9d5c70 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -55,7 +55,8 @@ struct ib_update_work {
u8 port_num;
 };
 
-static union ib_gid zgid;
+union ib_gid zgid;
+EXPORT_SYMBOL(zgid);
 
 static const struct ib_gid_attr zattr;
 
diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 2f81723..61df5c9 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -45,6 +45,9 @@
 #include rdma/ib_smi.h
 #include rdma/ib_user_verbs.h
 #include rdma/ib_addr.h
+#include rdma/ib_cache.h
+
+#include net/bonding.h
 
 #include linux/mlx4/driver.h
 #include linux/mlx4/cmd.h
@@ -93,8 +96,6 @@ static void init_query_mad(struct ib_smp *mad)
mad-method= IB_MGMT_METHOD_GET;
 }
 
-static union ib_gid zgid;
-
 static int check_flow_steering_support(struct mlx4_dev *dev)
 {
int eth_num_ports = 0;
@@ -131,6 +132,237 @@ static int num_ib_ports(struct mlx4_dev *dev)
return ib_ports;
 }
 
+static struct net_device *mlx4_ib_get_netdev(struct ib_device *device, u8 
port_num)
+{
+   struct mlx4_ib_dev *ibdev = to_mdev(device);
+   struct net_device *dev;
+
+   rcu_read_lock();
+   dev = mlx4_get_protocol_dev(ibdev-dev, MLX4_PROT_ETH, port_num);
+
+   if (dev) {
+   if (mlx4_is_bonded(ibdev-dev)) {
+   struct net_device *upper = NULL;
+
+   upper = netdev_master_upper_dev_get_rcu(dev);
+   if (upper) {
+   struct net_device *active;
+
+   active = 
bond_option_active_slave_get_rcu(netdev_priv(upper));
+   if (active)
+   dev = active;
+   }
+   }
+   }
+   if (dev)
+   dev_hold(dev);
+
+   rcu_read_unlock();
+   return dev;
+}
+
+static int mlx4_ib_update_gids(struct gid_entry *gids,
+  struct mlx4_ib_dev *ibdev,
+  u8 port_num)
+{
+   struct mlx4_cmd_mailbox *mailbox;
+   int err;
+   struct mlx4_dev *dev = ibdev-dev;
+   int i;
+   union ib_gid *gid_tbl;
+
+   mailbox = mlx4_alloc_cmd_mailbox(dev);
+   if (IS_ERR(mailbox))
+   return -ENOMEM;
+
+   gid_tbl = mailbox-buf;
+
+   for (i = 0; i  MLX4_MAX_PORT_GIDS; ++i)
+   memcpy(gid_tbl[i], gids[i].gid, sizeof(union ib_gid));
+
+   err = mlx4_cmd(dev, mailbox-dma,
+  MLX4_SET_PORT_GID_TABLE  8 | port_num,
+  1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+  MLX4_CMD_WRAPPED);
+   if (mlx4_is_bonded(dev))
+   err += mlx4_cmd(dev, mailbox-dma,
+   MLX4_SET_PORT_GID_TABLE  8 | 2,
+   1, MLX4_CMD_SET_PORT, MLX4_CMD_TIME_CLASS_B,
+   MLX4_CMD_WRAPPED);
+
+   mlx4_free_cmd_mailbox(dev, mailbox);
+   return err;
+}
+
+static int mlx4_ib_add_gid(struct ib_device *device,
+  u8 port_num,
+  unsigned int index,
+  const union ib_gid *gid,
+  const struct ib_gid_attr *attr,
+  void **context)
+{
+   struct mlx4_ib_dev *ibdev = to_mdev(device);
+   struct mlx4_ib_iboe *iboe = ibdev-iboe;
+   struct mlx4_port_gid_table   *port_gid_table;
+   int free = -1, found = -1;
+   int ret = 0;
+   int hw_update = 0;
+   int i;
+   struct gid_entry *gids = NULL;
+
+   if (!rdma_cap_roce_gid_table(device, port_num))
+   return -EINVAL;
+
+   if (port_num  MLX4_MAX_PORTS)
+   return -EINVAL;
+
+   if (!context)
+   return -EINVAL;
+
+   port_gid_table = iboe-gids[port_num - 1];
+   spin_lock_bh(iboe-lock);
+   for (i = 0; i  MLX4_MAX_PORT_GIDS; ++i) {
+   if 

Re: [PATCH 12/22] IB/iser: Introduce iser_reg_ops

2015-07-30 Thread Steve Wise

On 7/30/2015 3:06 AM, Sagi Grimberg wrote:

Move all the per-device function pointers to an easy
extensible iser_reg_ops structure that contains all
the iser registration operations.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
  drivers/infiniband/ulp/iser/iscsi_iser.h | 39 ++--
  drivers/infiniband/ulp/iser/iser_initiator.c | 16 ++--
  drivers/infiniband/ulp/iser/iser_memory.c| 35 +
  drivers/infiniband/ulp/iser/iser_verbs.c | 30 +
  4 files changed, 75 insertions(+), 45 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 70bf6e7..9ce090c 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -326,6 +326,25 @@ struct iser_comp {
  };
  
  /**

+ * struct iser_device - Memory registration operations
+ * per-device registration schemes
+ *
+ * @alloc_reg_res: Allocate registration resources
+ * @free_reg_res:  Free registration resources
+ * @reg_rdma_mem:  Register memory buffers
+ * @unreg_rdma_mem:Un-register memory buffers
+ */
+struct iser_reg_ops {
+   int(*alloc_reg_res)(struct ib_conn *ib_conn,
+   unsigned cmds_max);
+   void   (*free_reg_res)(struct ib_conn *ib_conn);
+   int(*reg_rdma_mem)(struct iscsi_iser_task *iser_task,
+  enum iser_data_dir cmd_dir);
+   void   (*unreg_rdma_mem)(struct iscsi_iser_task *iser_task,
+enum iser_data_dir cmd_dir);
+};
+
+/**
   * struct iser_device - iSER device handle
   *
   * @ib_device: RDMA device
@@ -338,11 +357,7 @@ struct iser_comp {
   * @comps_used:Number of completion contexts used, Min between online
   * cpus and device max completion vectors
   * @comps: Dinamically allocated array of completion handlers
- * Memory registration pool Function pointers (FMR or Fastreg):
- * @iser_alloc_rdma_reg_res: Allocation of memory regions pool
- * @iser_free_rdma_reg_res:  Free of memory regions pool
- * @iser_reg_rdma_mem:   Memory registration routine
- * @iser_unreg_rdma_mem: Memory deregistration routine
+ * @reg_ops:   Registration ops
   */
  struct iser_device {
struct ib_device *ib_device;
@@ -354,13 +369,7 @@ struct iser_device {
int  refcount;
int  comps_used;
struct iser_comp *comps;
-   int  (*iser_alloc_rdma_reg_res)(struct ib_conn 
*ib_conn,
-   unsigned 
cmds_max);
-   void (*iser_free_rdma_reg_res)(struct ib_conn 
*ib_conn);
-   int  (*iser_reg_rdma_mem)(struct 
iscsi_iser_task *iser_task,
- enum iser_data_dir 
cmd_dir);
-   void (*iser_unreg_rdma_mem)(struct 
iscsi_iser_task *iser_task,
-   enum iser_data_dir 
cmd_dir);
+   struct iser_reg_ops  *reg_ops;
  };
  
  #define ISER_CHECK_GUARD	0xc0

@@ -563,6 +572,8 @@ extern int iser_debug_level;
  extern bool iser_pi_enable;
  extern int iser_pi_guard;
  
+int iser_assign_reg_ops(struct iser_device *device);

+
  int iser_send_control(struct iscsi_conn *conn,
  struct iscsi_task *task);
  
@@ -636,9 +647,9 @@ int  iser_initialize_task_headers(struct iscsi_task *task,

struct iser_tx_desc *tx_desc);
  int iser_alloc_rx_descriptors(struct iser_conn *iser_conn,
  struct iscsi_session *session);
-int iser_create_fmr_pool(struct ib_conn *ib_conn, unsigned cmds_max);
+int iser_alloc_fmr_pool(struct ib_conn *ib_conn, unsigned cmds_max);
  void iser_free_fmr_pool(struct ib_conn *ib_conn);
-int iser_create_fastreg_pool(struct ib_conn *ib_conn, unsigned cmds_max);
+int iser_alloc_fastreg_pool(struct ib_conn *ib_conn, unsigned cmds_max);
  void iser_free_fastreg_pool(struct ib_conn *ib_conn);
  u8 iser_check_task_pi_status(struct iscsi_iser_task *iser_task,
 enum iser_data_dir cmd_dir, sector_t *sector);
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 42d6f42..88d8a89 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -73,7 +73,7 @@ static int iser_prepare_read_cmd(struct iscsi_task *task)
return err;
}
  
-	err = device-iser_reg_rdma_mem(iser_task, ISER_DIR_IN);

+   err = device-reg_ops-reg_rdma_mem(iser_task, ISER_DIR_IN);
if (err) {
iser_err(Failed to set up Data-IN RDMA\n);
  

Re: [PATCH] IB/ipath: Move ipath driver to staging.

2015-07-30 Thread Doug Ledford
On 07/30/2015 09:25 AM, dennis.dalessan...@intel.com wrote:
 From: Dennis Dalessandro dennis.dalessan...@intel.com
 
 It is now time for the ipath driver to begin to be phased out of the kernel.
 This patch moves the ipath driver from the Infiniband sub tree to the staging
 area where it will remain until the code is removed from the kernel in a few
 releases.
 
 Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com
 Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com

Thanks, picked up for 4.3.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 00/14] Demux IB CM requests in the rdma_cm module

2015-07-30 Thread Doug Ledford
On 07/30/2015 10:50 AM, Haggai Eran wrote:
 I'm sending the patchset again with the rwsem patch and rebased over Doug's
 to-be-rebased/for-4.3 tree.

Thanks for rebasing, I was able to apply them this time.

 
 Regards,
 Haggai
 
 Changes from v3:
 - rebase over github.com/dledford/linux to-be-rebased/for-4.3
 - add rwsem patch
 
 Changes from v2:
 - added missing reviewed-bys
 - Patch 5: remove service_mask as a parameter from ib_cm_insert_listen()
 - Patch 9:
   * move cma_req_info struct near other structs
   * put GID by value in the struct
 
 Changes from v1:
 - Patch 1: mark ib_client_data as going down instead of removing all client
   contexts during de-registration.
 - Patch 2:
   * move kdoc to the function definition
   * do not call get_net_dev_by_params() on devices/clients that are going
 down
   * pass client data directly to the callback
 - Patch 3:
   * pass client data directly to callback
   * fix a lockdep warning in ipoib_match_gid_pkey_addr()
   * remove a debugging print left over
   * set a rate limit to the duplicated IP address warning
 - Patch 5:
   * change atomic_dec(id-refcount) to cm_deref_id()
   * always update listen_sharecount under the cm.lock spinlock
 - Patch 6: handle AF_IB requests by getting parameters from the listener
 - Patch 8: new patch to expose BTH P_Key from ib_cm to rdma_cm
 - Patch 9:
   * get P_Key used for de-mux from the BTH
   * use -EAFNOSUPPORT in cma_save_ip_info to designate a possible AF_IB
 connection request
   * pass a NULL netdev for AF_IB requests
 - Patch 11: handle AF_IB connections by filling connection information from
   the listener id instead of from the net_dev
 - Patch 12: fix mention of the old ib_cm_id_create_and_listen function in
   the changelog entry.
 
 Changes from v0:
 - Added a patch to prevent a race between ib_unregister_device() and
   ib_get_net_dev_by_params().
 - Removed the patch that exported a UD GMP packet's GID from the GRH, and
   related code.
 - Patch 3:
   * Add _rcu suffix to ipoib_is_dev_match_addr().
   * Add helper function to get the master netdev for bonding support.
   * Scan for matching net devices in two phases: first without looking at
   * the IP address, and then looking at the IP address only when the first
 phase did not find a unique net device.
 - Patch 5:
   * Do not init listen_sharecount = 1 for non-listening ib_cm_ids.
   * Remove code that sets a CM ID's state to IB_CM_IDLE right before
 destruction.
   * Rename ib_cm_id_create_and_listen() to ib_cm_insert_listen().
   * Do not increase reference counts when failing to add a shared CM ID due
 to having a different handler callback.
 - Patch 9: Clean IPv4 net_dev validation function.
 - Added patch 10: new patch to use the found net_dev in IB/cma for
   eliminating unneeded calls to cma_translate_addr.
 - Patch 12: Remove the lock argument to __ib_cm_listen().
 
 The rdma_cm module relies today on the ib_cm module to demux incoming
 requests based on their service ID and IP address. The ib_cm module is the
 wrong place to perform this task, as it can also be used with services that
 do not adhere to the RDMA IP CM service as defined in the IBA
 specifications. It is forced to use an opaque private data struct and mask
 to compare incoming requests against.
 
 This series moves that demux task responsibility to the rdma_cm module. The
 rdma_cm module can look into the private data attached to a CM request,
 containing the IP addresses related to the request. It uses the details of
 the request to find the net device associated with the request, and use
 that net device to find the correct listening rdma_cm_id.
 
 The series applies against Doug's for-v4.2 tree with the patch adding a
 rwsem to IB core [2] applied.
 
 The series is structured as follows:
 Patch 1 prevents a possible race between ib_client.remove() callbacks from
 ib_unregister_device(), and ib_client callbacks that rely on the
 lists_rwsem locked for read, such as ib_get_net_dev_by_params(). Both
 callbacks may call ib_get_client_data(), and the patch makes sure that the
 remove callback doesn't free the client data while it is being used by the
 other callback.
 
 Patches 2-3 add the ability to lookup a network device according to the IB
 device, port, P_Key, GID and IP address. They find the matching IPoIB
 interfaces, and return a matching net_device if one exists.
 
 Patches 4-5 make necessary changes in ib_cm to allow RDMA CM get the
 information it needs out of CM and SIDR requests, and share a single
 ib_cm_id with multiple RDMA CM listeners.
 
 Patches 6-7 do some preliminary refactoring to the rdma_cm module. They
 allow extracting information out of incoming requests instead of retrieving
 them from a listening CM ID, and add helper functions to access the port
 space IDRs.
 
 Finally, patches 8-12 change rdma_cm to demultiplex requests on its own, and
 patch 13 cleans up the now unneeded code in ib_cm to compare against the
 private data.
 
 

Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Jason Gunthorpe
On Thu, Jul 30, 2015 at 11:46:36PM +0300, Yuval Shaia wrote:
 On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote:
  On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
  
   In its current state, I have my doubts about this patch.  However, it
   seems to me that this should be relatively easy to fix in such a way
   that you get 90%+ of the performance benefit, and can turn it on by
   default, and we don't cause any problems.
  
  The best way to implement this is to leverage all the checksum
  offload work people did for virtualization.
  
  Forward the checksum offload status through the RC connection and
  recover it on the other side.
 The current approach is to utilize IPoIB's private-data to exchange this
 information.

You need private-data exchange to negotiate the feature.

The feature should be a per-packet csum status header.

When sending a skb that is already fully csumed the receiver sets
CHECKSUM_UNNECESSARY.

When sending a skb that has CHECKSUM_PARTIAL then the
receiver needs to call skb_partial_csum_set.

Look at how something like VIRTIO_NET_HDR_F_NEEDS_CSUM works and copy
that scheme.

DO NOT EVER set CHECKSUM_UNNECESSARY on packets that do not have valid
csums - that breaks the net stack.

Yes, you need to add a header to all packets to support this scheme,
that is what the private-data negotiation is for.

While you are at it, I'd make room for something like
VIRTIO_NET_HDR_GSO_* in the RC protocol too. Implementing GSO
forwarding is probably another big performance win.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 37/50] IB/hfi1: add sdma routines

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/sdma.c | 2962 +
 1 file changed, 2962 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/sdma.c

diff --git a/drivers/infiniband/hw/hfi1/sdma.c 
b/drivers/infiniband/hw/hfi1/sdma.c
new file mode 100644
index 000..37bd767
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/sdma.c
@@ -0,0 +1,2962 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/spinlock.h
+#include linux/seqlock.h
+#include linux/netdevice.h
+#include linux/moduleparam.h
+#include linux/bitops.h
+#include linux/timer.h
+#include linux/vmalloc.h
+
+#include hfi.h
+#include common.h
+#include qp.h
+#include sdma.h
+#include iowait.h
+#include trace.h
+
+/* must be a power of 2 = 64 = 32768 */
+#define SDMA_DESCQ_CNT 1024
+#define INVALID_TAIL 0x
+
+static uint sdma_descq_cnt = SDMA_DESCQ_CNT;
+module_param(sdma_descq_cnt, uint, S_IRUGO);
+MODULE_PARM_DESC(sdma_descq_cnt, Number of SDMA descq entries);
+
+static uint sdma_idle_cnt = 250;
+module_param(sdma_idle_cnt, uint, S_IRUGO);
+MODULE_PARM_DESC(sdma_idle_cnt, sdma interrupt idle delay (ns,default 250));
+
+uint mod_num_sdma;
+module_param_named(num_sdma, mod_num_sdma, uint, S_IRUGO);
+MODULE_PARM_DESC(num_sdma, Set max number SDMA engines to use);
+
+#define SDMA_WAIT_BATCH_SIZE 20
+/* max wait time for a SDMA engine to indicate it has halted */
+#define SDMA_ERR_HALT_TIMEOUT 10 /* ms */
+/* all SDMA engine errors that cause a halt */
+
+#define SD(name) SEND_DMA_##name
+#define 

RE: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Marciniszyn, Mike
   That is obvious and useless. Patches should have a meaningful
   description and justify the changes.
  
 
  The driver uses the CNP opcode for congestion control.
 
 And that requires a new transport protocol???
 

The opcode is 0x80, which appears in the protocol part of the 8 bit opcode.  
That is what is specified in A3.10.2 of the 1.3 spec.

That also happens to land in the upper bits of the opcode. 

Would this fit better with a  IB_OPCODE_CNP_TRANS (0x80) with a single opcode 
of IB_OPCODE_CNP_OP 0x00, combined with the IB_OPCODE macro to produce 
IB_OPCODE_CNP?

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 41/50] IB/hfi1: add tracepoint debug routines

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/trace.c |  211 +
 drivers/infiniband/hw/hfi1/trace.h | 1421 
 2 files changed, 1632 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/trace.c
 create mode 100644 drivers/infiniband/hw/hfi1/trace.h

diff --git a/drivers/infiniband/hw/hfi1/trace.c 
b/drivers/infiniband/hw/hfi1/trace.c
new file mode 100644
index 000..afbb212
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/trace.c
@@ -0,0 +1,211 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#define CREATE_TRACE_POINTS
+#define HFI1_TRACE_DO_NOT_CREATE_INLINES
+#include trace.h
+
+u8 ibhdr_exhdr_len(struct hfi1_ib_header *hdr)
+{
+   struct hfi1_other_headers *ohdr;
+   u8 opcode;
+   u8 lnh = (u8)(be16_to_cpu(hdr-lrh[0])  3);
+
+   if (lnh == HFI1_LRH_BTH)
+   ohdr = hdr-u.oth;
+   else
+   ohdr = hdr-u.l.oth;
+   opcode = be32_to_cpu(ohdr-bth[0])  24;
+   return hdr_len_by_opcode[opcode] == 0 ?
+  0 : hdr_len_by_opcode[opcode] - (12 + 8);
+}
+
+#define IMM_PRN  imm %d
+#define RETH_PRN reth vaddr 0x%.16llx rkey 0x%.8x dlen 0x%.8x
+#define AETH_PRN aeth syn 0x%.2x msn 0x%.8x
+#define DETH_PRN deth qkey 0x%.8x sqpn 0x%.6x
+#define ATOMICACKETH_PRN origdata %lld
+#define ATOMICETH_PRN vaddr 0x%llx rkey 0x%.8x sdata %lld cdata %lld
+
+#define OP(transport, op) IB_OPCODE_## transport ## _ ## op
+
+static u64 ib_u64_get(__be32 *p)
+{
+   return ((u64)be32_to_cpu(p[0])  32) | be32_to_cpu(p[1]);
+}
+
+const 

[PATCH v4 48/50] IB/hfi1: add multicast routines

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/verbs.c   |1 
 drivers/infiniband/hw/hfi1/verbs_mcast.c |  385 ++
 2 files changed, 386 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/verbs_mcast.c

diff --git a/drivers/infiniband/hw/hfi1/verbs.c 
b/drivers/infiniband/hw/hfi1/verbs.c
index 230c10f..c60e28b 100644
--- a/drivers/infiniband/hw/hfi1/verbs.c
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -56,6 +56,7 @@
 #include linux/rculist.h
 #include linux/mm.h
 #include linux/random.h
+#include linux/vmalloc.h
 
 #include hfi.h
 #include common.h
diff --git a/drivers/infiniband/hw/hfi1/verbs_mcast.c 
b/drivers/infiniband/hw/hfi1/verbs_mcast.c
new file mode 100644
index 000..afc6b4c
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/verbs_mcast.c
@@ -0,0 +1,385 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/rculist.h
+
+#include hfi.h
+
+/**
+ * mcast_qp_alloc - alloc a struct to link a QP to mcast GID struct
+ * @qp: the QP to link
+ */
+static struct hfi1_mcast_qp *mcast_qp_alloc(struct hfi1_qp *qp)
+{
+   struct hfi1_mcast_qp *mqp;
+
+   mqp = kmalloc(sizeof(*mqp), GFP_KERNEL);
+   if (!mqp)
+   goto bail;
+
+   mqp-qp = qp;
+   atomic_inc(qp-refcount);
+
+bail:
+   return mqp;
+}
+
+static void mcast_qp_free(struct hfi1_mcast_qp *mqp)
+{
+   struct hfi1_qp *qp = mqp-qp;
+
+   /* Notify hfi1_destroy_qp() if it is waiting. */
+   if (atomic_dec_and_test(qp-refcount))
+   wake_up(qp-wait);
+
+

[PATCH v4 47/50] IB/hfi1: add general verbs handling

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/verbs.c | 2143 
 drivers/infiniband/hw/hfi1/verbs.h | 1149 +++
 2 files changed, 3292 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/verbs.c
 create mode 100644 drivers/infiniband/hw/hfi1/verbs.h

diff --git a/drivers/infiniband/hw/hfi1/verbs.c 
b/drivers/infiniband/hw/hfi1/verbs.c
new file mode 100644
index 000..230c10f
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/verbs.c
@@ -0,0 +1,2143 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include rdma/ib_mad.h
+#include rdma/ib_user_verbs.h
+#include linux/io.h
+#include linux/module.h
+#include linux/utsname.h
+#include linux/rculist.h
+#include linux/mm.h
+#include linux/random.h
+
+#include hfi.h
+#include common.h
+#include device.h
+#include trace.h
+#include qp.h
+#include sdma.h
+
+unsigned int hfi1_lkey_table_size = 16;
+module_param_named(lkey_table_size, hfi1_lkey_table_size, uint,
+  S_IRUGO);
+MODULE_PARM_DESC(lkey_table_size,
+LKEY table size in bits (2^n, 1 = n = 23));
+
+static unsigned int hfi1_max_pds = 0x;
+module_param_named(max_pds, hfi1_max_pds, uint, S_IRUGO);
+MODULE_PARM_DESC(max_pds,
+Maximum number of protection domains to support);
+
+static unsigned int hfi1_max_ahs = 0x;
+module_param_named(max_ahs, hfi1_max_ahs, uint, S_IRUGO);
+MODULE_PARM_DESC(max_ahs, Maximum number of address handles to support);
+
+unsigned int hfi1_max_cqes = 0x2;

[PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 Documentation/infiniband/sysfs.txt |   22 +
 drivers/infiniband/hw/hfi1/sysfs.c |  761 
 2 files changed, 783 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/sysfs.c

diff --git a/Documentation/infiniband/sysfs.txt 
b/Documentation/infiniband/sysfs.txt
index ddd519b..23d1b6d 100644
--- a/Documentation/infiniband/sysfs.txt
+++ b/Documentation/infiniband/sysfs.txt
@@ -64,3 +64,25 @@ MTHCA
 fw_ver   - Firmware version
 hca_type - HCA type: MT23108, MT25208 (MT23108 compat mode),
or MT25208
+
+HFI1
+
+  The hfi1 driver also creates these additional files:
+
+   hw_rev - hardware revision
+   board_id - manufacturing board id
+   version - driver version
+   tempsense - thermal sense information
+   serial - board serial number
+   nfreectxts - number of free user contexts
+   nctxts - number of allowed contexts (PSM2)
+   localbus_info - PCIe info
+   chip_reset - diagnostic (root only)
+   boardversion - board version
+   ports/1/
+  CMgtA/
+   cc_settings_bin - CCA tables used by PSM2
+   cc_table_bin
+  sc2v/ - 32 files (0 - 31) used to translate sl-vl
+  sl2sc/ - 32 files (0 - 31) used to translate sl-sc
+  vl2mtu/ - 16 (0 - 15) files used to determine MTU for vl
diff --git a/drivers/infiniband/hw/hfi1/sysfs.c 
b/drivers/infiniband/hw/hfi1/sysfs.c
new file mode 100644
index 000..b10e857
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/sysfs.c
@@ -0,0 +1,761 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE 

[PATCH v4 44/50] IB/hfi1: add UD QP handling

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/ud.c |  885 +++
 1 file changed, 885 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/ud.c

diff --git a/drivers/infiniband/hw/hfi1/ud.c b/drivers/infiniband/hw/hfi1/ud.c
new file mode 100644
index 000..d40d1a1
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ud.c
@@ -0,0 +1,885 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/net.h
+#include rdma/ib_smi.h
+
+#include hfi.h
+#include mad.h
+#include qp.h
+
+/**
+ * ud_loopback - handle send on loopback QPs
+ * @sqp: the sending QP
+ * @swqe: the send work request
+ *
+ * This is called from hfi1_make_ud_req() to forward a WQE addressed
+ * to the same HFI.
+ * Note that the receive interrupt handler may be calling hfi1_ud_rcv()
+ * while this is being called.
+ */
+static void ud_loopback(struct hfi1_qp *sqp, struct hfi1_swqe *swqe)
+{
+   struct hfi1_ibport *ibp = to_iport(sqp-ibqp.device, sqp-port_num);
+   struct hfi1_pportdata *ppd;
+   struct hfi1_qp *qp;
+   struct ib_ah_attr *ah_attr;
+   unsigned long flags;
+   struct hfi1_sge_state ssge;
+   struct hfi1_sge *sge;
+   struct ib_wc wc;
+   u32 length;
+   enum ib_qp_type sqptype, dqptype;
+
+   rcu_read_lock();
+
+   qp = hfi1_lookup_qpn(ibp, swqe-wr.wr.ud.remote_qpn);
+   if (!qp) {
+   ibp-n_pkt_drops++;
+   rcu_read_unlock();
+   return;
+   }
+
+   sqptype = sqp-ibqp.qp_type == IB_QPT_GSI ?
+ 

[PATCH v4 36/50] IB/hfi1: add common routines for RC/UC

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/ruc.c |  948 ++
 1 file changed, 948 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/ruc.c

diff --git a/drivers/infiniband/hw/hfi1/ruc.c b/drivers/infiniband/hw/hfi1/ruc.c
new file mode 100644
index 000..a411528
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/ruc.c
@@ -0,0 +1,948 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/spinlock.h
+
+#include hfi.h
+#include mad.h
+#include qp.h
+#include sdma.h
+
+/*
+ * Convert the AETH RNR timeout code into the number of microseconds.
+ */
+const u32 ib_hfi1_rnr_table[32] = {
+   655360, /* 00: 655.36 */
+   10, /* 01:.01 */
+   20, /* 02 .02 */
+   30, /* 03:.03 */
+   40, /* 04:.04 */
+   60, /* 05:.06 */
+   80, /* 06:.08 */
+   120,/* 07:.12 */
+   160,/* 08:.16 */
+   240,/* 09:.24 */
+   320,/* 0A:.32 */
+   480,/* 0B:.48 */
+   640,/* 0C:.64 */
+   960,/* 0D:.96 */
+   1280,   /* 0E:   1.28 */
+   1920,   /* 0F:   1.92 */
+   2560,   /* 10:   2.56 */
+   3840,   /* 11:   3.84 */
+   5120,   /* 12:   5.12 */
+   7680,   /* 13:   7.68 */
+   10240,  /* 14:  10.24 */
+   15360,  /* 15:  15.36 */
+   20480,  /* 16:  20.48 */
+   30720,  /* 17:  30.72 */
+   40960,  /* 18:  40.96 */
+   61440,  /* 19:  61.44 */
+   81920,  /* 1A:  81.92 */
+   

Re: [PATCH v4 17/50] IB/hfi1: add PSM driver control/data path

2015-07-30 Thread Jason Gunthorpe
On Thu, Jul 30, 2015 at 03:18:59PM -0400, Mike Marciniszyn wrote:
 +static ssize_t hfi1_write_iter(struct kiocb *kiocb, struct iov_iter *from)
 +{
 + struct hfi1_user_sdma_pkt_q *pq;
 + struct hfi1_user_sdma_comp_q *cq;
 + int ret = 0, done = 0, reqs = 0;
 + unsigned long dim = from-nr_segs;

I thought you were getting rid of this?

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Yuval Shaia
On Thu, Jul 30, 2015 at 11:15:38AM -0600, Jason Gunthorpe wrote:
 On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
 
  In its current state, I have my doubts about this patch.  However, it
  seems to me that this should be relatively easy to fix in such a way
  that you get 90%+ of the performance benefit, and can turn it on by
  default, and we don't cause any problems.
 
 The best way to implement this is to leverage all the checksum
 offload work people did for virtualization.
 
 Forward the checksum offload status through the RC connection and
 recover it on the other side.
The current approach is to utilize IPoIB's private-data to exchange this
information.
 
 Then the far side stack will know it is dealing with a partial
 checksum packet and will properly regenerate the checksum if it
 re-transmits.
 
 ie doing it this way doesn't totally break the netstack :)
 
 Jason
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Marciniszyn, Mike
 The opcode is 0x80, which appears in the protocol part of the 8 bit opcode.
 That is what is specified in A3.10.2 of the 1.3 spec.
 

Correction A10.3.2.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Yuval Shaia
On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:
 On 07/30/2015 04:46 AM, Yuval Shaia wrote:
   struct ipoib_cm_data {
  __be32 qpn; /* High byte MUST be ignored on receive */
  __be32 mtu;
 +__be16 sig; /* must be IPOIB_CM_PROTO_SIG */
 +__be16 caps; /* 4 bits proto ver and 12 bits capabilities */
   };
 
 This patch modifies the private login data format that has been
 standardized by the IETF in RFC 4755. Has this modification already
 been discussed with the IETF ?
 
 See also https://tools.ietf.org/html/rfc4755#section-6.
Yes.
I first want to check how linux community react to this proposal.

Please note that though the standard specify 64 bits of data, the actual
data the driver reads/writes is can be up to 196 bytes.
 
 Bart.
 --
 To unsubscribe from this list: send the line unsubscribe linux-rdma in
 the body of a message to majord...@vger.kernel.org
 More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 This patch adds the value of the CNP opcode to the existing list of enumerated
 opcodes.

That is obvious and useless. Patches should have a meaningful
description and justify the changes.

Why do you add the CNP opcode and what in the world does it do? CNP is
what? And why do the other enum values not work for you?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Marciniszyn, Mike wrote:

  That is obvious and useless. Patches should have a meaningful description
  and justify the changes.
 

 The driver uses the CNP opcode for congestion control.

And that requires a new transport protocol???

  Why do you add the CNP opcode and what in the world does it do? CNP is
  what? And why do the other enum values not work for you?

 The driver supports congestion control in software vs. outboard
 firmware, so the opcode should be available in the appropriate kernel
 include file.

So is CNP an operation or a protocol?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 17/50] IB/hfi1: add PSM driver control/data path

2015-07-30 Thread Marciniszyn, Mike
 
 I thought you were getting rid of this?
 
 Jason

Doug wanted the v4 submitted as we currently have it.

Doug?

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread Marciniszyn, Mike
 That is obvious and useless. Patches should have a meaningful description
 and justify the changes.
 

The driver uses the CNP opcode for congestion control.

 Why do you add the CNP opcode and what in the world does it do? CNP is
 what? And why do the other enum values not work for you?

The driver supports congestion control in software vs. outboard firmware, so 
the opcode should be available in the appropriate kernel include file.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 +HFI1
 +
 +  The hfi1 driver also creates these additional files:
 +
 +   hw_rev - hardware revision
 +   board_id - manufacturing board id
 +   version - driver version
 +   tempsense - thermal sense information
 +   serial - board serial number
 +   nfreectxts - number of free user contexts
 +   nctxts - number of allowed contexts (PSM2)
 +   localbus_info - PCIe info
 +   chip_reset - diagnostic (root only)
 +   boardversion - board version

Arent these already provide by the pci-e driver framework? Tools will not
work if you do not put the information out there in a way that they can be
scanned.

F.e the following output of lspci -vv shows a revision and the board_id
is also usually avaialble. The kernel driver version is also there via
the driver/module directory etc etc. Please integrate properly into the
kernel device driver infrastructure and do not create useless new entries.

lspci -vv

00:19.0 Ethernet controller: Intel Corporation Ethernet Connection I217-LM (rev 
04)
Subsystem: Fujitsu Technology Solutions Device 11ed
Control: I/O+ Mem+ BusMaster+ SpecCycle- MemWINV- VGASnoop- ParErr- 
Stepping- SERR- FastB2B- DisINTx+
Status: Cap+ 66MHz- UDF- FastB2B- ParErr- DEVSEL=fast TAbort- TAbort- 
MAbort- SERR- PERR- INTx-
Latency: 0
Interrupt: pin A routed to IRQ 28
Region 0: Memory at f7c0 (32-bit, non-prefetchable) [size=128K]
Region 1: Memory at f7c3d000 (32-bit, non-prefetchable) [size=4K]
Region 2: I/O ports at f080 [size=32]
Capabilities: [c8] Power Management version 2
Flags: PMEClk- DSI+ D1- D2- AuxCurrent=0mA 
PME(D0+,D1-,D2-,D3hot+,D3cold+)
Status: D0 NoSoftRst- PME-Enable- DSel=0 DScale=1 PME-
Capabilities: [d0] MSI: Enable+ Count=1/1 Maskable- 64bit+
Address: fee00378  Data: 
Capabilities: [e0] PCI Advanced Features
AFCap: TP+ FLR+
AFCtrl: FLR-
AFStatus: TP-
Kernel driver in use: e1000e

ls -l /sys/devices/pci\:00/:00:19.0/driver/module/
total 0
-r--r--r-- 1 root root 4096 Jul 30 14:59 coresize
drwxr-xr-x 2 root root0 Jul 30 15:47 drivers
drwxr-xr-x 2 root root0 Jul 30 14:59 holders
-r--r--r-- 1 root root 4096 Jul 30 15:47 initsize
-r--r--r-- 1 root root 4096 Jul 30 14:59 initstate
drwxr-xr-x 2 root root0 Jul 30 15:47 notes
drwxr-xr-x 2 root root0 Jul 30 15:47 parameters
-r--r--r-- 1 root root 4096 Jul 30 14:59 refcnt
drwxr-xr-x 2 root root0 Jul 30 15:47 sections
-r--r--r-- 1 root root 4096 Jul 30 15:47 srcversion
-r--r--r-- 1 root root 4096 Jul 30 15:47 taint
--w--- 1 root root 4096 Jul 30 14:59 uevent
-r--r--r-- 1 root root 4096 Jul 30 15:49 version


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 46/50] IB/hfi1: add PSM sdma hooks

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/user_sdma.c | 1444 
 drivers/infiniband/hw/hfi1/user_sdma.h |   89 ++
 2 files changed, 1533 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/user_sdma.c
 create mode 100644 drivers/infiniband/hw/hfi1/user_sdma.h

diff --git a/drivers/infiniband/hw/hfi1/user_sdma.c 
b/drivers/infiniband/hw/hfi1/user_sdma.c
new file mode 100644
index 000..5552661
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/user_sdma.c
@@ -0,0 +1,1444 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#include linux/mm.h
+#include linux/types.h
+#include linux/device.h
+#include linux/dmapool.h
+#include linux/slab.h
+#include linux/list.h
+#include linux/highmem.h
+#include linux/io.h
+#include linux/uio.h
+#include linux/rbtree.h
+#include linux/spinlock.h
+#include linux/delay.h
+#include linux/kthread.h
+#include linux/mmu_context.h
+#include linux/module.h
+#include linux/vmalloc.h
+
+#include hfi.h
+#include sdma.h
+#include user_sdma.h
+#include sdma.h
+#include verbs.h  /* for the headers */
+#include common.h /* for struct hfi1_tid_info */
+#include trace.h
+
+static uint hfi1_sdma_comp_ring_size = 128;
+module_param_named(sdma_comp_size, hfi1_sdma_comp_ring_size, uint, S_IRUGO);
+MODULE_PARM_DESC(sdma_comp_size, Size of User SDMA completion ring. Default: 
128);
+
+/* The maximum number of Data io vectors per message/request */
+#define MAX_VECTORS_PER_REQ 8
+/*
+ * Maximum number of packet to send from each message/request
+ * before moving to the 

[PATCH v4 45/50] IB/hfi1: add low level page locking

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/user_pages.c |  156 +++
 1 file changed, 156 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/user_pages.c

diff --git a/drivers/infiniband/hw/hfi1/user_pages.c 
b/drivers/infiniband/hw/hfi1/user_pages.c
new file mode 100644
index 000..9071afb
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/user_pages.c
@@ -0,0 +1,156 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/mm.h
+#include linux/device.h
+
+#include hfi.h
+
+static void __hfi1_release_user_pages(struct page **p, size_t num_pages,
+ int dirty)
+{
+   size_t i;
+
+   for (i = 0; i  num_pages; i++) {
+   if (dirty)
+   set_page_dirty_lock(p[i]);
+   put_page(p[i]);
+   }
+}
+
+/*
+ * Call with current-mm-mmap_sem held.
+ */
+static int __hfi1_get_user_pages(unsigned long start_page, size_t num_pages,
+struct page **p)
+{
+   unsigned long lock_limit;
+   size_t got;
+   int ret;
+
+   lock_limit = rlimit(RLIMIT_MEMLOCK)  PAGE_SHIFT;
+
+   if (num_pages  lock_limit  !capable(CAP_IPC_LOCK)) {
+   ret = -ENOMEM;
+   goto bail;
+   }
+
+   for (got = 0; got  num_pages; got += ret) {
+   ret = get_user_pages(current, current-mm,
+start_page + got * PAGE_SIZE,
+num_pages - got, 1, 1,
+ 

[PATCH v4 50/50] IB/core: Add opa driver to kbuild

2015-07-30 Thread Mike Marciniszyn
From: Jubin John jubin.j...@intel.com

Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
---
 drivers/infiniband/Kconfig |1 +
 drivers/infiniband/hw/Makefile |1 +
 2 files changed, 2 insertions(+)

diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index da4c697..f84eecd 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -56,6 +56,7 @@ config INFINIBAND_ADDR_TRANS
 
 source drivers/infiniband/hw/mthca/Kconfig
 source drivers/infiniband/hw/qib/Kconfig
+source drivers/infiniband/hw/hfi1/Kconfig
 source drivers/infiniband/hw/ehca/Kconfig
 source drivers/infiniband/hw/cxgb3/Kconfig
 source drivers/infiniband/hw/cxgb4/Kconfig
diff --git a/drivers/infiniband/hw/Makefile b/drivers/infiniband/hw/Makefile
index 1bdb999..52f3788 100644
--- a/drivers/infiniband/hw/Makefile
+++ b/drivers/infiniband/hw/Makefile
@@ -1,5 +1,6 @@
 obj-$(CONFIG_INFINIBAND_MTHCA) += mthca/
 obj-$(CONFIG_INFINIBAND_QIB)   += qib/
+obj-$(CONFIG_INFINIBAND_HFI1)  += hfi1/
 obj-$(CONFIG_INFINIBAND_EHCA)  += ehca/
 obj-$(CONFIG_INFINIBAND_CXGB3) += cxgb3/
 obj-$(CONFIG_INFINIBAND_CXGB4) += cxgb4/

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/50] Add OPA gen1 driver

2015-07-30 Thread Christoph Lameter
On Thu, 30 Jul 2015, Mike Marciniszyn wrote:

 As a verbs driver the device functions as an InfiniBand device and
 supports the standard features of the IBTA specification v1.3 with
 the exceptions noted below.

Hmmm... So OPA networks and IB networks (Truescale?) will be able to
interoperate?

 The public information can be reviewed at:

 http://www.intel.com/content/www/us/en/omni-path/omni-path-fabric-overview.html

That is very helpful although I have to guess what the various marketing
terms mean. Is there more detail on NICs and switch specifications
available?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Yuval Shaia
On Thu, Jul 30, 2015 at 11:51:12AM -0400, Doug Ledford wrote:
 On 07/30/2015 11:20 AM, Yuval Shaia wrote:
  On Thu, Jul 30, 2015 at 03:58:13PM +0200, Yann Droneaud wrote:
  Hi,
 
  Le jeudi 30 juillet 2015 à 04:46 -0700, Yuval Shaia a écrit :
  This enhancement suggest the usage of IB CRC instead of CSUM in IPoIB 
  CM. IPoIB CM uses RC (Reliable Connection) which guarantees the 
  corruption free delivery of the packet.
 
  InfiniBand uses 32b CRC which provides stronger data integrity 
  protection compare to 16b IP Checksum.
 
  InfiniBand 32b CRC = Ethernet 32b CRC, it's link layer, layer 2.
 
  IPv4 checksum is at another level, it's internet layer, layer 3.
 
   So, there is no added value that IP/TCP Checksum provides in the IB 
  world.
 
 
  Sure, IPv4 checksum is a thing of the past: checksum was dropped from
  IP header in IPv6: it assumes the lower layer, such as Ethernet,
  provides the required integrety check.
 
  I think not checking the IPv4 checksum should be a choice, carefully
  thought, for inside a fabric, as I understand your proposal, packet
  with invalid checksum will be allowed to go in/out of the fabric.
  Yes, this is why it is controlled by module parameter.
  Maybe a better choice would be to default it to 0.
 
 In it's current form, yes, it should default to 0.
 
 
  It sound like it's a departure from the behavior one can expect from an
  IPv4 network stack.
  It should be considered as network-fine-tuning parameter so if admin knows 
  his fabric he can use it.
 
  The proposal is to tell network stack that IPoIB-CM supports IP 
  Checksum offload. This enables the kernel to save the time of 
  checksum calculation of IPoIB CM packets. Network sends the IP packet 
  without adding the IP Checksum to the header. On the receive side, 
  IPoIB driver again tells the network stack that IP Checksum is good 
  for the incoming packets and network stack avoids the IP Checksum 
  calculations.
 
  During connection establishment the driver determine if peer supports
  IB CRC as checksum. This is done so driver will be able to calculate
  checksum before transmiting the packet in case the peer does not 
  support this feature.
 
 
  Two questions:
  Three :)
 
 No, he really only had 2, the second one was a line split of the word
 checksum-less done by his mailer ;-)
 
 
  - What will see tool such as wireshark/tcpdump when sniffing checksum
  Zero or what ever the networking layer puts in csum when H/W supports 
  CSUM-offloading.
  Please note that with this patch driver still supports backward 
  computability (per connection).
  This means that for connections with peer which does not support this 
  functionality you expect to see this value filled with checksum.
  -less IPv4 packets sent/received on IPoIB interface ?
  No
 
  - What might happen if such checksum-less IPv4 packet is later routed to a 
  different IPv4 network ?
  As noted above, for network that is opened to outside world this feature 
  should be blocked.
  In general i would say that if a layer 2 terminator device (e.x router) 
  exist in the fabric - this feature can't be used and must be blocked.
  With this limitation it still worth use it because of the reason of 
  increasing throughput
 
 In its current state, I have my doubts about this patch.  However, it
 seems to me that this should be relatively easy to fix in such a way
 that you get 90%+ of the performance benefit, and can turn it on by
 default, and we don't cause any problems.  Why not perform the checksum
 operation on a per connection basis?  This is all IPoIB traffic anyway,
This part is already implemented.
Actually this is the main purpose of adding 'caps' field to ipoib_cm_tx.
The peer capabilities (currently only one option but design let us add
up to 12 capabilities in the future) is passed in IPoIB's private data and
saved in ipoib_cm_tx.caps per connection basis.
Then, on ipoib_cm_send, the decision is made based on that (and on some
other conditions) and if needed - the driver calculate the checksum just
before sending.
 so every send will have a src ip and dst ip.  If the dst ip is link
 local to our src ip device, and the connected mode partner is capable of
 running without csum, then send that specific packet without doing a
 checksum.  If the IP address is not link local, then do the checksum as
 normal.  That way if our final destination is on the other side of a
 router, we aren't leaking un-checksummed packets.  It means we would
 miss out on being able to do checksum-less transfers from host A on
 fabric 0 through host B as a router to host C on fabric 1, but I doubt
 that's a very common situation to be in.  Or maybe a better way of
 putting this is if our next hop IP address != our dest IP address, then
 perform the checksum, otherwise if capable of checksum-less operation,
 do so.  Can you rework the patch to operate in that manner?
I think that the concern with 'router' is that when packet goes into it
and then goes 

RE: [PATCH v3 01/49] IB/core: Add header definitions

2015-07-30 Thread Marciniszyn, Mike
Hal,

I missed this email.   Ira and I agree with the comments.

We will address this quickly with a follow up patch.

Sorry,
Mike

 -Original Message-
 From: Hal Rosenstock [mailto:h...@dev.mellanox.co.il]
 Sent: Wednesday, June 17, 2015 10:13 AM
 To: Marciniszyn, Mike
 Cc: dledf...@redhat.com; linux-rdma@vger.kernel.org; Weiny, Ira
 Subject: Re: [PATCH v3 01/49] IB/core: Add header definitions
 
 On 6/17/2015 8:28 AM, Mike Marciniszyn wrote:
  From: Ira Weiny ira.we...@intel.com
 
  Add common OPA header definitions for driver
  build:
  - opa_port_info.h
  - opa_smi.h
  - hfi1_user.sh
 
  Additionally, ib_mad.h, has additional definitions that are common to
  ib_drivers including:
  - trap support
  - cca support
 
  The qib driver has the duplication removed in favor those in ib_mad.h
 
  Reviewed-by: Mike Marciniszyn mike.marcinis...@intel.com
  Reviewed-by: John, Jubin jubin.j...@intel.com
  Signed-off-by: Ira Weiny ira.we...@intel.com
  ---
   drivers/infiniband/hw/qib/qib_mad.h |  147 +---
   include/rdma/ib_mad.h   |  138 +++
   include/rdma/opa_port_info.h|  433
 +++
 
 Should opa_port_info.h be in include/rdma or in drivers/infiniband/hw/hfi1
 ?
 
   include/rdma/opa_smi.h  |   47 
   include/uapi/rdma/hfi/hfi1_user.h   |  427
 +++
   5 files changed, 1053 insertions(+), 139 deletions(-)  create mode
  100644 include/rdma/opa_port_info.h  create mode 100644
  include/uapi/rdma/hfi/hfi1_user.h
 
  diff --git a/drivers/infiniband/hw/qib/qib_mad.h
  b/drivers/infiniband/hw/qib/qib_mad.h
  index 941d4d5..57e99dc 100644
  --- a/drivers/infiniband/hw/qib/qib_mad.h
  +++ b/drivers/infiniband/hw/qib/qib_mad.h
  @@ -36,148 +36,17 @@
 
   #include rdma/ib_pma.h
 
  -#define IB_SMP_UNSUP_VERSIONcpu_to_be16(0x0004)
  -#define IB_SMP_UNSUP_METHOD cpu_to_be16(0x0008)
  -#define IB_SMP_UNSUP_METH_ATTR  cpu_to_be16(0x000C)
  -#define IB_SMP_INVALID_FIELDcpu_to_be16(0x001C)
  +#define IB_SMP_UNSUP_VERSION \
  +cpu_to_be16(IB_MGMT_MAD_STATUS_BAD_VERSION)
 
  -struct ib_node_info {
  -   u8 base_version;
  -   u8 class_version;
  -   u8 node_type;
  -   u8 num_ports;
  -   __be64 sys_guid;
  -   __be64 node_guid;
  -   __be64 port_guid;
  -   __be16 partition_cap;
  -   __be16 device_id;
  -   __be32 revision;
  -   u8 local_port_num;
  -   u8 vendor_id[3];
  -} __packed;
  -
  -struct ib_mad_notice_attr {
  -   u8 generic_type;
  -   u8 prod_type_msb;
  -   __be16 prod_type_lsb;
  -   __be16 trap_num;
  -   __be16 issuer_lid;
  -   __be16 toggle_count;
  -
  -   union {
  -   struct {
  -   u8  details[54];
  -   } raw_data;
  -
  -   struct {
  -   __be16  reserved;
  -   __be16  lid;/* where violation happened
 */
  -   u8  port_num;   /* where violation happened
 */
  -   } __packed ntc_129_131;
  -
  -   struct {
  -   __be16  reserved;
  -   __be16  lid;/* LID where change occurred
 */
  -   u8  reserved2;
  -   u8  local_changes;  /* low bit - local changes */
  -   __be32  new_cap_mask;   /* new capability
 mask */
  -   u8  reserved3;
  -   u8  change_flags;   /* low 3 bits only */
  -   } __packed ntc_144;
  -
  -   struct {
  -   __be16  reserved;
  -   __be16  lid;/* lid where sys guid changed
 */
  -   __be16  reserved2;
  -   __be64  new_sys_guid;
  -   } __packed ntc_145;
  -
  -   struct {
  -   __be16  reserved;
  -   __be16  lid;
  -   __be16  dr_slid;
  -   u8  method;
  -   u8  reserved2;
  -   __be16  attr_id;
  -   __be32  attr_mod;
  -   __be64  mkey;
  -   u8  reserved3;
  -   u8  dr_trunc_hop;
  -   u8  dr_rtn_path[30];
  -   } __packed ntc_256;
  -
  -   struct {
  -   __be16  reserved;
  -   __be16  lid1;
  -   __be16  lid2;
  -   __be32  key;
  -   __be32  sl_qp1; /* SL: high 4 bits */
  -   __be32  qp2;/* high 8 bits reserved */
  -   union ib_gidgid1;
  -   union ib_gidgid2;
  -   } __packed ntc_257_258;
  -
  -   } details;
  -};
  -
  -/*
  - * Generic trap/notice types
  - */
  -#define IB_NOTICE_TYPE_FATAL   0x80
  -#define IB_NOTICE_TYPE_URGENT  0x81
  -#define IB_NOTICE_TYPE_SECURITY0x82
  -#define IB_NOTICE_TYPE_SM  0x83
  -#define IB_NOTICE_TYPE_INFO0x84

Re: [PATCH v4 17/50] IB/hfi1: add PSM driver control/data path

2015-07-30 Thread Doug Ledford
On 07/30/2015 04:01 PM, Marciniszyn, Mike wrote:

 I thought you were getting rid of this?

 Jason
 
 Doug wanted the v4 submitted as we currently have it.

To be accurate, I said If you want a chance at making 4.3, I need a
v4.  I didn't comment on whether or not any specific review comments
were addressed.

 Doug?

I have no problem with this code.  That Al finds the user space ABI for
this driver to be bizarre is neither here nor there to me.  Sure, this
file does not exhibit normal file API behavior.  Who cares?  It's not a
normal file in *any* sense of the word.  For example, the normal write
routine will never, ever accept just plain data.  It's always in the
form of a command.  If you don't have the right magic decoder ring, you
will get nothing but errors when trying to do something with this file.
 Much like /dev/infiniband/uverbs? files, it is a command interface, not
a raw data interface.  I actually think the fact that you guys use write
for a single command and writev/write_iter for a command queue is an
elegant solution to your particular needs.  The only reason Al threw a
hissy over it is because it tripped him up when he went to do the
conversion from writev to write_iter.  That's understandable.  So, some
clear documentation so someone like Al doesn't have to go reading
through sources and try to figure out what you are doing would be the
generally nice thing to do for other kernel generalists that might come
poking around this way.  Or, another option would be to drop the write
function altogether and just make all commands come through
writev/write_iter and if you only have one command, you only send one
element.  Regardless, those things can be cleaned up in follow on
patches, please do not resubmit this set for that.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 01/50] IB: Add CNP opcode enumeration.

2015-07-30 Thread ira.weiny
On Thu, Jul 30, 2015 at 09:18:17PM +, Marciniszyn, Mike wrote:
  The opcode is 0x80, which appears in the protocol part of the 8 bit opcode.
  That is what is specified in A3.10.2 of the 1.3 spec.
  
 
 Correction A10.3.2.

This value also appears in Table 38 of section 9.2 where the other OpCodes
which appear in this enum are defined.

So it does appear to be the correct place to put this value.

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 17/50] IB/hfi1: add PSM driver control/data path

2015-07-30 Thread Doug Ledford
On 07/30/2015 06:00 PM, Marciniszyn, Mike wrote:
 On 07/30/2015 04:01 PM, Marciniszyn, Mike wrote:

 I thought you were getting rid of this?

 Jason

 Doug wanted the v4 submitted as we currently have it.

 To be accurate, I said If you want a chance at making 4.3, I need a v4.  I
 didn't comment on whether or not any specific review comments were
 addressed.

 Doug?

 I have no problem with this code.  That Al finds the user space ABI for this
 driver to be bizarre is neither here nor there to me.  Sure, this file does 
 not
 exhibit normal file API behavior.  Who cares?  It's not a normal file in 
 *any*
 sense of the word.  For example, the normal write routine will never, ever
 accept just plain data.  It's always in the form of a command.  If you don't
 have the right magic decoder ring, you will get nothing but errors when
 trying to do something with this file.
  Much like /dev/infiniband/uverbs? files, it is a command interface, not a
 raw data interface.  I actually think the fact that you guys use write for a
 single command and writev/write_iter for a command queue is an elegant
 solution to your particular needs.  The only reason Al threw a hissy over it 
 is
 because it tripped him up when he went to do the conversion from writev to
 write_iter.  That's understandable.  So, some clear documentation so
 someone like Al doesn't have to go reading through sources and try to figure
 out what you are doing would be the generally nice thing to do for other
 kernel generalists that might come poking around this way.  Or, another
 option would be to drop the write function altogether and just make all
 commands come through writev/write_iter and if you only have one
 command, you only send one element.  Regardless, those things can be
 cleaned up in follow on patches, please do not resubmit this set for that.

 
 Jason,
 
 I did ask you in http://marc.info/?l=linux-rdmam=143707462806767w=2 if you 
 thought ioctl was ok.
 
 Hearing nothing, we left the interface as it was.

I think the interface is fine as is, with the only thing I would do, if
*really* forced to by Al, would be to do as I suggested above and
convert all of the write cases to writev with a single element.

 I suspect (I lack the early history) that the ioctl BKL might have forced 
 both uverbs and PSM to go this route.

An ioctl interface is not really designed for a queue of commands to be
sent in a single operation any more than any other interface is.  I
don't personally see a great benefit to it.

 
 Doug,
 
 Where would be the appropriate location to document?  In the source itself?  
 Somewhere else?

At the function for both write and writev (and a patch to update to
write_iter would be a good next step to keep us in line with qib),
document how each is used and point to the other one and point out that
they differ in their basic usage.  If Al had a clear comment saying our
write function is used to pass a single command to our driver, and our
writev function is used to pass a queue of formatted commands, one per
element, he might have not written what he did in his commit message.


-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: OpenPGP digital signature


Re: [PATCH v4 00/50] Add OPA gen1 driver

2015-07-30 Thread ira.weiny
On Thu, Jul 30, 2015 at 03:37:24PM -0500, Christoph Lameter wrote:
 On Thu, 30 Jul 2015, Mike Marciniszyn wrote:
 
  As a verbs driver the device functions as an InfiniBand device and
  supports the standard features of the IBTA specification v1.3 with
  the exceptions noted below.
 
 Hmmm... So OPA networks and IB networks (Truescale?) will be able to
 interoperate?

No, you can't plug an IB device into an OPA switch or a HFI into an IB switch.

The comment As a verbs driver means that we present the standard verbs
software interface to the core kernel and userspace.  This was to aid in review
of the patch series and how it interacts with the rest of the infiniband
subtree.

Intel’s host software strategy is to utilize the existing OpenFabrics Alliance
interfaces, thus ensuring that today’s application software written to those
interfaces run with Intel OPA with no code changes required.

-- 
https://www-ssl.intel.com/content/www/us/en/high-performance-computing-fabrics/omni-path-fabric-software-components.html

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 00/50] Add OPA gen1 driver

2015-07-30 Thread Or Gerlitz
On Thu, Jul 30, 2015 at 10:17 PM, Mike Marciniszyn
mike.marcinis...@intel.com wrote:

 The following patch series adds the OPA device driver.
[...]
   IB/hfi1: add qp handling
   IB/hfi1: add RC QP handling
   IB/hfi1: add UC QP handling
   IB/hfi1: add UD QP handling

On Wed, Jun 17, 2015, Mike Marciniszyn mike.marcinis...@intel.com wrote:
 This patch series adds the OPA gen1 driver.
[...]
   IB/hfi1: add qp handling
   IB/hfi1: add RC QP handling
   IB/hfi1: add routines for RC/UC
   IB/hfi1: add UC QP handling
   IB/hfi1: add UD QP handling

Mike, nothing changed since your V3 in that respect, so repeating myself:

This is the 3rd time in a row {(1) ipath (2) qib (3) OPA Gen1} for you
guys to implement IB transports in SW @ your low-level drivers.

So... enough is enough, please put it in a kernel module residing in
the IB core and use it in this driver, to begin with. The fact that
ipath is going to go, makes the cope duplication only 2X vs the 3X,
but it's still 2X

Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Marciniszyn, Mike
  +HFI1
  +
  +  The hfi1 driver also creates these additional files:
  +
  +   hw_rev - hardware revision

I'm checking on this to see if it is indeed a duplicate.

  +   board_id - manufacturing board id

There is no PCIe equivalent.

  +   version - driver version

This IS a duplicate of in /sys/module/hfi1/version.

Will remove.

  +   tempsense - thermal sense information

No PCIe equivalent.

  +   serial - board serial number

No PCIe equivalent.

  +   localbus_info - PCIe info

Already present in PCIe.Will remove.

  +   chip_reset - diagnostic (root only)

Used by our manufacturing process.

  +   boardversion - board version
 

These are sourced from chip registers.  There is no PCIe equivalent.

 Arent these already provide by the pci-e driver framework? Tools will not
 work if you do not put the information out there in a way that they can be
 scanned.
 

Doug,  do you want a revision with the ones I know can be removed or is a 
follow-up ok.

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Marciniszyn, Mike
  There is no PCIe equivalent.
 
+   serial - board serial number
 
  No PCIe equivalent.
 
+   boardversion - board version
 
 These all have PCI-E versions. Most should live in the config space VPD.
 

I'm not seeing VPD in our current cards.

I'm checking to make sure.

Was this an lspci -vvv output for the example you showed?

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 12/12] rds/ib: Remove ib_get_dma_mr calls

2015-07-30 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 net/rds/ib.c  | 8 
 net/rds/ib.h  | 2 --
 net/rds/ib_cm.c   | 4 +---
 net/rds/ib_recv.c | 6 +++---
 net/rds/ib_send.c | 8 
 5 files changed, 8 insertions(+), 20 deletions(-)

diff --git a/net/rds/ib.c b/net/rds/ib.c
index ba2dffeff608..56c570131667 100644
--- a/net/rds/ib.c
+++ b/net/rds/ib.c
@@ -99,8 +99,6 @@ static void rds_ib_dev_free(struct work_struct *work)
 
if (rds_ibdev-mr_pool)
rds_ib_destroy_mr_pool(rds_ibdev-mr_pool);
-   if (rds_ibdev-mr)
-   ib_dereg_mr(rds_ibdev-mr);
if (rds_ibdev-pd)
ib_dealloc_pd(rds_ibdev-pd);
 
@@ -164,12 +162,6 @@ static void rds_ib_add_one(struct ib_device *device)
goto put_dev;
}
 
-   rds_ibdev-mr = ib_get_dma_mr(rds_ibdev-pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(rds_ibdev-mr)) {
-   rds_ibdev-mr = NULL;
-   goto put_dev;
-   }
-
rds_ibdev-mr_pool = rds_ib_create_mr_pool(rds_ibdev);
if (IS_ERR(rds_ibdev-mr_pool)) {
rds_ibdev-mr_pool = NULL;
diff --git a/net/rds/ib.h b/net/rds/ib.h
index 86d88ec5d556..36f7d808ffaa 100644
--- a/net/rds/ib.h
+++ b/net/rds/ib.h
@@ -100,7 +100,6 @@ struct rds_ib_connection {
/* alphabet soup, IBTA style */
struct rdma_cm_id   *i_cm_id;
struct ib_pd*i_pd;
-   struct ib_mr*i_mr;
struct ib_cq*i_send_cq;
struct ib_cq*i_recv_cq;
 
@@ -173,7 +172,6 @@ struct rds_ib_device {
struct list_headconn_list;
struct ib_device*dev;
struct ib_pd*pd;
-   struct ib_mr*mr;
struct rds_ib_mr_pool   *mr_pool;
unsigned intfmr_max_remaps;
unsigned intmax_fmrs;
diff --git a/net/rds/ib_cm.c b/net/rds/ib_cm.c
index 0da2a45b33bd..a75e8832bc23 100644
--- a/net/rds/ib_cm.c
+++ b/net/rds/ib_cm.c
@@ -269,7 +269,6 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 
/* Protection domain and memory range */
ic-i_pd = rds_ibdev-pd;
-   ic-i_mr = rds_ibdev-mr;
 
cq_attr.cqe = ic-i_send_ring.w_nr + 1;
ic-i_send_cq = ib_create_cq(dev, rds_ib_send_cq_comp_handler,
@@ -375,7 +374,7 @@ static int rds_ib_setup_qp(struct rds_connection *conn)
 
rds_ib_recv_init_ack(ic);
 
-   rdsdebug(conn %p pd %p mr %p cq %p %p\n, conn, ic-i_pd, ic-i_mr,
+   rdsdebug(conn %p pd %p cq %p %p\n, conn, ic-i_pd,
 ic-i_send_cq, ic-i_recv_cq);
 
 out:
@@ -678,7 +677,6 @@ void rds_ib_conn_shutdown(struct rds_connection *conn)
 
ic-i_cm_id = NULL;
ic-i_pd = NULL;
-   ic-i_mr = NULL;
ic-i_send_cq = NULL;
ic-i_recv_cq = NULL;
ic-i_send_hdrs = NULL;
diff --git a/net/rds/ib_recv.c b/net/rds/ib_recv.c
index cac5b4506ee3..0ceb4c60d2a3 100644
--- a/net/rds/ib_recv.c
+++ b/net/rds/ib_recv.c
@@ -62,12 +62,12 @@ void rds_ib_recv_init_ring(struct rds_ib_connection *ic)
sge = recv-r_sge[0];
sge-addr = ic-i_recv_hdrs_dma + (i * sizeof(struct 
rds_header));
sge-length = sizeof(struct rds_header);
-   sge-lkey = ic-i_mr-lkey;
+   sge-lkey = ic-i_pd-local_dma_lkey;
 
sge = recv-r_sge[1];
sge-addr = 0;
sge-length = RDS_FRAG_SIZE;
-   sge-lkey = ic-i_mr-lkey;
+   sge-lkey = ic-i_pd-local_dma_lkey;
}
 }
 
@@ -520,7 +520,7 @@ void rds_ib_recv_init_ack(struct rds_ib_connection *ic)
 
sge-addr = ic-i_ack_dma;
sge-length = sizeof(struct rds_header);
-   sge-lkey = ic-i_mr-lkey;
+   sge-lkey = ic-i_pd-local_dma_lkey;
 
wr-sg_list = sge;
wr-num_sge = 1;
diff --git a/net/rds/ib_send.c b/net/rds/ib_send.c
index 5d0a704fa039..f6c829d43373 100644
--- a/net/rds/ib_send.c
+++ b/net/rds/ib_send.c
@@ -202,9 +202,9 @@ void rds_ib_send_init_ring(struct rds_ib_connection *ic)
sge = send-s_sge[0];
sge-addr = ic-i_send_hdrs_dma + (i * sizeof(struct 
rds_header));
sge-length = sizeof(struct rds_header);
-   sge-lkey = ic-i_mr-lkey;
+   sge-lkey = ic-i_pd-local_dma_lkey;
 
-   send-s_sge[1].lkey = ic-i_mr-lkey;
+   send-s_sge[1].lkey = ic-i_pd-local_dma_lkey;
}
 }
 
@@ -813,7 +813,7 @@ int rds_ib_xmit_atomic(struct rds_connection *conn, struct 
rm_atomic_op *op)
/* Convert our struct scatterlist to struct ib_sge */
send-s_sge[0].addr = ib_sg_dma_address(ic-i_cm_id-device, op-op_sg);
send-s_sge[0].length = ib_sg_dma_len(ic-i_cm_id-device, op-op_sg);
-   send-s_sge[0].lkey = ic-i_mr-lkey;
+   send-s_sge[0].lkey 

[PATCH v2 11/12] net/9p: Remove ib_get_dma_mr calls

2015-07-30 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Tested-by: Dominique Martinet dominique.marti...@cea.fr
---
 net/9p/trans_rdma.c | 26 ++
 1 file changed, 2 insertions(+), 24 deletions(-)

diff --git a/net/9p/trans_rdma.c b/net/9p/trans_rdma.c
index 37a78d20c0f6..ba1210253f5e 100644
--- a/net/9p/trans_rdma.c
+++ b/net/9p/trans_rdma.c
@@ -94,8 +94,6 @@ struct p9_trans_rdma {
struct ib_pd *pd;
struct ib_qp *qp;
struct ib_cq *cq;
-   struct ib_mr *dma_mr;
-   u32 lkey;
long timeout;
int sq_depth;
struct semaphore sq_sem;
@@ -382,9 +380,6 @@ static void rdma_destroy_trans(struct p9_trans_rdma *rdma)
if (!rdma)
return;
 
-   if (rdma-dma_mr  !IS_ERR(rdma-dma_mr))
-   ib_dereg_mr(rdma-dma_mr);
-
if (rdma-qp  !IS_ERR(rdma-qp))
ib_destroy_qp(rdma-qp);
 
@@ -415,7 +410,7 @@ post_recv(struct p9_client *client, struct p9_rdma_context 
*c)
 
sge.addr = c-busa;
sge.length = client-msize;
-   sge.lkey = rdma-lkey;
+   sge.lkey = rdma-pd-local_dma_lkey;
 
wr.next = NULL;
c-wc_op = IB_WC_RECV;
@@ -506,7 +501,7 @@ dont_need_post_recv:
 
sge.addr = c-busa;
sge.length = c-req-tc-size;
-   sge.lkey = rdma-lkey;
+   sge.lkey = rdma-pd-local_dma_lkey;
 
wr.next = NULL;
c-wc_op = IB_WC_SEND;
@@ -647,7 +642,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
struct p9_trans_rdma *rdma;
struct rdma_conn_param conn_param;
struct ib_qp_init_attr qp_attr;
-   struct ib_device_attr devattr;
struct ib_cq_init_attr cq_attr = {};
 
/* Parse the transport specific mount options */
@@ -700,11 +694,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
if (err || (rdma-state != P9_RDMA_ROUTE_RESOLVED))
goto error;
 
-   /* Query the device attributes */
-   err = ib_query_device(rdma-cm_id-device, devattr);
-   if (err)
-   goto error;
-
/* Create the Completion Queue */
cq_attr.cqe = opts.sq_depth + opts.rq_depth + 1;
rdma-cq = ib_create_cq(rdma-cm_id-device, cq_comp_handler,
@@ -719,17 +708,6 @@ rdma_create_trans(struct p9_client *client, const char 
*addr, char *args)
if (IS_ERR(rdma-pd))
goto error;
 
-   /* Cache the DMA lkey in the transport */
-   rdma-dma_mr = NULL;
-   if (devattr.device_cap_flags  IB_DEVICE_LOCAL_DMA_LKEY)
-   rdma-lkey = rdma-cm_id-device-local_dma_lkey;
-   else {
-   rdma-dma_mr = ib_get_dma_mr(rdma-pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(rdma-dma_mr))
-   goto error;
-   rdma-lkey = rdma-dma_mr-lkey;
-   }
-
/* Create the Queue Pair */
memset(qp_attr, 0, sizeof qp_attr);
qp_attr.event_handler = qp_event_handler;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 10/12] ib_srpt: Remove ib_get_dma_mr calls

2015-07-30 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Reviewed-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/srpt/ib_srpt.c | 15 ---
 drivers/infiniband/ulp/srpt/ib_srpt.h |  1 -
 2 files changed, 4 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.c 
b/drivers/infiniband/ulp/srpt/ib_srpt.c
index 60ff0a2390e5..20adb96ba0b2 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.c
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.c
@@ -783,7 +783,7 @@ static int srpt_post_recv(struct srpt_device *sdev,
 
list.addr = ioctx-ioctx.dma;
list.length = srp_max_req_size;
-   list.lkey = sdev-mr-lkey;
+   list.lkey = sdev-pd-local_dma_lkey;
 
wr.next = NULL;
wr.sg_list = list;
@@ -818,7 +818,7 @@ static int srpt_post_send(struct srpt_rdma_ch *ch,
 
list.addr = ioctx-ioctx.dma;
list.length = len;
-   list.lkey = sdev-mr-lkey;
+   list.lkey = sdev-pd-local_dma_lkey;
 
wr.next = NULL;
wr.wr_id = encode_wr_id(SRPT_SEND, ioctx-ioctx.index);
@@ -1206,7 +1206,7 @@ static int srpt_map_sg_to_ib_sge(struct srpt_rdma_ch *ch,
 
while (rsize  0  tsize  0) {
sge-addr = dma_addr;
-   sge-lkey = ch-sport-sdev-mr-lkey;
+   sge-lkey = ch-sport-sdev-pd-local_dma_lkey;
 
if (rsize = dma_len) {
sge-length =
@@ -3211,10 +3211,6 @@ static void srpt_add_one(struct ib_device *device)
if (IS_ERR(sdev-pd))
goto free_dev;
 
-   sdev-mr = ib_get_dma_mr(sdev-pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(sdev-mr))
-   goto err_pd;
-
sdev-srq_size = min(srpt_srq_size, sdev-dev_attr.max_srq_wr);
 
srq_attr.event_handler = srpt_srq_event;
@@ -3226,7 +3222,7 @@ static void srpt_add_one(struct ib_device *device)
 
sdev-srq = ib_create_srq(sdev-pd, srq_attr);
if (IS_ERR(sdev-srq))
-   goto err_mr;
+   goto err_pd;
 
pr_debug(%s: create SRQ #wr= %d max_allow=%d dev= %s\n,
 __func__, sdev-srq_size, sdev-dev_attr.max_srq_wr,
@@ -3311,8 +3307,6 @@ err_cm:
ib_destroy_cm_id(sdev-cm_id);
 err_srq:
ib_destroy_srq(sdev-srq);
-err_mr:
-   ib_dereg_mr(sdev-mr);
 err_pd:
ib_dealloc_pd(sdev-pd);
 free_dev:
@@ -3358,7 +3352,6 @@ static void srpt_remove_one(struct ib_device *device)
srpt_release_sdev(sdev);
 
ib_destroy_srq(sdev-srq);
-   ib_dereg_mr(sdev-mr);
ib_dealloc_pd(sdev-pd);
 
srpt_free_ioctx_ring((struct srpt_ioctx **)sdev-ioctx_ring, sdev,
diff --git a/drivers/infiniband/ulp/srpt/ib_srpt.h 
b/drivers/infiniband/ulp/srpt/ib_srpt.h
index 21f8df67522a..5faad8acd789 100644
--- a/drivers/infiniband/ulp/srpt/ib_srpt.h
+++ b/drivers/infiniband/ulp/srpt/ib_srpt.h
@@ -393,7 +393,6 @@ struct srpt_port {
 struct srpt_device {
struct ib_device*device;
struct ib_pd*pd;
-   struct ib_mr*mr;
struct ib_srq   *srq;
struct ib_cm_id *cm_id;
struct ib_device_attr   dev_attr;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 02/12] IB/mad: Remove ib_get_dma_mr calls

2015-07-30 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/core/mad.c  | 26 +++---
 drivers/infiniband/core/mad_priv.h |  1 -
 include/rdma/ib_mad.h  |  1 -
 3 files changed, 3 insertions(+), 25 deletions(-)

diff --git a/drivers/infiniband/core/mad.c b/drivers/infiniband/core/mad.c
index 786fc51bf04b..7c728a2d1d56 100644
--- a/drivers/infiniband/core/mad.c
+++ b/drivers/infiniband/core/mad.c
@@ -338,13 +338,6 @@ struct ib_mad_agent *ib_register_mad_agent(struct 
ib_device *device,
goto error1;
}
 
-   mad_agent_priv-agent.mr = ib_get_dma_mr(port_priv-qp_info[qpn].qp-pd,
-IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(mad_agent_priv-agent.mr)) {
-   ret = ERR_PTR(-ENOMEM);
-   goto error2;
-   }
-
if (mad_reg_req) {
reg_req = kmemdup(mad_reg_req, sizeof *reg_req, GFP_KERNEL);
if (!reg_req) {
@@ -429,8 +422,6 @@ error4:
spin_unlock_irqrestore(port_priv-reg_lock, flags);
kfree(reg_req);
 error3:
-   ib_dereg_mr(mad_agent_priv-agent.mr);
-error2:
kfree(mad_agent_priv);
 error1:
return ret;
@@ -590,7 +581,6 @@ static void unregister_mad_agent(struct 
ib_mad_agent_private *mad_agent_priv)
wait_for_completion(mad_agent_priv-comp);
 
kfree(mad_agent_priv-reg_req);
-   ib_dereg_mr(mad_agent_priv-agent.mr);
kfree(mad_agent_priv);
 }
 
@@ -1038,7 +1028,7 @@ struct ib_mad_send_buf * ib_create_send_mad(struct 
ib_mad_agent *mad_agent,
 
mad_send_wr-mad_agent_priv = mad_agent_priv;
mad_send_wr-sg_list[0].length = hdr_len;
-   mad_send_wr-sg_list[0].lkey = mad_agent-mr-lkey;
+   mad_send_wr-sg_list[0].lkey = mad_agent-qp-pd-local_dma_lkey;
 
/* OPA MADs don't have to be the full 2048 bytes */
if (opa  base_version == OPA_MGMT_BASE_VERSION 
@@ -1047,7 +1037,7 @@ struct ib_mad_send_buf * ib_create_send_mad(struct 
ib_mad_agent *mad_agent,
else
mad_send_wr-sg_list[1].length = mad_size - hdr_len;
 
-   mad_send_wr-sg_list[1].lkey = mad_agent-mr-lkey;
+   mad_send_wr-sg_list[1].lkey = mad_agent-qp-pd-local_dma_lkey;
 
mad_send_wr-send_wr.wr_id = (unsigned long) mad_send_wr;
mad_send_wr-send_wr.sg_list = mad_send_wr-sg_list;
@@ -2885,7 +2875,7 @@ static int ib_mad_post_receive_mads(struct ib_mad_qp_info 
*qp_info,
struct ib_mad_queue *recv_queue = qp_info-recv_queue;
 
/* Initialize common scatter list fields */
-   sg_list.lkey = (*qp_info-port_priv-mr).lkey;
+   sg_list.lkey = qp_info-port_priv-pd-local_dma_lkey;
 
/* Initialize common receive WR fields */
recv_wr.next = NULL;
@@ -3201,13 +3191,6 @@ static int ib_mad_port_open(struct ib_device *device,
goto error4;
}
 
-   port_priv-mr = ib_get_dma_mr(port_priv-pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(port_priv-mr)) {
-   dev_err(device-dev, Couldn't get ib_mad DMA MR\n);
-   ret = PTR_ERR(port_priv-mr);
-   goto error5;
-   }
-
if (has_smi) {
ret = create_mad_qp(port_priv-qp_info[0], IB_QPT_SMI);
if (ret)
@@ -3248,8 +3231,6 @@ error8:
 error7:
destroy_mad_qp(port_priv-qp_info[0]);
 error6:
-   ib_dereg_mr(port_priv-mr);
-error5:
ib_dealloc_pd(port_priv-pd);
 error4:
ib_destroy_cq(port_priv-cq);
@@ -3284,7 +3265,6 @@ static int ib_mad_port_close(struct ib_device *device, 
int port_num)
destroy_workqueue(port_priv-wq);
destroy_mad_qp(port_priv-qp_info[1]);
destroy_mad_qp(port_priv-qp_info[0]);
-   ib_dereg_mr(port_priv-mr);
ib_dealloc_pd(port_priv-pd);
ib_destroy_cq(port_priv-cq);
cleanup_recv_queue(port_priv-qp_info[1]);
diff --git a/drivers/infiniband/core/mad_priv.h 
b/drivers/infiniband/core/mad_priv.h
index 5be89f98928f..4a4f7aad0978 100644
--- a/drivers/infiniband/core/mad_priv.h
+++ b/drivers/infiniband/core/mad_priv.h
@@ -199,7 +199,6 @@ struct ib_mad_port_private {
int port_num;
struct ib_cq *cq;
struct ib_pd *pd;
-   struct ib_mr *mr;
 
spinlock_t reg_lock;
struct ib_mad_mgmt_version_table version[MAX_MGMT_VERSION];
diff --git a/include/rdma/ib_mad.h b/include/rdma/ib_mad.h
index c8422d5a5a91..1f27023c919a 100644
--- a/include/rdma/ib_mad.h
+++ b/include/rdma/ib_mad.h
@@ -388,7 +388,6 @@ enum {
 struct ib_mad_agent {
struct ib_device*device;
struct ib_qp*qp;
-   struct ib_mr*mr;
ib_mad_recv_handler recv_handler;
ib_mad_send_handler send_handler;
ib_mad_snoop_handlersnoop_handler;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma 

[PATCH v2 03/12] IB/ipoib: Remove ib_get_dma_mr calls

2015-07-30 Thread Jason Gunthorpe
The pd now has a local_dma_lkey member which completely replaces
ib_get_dma_mr, use it instead.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 18 +++---
 3 files changed, 4 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 79859c4d43c9..ca2873698d75 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -342,7 +342,6 @@ struct ipoib_dev_priv {
u16   pkey;
u16   pkey_index;
struct ib_pd *pd;
-   struct ib_mr *mr;
struct ib_cq *recv_cq;
struct ib_cq *send_cq;
struct ib_qp *qp;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index ee39be6ccfb0..206227bb385f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -332,7 +332,7 @@ static void ipoib_cm_init_rx_wr(struct net_device *dev,
int i;
 
for (i = 0; i  priv-cm.num_frags; ++i)
-   sge[i].lkey = priv-mr-lkey;
+   sge[i].lkey = priv-pd-local_dma_lkey;
 
sge[0].length = IPOIB_CM_HEAD_SIZE;
for (i = 1; i  priv-cm.num_frags; ++i)
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 
b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index 9e6ee82a8fd7..3423256d3500 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -152,12 +152,6 @@ int ipoib_transport_dev_init(struct net_device *dev, 
struct ib_device *ca)
return -ENODEV;
}
 
-   priv-mr = ib_get_dma_mr(priv-pd, IB_ACCESS_LOCAL_WRITE);
-   if (IS_ERR(priv-mr)) {
-   printk(KERN_WARNING %s: ib_get_dma_mr failed\n, ca-name);
-   goto out_free_pd;
-   }
-
/*
 * the various IPoIB tasks assume they will never race against
 * themselves, so always use a single thread workqueue
@@ -165,7 +159,7 @@ int ipoib_transport_dev_init(struct net_device *dev, struct 
ib_device *ca)
priv-wq = create_singlethread_workqueue(ipoib_wq);
if (!priv-wq) {
printk(KERN_WARNING ipoib: failed to allocate device WQ\n);
-   goto out_free_mr;
+   goto out_free_pd;
}
 
size = ipoib_recvq_size + 1;
@@ -224,13 +218,13 @@ int ipoib_transport_dev_init(struct net_device *dev, 
struct ib_device *ca)
priv-dev-dev_addr[3] = (priv-qp-qp_num  )  0xff;
 
for (i = 0; i  MAX_SKB_FRAGS + 1; ++i)
-   priv-tx_sge[i].lkey = priv-mr-lkey;
+   priv-tx_sge[i].lkey = priv-pd-local_dma_lkey;
 
priv-tx_wr.opcode  = IB_WR_SEND;
priv-tx_wr.sg_list = priv-tx_sge;
priv-tx_wr.send_flags  = IB_SEND_SIGNALED;
 
-   priv-rx_sge[0].lkey = priv-mr-lkey;
+   priv-rx_sge[0].lkey = priv-pd-local_dma_lkey;
 
priv-rx_sge[0].length = IPOIB_UD_BUF_SIZE(priv-max_ib_mtu);
priv-rx_wr.num_sge = 1;
@@ -253,9 +247,6 @@ out_free_wq:
destroy_workqueue(priv-wq);
priv-wq = NULL;
 
-out_free_mr:
-   ib_dereg_mr(priv-mr);
-
 out_free_pd:
ib_dealloc_pd(priv-pd);
 
@@ -288,9 +279,6 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
priv-wq = NULL;
}
 
-   if (ib_dereg_mr(priv-mr))
-   ipoib_warn(priv, ib_dereg_mr failed\n);
-
if (ib_dealloc_pd(priv-pd))
ipoib_warn(priv, ib_dealloc_pd failed\n);
 
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 09/12] IB/srp: Do not create an all physical insecure rkey by default

2015-07-30 Thread Jason Gunthorpe
The ULP only needs this if the insecure register_always performance
optimization is enabled, or if FRWR/FMR is not supported in the driver.

Do not create an all physical MR unless it is needed to support either of
those modes. Default register_always to true so the out of the box
configuration does not create an insecure all physical MR.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
---
 drivers/infiniband/ulp/srp/ib_srp.c | 31 +--
 drivers/infiniband/ulp/srp/ib_srp.h |  2 +-
 2 files changed, 22 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 19a1356f8b2a..a8003079c232 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -69,7 +69,7 @@ static unsigned int cmd_sg_entries;
 static unsigned int indirect_sg_entries;
 static bool allow_ext_sg;
 static bool prefer_fr;
-static bool register_always;
+static bool register_always = true;
 static int topspin_workarounds = 1;
 
 module_param(srp_sg_tablesize, uint, 0444);
@@ -3147,7 +3147,8 @@ static ssize_t srp_create_target(struct device *dev,
target-scsi_host   = target_host;
target-srp_host= host;
target-lkey= host-srp_dev-pd-local_dma_lkey;
-   target-rkey= host-srp_dev-mr-rkey;
+   if (host-srp_dev-rkey_mr)
+   target-rkey= host-srp_dev-rkey_mr-rkey;
target-cmd_sg_cnt  = cmd_sg_entries;
target-sg_tablesize= indirect_sg_entries ? : cmd_sg_entries;
target-allow_ext_sg= allow_ext_sg;
@@ -3378,6 +3379,7 @@ static void srp_add_one(struct ib_device *device)
struct srp_host *host;
int mr_page_shift, p;
u64 max_pages_per_mr;
+   unsigned int mr_flags = 0;
 
dev_attr = kmalloc(sizeof *dev_attr, GFP_KERNEL);
if (!dev_attr)
@@ -3396,8 +3398,11 @@ static void srp_add_one(struct ib_device *device)
device-map_phys_fmr  device-unmap_fmr);
srp_dev-has_fr = (dev_attr-device_cap_flags 
   IB_DEVICE_MEM_MGT_EXTENSIONS);
-   if (!srp_dev-has_fmr  !srp_dev-has_fr)
+   if (!srp_dev-has_fmr  !srp_dev-has_fr) {
dev_warn(device-dev, neither FMR nor FR is supported\n);
+   /* Fall back to using an insecure all physical rkey */
+   mr_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;
+   }
 
srp_dev-use_fast_reg = (srp_dev-has_fr 
 (!srp_dev-has_fmr || prefer_fr));
@@ -3433,12 +3438,17 @@ static void srp_add_one(struct ib_device *device)
if (IS_ERR(srp_dev-pd))
goto free_dev;
 
-   srp_dev-mr = ib_get_dma_mr(srp_dev-pd,
-   IB_ACCESS_LOCAL_WRITE |
-   IB_ACCESS_REMOTE_READ |
-   IB_ACCESS_REMOTE_WRITE);
-   if (IS_ERR(srp_dev-mr))
-   goto err_pd;
+   if (!register_always)
+   mr_flags |= IB_ACCESS_REMOTE_READ | IB_ACCESS_REMOTE_WRITE;
+
+   if (mr_flags) {
+   srp_dev-rkey_mr = ib_get_dma_mr(
+   srp_dev-pd, IB_ACCESS_LOCAL_WRITE | mr_flags);
+   if (IS_ERR(srp_dev-rkey_mr))
+   goto err_pd;
+   } else
+   srp_dev-rkey_mr = NULL;
+
 
for (p = rdma_start_port(device); p = rdma_end_port(device); ++p) {
host = srp_add_port(srp_dev, p);
@@ -3495,7 +3505,8 @@ static void srp_remove_one(struct ib_device *device)
kfree(host);
}
 
-   ib_dereg_mr(srp_dev-mr);
+   if (srp_dev-rkey_mr)
+   ib_dereg_mr(srp_dev-rkey_mr);
ib_dealloc_pd(srp_dev-pd);
 
kfree(srp_dev);
diff --git a/drivers/infiniband/ulp/srp/ib_srp.h 
b/drivers/infiniband/ulp/srp/ib_srp.h
index 17ee3f80ba55..8b241f17f8b8 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.h
+++ b/drivers/infiniband/ulp/srp/ib_srp.h
@@ -95,7 +95,7 @@ struct srp_device {
struct list_headdev_list;
struct ib_device   *dev;
struct ib_pd   *pd;
-   struct ib_mr   *mr;
+   struct ib_mr   *rkey_mr;
u64 mr_page_mask;
int mr_page_size;
int mr_max_size;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 00/12] IB: Replace safe uses for ib_get_dma_mr with pd-local_dma_lkey

2015-07-30 Thread Jason Gunthorpe
This series moves dealing with the safe all physical mr:

  ib_get_dma_mr(pd,IB_ACCESS_LOCAL_WRITE);

Into ib_alloc_pd, and in the process makes the global local_dma_lkey 
functionality
broadly enabled for all ULPs.

The remaining users of ib_get_dma_mr are all unsafe:
 drivers/infiniband/ulp/iser/iser_verbs.c:
device-mr = ib_get_dma_mr(device-pd, IB_ACCESS_LOCAL_WRITE |
   IB_ACCESS_REMOTE_WRITE |
   IB_ACCESS_REMOTE_READ);

 drivers/infiniband/ulp/srp/ib_srp.c:
srp_dev-mr = ib_get_dma_mr(srp_dev-pd,
IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_READ |
IB_ACCESS_REMOTE_WRITE);

 drivers/staging/lustre/lnet/klnds/o2iblnd/o2iblnd.c:
int acflags = IB_ACCESS_LOCAL_WRITE | IB_ACCESS_REMOTE_WRITE;
mr = ib_get_dma_mr(hdev-ibh_pd, acflags);

 net/rds/iw.c:
rds_iwdev-mr = ib_get_dma_mr(rds_iwdev-pd,
IB_ACCESS_REMOTE_READ |
IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_LOCAL_WRITE);

 net/sunrpc/xprtrdma/svc_rdma_transport.c:
if (rdma_protocol_iwarp(newxprt-sc_cm_id-device,
newxprt-sc_cm_id-port_num) 
!(newxprt-sc_dev_caps  SVCRDMA_DEVCAP_FAST_REG))
dma_mr_acc |= IB_ACCESS_REMOTE_WRITE;
newxprt-sc_phys_mr =
ib_get_dma_mr(newxprt-sc_pd, dma_mr_acc);

 net/sunrpc/xprtrdma/verbs.c:
case RPCRDMA_ALLPHYSICAL:
ia-ri_ops = rpcrdma_physical_memreg_ops;
mem_priv = IB_ACCESS_LOCAL_WRITE |
IB_ACCESS_REMOTE_WRITE |
IB_ACCESS_REMOTE_READ;
ia-ri_bind_mem = ib_get_dma_mr(ia-ri_pd, mem_priv);

Calling ib_get_dma_mr with IB_ACCESS_REMOTE_* flags is considered to be a
serious security problem and should not be done without the user directly
opting in to an off-by-default scheme. The call allows the peer on the QP
unrestricted access to local physical memory if they can guess the rkey value.

A future series will cause the kernel to be tainted by the above call sites to
promote migrating away from this.

To Migrate:
 * If ib_get_dma_mr was being used to get an lkey then use
   local_dma_lkey instead (I belive this series gets all of those cases).

   If the lkey is being used for RDMA_READ, and iWarp support is required then
   iWarp must be detected and FRMR must be used to create a limited temporary
   MR just for the RDMA_READ. (eg NFS, RDS)

 * If ib_get_dma_mr was being used to get an rkey then use FRMR to cerate
   limited temporary MR's (eg SRP, iSER, etc)

All patches are compile tested. I've done basic testing up to and including
the IPoIB patch, the rest required specialized setups I don't have access to,
but are fairly straightforward.

Jason Gunthorpe (12):
  IB/core: Guarantee that a local_dma_lkey is available
  IB/mad: Remove ib_get_dma_mr calls
  IB/ipoib: Remove ib_get_dma_mr calls
  IB/mlx4: Remove ib_get_dma_mr calls
  IB/mlx5: Remove ib_get_dma_mr calls
  IB/iser: Use pd-local_dma_lkey
  iser-target: Remove ib_get_dma_mr calls
  IB/srp: Use pd-local_dma_lkey
  IB/srp: Do not create an all physical insecure rkey by default
  ib_srpt: Remove ib_get_dma_mr calls
  net/9p: Remove ib_get_dma_mr calls
  rds/ib: Remove ib_get_dma_mr calls

 drivers/infiniband/core/mad.c| 26 ++-
 drivers/infiniband/core/mad_priv.h   |  1 -
 drivers/infiniband/core/verbs.c  | 47 +---
 drivers/infiniband/hw/mlx4/mad.c | 23 +++---
 drivers/infiniband/hw/mlx4/mlx4_ib.h |  1 -
 drivers/infiniband/hw/mlx5/main.c| 13 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  1 -
 drivers/infiniband/hw/mlx5/mr.c  |  5 ++-
 drivers/infiniband/ulp/ipoib/ipoib.h |  1 -
 drivers/infiniband/ulp/ipoib/ipoib_cm.c  |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c   | 18 ++-
 drivers/infiniband/ulp/iser/iscsi_iser.c |  2 +-
 drivers/infiniband/ulp/iser/iser_initiator.c |  8 ++---
 drivers/infiniband/ulp/iser/iser_memory.c|  2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c  | 33 +++
 drivers/infiniband/ulp/isert/ib_isert.h  |  1 -
 drivers/infiniband/ulp/srp/ib_srp.c  | 33 ---
 drivers/infiniband/ulp/srp/ib_srp.h  |  2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c| 15 +++--
 drivers/infiniband/ulp/srpt/ib_srpt.h|  1 -
 include/rdma/ib_mad.h|  1 -
 include/rdma/ib_verbs.h  |  9 ++
 net/9p/trans_rdma.c  | 26 ++-
 

[PATCH v2 06/12] IB/iser: Use pd-local_dma_lkey

2015-07-30 Thread Jason Gunthorpe
Replace all leys with  pd-local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and this
looks trivially fixed by forcing the use of registration in a future
patch.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Reviewed-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.c | 2 +-
 drivers/infiniband/ulp/iser/iser_initiator.c | 8 
 drivers/infiniband/ulp/iser/iser_memory.c| 2 +-
 drivers/infiniband/ulp/iser/iser_verbs.c | 2 +-
 4 files changed, 7 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 6a594aac2290..f44c6b879329 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -204,7 +204,7 @@ iser_initialize_task_headers(struct iscsi_task *task,
tx_desc-dma_addr = dma_addr;
tx_desc-tx_sg[0].addr   = tx_desc-dma_addr;
tx_desc-tx_sg[0].length = ISER_HEADERS_LEN;
-   tx_desc-tx_sg[0].lkey   = device-mr-lkey;
+   tx_desc-tx_sg[0].lkey   = device-pd-local_dma_lkey;
 
iser_task-iser_conn = iser_conn;
 out:
diff --git a/drivers/infiniband/ulp/iser/iser_initiator.c 
b/drivers/infiniband/ulp/iser/iser_initiator.c
index 3e2118e8ed87..2d02f042c69a 100644
--- a/drivers/infiniband/ulp/iser/iser_initiator.c
+++ b/drivers/infiniband/ulp/iser/iser_initiator.c
@@ -173,8 +173,8 @@ static void iser_create_send_desc(struct iser_conn  
*iser_conn,
 
tx_desc-num_sge = 1;
 
-   if (tx_desc-tx_sg[0].lkey != device-mr-lkey) {
-   tx_desc-tx_sg[0].lkey = device-mr-lkey;
+   if (tx_desc-tx_sg[0].lkey != device-pd-local_dma_lkey) {
+   tx_desc-tx_sg[0].lkey = device-pd-local_dma_lkey;
iser_dbg(sdesc %p lkey mismatch, fixing\n, tx_desc);
}
 }
@@ -291,7 +291,7 @@ int iser_alloc_rx_descriptors(struct iser_conn *iser_conn,
rx_sg = rx_desc-rx_sg;
rx_sg-addr   = rx_desc-dma_addr;
rx_sg-length = ISER_RX_PAYLOAD_SIZE;
-   rx_sg-lkey   = device-mr-lkey;
+   rx_sg-lkey   = device-pd-local_dma_lkey;
}
 
iser_conn-rx_desc_head = 0;
@@ -543,7 +543,7 @@ int iser_send_control(struct iscsi_conn *conn,
 
tx_dsg-addr= iser_conn-login_req_dma;
tx_dsg-length  = task-data_count;
-   tx_dsg-lkey= device-mr-lkey;
+   tx_dsg-lkey= device-pd-local_dma_lkey;
mdesc-num_sge = 2;
}
 
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index f0cdc961eb11..3129a42150ff 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -393,7 +393,7 @@ iser_reg_dma(struct iser_device *device, struct 
iser_data_buf *mem,
 {
struct scatterlist *sg = mem-sg;
 
-   reg-sge.lkey = device-mr-lkey;
+   reg-sge.lkey = device-pd-local_dma_lkey;
reg-rkey = device-mr-rkey;
reg-sge.addr = ib_sg_dma_address(device-ib_device, sg[0]);
reg-sge.length = ib_sg_dma_len(device-ib_device, sg[0]);
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 5c9f565ea0e8..52268356c79e 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -1017,7 +1017,7 @@ int iser_post_recvl(struct iser_conn *iser_conn)
 
sge.addr   = iser_conn-login_resp_dma;
sge.length = ISER_RX_LOGIN_SIZE;
-   sge.lkey   = ib_conn-device-mr-lkey;
+   sge.lkey   = ib_conn-device-pd-local_dma_lkey;
 
rx_wr.wr_id   = (uintptr_t)iser_conn-login_resp_buf;
rx_wr.sg_list = sge;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v2 08/12] IB/srp: Use pd-local_dma_lkey

2015-07-30 Thread Jason Gunthorpe
Replace all leys with  pd-local_dma_lkey. This driver does not support
iWarp, so this is safe.

The insecure use of ib_get_dma_mr is thus isolated to an rkey, and will
have to be fixed separately.

Signed-off-by: Jason Gunthorpe jguntho...@obsidianresearch.com
Reviewed-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/srp/ib_srp.c | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index 31a20b462266..19a1356f8b2a 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -3146,7 +3146,7 @@ static ssize_t srp_create_target(struct device *dev,
target-io_class= SRP_REV16A_IB_IO_CLASS;
target-scsi_host   = target_host;
target-srp_host= host;
-   target-lkey= host-srp_dev-mr-lkey;
+   target-lkey= host-srp_dev-pd-local_dma_lkey;
target-rkey= host-srp_dev-mr-rkey;
target-cmd_sg_cnt  = cmd_sg_entries;
target-sg_tablesize= indirect_sg_entries ? : cmd_sg_entries;
-- 
2.1.4

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH v4 17/50] IB/hfi1: add PSM driver control/data path

2015-07-30 Thread Marciniszyn, Mike
 On 07/30/2015 04:01 PM, Marciniszyn, Mike wrote:
 
  I thought you were getting rid of this?
 
  Jason
 
  Doug wanted the v4 submitted as we currently have it.
 
 To be accurate, I said If you want a chance at making 4.3, I need a v4.  I
 didn't comment on whether or not any specific review comments were
 addressed.
 
  Doug?
 
 I have no problem with this code.  That Al finds the user space ABI for this
 driver to be bizarre is neither here nor there to me.  Sure, this file does 
 not
 exhibit normal file API behavior.  Who cares?  It's not a normal file in *any*
 sense of the word.  For example, the normal write routine will never, ever
 accept just plain data.  It's always in the form of a command.  If you don't
 have the right magic decoder ring, you will get nothing but errors when
 trying to do something with this file.
  Much like /dev/infiniband/uverbs? files, it is a command interface, not a
 raw data interface.  I actually think the fact that you guys use write for a
 single command and writev/write_iter for a command queue is an elegant
 solution to your particular needs.  The only reason Al threw a hissy over it 
 is
 because it tripped him up when he went to do the conversion from writev to
 write_iter.  That's understandable.  So, some clear documentation so
 someone like Al doesn't have to go reading through sources and try to figure
 out what you are doing would be the generally nice thing to do for other
 kernel generalists that might come poking around this way.  Or, another
 option would be to drop the write function altogether and just make all
 commands come through writev/write_iter and if you only have one
 command, you only send one element.  Regardless, those things can be
 cleaned up in follow on patches, please do not resubmit this set for that.
 

Jason,

I did ask you in http://marc.info/?l=linux-rdmam=143707462806767w=2 if you 
thought ioctl was ok.

Hearing nothing, we left the interface as it was.

I suspect (I lack the early history) that the ioctl BKL might have forced both 
uverbs and PSM to go this route.

Doug,

Where would be the appropriate location to document?  In the source itself?  
Somewhere else?

Mike
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v4 40/50] IB/hfi1: add sysfs routines and documentation

2015-07-30 Thread Jason Gunthorpe
On Thu, Jul 30, 2015 at 10:28:42PM +, Marciniszyn, Mike wrote:
   +   board_id - manufacturing board id
 
 There is no PCIe equivalent.
 
   +   serial - board serial number
 
 No PCIe equivalent.
 
   +   boardversion - board version

These all have PCI-E versions. Most should live in the config space
VPD.

01:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Capabilities: [48] Vital Product Data
Product Name: CX354A - ConnectX-3 QSFP
Read-only fields:
[PN] Part number: MCX354A-QCBT 
[EC] Engineering changes: A1
[SN] Serial number: MT1151X00841
[V0] Vendor specific: PCIe Gen3 x8
[RV] Reserved: checksum good, 0 byte(s) reserved

There is room for various vendor-specific things in the VPD if you
need.

I'm not sure, but I kinda remember there is even a PCI-E standard for
the temp sensor these days..

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/ipoib: CSUM support in connected mode

2015-07-30 Thread Bart Van Assche

On 07/30/15 13:09, Yuval Shaia wrote:

On Thu, Jul 30, 2015 at 09:38:54AM -0700, Bart Van Assche wrote:

On 07/30/2015 04:46 AM, Yuval Shaia wrote:

  struct ipoib_cm_data {
__be32 qpn; /* High byte MUST be ignored on receive */
__be32 mtu;
+   __be16 sig; /* must be IPOIB_CM_PROTO_SIG */
+   __be16 caps; /* 4 bits proto ver and 12 bits capabilities */
  };


This patch modifies the private login data format that has been
standardized by the IETF in RFC 4755. Has this modification already
been discussed with the IETF ?


Yes.


Hello Yuval,

It should have been mentioned in the patch description that this patch 
modifies the wire format, how it modifies the wire format, and also 
which feedback (if any) has been received so far from the IETF.


Thanks,

Bart.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v4 07/50] IB/hfi1: add chip register definitions

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/chip_registers.h | 1289 +++
 1 file changed, 1289 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/chip_registers.h

diff --git a/drivers/infiniband/hw/hfi1/chip_registers.h 
b/drivers/infiniband/hw/hfi1/chip_registers.h
new file mode 100644
index 000..6521030
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/chip_registers.h
@@ -0,0 +1,1289 @@
+#ifndef DEF_CHIP_REG
+#define DEF_CHIP_REG
+
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#define CORE   0x
+#define CCE(CORE + 0x)
+#define ASIC   (CORE + 0x0040)
+#define MISC   (CORE + 0x0050)
+#define DC_TOP_CSRS(CORE + 0x0060)
+#define CHIP_DEBUG (CORE + 0x0070)
+#define RXE(CORE + 0x0100)
+#define TXE(CORE + 0x0180)
+#define DCC_CSRS   (DC_TOP_CSRS + 0x)
+#define DC_LCB_CSRS(DC_TOP_CSRS + 0x1000)
+#define DC_8051_CSRS   (DC_TOP_CSRS + 0x2000)
+#define PCIE   0
+
+#define ASIC_NUM_SCRATCH 4
+#define CCE_ERR_INT_CNT 0
+#define CCE_MISC_INT_CNT 2
+#define CCE_NUM_32_BIT_COUNTERS 3
+#define CCE_NUM_32_BIT_INT_COUNTERS 6
+#define CCE_NUM_INT_CSRS 12
+#define CCE_NUM_INT_MAP_CSRS 96
+#define CCE_NUM_MSIX_PBAS 4
+#define CCE_NUM_MSIX_VECTORS 256
+#define CCE_NUM_SCRATCH 4
+#define CCE_PCIE_POSTED_CRDT_STALL_CNT 2
+#define CCE_PCIE_TRGT_STALL_CNT 0

[PATCH v4 09/50] IB/hfi1: add common header file definitions

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/common.h |  415 +++
 1 file changed, 415 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/common.h

diff --git a/drivers/infiniband/hw/hfi1/common.h 
b/drivers/infiniband/hw/hfi1/common.h
new file mode 100644
index 000..5f22937
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/common.h
@@ -0,0 +1,415 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#ifndef _COMMON_H
+#define _COMMON_H
+
+#include rdma/hfi/hfi1_user.h
+
+/*
+ * This file contains defines, structures, etc. that are used
+ * to communicate between kernel and user code.
+ */
+
+/* version of protocol header (known to chip also). In the long run,
+ * we should be able to generate and accept a range of version numbers;
+ * for now we only accept one, and it's compiled in.
+ */
+#define IPS_PROTO_VERSION 2
+
+/*
+ * These are compile time constants that you may want to enable or disable
+ * if you are trying to debug problems with code or performance.
+ * HFI1_VERBOSE_TRACING define as 1 if you want additional tracing in
+ * fast path code
+ * HFI1_TRACE_REGWRITES define as 1 if you want register writes to be
+ * traced in fast path code
+ * _HFI1_TRACING define as 0 if you want to remove all tracing in a
+ * compilation unit
+ */
+
+/*
+ * If a packet's QP[23:16] bits match this value, then it is
+ * a PSM packet and the hardware will expect a KDETH header
+ * following the BTH.
+ */
+#define DEFAULT_KDETH_QP 0x80
+
+/* driver/hw feature set bitmask */
+#define 

[PATCH v4 22/50] IB/hfi1: add progress delay/restart hooks

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/iowait.h |  186 +++
 1 file changed, 186 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/iowait.h

diff --git a/drivers/infiniband/hw/hfi1/iowait.h 
b/drivers/infiniband/hw/hfi1/iowait.h
new file mode 100644
index 000..fa361b4
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/iowait.h
@@ -0,0 +1,186 @@
+#ifndef _HFI1_IOWAIT_H
+#define _HFI1_IOWAIT_H
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/list.h
+#include linux/workqueue.h
+#include linux/sched.h
+
+/*
+ * typedef (*restart_t)() - restart callback
+ * @work: pointer to work structure
+ */
+typedef void (*restart_t)(struct work_struct *work);
+
+struct sdma_txreq;
+struct sdma_engine;
+/**
+ * struct iowait - linkage for delayed progress/waiting
+ * @list: used to add/insert into QP/PQ wait lists
+ * @tx_head: overflow list of sdma_txreq's
+ * @sleep: no space callback
+ * @wakeup: space callback
+ * @iowork: workqueue overhead
+ * @wait_dma: wait for sdma_busy == 0
+ * @sdma_busy: # of packets in flight
+ * @count: total number of descriptors in tx_head'ed list
+ * @tx_limit: limit for overflow queuing
+ * @tx_count: number of tx entry's in tx_head'ed list
+ *
+ * This is to be embedded in user's state structure
+ * (QP or PQ).
+ *
+ * The sleep and wakeup members are a
+ * bit misnamed.   They do not strictly
+ * speaking sleep or wake up, but they
+ * are callbacks for the ULP to implement
+ * what ever queuing/dequeuing of
+ * the embedded iowait and 

[PATCH v4 24/50] IB/hfi1: add OPA mad handling part1

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/mad.c | 2757 ++
 1 file changed, 2757 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/mad.c

diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
new file mode 100644
index 000..034e284
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -0,0 +1,2757 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/net.h
+#define OPA_NUM_PKEY_BLOCKS_PER_SMP (OPA_SMP_DR_DATA_SIZE \
+   / (OPA_PARTITION_TABLE_BLK_SIZE * sizeof(u16)))
+
+#include hfi.h
+#include mad.h
+#include trace.h
+
+/* the reset value from the FM is supposed to be 0x, handle both */
+#define OPA_LINK_WIDTH_RESET_OLD 0x0fff
+#define OPA_LINK_WIDTH_RESET 0x
+
+static int reply(struct ib_mad_hdr *smp)
+{
+   /*
+* The verbs framework will handle the directed/LID route
+* packet changes.
+*/
+   smp-method = IB_MGMT_METHOD_GET_RESP;
+   if (smp-mgmt_class == IB_MGMT_CLASS_SUBN_DIRECTED_ROUTE)
+   smp-status |= IB_SMP_DIRECTION;
+   return IB_MAD_RESULT_SUCCESS | IB_MAD_RESULT_REPLY;
+}
+
+static inline void clear_opa_smp_data(struct opa_smp *smp)
+{
+   void *data = opa_get_smp_data(smp);
+   size_t size = opa_get_smp_data_size(smp);
+
+   memset(data, 0, size);
+}
+
+static void send_trap(struct hfi1_ibport *ibp, void *data, unsigned len)
+{
+   struct ib_mad_send_buf *send_buf;
+   struct ib_mad_agent *agent;
+   

[PATCH v4 23/50] IB/hfi1: add rkey/lkey validation

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/keys.c |  411 +
 1 file changed, 411 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/keys.c

diff --git a/drivers/infiniband/hw/hfi1/keys.c 
b/drivers/infiniband/hw/hfi1/keys.c
new file mode 100644
index 000..f6eff17
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/keys.c
@@ -0,0 +1,411 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include hfi.h
+
+/**
+ * hfi1_alloc_lkey - allocate an lkey
+ * @mr: memory region that this lkey protects
+ * @dma_region: 0-normal key, 1-restricted DMA key
+ *
+ * Returns 0 if successful, otherwise returns -errno.
+ *
+ * Increments mr reference count as required.
+ *
+ * Sets the lkey field mr for non-dma regions.
+ *
+ */
+
+int hfi1_alloc_lkey(struct hfi1_mregion *mr, int dma_region)
+{
+   unsigned long flags;
+   u32 r;
+   u32 n;
+   int ret = 0;
+   struct hfi1_ibdev *dev = to_idev(mr-pd-device);
+   struct hfi1_lkey_table *rkt = dev-lk_table;
+
+   hfi1_get_mr(mr);
+   spin_lock_irqsave(rkt-lock, flags);
+
+   /* special case for dma_mr lkey == 0 */
+   if (dma_region) {
+   struct hfi1_mregion *tmr;
+
+   tmr = rcu_access_pointer(dev-dma_mr);
+   if (!tmr) {
+   rcu_assign_pointer(dev-dma_mr, mr);
+   mr-lkey_published = 1;
+   } else {
+   hfi1_put_mr(mr);
+   }
+   goto success;
+   }
+
+   /* 

[PATCH v4 30/50] IB/hfi1: add pcie routines

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/pcie.c | 1253 +
 1 file changed, 1253 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/pcie.c

diff --git a/drivers/infiniband/hw/hfi1/pcie.c 
b/drivers/infiniband/hw/hfi1/pcie.c
new file mode 100644
index 000..ac5653c
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/pcie.c
@@ -0,0 +1,1253 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+#include linux/pci.h
+#include linux/io.h
+#include linux/delay.h
+#include linux/vmalloc.h
+#include linux/aer.h
+#include linux/module.h
+
+#include hfi.h
+#include chip_registers.h
+
+/* link speed vector for Gen3 speed - not in Linux headers */
+#define GEN1_SPEED_VECTOR 0x1
+#define GEN2_SPEED_VECTOR 0x2
+#define GEN3_SPEED_VECTOR 0x3
+
+/*
+ * This file contains PCIe utility routines.
+ */
+
+/*
+ * Code to adjust PCIe capabilities.
+ */
+static void tune_pcie_caps(struct hfi1_devdata *);
+
+/*
+ * Do all the common PCIe setup and initialization.
+ * devdata is not yet allocated, and is not allocated until after this
+ * routine returns success.  Therefore dd_dev_err() can't be used for error
+ * printing.
+ */
+int hfi1_pcie_init(struct pci_dev *pdev, const struct pci_device_id *ent)
+{
+   int ret;
+
+   ret = pci_enable_device(pdev);
+   if (ret) {
+   /*
+* This can happen (in theory) iff:
+* We did a chip reset, and then failed to reprogram the
+* BAR, or the chip reset due to an internal error.  We then
+ 

[PATCH v4 25/50] IB/hfi1: add OPA mad handling part2

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/mad.c | 1502 ++
 1 file changed, 1501 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/hw/hfi1/mad.c b/drivers/infiniband/hw/hfi1/mad.c
index 034e284..0a18fee 100644
--- a/drivers/infiniband/hw/hfi1/mad.c
+++ b/drivers/infiniband/hw/hfi1/mad.c
@@ -2754,4 +2754,1504 @@ static int pma_get_opa_porterrors(struct opa_pma_mad 
*pmp,
 
if (response_data_size  sizeof(pmp-data)) {
pmp-mad_hdr.status |= IB_SMP_INVALID_FIELD;
-   return r
\ No newline at end of file
+   return reply((struct ib_mad_hdr *)pmp);
+   }
+   /*
+* The bit set in the mask needs to be consistent with the
+* port the request came in on.
+*/
+   port_mask = be64_to_cpu(req-port_select_mask[3]);
+   port_num = find_first_bit((unsigned long *)port_mask,
+   sizeof(port_mask));
+
+   if ((u8)port_num != port) {
+   pmp-mad_hdr.status |= IB_SMP_INVALID_FIELD;
+   return reply((struct ib_mad_hdr *)pmp);
+   }
+
+   rsp = (struct _port_ectrs *)(req-port[0]);
+
+   ibp = to_iport(ibdev, port_num);
+   ppd = ppd_from_ibp(ibp);
+
+   memset(rsp, 0, sizeof(*rsp));
+   rsp-port_number = (u8)port_num;
+
+   rsp-port_rcv_constraint_errors =
+   cpu_to_be64(read_port_cntr(ppd, C_SW_RCV_CSTR_ERR,
+  CNTR_INVALID_VL));
+   /* port_rcv_switch_relay_errors is 0 for HFIs */
+   rsp-port_xmit_discards =
+   cpu_to_be64(read_port_cntr(ppd, C_SW_XMIT_DSCD,
+   CNTR_INVALID_VL));
+   rsp-port_rcv_remote_physical_errors =
+   cpu_to_be64(read_dev_cntr(dd, C_DC_RMT_PHY_ERR,
+   CNTR_INVALID_VL));
+   tmp = read_dev_cntr(dd, C_DC_RX_REPLAY, CNTR_INVALID_VL);
+   tmp2 = tmp + read_dev_cntr(dd, C_DC_TX_REPLAY, CNTR_INVALID_VL);
+   if (tmp2  tmp) {
+   /* overflow/wrapped */
+   rsp-local_link_integrity_errors = cpu_to_be64(~0);
+   } else {
+   rsp-local_link_integrity_errors = cpu_to_be64(tmp2);
+   }
+   tmp = read_dev_cntr(dd, C_DC_SEQ_CRC_CNT, CNTR_INVALID_VL);
+   tmp2 = tmp + read_dev_cntr(dd, C_DC_REINIT_FROM_PEER_CNT,
+   CNTR_INVALID_VL);
+   if (tmp2  (u32)UINT_MAX || tmp2  tmp) {
+   /* overflow/wrapped */
+   rsp-link_error_recovery = cpu_to_be32(~0);
+   } else {
+   rsp-link_error_recovery = cpu_to_be32(tmp2);
+   }
+   rsp-port_xmit_constraint_errors =
+   cpu_to_be64(read_port_cntr(ppd, C_SW_XMIT_CSTR_ERR,
+  CNTR_INVALID_VL));
+   rsp-excessive_buffer_overruns =
+   cpu_to_be64(read_dev_cntr(dd, C_RCV_OVF, CNTR_INVALID_VL));
+   rsp-fm_config_errors =
+   cpu_to_be64(read_dev_cntr(dd, C_DC_FM_CFG_ERR,
+   CNTR_INVALID_VL));
+   rsp-link_downed = cpu_to_be32(read_port_cntr(ppd, C_SW_LINK_DOWN,
+   CNTR_INVALID_VL));
+   tmp = read_dev_cntr(dd, C_DC_UNC_ERR, CNTR_INVALID_VL);
+   rsp-uncorrectable_errors = tmp  0x100 ? (tmp  0xff) : 0xff;
+
+   vlinfo = (struct _vls_ectrs *)(rsp-vls[0]);
+   vfi = 0;
+   vl_select_mask = be32_to_cpu(req-vl_select_mask);
+   for_each_set_bit(vl, (unsigned long *)(vl_select_mask),
+8 * sizeof(req-vl_select_mask)) {
+   memset(vlinfo, 0, sizeof(*vlinfo));
+   /* 

[PATCH v4 26/50] IB/hfi1: add local mad header

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/mad.h |  325 ++
 1 file changed, 325 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/mad.h

diff --git a/drivers/infiniband/hw/hfi1/mad.h b/drivers/infiniband/hw/hfi1/mad.h
new file mode 100644
index 000..4745750
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/mad.h
@@ -0,0 +1,325 @@
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+#ifndef _HFI1_MAD_H
+#define _HFI1_MAD_H
+
+#include rdma/ib_pma.h
+#define USE_PI_LED_ENABLE  1 /* use led enabled bit in struct
+  * opa_port_states, if available */
+#include rdma/opa_smi.h
+#include rdma/opa_port_info.h
+#ifndef PI_LED_ENABLE_SUP
+#define PI_LED_ENABLE_SUP 0
+#endif
+#include opa_compat.h
+
+
+
+#define IB_VLARB_LOWPRI_0_311
+#define IB_VLARB_LOWPRI_32_63   2
+#define IB_VLARB_HIGHPRI_0_31   3
+#define IB_VLARB_HIGHPRI_32_63  4
+
+#define OPA_MAX_PREEMPT_CAP 32
+#define OPA_VLARB_LOW_ELEMENTS   0
+#define OPA_VLARB_HIGH_ELEMENTS  1
+#define OPA_VLARB_PREEMPT_ELEMENTS   2
+#define OPA_VLARB_PREEMPT_MATRIX 3
+
+#define IB_PMA_PORT_COUNTERS_CONG   cpu_to_be16(0xFF00)
+
+struct ib_pma_portcounters_cong {
+   u8 reserved;
+   u8 reserved1;
+   __be16 port_check_rate;
+   __be16 symbol_error_counter;
+   u8 link_error_recovery_counter;
+   u8 link_downed_counter;
+   __be16 port_rcv_errors;
+   __be16 port_rcv_remphys_errors;
+   __be16 port_rcv_switch_relay_errors;
+   __be16 

[PATCH v4 29/50] IB/hfi1: add misc OPA defines

2015-07-30 Thread Mike Marciniszyn
Signed-off-by: Andrew Friedley andrew.fried...@intel.com
Signed-off-by: Arthur Kepner arthur.kep...@intel.com
Signed-off-by: Brendan Cunningham brendan.cunning...@intel.com
Signed-off-by: Brian Welty brian.we...@intel.com
Signed-off-by: Caz Yokoyama caz.yokoy...@intel.com
Signed-off-by: Dean Luick dean.lu...@intel.com
Signed-off-by: Dennis Dalessandro dennis.dalessan...@intel.com
Signed-off-by: Easwar Hariharan easwar.hariha...@intel.com
Signed-off-by: Harish Chegondi harish.chego...@intel.com
Signed-off-by: Ira Weiny ira.we...@intel.com
Signed-off-by: Jim Snow jim.m.s...@intel.com
Signed-off-by: John Gregor john.a.gre...@intel.com
Signed-off-by: Jubin John jubin.j...@intel.com
Signed-off-by: Kaike Wan kaike@intel.com
Signed-off-by: Kevin Pine kevin.p...@intel.com
Signed-off-by: Kyle Liddell kyle.lidd...@intel.com
Signed-off-by: Mike Marciniszyn mike.marcinis...@intel.com
Signed-off-by: Mitko Haralanov mitko.harala...@intel.com
Signed-off-by: Ravi Krishnaswamy ravi.krishnasw...@intel.com
Signed-off-by: Sadanand Warrier sadanand.warr...@intel.com
Signed-off-by: Sanath Kumar sanath.s.ku...@intel.com
Signed-off-by: Sudeep Dutt sudeep.d...@intel.com
Signed-off-by: Vlad Danushevsky vladimir.danusev...@intel.com
---
 drivers/infiniband/hw/hfi1/opa_compat.h |  129 +++
 1 file changed, 129 insertions(+)
 create mode 100644 drivers/infiniband/hw/hfi1/opa_compat.h

diff --git a/drivers/infiniband/hw/hfi1/opa_compat.h 
b/drivers/infiniband/hw/hfi1/opa_compat.h
new file mode 100644
index 000..f64eec1
--- /dev/null
+++ b/drivers/infiniband/hw/hfi1/opa_compat.h
@@ -0,0 +1,129 @@
+#ifndef _LINUX_H
+#define _LINUX_H
+/*
+ *
+ * This file is provided under a dual BSD/GPLv2 license.  When using or
+ * redistributing this file, you may do so under either license.
+ *
+ * GPL LICENSE SUMMARY
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * This program is free software; you can redistribute it and/or modify
+ * it under the terms of version 2 of the GNU General Public License as
+ * published by the Free Software Foundation.
+ *
+ * This program is distributed in the hope that it will be useful, but
+ * WITHOUT ANY WARRANTY; without even the implied warranty of
+ * MERCHANTABILITY or FITNESS FOR A PARTICULAR PURPOSE.  See the GNU
+ * General Public License for more details.
+ *
+ * BSD LICENSE
+ *
+ * Copyright(c) 2015 Intel Corporation.
+ *
+ * Redistribution and use in source and binary forms, with or without
+ * modification, are permitted provided that the following conditions
+ * are met:
+ *
+ *  - Redistributions of source code must retain the above copyright
+ *notice, this list of conditions and the following disclaimer.
+ *  - Redistributions in binary form must reproduce the above copyright
+ *notice, this list of conditions and the following disclaimer in
+ *the documentation and/or other materials provided with the
+ *distribution.
+ *  - Neither the name of Intel Corporation nor the names of its
+ *contributors may be used to endorse or promote products derived
+ *from this software without specific prior written permission.
+ *
+ * THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS
+ * AS IS AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT
+ * LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR
+ * A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT
+ * OWNER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL,
+ * SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT
+ * LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE,
+ * DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY
+ * THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+ * (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE
+ * OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.
+ *
+ */
+
+/*
+ * This header file is for OPA-specific definitions which are
+ * required by the HFI driver, and which aren't yet in the Linux
+ * IB core. We'll collect these all here, then merge them into
+ * the kernel when that's convenient.
+ */
+
+/* OPA SMA attribute IDs */
+#define OPA_ATTRIB_ID_CONGESTION_INFO  cpu_to_be16(0x008b)
+#define OPA_ATTRIB_ID_HFI_CONGESTION_LOG   cpu_to_be16(0x008f)
+#define OPA_ATTRIB_ID_HFI_CONGESTION_SETTING   cpu_to_be16(0x0090)
+#define OPA_ATTRIB_ID_CONGESTION_CONTROL_TABLE cpu_to_be16(0x0091)
+
+/* OPA PMA attribute IDs */
+#define OPA_PM_ATTRIB_ID_PORT_STATUS   cpu_to_be16(0x0040)
+#define OPA_PM_ATTRIB_ID_CLEAR_PORT_STATUS cpu_to_be16(0x0041)
+#define OPA_PM_ATTRIB_ID_DATA_PORT_COUNTERScpu_to_be16(0x0042)
+#define OPA_PM_ATTRIB_ID_ERROR_PORT_COUNTERS   cpu_to_be16(0x0043)
+#define OPA_PM_ATTRIB_ID_ERROR_INFOcpu_to_be16(0x0044)
+
+/* OPA status codes */
+#define OPA_PM_STATUS_REQUEST_TOO_LARGEcpu_to_be16(0x100)
+
+static inline u8 

Re: [PATCH WIP 28/43] IB/core: Introduce new fast registration API

2015-07-30 Thread Sagi Grimberg



Can you explain what do you mean by downgrades everything to a 2k
alignment? If the ULP is responsible for a PAGE_SIZE alignment than
how would this get out of alignment with swiotlb?


swiotlb copies all DMA maps to a shared buffer below 4G so it can be
used with 32 bit devices.

The shared buffer is managed in a way that copies each s/g element to
a continuous 2k aligned subsection of the buffer.



Thanks for the explanation.


Basically, swiotlb realigns everything that passes through it.


So this won't ever happen if the ULP will DMA map the SG and check
for gaps right?

Also, is it interesting to support swiotlb even if we don't have
any devices that require it (and should we expect one to ever exist)?



The DMA API allows this, so ultimately, code has to check the dma
physical address when concerned about alignment.. But we should not
expect this to commonly fail.

So, something like..

   if (!ib_does_sgl_fit_in_mr(mr,sg))
  .. bounce buffer ..


I don't understand the need for this is we do the same thing
if the actual mapping fails...



   if (!ib_map_mr_sg(mr,sg)) // does dma mapping and checks it
  .. bounce buffer ..


Each ULP would want to do something different, iser
will bounce but srp would need to use multiple mrs, nfs will
split the request.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 05/22] IB/iser: Get rid of un-maintained counters

2015-07-30 Thread Sagi Grimberg
We don't update those anywhere in the code and they
seem pretty useless (no one seem to care about those).

qp_tx_queue_full: We never should get this
fmr_map_not_avail: We can never get to this
eh_abort_cnt: We don't monitor aborts

Go ahead and remove them.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.c | 12 +++-
 1 file changed, 3 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 859d9d9..92b1020 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -750,15 +750,9 @@ iscsi_iser_conn_get_stats(struct iscsi_cls_conn *cls_conn, 
struct iscsi_stats *s
stats-r2t_pdus = conn-r2t_pdus_cnt; /* always 0 */
stats-tmfcmd_pdus = conn-tmfcmd_pdus_cnt;
stats-tmfrsp_pdus = conn-tmfrsp_pdus_cnt;
-   stats-custom_length = 4;
-   strcpy(stats-custom[0].desc, qp_tx_queue_full);
-   stats-custom[0].value = 0; /* TB iser_conn-qp_tx_queue_full; */
-   strcpy(stats-custom[1].desc, fmr_map_not_avail);
-   stats-custom[1].value = 0; /* TB iser_conn-fmr_map_not_avail */;
-   strcpy(stats-custom[2].desc, eh_abort_cnt);
-   stats-custom[2].value = conn-eh_abort_cnt;
-   strcpy(stats-custom[3].desc, fmr_unalign_cnt);
-   stats-custom[3].value = conn-fmr_unalign_cnt;
+   stats-custom_length = 1;
+   strcpy(stats-custom[0].desc, fmr_unalign_cnt);
+   stats-custom[0].value = conn-fmr_unalign_cnt;
 }
 
 static int iscsi_iser_get_ep_param(struct iscsi_endpoint *ep,
-- 
1.8.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/22] IB/iser: Introduce iser registration pool struct

2015-07-30 Thread Sagi Grimberg
Instead of having it a part of the connection structure,
have it be under a dedicated (embedded) structure in the
connection. A logical separation of the registration pool
and the connection structure.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
Signed-off-by: Adir Lev ad...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.h  | 49 ++---
 drivers/infiniband/ulp/iser/iser_memory.c | 32 +
 drivers/infiniband/ulp/iser/iser_verbs.c  | 60 ++-
 3 files changed, 82 insertions(+), 59 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 9ce090c..1fc4c23 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -418,6 +418,33 @@ struct iser_fr_desc {
 };
 
 /**
+ * struct iser_fr_pool: connection fast registration pool
+ *
+ * @lock:protects fmr/fastreg pool
+ * @union.fmr:
+ * @pool:FMR pool for fast registrations
+ * @page_vec:fast reg page list to hold mapped commands pages
+ *   used for registration
+ * @union.fastreg:
+ * @pool:Fast registration descriptors pool for fast
+ *   registrations
+ * @pool_size:   Size of pool
+ */
+struct iser_fr_pool {
+   spinlock_t   lock;
+   union {
+   struct {
+   struct ib_fmr_pool  *pool;
+   struct iser_page_vec*page_vec;
+   } fmr;
+   struct {
+   struct list_head pool;
+   int  pool_size;
+   } fastreg;
+   };
+};
+
+/**
  * struct ib_conn - Infiniband related objects
  *
  * @cma_id:  rdma_cm connection maneger handle
@@ -430,15 +457,7 @@ struct iser_fr_desc {
  * @pi_support:  Indicate device T10-PI support
  * @beacon:  beacon send wr to signal all flush errors were drained
  * @flush_comp:  completes when all connection completions consumed
- * @lock:protects fmr/fastreg pool
- * @union.fmr:
- * @pool:FMR pool for fast registrations
- * @page_vec:page vector to hold mapped commands pages
- *   used for registration
- * @union.fastreg:
- * @pool:Fast registration descriptors pool for fast
- *   registrations
- * @pool_size:   Size of pool
+ * @fr_pool: connection fast registration poool
  */
 struct ib_conn {
struct rdma_cm_id   *cma_id;
@@ -451,17 +470,7 @@ struct ib_conn {
bool pi_support;
struct ib_send_wrbeacon;
struct completionflush_comp;
-   spinlock_t   lock;
-   union {
-   struct {
-   struct ib_fmr_pool  *pool;
-   struct iser_page_vec*page_vec;
-   } fmr;
-   struct {
-   struct list_head pool;
-   int  pool_size;
-   } fastreg;
-   };
+   struct iser_fr_pool  fr_pool;
 };
 
 /**
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index ff3ec53..65c035d 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -184,14 +184,15 @@ iser_copy_to_bounce(struct iser_data_buf *data)
 struct iser_fr_desc *
 iser_reg_desc_get(struct ib_conn *ib_conn)
 {
+   struct iser_fr_pool *fr_pool = ib_conn-fr_pool;
struct iser_fr_desc *desc;
unsigned long flags;
 
-   spin_lock_irqsave(ib_conn-lock, flags);
-   desc = list_first_entry(ib_conn-fastreg.pool,
+   spin_lock_irqsave(fr_pool-lock, flags);
+   desc = list_first_entry(fr_pool-fastreg.pool,
struct iser_fr_desc, list);
list_del(desc-list);
-   spin_unlock_irqrestore(ib_conn-lock, flags);
+   spin_unlock_irqrestore(fr_pool-lock, flags);
 
return desc;
 }
@@ -200,11 +201,12 @@ void
 iser_reg_desc_put(struct ib_conn *ib_conn,
  struct iser_fr_desc *desc)
 {
+   struct iser_fr_pool *fr_pool = ib_conn-fr_pool;
unsigned long flags;
 
-   spin_lock_irqsave(ib_conn-lock, flags);
-   list_add(desc-list, ib_conn-fastreg.pool);
-   spin_unlock_irqrestore(ib_conn-lock, flags);
+   spin_lock_irqsave(fr_pool-lock, flags);
+   list_add(desc-list, fr_pool-fastreg.pool);
+   spin_unlock_irqrestore(fr_pool-lock, flags);
 }
 
 /**
@@ -480,6 +482,7 @@ int iser_reg_page_vec(struct iscsi_iser_task *iser_task,
  struct iser_mem_reg *mem_reg)
 {
struct ib_conn *ib_conn = iser_task-iser_conn-ib_conn;
+   struct iser_fr_pool *fr_pool = 

[PATCH 10/22] IB/iser: Rename struct fast_reg_descriptor - iser_fr_desc

2015-07-30 Thread Sagi Grimberg
Avoid struct names without iser_ prefix.

This patch does not change any functionality.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.h  |  8 
 drivers/infiniband/ulp/iser/iser_memory.c | 10 +-
 drivers/infiniband/ulp/iser/iser_verbs.c  | 10 +-
 3 files changed, 14 insertions(+), 14 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 9cdfdbd..70bf6e7 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -396,13 +396,13 @@ struct iser_pi_context {
 };
 
 /**
- * struct fast_reg_descriptor - Fast registration descriptor
+ * struct iser_fr_desc - Fast registration descriptor
  *
  * @list:   entry in connection fastreg pool
  * @rsc:data buffer registration resources
  * @pi_ctx: protection information context
  */
-struct fast_reg_descriptor {
+struct iser_fr_desc {
struct list_head  list;
struct iser_reg_resources rsc;
struct iser_pi_context   *pi_ctx;
@@ -642,9 +642,9 @@ int iser_create_fastreg_pool(struct ib_conn *ib_conn, 
unsigned cmds_max);
 void iser_free_fastreg_pool(struct ib_conn *ib_conn);
 u8 iser_check_task_pi_status(struct iscsi_iser_task *iser_task,
 enum iser_data_dir cmd_dir, sector_t *sector);
-struct fast_reg_descriptor *
+struct iser_fr_desc *
 iser_reg_desc_get(struct ib_conn *ib_conn);
 void
 iser_reg_desc_put(struct ib_conn *ib_conn,
- struct fast_reg_descriptor *desc);
+ struct iser_fr_desc *desc);
 #endif
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index e6516bc..4209d73 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -146,15 +146,15 @@ iser_copy_to_bounce(struct iser_data_buf *data)
iser_copy_bounce(data, true);
 }
 
-struct fast_reg_descriptor *
+struct iser_fr_desc *
 iser_reg_desc_get(struct ib_conn *ib_conn)
 {
-   struct fast_reg_descriptor *desc;
+   struct iser_fr_desc *desc;
unsigned long flags;
 
spin_lock_irqsave(ib_conn-lock, flags);
desc = list_first_entry(ib_conn-fastreg.pool,
-   struct fast_reg_descriptor, list);
+   struct iser_fr_desc, list);
list_del(desc-list);
spin_unlock_irqrestore(ib_conn-lock, flags);
 
@@ -163,7 +163,7 @@ iser_reg_desc_get(struct ib_conn *ib_conn)
 
 void
 iser_reg_desc_put(struct ib_conn *ib_conn,
- struct fast_reg_descriptor *desc)
+ struct iser_fr_desc *desc)
 {
unsigned long flags;
 
@@ -787,7 +787,7 @@ int iser_reg_rdma_mem_fastreg(struct iscsi_iser_task 
*iser_task,
struct ib_device *ibdev = device-ib_device;
struct iser_data_buf *mem = iser_task-data[cmd_dir];
struct iser_mem_reg *mem_reg = iser_task-rdma_reg[cmd_dir];
-   struct fast_reg_descriptor *desc = NULL;
+   struct iser_fr_desc *desc = NULL;
int err, aligned_len;
 
aligned_len = iser_data_buf_aligned_len(mem, ibdev);
diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index 46b8875..f7828e3 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -320,7 +320,7 @@ iser_free_reg_res(struct iser_reg_resources *rsc)
 
 static int
 iser_alloc_pi_ctx(struct ib_device *ib_device, struct ib_pd *pd,
- struct fast_reg_descriptor *desc)
+ struct iser_fr_desc *desc)
 {
struct iser_pi_context *pi_ctx = NULL;
int ret;
@@ -365,7 +365,7 @@ iser_free_pi_ctx(struct iser_pi_context *pi_ctx)
 
 static int
 iser_create_fastreg_desc(struct ib_device *ib_device, struct ib_pd *pd,
-bool pi_enable, struct fast_reg_descriptor *desc)
+bool pi_enable, struct iser_fr_desc *desc)
 {
int ret;
 
@@ -397,7 +397,7 @@ pi_ctx_alloc_failure:
 int iser_create_fastreg_pool(struct ib_conn *ib_conn, unsigned cmds_max)
 {
struct iser_device *device = ib_conn-device;
-   struct fast_reg_descriptor *desc;
+   struct iser_fr_desc *desc;
int i, ret;
 
INIT_LIST_HEAD(ib_conn-fastreg.pool);
@@ -435,7 +435,7 @@ err:
  */
 void iser_free_fastreg_pool(struct ib_conn *ib_conn)
 {
-   struct fast_reg_descriptor *desc, *tmp;
+   struct iser_fr_desc *desc, *tmp;
int i = 0;
 
if (list_empty(ib_conn-fastreg.pool))
@@ -1252,7 +1252,7 @@ u8 iser_check_task_pi_status(struct iscsi_iser_task 
*iser_task,
 enum iser_data_dir cmd_dir, sector_t *sector)
 {
struct iser_mem_reg *reg = iser_task-rdma_reg[cmd_dir];
-   struct fast_reg_descriptor *desc = reg-mem_h;
+   struct iser_fr_desc *desc = reg-mem_h;

[PATCH 22/22] IB/iser: Chain all iser transaction send work requests

2015-07-30 Thread Sagi Grimberg
Concatination of send work requests benefits performance
by reducing the send queue lock contention (acquired in
ib_post_send) and saves us HW doorbells which is posted
only once.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.c  |   1 +
 drivers/infiniband/ulp/iser/iscsi_iser.h  |  34 +
 drivers/infiniband/ulp/iser/iser_memory.c | 120 +-
 drivers/infiniband/ulp/iser/iser_verbs.c  |  21 +++---
 4 files changed, 99 insertions(+), 77 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.c 
b/drivers/infiniband/ulp/iser/iscsi_iser.c
index 9eeefc8..ec87ce1 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.c
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.c
@@ -204,6 +204,7 @@ iser_initialize_task_headers(struct iscsi_task *task,
goto out;
}
 
+   tx_desc-wr_idx = 0;
tx_desc-mapped = true;
tx_desc-dma_addr = dma_addr;
tx_desc-tx_sg[0].addr   = tx_desc-dma_addr;
diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index 8a32e20..4af1916 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -265,6 +265,14 @@ enum iser_desc_type {
ISCSI_TX_DATAOUT
 };
 
+/* Maximum number of work requests per task:
+ * Data memory region local invalidate + fast registration
+ * Protection memory region local invalidate + fast registration
+ * Signature memory region local invalidate + fast registration
+ * PDU send
+ */
+#define ISER_MAX_WRS 7
+
 /**
  * struct iser_tx_desc - iSER TX descriptor (for send wr_id)
  *
@@ -277,6 +285,11 @@ enum iser_desc_type {
  * unsolicited data-out or control
  * @num_sge:   number sges used on this TX task
  * @mapped:Is the task header mapped
+ * @wr_idx:Current WR index
+ * @wrs:   Array of WRs per task
+ * @data_reg:  Data buffer registration details
+ * @prot_reg:  Protection buffer registration details
+ * @sig_attrs: Signature attributes
  */
 struct iser_tx_desc {
struct iser_hdr  iser_header;
@@ -286,6 +299,11 @@ struct iser_tx_desc {
struct ib_sgetx_sg[2];
int  num_sge;
bool mapped;
+   u8   wr_idx;
+   struct ib_send_wrwrs[ISER_MAX_WRS];
+   struct iser_mem_reg  data_reg;
+   struct iser_mem_reg  prot_reg;
+   struct ib_sig_attrs  sig_attrs;
 };
 
 #define ISER_RX_PAD_SIZE   (256 - (ISER_RX_PAYLOAD_SIZE + \
@@ -689,4 +707,20 @@ iser_reg_desc_get_fmr(struct ib_conn *ib_conn);
 void
 iser_reg_desc_put_fmr(struct ib_conn *ib_conn,
  struct iser_fr_desc *desc);
+
+static inline struct ib_send_wr *
+iser_tx_next_wr(struct iser_tx_desc *tx_desc)
+{
+   struct ib_send_wr *cur_wr = tx_desc-wrs[tx_desc-wr_idx];
+   struct ib_send_wr *last_wr;
+
+   if (tx_desc-wr_idx) {
+   last_wr = tx_desc-wrs[tx_desc-wr_idx - 1];
+   last_wr-next = cur_wr;
+   }
+   tx_desc-wr_idx++;
+
+   return cur_wr;
+}
+
 #endif
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index 750f03f..2493cc7 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -664,10 +664,11 @@ iser_inv_rkey(struct ib_send_wr *inv_wr, struct ib_mr *mr)
 {
u32 rkey;
 
-   memset(inv_wr, 0, sizeof(*inv_wr));
inv_wr-opcode = IB_WR_LOCAL_INV;
inv_wr-wr_id = ISER_FASTREG_LI_WRID;
inv_wr-ex.invalidate_rkey = mr-rkey;
+   inv_wr-send_flags = 0;
+   inv_wr-num_sge = 0;
 
rkey = ib_inc_rkey(mr-rkey);
ib_update_fast_reg_key(mr, rkey);
@@ -680,47 +681,38 @@ iser_reg_sig_mr(struct iscsi_iser_task *iser_task,
struct iser_mem_reg *prot_reg,
struct iser_mem_reg *sig_reg)
 {
-   struct ib_conn *ib_conn = iser_task-iser_conn-ib_conn;
-   struct ib_send_wr sig_wr, inv_wr;
-   struct ib_send_wr *bad_wr, *wr = NULL;
-   struct ib_sig_attrs sig_attrs;
+   struct iser_tx_desc *tx_desc = iser_task-desc;
+   struct ib_sig_attrs *sig_attrs = tx_desc-sig_attrs;
+   struct ib_send_wr *wr;
int ret;
 
-   memset(sig_attrs, 0, sizeof(sig_attrs));
-   ret = iser_set_sig_attrs(iser_task-sc, sig_attrs);
+   memset(sig_attrs, 0, sizeof(*sig_attrs));
+   ret = iser_set_sig_attrs(iser_task-sc, sig_attrs);
if (ret)
goto err;
 
-   iser_set_prot_checks(iser_task-sc, sig_attrs.check_mask);
+   iser_set_prot_checks(iser_task-sc, sig_attrs-check_mask);
 
if (!pi_ctx-sig_mr_valid) {
-   iser_inv_rkey(inv_wr, pi_ctx-sig_mr);
-   wr = inv_wr;
+   wr = iser_tx_next_wr(tx_desc);
+   iser_inv_rkey(wr, 

[PATCH 08/22] IB/iser: Remove an unneeded print for unaligned memory

2015-07-30 Thread Sagi Grimberg
We can do it in iser_aligned_data_len instead and
it will save us an argument that is passed to
fall_to_counce_buf just for the print.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iser_memory.c | 20 +---
 1 file changed, 9 insertions(+), 11 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index 3129a42..56dd53b 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -330,8 +330,11 @@ static int iser_data_buf_aligned_len(struct iser_data_buf 
*data,
break;
}
ret_len = (next_sg) ? i : i+1;
-   iser_dbg(Found %d aligned entries out of %d in sg:0x%p\n,
-ret_len, data-dma_nents, data);
+
+   if (unlikely(ret_len != data-dma_nents))
+   iser_warn(rdma alignment violation (%d/%d aligned)\n,
+ ret_len, data-dma_nents);
+
return ret_len;
 }
 
@@ -407,15 +410,12 @@ iser_reg_dma(struct iser_device *device, struct 
iser_data_buf *mem,
 
 static int fall_to_bounce_buf(struct iscsi_iser_task *iser_task,
  struct iser_data_buf *mem,
- enum iser_data_dir cmd_dir,
- int aligned_len)
+ enum iser_data_dir cmd_dir)
 {
struct iscsi_conn *iscsi_conn = iser_task-iser_conn-iscsi_conn;
struct iser_device *device = iser_task-iser_conn-ib_conn.device;
 
iscsi_conn-fmr_unalign_cnt++;
-   iser_warn(rdma alignment violation (%d/%d aligned) or FMR not 
supported\n,
- aligned_len, mem-size);
 
if (iser_debug_level  0)
iser_data_buf_dump(mem, device-ib_device);
@@ -537,8 +537,7 @@ int iser_reg_rdma_mem_fmr(struct iscsi_iser_task *iser_task,
 
aligned_len = iser_data_buf_aligned_len(mem, ibdev);
if (aligned_len != mem-dma_nents) {
-   err = fall_to_bounce_buf(iser_task, mem,
-cmd_dir, aligned_len);
+   err = fall_to_bounce_buf(iser_task, mem, cmd_dir);
if (err) {
iser_err(failed to allocate bounce buffer\n);
return err;
@@ -800,8 +799,7 @@ int iser_reg_rdma_mem_fastreg(struct iscsi_iser_task 
*iser_task,
 
aligned_len = iser_data_buf_aligned_len(mem, ibdev);
if (aligned_len != mem-dma_nents) {
-   err = fall_to_bounce_buf(iser_task, mem,
-cmd_dir, aligned_len);
+   err = fall_to_bounce_buf(iser_task, mem, cmd_dir);
if (err) {
iser_err(failed to allocate bounce buffer\n);
return err;
@@ -828,7 +826,7 @@ int iser_reg_rdma_mem_fastreg(struct iscsi_iser_task 
*iser_task,
aligned_len = iser_data_buf_aligned_len(mem, ibdev);
if (aligned_len != mem-dma_nents) {
err = fall_to_bounce_buf(iser_task, mem,
-cmd_dir, aligned_len);
+cmd_dir);
if (err) {
iser_err(failed to allocate bounce 
buffer\n);
return err;
-- 
1.8.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 17/22] IB/iser: Make reg_desc_get a per device routine

2015-07-30 Thread Sagi Grimberg
As for fmrs we will hold a single registration descriptor
as no need for multiple like in the frwr mode (descriptor
for each task). This change helps unifying the duplicate
registration code paths.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
Signed-off-by: Adir Lev ad...@mellanox.com
---
 drivers/infiniband/ulp/iser/iscsi_iser.h  | 16 ++---
 drivers/infiniband/ulp/iser/iser_memory.c | 38 +++
 2 files changed, 41 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iscsi_iser.h 
b/drivers/infiniband/ulp/iser/iscsi_iser.h
index a8c8177..611abaa 100644
--- a/drivers/infiniband/ulp/iser/iscsi_iser.h
+++ b/drivers/infiniband/ulp/iser/iscsi_iser.h
@@ -333,6 +333,8 @@ struct iser_comp {
  * @free_reg_res:  Free registration resources
  * @reg_rdma_mem:  Register memory buffers
  * @unreg_rdma_mem:Un-register memory buffers
+ * @reg_desc_get:  Get a registration descriptor for pool
+ * @reg_desc_put:  Get a registration descriptor to pool
  */
 struct iser_reg_ops {
int(*alloc_reg_res)(struct ib_conn *ib_conn,
@@ -342,6 +344,9 @@ struct iser_reg_ops {
   enum iser_data_dir cmd_dir);
void   (*unreg_rdma_mem)(struct iscsi_iser_task *iser_task,
 enum iser_data_dir cmd_dir);
+   struct iser_fr_desc * (*reg_desc_get)(struct ib_conn *ib_conn);
+   void   (*reg_desc_put)(struct ib_conn *ib_conn,
+  struct iser_fr_desc *desc);
 };
 
 /**
@@ -657,8 +662,13 @@ void iser_free_fastreg_pool(struct ib_conn *ib_conn);
 u8 iser_check_task_pi_status(struct iscsi_iser_task *iser_task,
 enum iser_data_dir cmd_dir, sector_t *sector);
 struct iser_fr_desc *
-iser_reg_desc_get(struct ib_conn *ib_conn);
+iser_reg_desc_get_fr(struct ib_conn *ib_conn);
 void
-iser_reg_desc_put(struct ib_conn *ib_conn,
- struct iser_fr_desc *desc);
+iser_reg_desc_put_fr(struct ib_conn *ib_conn,
+struct iser_fr_desc *desc);
+struct iser_fr_desc *
+iser_reg_desc_get_fmr(struct ib_conn *ib_conn);
+void
+iser_reg_desc_put_fmr(struct ib_conn *ib_conn,
+ struct iser_fr_desc *desc);
 #endif
diff --git a/drivers/infiniband/ulp/iser/iser_memory.c 
b/drivers/infiniband/ulp/iser/iser_memory.c
index 11bba5a..4e687ee 100644
--- a/drivers/infiniband/ulp/iser/iser_memory.c
+++ b/drivers/infiniband/ulp/iser/iser_memory.c
@@ -44,6 +44,8 @@ static struct iser_reg_ops fastreg_ops = {
.free_reg_res   = iser_free_fastreg_pool,
.reg_rdma_mem   = iser_reg_rdma_mem_fastreg,
.unreg_rdma_mem = iser_unreg_mem_fastreg,
+   .reg_desc_get   = iser_reg_desc_get_fr,
+   .reg_desc_put   = iser_reg_desc_put_fr,
 };
 
 static struct iser_reg_ops fmr_ops = {
@@ -51,6 +53,8 @@ static struct iser_reg_ops fmr_ops = {
.free_reg_res   = iser_free_fmr_pool,
.reg_rdma_mem   = iser_reg_rdma_mem_fmr,
.unreg_rdma_mem = iser_unreg_mem_fmr,
+   .reg_desc_get   = iser_reg_desc_get_fmr,
+   .reg_desc_put   = iser_reg_desc_put_fmr,
 };
 
 int iser_assign_reg_ops(struct iser_device *device)
@@ -182,7 +186,7 @@ iser_copy_to_bounce(struct iser_data_buf *data)
 }
 
 struct iser_fr_desc *
-iser_reg_desc_get(struct ib_conn *ib_conn)
+iser_reg_desc_get_fr(struct ib_conn *ib_conn)
 {
struct iser_fr_pool *fr_pool = ib_conn-fr_pool;
struct iser_fr_desc *desc;
@@ -198,8 +202,8 @@ iser_reg_desc_get(struct ib_conn *ib_conn)
 }
 
 void
-iser_reg_desc_put(struct ib_conn *ib_conn,
- struct iser_fr_desc *desc)
+iser_reg_desc_put_fr(struct ib_conn *ib_conn,
+struct iser_fr_desc *desc)
 {
struct iser_fr_pool *fr_pool = ib_conn-fr_pool;
unsigned long flags;
@@ -209,6 +213,21 @@ iser_reg_desc_put(struct ib_conn *ib_conn,
spin_unlock_irqrestore(fr_pool-lock, flags);
 }
 
+struct iser_fr_desc *
+iser_reg_desc_get_fmr(struct ib_conn *ib_conn)
+{
+   struct iser_fr_pool *fr_pool = ib_conn-fr_pool;
+
+   return list_first_entry(fr_pool-list,
+   struct iser_fr_desc, list);
+}
+
+void
+iser_reg_desc_put_fmr(struct ib_conn *ib_conn,
+ struct iser_fr_desc *desc)
+{
+}
+
 /**
  * iser_start_rdma_unaligned_sg
  */
@@ -544,13 +563,14 @@ void iser_unreg_mem_fmr(struct iscsi_iser_task *iser_task,
 void iser_unreg_mem_fastreg(struct iscsi_iser_task *iser_task,
enum iser_data_dir cmd_dir)
 {
+   struct iser_device *device = iser_task-iser_conn-ib_conn.device;
struct iser_mem_reg *reg = iser_task-rdma_reg[cmd_dir];
 
if (!reg-mem_h)
return;
 
-   iser_reg_desc_put(iser_task-iser_conn-ib_conn,
- reg-mem_h);
+   device-reg_ops-reg_desc_put(iser_task-iser_conn-ib_conn,
+reg-mem_h);
reg-mem_h 

Re: [PATCH WIP 21/43] mlx5: Allocate a private page list in ib_alloc_mr

2015-07-30 Thread Sagi Grimberg

On 7/28/2015 1:57 PM, Haggai Eran wrote:

Hi Sagi,

On 22/07/2015 09:55, Sagi Grimberg wrote:

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
  drivers/infiniband/hw/mlx5/mlx5_ib.h |  5 
  drivers/infiniband/hw/mlx5/mr.c  | 45 
  2 files changed, 50 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index c2916f1..df5e959 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -315,6 +315,11 @@ enum mlx5_ib_mtt_access_flags {

  struct mlx5_ib_mr {
struct ib_mribmr;
+   u64 *pl;
+   __be64  *mpl;
+   dma_addr_t  pl_map;

Nit: could you choose more descriptive names for these fields? It can be
difficult to understand what they mean just based on the acronym.


OK - I'll name it better in v1.




+   int ndescs;

This one isn't used in this patch, right?


Not in this patch - I can move it.

Thanks!
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-4.3 07/15] RDS: Convert to ib_alloc_mr

2015-07-30 Thread Sagi Grimberg
Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 net/rds/iw_rdma.c | 5 +++--
 net/rds/iw_send.c | 5 +++--
 2 files changed, 6 insertions(+), 4 deletions(-)

diff --git a/net/rds/iw_rdma.c b/net/rds/iw_rdma.c
index dba8d08..6a8fbd6 100644
--- a/net/rds/iw_rdma.c
+++ b/net/rds/iw_rdma.c
@@ -667,11 +667,12 @@ static int rds_iw_init_fastreg(struct rds_iw_mr_pool 
*pool,
struct ib_mr *mr;
int err;
 
-   mr = ib_alloc_fast_reg_mr(rds_iwdev-pd, pool-max_message_size);
+   mr = ib_alloc_mr(rds_iwdev-pd, IB_MR_TYPE_MEM_REG,
+pool-max_message_size);
if (IS_ERR(mr)) {
err = PTR_ERR(mr);
 
-   printk(KERN_WARNING RDS/IW: ib_alloc_fast_reg_mr failed 
(err=%d)\n, err);
+   printk(KERN_WARNING RDS/IW: ib_alloc_mr failed (err=%d)\n, 
err);
return err;
}
 
diff --git a/net/rds/iw_send.c b/net/rds/iw_send.c
index 334fe98..86152ec 100644
--- a/net/rds/iw_send.c
+++ b/net/rds/iw_send.c
@@ -153,9 +153,10 @@ void rds_iw_send_init_ring(struct rds_iw_connection *ic)
sge-length = sizeof(struct rds_header);
sge-lkey = 0;
 
-   send-s_mr = ib_alloc_fast_reg_mr(ic-i_pd, 
fastreg_message_size);
+   send-s_mr = ib_alloc_mr(ic-i_pd, IB_MR_TYPE_MEM_REG,
+fastreg_message_size);
if (IS_ERR(send-s_mr)) {
-   printk(KERN_WARNING RDS/IW: ib_alloc_fast_reg_mr 
failed\n);
+   printk(KERN_WARNING RDS/IW: ib_alloc_mr failed\n);
break;
}
 
-- 
1.8.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 11/22] IB/iser: Remove dead code in fmr_pool alloc/free

2015-07-30 Thread Sagi Grimberg
In the past the we always tried to allocate an fmr_pool
and if it failed on ENOSYS (not supported) then we continued
with dma mr. This is not the case anymore and if we tried to
allocate an fmr_pool then it is supported and we expect to succeed.

Also, the check if fmr_pool is allocated when free is called is
redundant as well as we are guaranteed it exists.

Signed-off-by: Sagi Grimberg sa...@mellanox.com
---
 drivers/infiniband/ulp/iser/iser_verbs.c | 26 ++
 1 file changed, 10 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/iser/iser_verbs.c 
b/drivers/infiniband/ulp/iser/iser_verbs.c
index f7828e3..2a0cb42 100644
--- a/drivers/infiniband/ulp/iser/iser_verbs.c
+++ b/drivers/infiniband/ulp/iser/iser_verbs.c
@@ -244,22 +244,18 @@ int iser_create_fmr_pool(struct ib_conn *ib_conn, 
unsigned cmds_max)
IB_ACCESS_REMOTE_READ);
 
ib_conn-fmr.pool = ib_create_fmr_pool(device-pd, params);
-   if (!IS_ERR(ib_conn-fmr.pool))
-   return 0;
+   if (IS_ERR(ib_conn-fmr.pool)) {
+   ret = PTR_ERR(ib_conn-fmr.pool);
+   iser_err(FMR allocation failed, err %d\n, ret);
+   goto err;
+   }
+
+   return 0;
 
-   /* no FMR = no need for page_vec */
+err:
kfree(ib_conn-fmr.page_vec);
ib_conn-fmr.page_vec = NULL;
-
-   ret = PTR_ERR(ib_conn-fmr.pool);
-   ib_conn-fmr.pool = NULL;
-   if (ret != -ENOSYS) {
-   iser_err(FMR allocation failed, err %d\n, ret);
-   return ret;
-   } else {
-   iser_warn(FMRs are not supported, using unaligned mode\n);
-   return 0;
-   }
+   return ret;
 }
 
 /**
@@ -270,9 +266,7 @@ void iser_free_fmr_pool(struct ib_conn *ib_conn)
iser_info(freeing conn %p fmr pool %p\n,
  ib_conn, ib_conn-fmr.pool);
 
-   if (ib_conn-fmr.pool != NULL)
-   ib_destroy_fmr_pool(ib_conn-fmr.pool);
-
+   ib_destroy_fmr_pool(ib_conn-fmr.pool);
ib_conn-fmr.pool = NULL;
 
kfree(ib_conn-fmr.page_vec);
-- 
1.8.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/22] iser patches for 4.3

2015-07-30 Thread Sagi Grimberg
This set is a resend that includes some extra patches that
piled up in the meantime.

I still have some patches in the pipe (including initiator/target
support for remote invalidate) but I'm targeting those to 4.4

This patch set includes:
- Small fixes for bugs encountered in testing
- Small fixes detected by static checkers
- Memory registration code path rework (consolidate to
  a single code path that branches only at the actual registration
  FRWR vs. FMR). This reduces code duplication that exists in current code.
- Larger IO transfer size support (up to 8MB at the moment) depending on
  the device capabilities.
- Optimize Io path by chaining send work requests and posting them
  only once.

Adir Lev (1):
  IB/iser: Maintain connection fmr_pool under a single registration
descriptor

Jenny Falkovich (1):
  IB/iser: Change some module parameters to be RO

Sagi Grimberg (20):
  IB/iser: Change minor assignments and logging prints
  IB/iser: Remove '.' from log message
  IB/iser: Fix missing return status check in iser_send_data_out
  IB/iser: Get rid of un-maintained counters
  IB/iser: Fix possible bogus DMA unmapping
  IB/iser: Remove a redundant always-false condition
  IB/iser: Remove an unneeded print for unaligned memory
  IB/iser: Introduce struct iser_reg_resources
  IB/iser: Rename struct fast_reg_descriptor - iser_fr_desc
  IB/iser: Remove dead code in fmr_pool alloc/free
  IB/iser: Introduce iser_reg_ops
  IB/iser: Move fastreg descriptor allocation to
iser_create_fastreg_desc
  IB/iser: Introduce iser registration pool struct
  IB/iser: Rename iser_reg_page_vec to iser_fast_reg_fmr
  IB/iser: Make reg_desc_get a per device routine
  IB/iser: Unify fast memory registration flows
  IB/iser: Pass registration pool a size parameter
  IB/iser: Support up to 8MB data transfer in a single command
  IB/iser: Add debug prints to the various memory registration methods
  IB/iser: Chain all iser transaction send work requests

 drivers/infiniband/ulp/iser/iscsi_iser.c |  89 +++--
 drivers/infiniband/ulp/iser/iscsi_iser.h | 206 
 drivers/infiniband/ulp/iser/iser_initiator.c |  34 +-
 drivers/infiniband/ulp/iser/iser_memory.c| 480 +++
 drivers/infiniband/ulp/iser/iser_verbs.c | 328 ++
 5 files changed, 645 insertions(+), 492 deletions(-)

-- 
1.8.4.3

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


  1   2   >