RE: [PATCH v4 14/19] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag

2015-02-11 Thread Weiny, Ira
 On 2/4/2015 6:29 PM, ira.we...@intel.com wrote:
  From: Ira Weiny ira.we...@intel.com
 
  OPA MADs share a common header with IBTA MADs but with a different
  base version and an extended length.  These jumbo MADs increase the
  performance of management traffic.
 
  Sharing a common header with IBTA MADs allows us to share most of the
  MAD processing code when dealing with OPA MADs in addition to
  supporting some IBTA MADs on OPA devices.
 
  Add a device capability flag to indicate OPA MAD support on the device.
 
  Signed-off-by: Ira Weiny ira.we...@intel.com
 
  ---
   include/rdma/ib_verbs.h | 5 +
   1 file changed, 5 insertions(+)
 
  diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index
  3ab4033..2614233 100644
  --- a/include/rdma/ib_verbs.h
  +++ b/include/rdma/ib_verbs.h
  @@ -128,6 +128,10 @@ enum ib_device_cap_flags {
  IB_DEVICE_ON_DEMAND_PAGING  = (131),
   };
 
  +enum ib_device_cap_flags2 {
  +   IB_DEVICE_OPA_MAD_SUPPORT   = 1
  +};
  +
   enum ib_signature_prot_cap {
  IB_PROT_T10DIF_TYPE_1 = 1,
  IB_PROT_T10DIF_TYPE_2 = 1  1,
  @@ -210,6 +214,7 @@ struct ib_device_attr {
  int sig_prot_cap;
  int sig_guard_cap;
  struct ib_odp_caps  odp_caps;
  +   u64 device_cap_flags2;
  u32 max_mad_size;
   };
 
 
 Why is OPA support determined via a device capability flag ? What are the
 tradeoffs for doing it this way versus the other choices that have been used 
 in
 the past for other RDMA technologies like RoCE, iWARP, usNIC, ... ?

None of those technologies use the MAD stack for Subnet Management.  Other MAD 
support is very limited (ie IB compatible PMA queries on the local port only).

Do you have a suggestion for alternatives?

Ira

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4: IB_EVENT_PATH_MIG not generated on path migration

2015-02-11 Thread Jason Gunthorpe
On Wed, Feb 11, 2015 at 10:21:36AM +0100, Fabian Holler wrote:
 Hello Jason,
 
 thanks for your answer.
 
 On Tue, Feb 10, 2015 at 11:00:39AM -0700, Jason Gunthorpe wrote:
  On Tue, Feb 10, 2015 at 04:56:43PM +0100, Fabian Holler wrote:
  
   Does anybody have an idea what could be wrong?
   Are the PATH_MIG* notifications with mlx4 drivers are working for 
   somebody?
  
  IIRC rdmacm does not forward these events.
  
  There is a lot missing to support APM with RDMA CM, it doesn't do any
  of the CM state machine operations to coordinate loading alternate
  paths, for instance.
 
 Alternate paths can be set directly at the QP.

APM requires a certain amount of synchronization between the local and
remote QP state, particularly if you want it to work more than
once. That is supposed to be done with CMA mads.

 The event handler specified in the struct ib_qp_init_attr * parameter of
 rdma_create_qp() is registered as QP event handler.
 rdma_create_qp() calls ib_create_qp() and sets qp-event_handler to
 ib_qp_init_attr-event_handler.

Okay, right, I was thinking of something else.

We had exactly this working a few months ago for a demo, in kernel
using Lustre and RDMA CM. I checked the patches we made and yes, the
event_handler was being properly called with IB_EVENT_PATH_MIG on
mlx4 and qib cards.

IIRC we may have needed the latest firmware on the mlx side.

Jason
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 05/10] IB/cm,cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

When receiving a connection request, ib_cm needs to associate the request with
a network namespace. To do this, it needs to know the request's destination
IP. For this the RDMA IP CM packet formatting functionality needs to be
exposed to ib_cm.

This patch merely moves the RDMA IP CM data formatting and parsing functions
to be part of ib_cm. The following patch will utilize the new knowledge to
look-up the appropriate namespace. Each namespace maintains an independent
table of RDMA CM service IDs, allowing isolation and separation between the
network namespaces.

When creating a new incoming connection ID, the code in cm_save_ip_info can no
longer rely on the listener's private data to find the port number, so it
reads it from the requested service ID. This required saving the service ID in
cm_format_paths_from_req.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com

---
 drivers/infiniband/core/cm.c  | 156 +++
 drivers/infiniband/core/cma.c | 166 +-
 include/rdma/ib_cm.h  |  56 ++
 3 files changed, 230 insertions(+), 148 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index 5a45cb7..efc5cff 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -51,6 +51,7 @@
 
 #include rdma/ib_cache.h
 #include rdma/ib_cm.h
+#include rdma/ib.h
 #include cm_msgs.h
 
 MODULE_AUTHOR(Sean Hefty);
@@ -701,6 +702,159 @@ static void cm_reject_sidr_req(struct cm_id_private 
*cm_id_priv,
ib_send_cm_sidr_rep(cm_id_priv-id, param);
 }
 
+int cm_format_hdr(void *hdr, int family,
+ struct sockaddr *src_addr,
+ struct sockaddr *dst_addr)
+{
+   struct cm_hdr *cm_hdr;
+
+   cm_hdr = hdr;
+   cm_hdr-cm_version = RDMA_IP_CM_VERSION;
+   if (family == AF_INET) {
+   struct sockaddr_in *src4, *dst4;
+
+   src4 = (struct sockaddr_in *)src_addr;
+   dst4 = (struct sockaddr_in *)dst_addr;
+
+   cm_set_ip_ver(cm_hdr, 4);
+   cm_hdr-src_addr.ip4.addr = src4-sin_addr.s_addr;
+   cm_hdr-dst_addr.ip4.addr = dst4-sin_addr.s_addr;
+   cm_hdr-port = src4-sin_port;
+   } else if (family == AF_INET6) {
+   struct sockaddr_in6 *src6, *dst6;
+
+   src6 = (struct sockaddr_in6 *)src_addr;
+   dst6 = (struct sockaddr_in6 *)dst_addr;
+
+   cm_set_ip_ver(cm_hdr, 6);
+   cm_hdr-src_addr.ip6 = src6-sin6_addr;
+   cm_hdr-dst_addr.ip6 = dst6-sin6_addr;
+   cm_hdr-port = src6-sin6_port;
+   }
+   return 0;
+}
+EXPORT_SYMBOL(cm_format_hdr);
+
+static void cm_save_ib_info(struct sockaddr *src_addr,
+   struct sockaddr *dst_addr,
+   struct ib_sa_path_rec *path)
+{
+   struct sockaddr_ib  *ib;
+
+   if (src_addr) {
+   ib = (struct sockaddr_ib *)src_addr;
+   ib-sib_family = AF_IB;
+   ib-sib_pkey = path-pkey;
+   ib-sib_flowinfo = path-flow_label;
+   memcpy(ib-sib_addr, path-sgid, 16);
+   ib-sib_sid = path-service_id;
+   ib-sib_sid_mask = cpu_to_be64(0xULL);
+   ib-sib_scope_id = 0;
+   }
+   if (dst_addr) {
+   ib = (struct sockaddr_ib *)dst_addr;
+   ib-sib_family = AF_IB;
+   ib-sib_pkey = path-pkey;
+   ib-sib_flowinfo = path-flow_label;
+   memcpy(ib-sib_addr, path-dgid, 16);
+   }
+}
+
+static void cm_save_ip6_info(struct sockaddr *src_addr,
+struct sockaddr *dst_addr,
+struct cm_hdr *hdr,
+__be16 local_port)
+{
+   struct sockaddr_in6 *ip6;
+
+   if (src_addr) {
+   ip6 = (struct sockaddr_in6 *)src_addr;
+   ip6-sin6_family = AF_INET6;
+   ip6-sin6_addr = hdr-dst_addr.ip6;
+   ip6-sin6_port = local_port;
+   }
+
+   if (dst_addr) {
+   ip6 = (struct sockaddr_in6 *)dst_addr;
+   ip6-sin6_family = AF_INET6;
+   ip6-sin6_addr = hdr-src_addr.ip6;
+   ip6-sin6_port = hdr-port;
+   }
+}
+
+static void cm_save_ip4_info(struct sockaddr *src_addr,
+struct sockaddr *dst_addr,
+struct cm_hdr *hdr,
+__be16 local_port)
+{
+   struct sockaddr_in *ip4;
+
+   if (src_addr) {
+   ip4 = (struct sockaddr_in *)src_addr;
+   ip4-sin_family = AF_INET;
+   ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr;
+   ip4-sin_port = local_port;
+   

[PATCH v1 01/10] IB/addr: Pass network namespace as a parameter

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

Add network namespace support to the ib_addr module. For that, all the address
resolution and matching should be done using the appropriate namespace instead
of init_net.

This is achieved by:

1. Adding an explicit network namespace argument to exported function that
   require a namespace.
2. Saving the namespace in the rdma_addr_client structure.
3. Using it when calling networking functions.

In order to preserve the behavior of calling modules, init_net is
passed as the parameter in calls from other modules. This is modified as
namespace support is added on more levels.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com

---
 drivers/infiniband/core/addr.c   | 31 --
 drivers/infiniband/core/cma.c|  4 ++-
 drivers/infiniband/core/verbs.c  | 14 +++---
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |  3 ++-
 include/rdma/ib_addr.h   | 44 
 5 files changed, 72 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index f80da50..95beaef 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
int ret = -EADDRNOTAVAIL;
 
if (dev_addr-bound_dev_if) {
-   dev = dev_get_by_index(init_net, dev_addr-bound_dev_if);
+   dev = dev_get_by_index(dev_addr-net, dev_addr-bound_dev_if);
if (!dev)
return -ENODEV;
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -137,9 +137,10 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
}
 
switch (addr-sa_family) {
-   case AF_INET:
-   dev = ip_dev_find(init_net,
-   ((struct sockaddr_in *) addr)-sin_addr.s_addr);
+   case AF_INET: {
+   struct sockaddr_in *addr_in = (struct sockaddr_in *)addr;
+
+   dev = ip_dev_find(dev_addr-net, addr_in-sin_addr.s_addr);
 
if (!dev)
return ret;
@@ -149,12 +150,12 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
*vlan_id = rdma_vlan_dev_vlan_id(dev);
dev_put(dev);
break;
-
+   }
 #if IS_ENABLED(CONFIG_IPV6)
case AF_INET6:
rcu_read_lock();
-   for_each_netdev_rcu(init_net, dev) {
-   if (ipv6_chk_addr(init_net,
+   for_each_netdev_rcu(dev_addr-net, dev) {
+   if (ipv6_chk_addr(dev_addr-net,
  ((struct sockaddr_in6 *) 
addr)-sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
@@ -236,7 +237,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
fl4.daddr = dst_ip;
fl4.saddr = src_ip;
fl4.flowi4_oif = addr-bound_dev_if;
-   rt = ip_route_output_key(init_net, fl4);
+   rt = ip_route_output_key(addr-net, fl4);
if (IS_ERR(rt)) {
ret = PTR_ERR(rt);
goto out;
@@ -278,12 +279,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
fl6.saddr = src_in-sin6_addr;
fl6.flowi6_oif = addr-bound_dev_if;
 
-   dst = ip6_route_output(init_net, NULL, fl6);
+   dst = ip6_route_output(addr-net, NULL, fl6);
if ((ret = dst-error))
goto put;
 
if (ipv6_addr_any(fl6.saddr)) {
-   ret = ipv6_dev_get_saddr(init_net, ip6_dst_idev(dst)-dev,
+   ret = ipv6_dev_get_saddr(addr-net,
+ip6_dst_idev(dst)-dev,
 fl6.daddr, 0, fl6.saddr);
if (ret)
goto put;
@@ -458,7 +460,7 @@ static void resolve_cb(int status, struct sockaddr 
*src_addr,
 }
 
 int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 
*dmac,
-  u16 *vlan_id)
+  u16 *vlan_id, struct net *net)
 {
int ret = 0;
struct rdma_dev_addr dev_addr;
@@ -481,6 +483,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union 
ib_gid *dgid, u8 *dmac,
return ret;
 
memset(dev_addr, 0, sizeof(dev_addr));
+   dev_addr.net = net;
 
ctx.addr = dev_addr;
init_completion(ctx.comp);
@@ -492,7 +495,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union 
ib_gid *dgid, u8 *dmac,
wait_for_completion(ctx.comp);
 
memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN);
-   dev = dev_get_by_index(init_net, dev_addr.bound_dev_if);
+  

[PATCH v1 03/10] IB/core: Find the network namespace matching connection parameters

2015-02-11 Thread Shachar Raindel
From: Yotam Kenneth yota...@mellanox.com

In the case of IPoIB, and maybe in other cases, the network device is
managed by an upper-layer protocol (ULP). In order to expose this
network device to other users of the IB device, let ULPs implement
a callback that returns network device according to connection parameters.

The IB device and port, together with the P_Key and the IP address should be
enough to uniquely identify the ULP net device.

This function is passed to ib_core as part of the ib_client
registration.

Using this functionality, add a way to get the network namespace
corresponding to a work completion. This is needed so that responses to CM
requests can be sent from the same network namespace as the request.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com

---
 drivers/infiniband/core/device.c | 57 
 include/rdma/ib_verbs.h  | 29 
 2 files changed, 86 insertions(+)

diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c
index 18c1ece..2f06be5 100644
--- a/drivers/infiniband/core/device.c
+++ b/drivers/infiniband/core/device.c
@@ -38,6 +38,7 @@
 #include linux/slab.h
 #include linux/init.h
 #include linux/mutex.h
+#include linux/netdevice.h
 #include rdma/rdma_netlink.h
 
 #include core_priv.h
@@ -733,6 +734,62 @@ int ib_find_pkey(struct ib_device *device,
 }
 EXPORT_SYMBOL(ib_find_pkey);
 
+static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev,
+u8 port,
+u16 pkey,
+struct sockaddr *addr)
+{
+   struct net_device *ret = NULL;
+   struct ib_client *client;
+
+   mutex_lock(device_mutex);
+   list_for_each_entry(client, client_list, list)
+   if (client-get_net_device_by_port_pkey_ip) {
+   ret = client-get_net_device_by_port_pkey_ip(dev, port,
+pkey,
+addr);
+   if (ret)
+   break;
+   }
+
+   mutex_unlock(device_mutex);
+   return ret;
+}
+
+struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev,
+ u8 port,
+ u16 pkey,
+ struct sockaddr *addr)
+{
+   struct net_device *ndev = NULL;
+   struct net *ns;
+
+   switch (rdma_port_get_link_layer(dev, port)) {
+   case IB_LINK_LAYER_INFINIBAND:
+   if (!addr)
+   goto not_found;
+   ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr);
+   break;
+   default:
+   goto not_found;
+   }
+
+   if (!ndev)
+   goto not_found;
+
+   rcu_read_lock();
+   ns = maybe_get_net(dev_net(ndev));
+   dev_put(ndev);
+   rcu_read_unlock();
+   if (!ns)
+   goto not_found;
+   return ns;
+
+not_found:
+   return get_net(init_net);
+}
+EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip);
+
 static int __init ib_core_init(void)
 {
int ret;
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index f4a85de..74b2394 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1683,6 +1683,21 @@ struct ib_client {
void (*add)   (struct ib_device *);
void (*remove)(struct ib_device *);
 
+   /* Returns the net_dev belonging to this ib_client and matching the
+* given parameters.
+* @dev:An RDMA device that the net_dev use for communication.
+* @port:   A physical port number on the RDMA device.
+* @pkey:   P_Key that the net_dev uses if applicable.
+* @addr:   An IP address the net_dev is configured with.
+*
+* An ib_client that implements a net_dev on top of RDMA devices
+* (such as IP over IB) should implement this callback, allowing the
+* rdma_cm module to find the right net_dev for a given request. */
+   struct net_device *(*get_net_device_by_port_pkey_ip)(
+   struct ib_device *dev,
+   u8 port,
+   u16 pkey,
+   struct sockaddr *addr);
struct list_head list;
 };
 
@@ -2679,4 +2694,18 @@ static inline int ib_check_mr_access(int flags)
 int ib_check_mr_status(struct ib_mr *mr, u32 check_mask,
   struct ib_mr_status *mr_status);
 
+/**
+ * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace
+ * for a received CM request
+ * @dev:   An RDMA device on 

[PATCH v1 04/10] IB/ipoib: Return IPoIB devices as possible matches to get_net_device_by_port_pkey_ip

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

Implement callback that returns network device to ib_core according to
connection parameters. Check the ipoib device and iterate over all child
devices to look for a match.

For each ipoib device we iterate through all upper devices when searching for
a matching IP, in order to support bonding.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com

---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 122 +-
 1 file changed, 121 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3..63119ff 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -48,6 +48,9 @@
 
 #include linux/jhash.h
 #include net/arp.h
+#include net/addrconf.h
+#include linux/inetdevice.h
+#include rdma/ib_cache.h
 
 #define DRV_VERSION 1.0.0
 
@@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client;
 static void ipoib_add_one(struct ib_device *device);
 static void ipoib_remove_one(struct ib_device *device);
 static void ipoib_neigh_reclaim(struct rcu_head *rp);
+static struct net_device *ipoib_get_net_device_by_port_pkey_ip(
+   struct ib_device *dev, u8 port, u16 pkey,
+   struct sockaddr *addr);
 
 static struct ib_client ipoib_client = {
.name   = ipoib,
.add= ipoib_add_one,
-   .remove = ipoib_remove_one
+   .remove = ipoib_remove_one,
+   .get_net_device_by_port_pkey_ip = ipoib_get_net_device_by_port_pkey_ip,
 };
 
 int ipoib_open(struct net_device *dev)
@@ -222,6 +229,119 @@ static int ipoib_change_mtu(struct net_device *dev, int 
new_mtu)
return 0;
 }
 
+static bool ipoib_is_dev_match_addr(struct sockaddr *addr,
+   struct net_device *dev)
+{
+   struct net *net = dev_net(dev);
+
+   if (addr-sa_family == AF_INET) {
+   struct in_device *in_dev = in_dev_get(dev);
+   struct sockaddr_in *addr_in = (struct sockaddr_in *)addr;
+   __be32 ret_addr;
+
+   if (!in_dev)
+   return false;
+
+   ret_addr = inet_confirm_addr(net, in_dev, 0,
+addr_in-sin_addr.s_addr,
+RT_SCOPE_HOST);
+   in_dev_put(in_dev);
+   if (ret_addr)
+   return true;
+   }
+#if IS_ENABLED(CONFIG_IPV6)
+   else if (addr-sa_family == AF_INET6) {
+   struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr;
+
+   if (ipv6_chk_addr(net, addr_in6-sin6_addr, dev, 1))
+   return true;
+   }
+#endif
+   return false;
+}
+
+/**
+ * Find a net_device matching the given address, which is an upper device of
+ * the given net_device.
+ * @addr: IP address to look for.
+ * @dev: base IPoIB net_device
+ *
+ * If found, returns the net_device with a reference held. Otherwise return
+ * NULL.
+ */
+static struct net_device *ipoib_get_net_dev_match_addr(struct sockaddr *addr,
+  struct net_device *dev)
+{
+   struct net_device *upper,
+ *result = NULL;
+   struct list_head *iter;
+
+   if (ipoib_is_dev_match_addr(addr, dev)) {
+   dev_hold(dev);
+   return dev;
+   }
+
+   rcu_read_lock();
+   netdev_for_each_all_upper_dev_rcu(dev, upper, iter) {
+   if (ipoib_is_dev_match_addr(addr, upper)) {
+   dev_hold(upper);
+   result = upper;
+   break;
+   }
+   }
+   rcu_read_unlock();
+   return result;
+}
+
+static struct net_device *ipoib_get_net_device_by_port_pkey_ip(
+   struct ib_device *dev, u8 port, u16 pkey, struct sockaddr *addr)
+{
+   struct ipoib_dev_priv *priv;
+   struct list_head *dev_list;
+   u16 pkey_index;
+
+   ib_find_cached_pkey(dev, port, pkey, pkey_index);
+   if (pkey_index == (u16)-1)
+   return NULL;
+
+   if (rdma_node_get_transport(dev-node_type) != RDMA_TRANSPORT_IB)
+   return NULL;
+
+   dev_list = ib_get_client_data(dev, ipoib_client);
+   if (!dev_list)
+   return NULL;
+
+   list_for_each_entry(priv, dev_list, list) {
+   struct net_device *net_dev = NULL;
+   struct ipoib_dev_priv *child_priv;
+
+   if (priv-port != port)
+   continue;
+
+   if (priv-pkey_index == pkey_index) {
+   net_dev = ipoib_get_net_dev_match_addr(addr, priv-dev);
+   if (net_dev)
+   return net_dev;
+   }
+
+   

[PATCH v1 06/10] IB/cm: Add network namespace support

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

Add namespace support to the IB-CM layer.

- Each CM-ID now has a network namespace it is associated with, assigned at
  creation. This namespace is used as needed during subsequent action on the
  CM-ID or related objects.

- All of the relevant calls to ib_addr and ib_core were updated to use the
  namespace from the CM-ID. External APIs were extended as needed to allow
  specifying the namespace where relevant.

- The listening service ID table is now also indexed by the CM-ID namespace.

- For incoming connection requests, we use the connection parameters to select
  namespace. The namespace is matched when looking for listening service ID.

To preserve current behavior pass init_net to ib_cm wherever network namespace
function parameters were added.

The ib_cm_create_id interface now takes a reference to the relevant network
namespace. CM-IDs created by accepting a connection for a listening CM-ID will
also take a reference to the namespace. When the ID is destroyed, the
namespace reference is released.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com

---
 drivers/infiniband/core/cm.c| 124 
 drivers/infiniband/core/cma.c   |   8 ++-
 drivers/infiniband/core/ucm.c   |   3 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c |  21 +-
 drivers/infiniband/ulp/srp/ib_srp.c |   2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c   |   2 +-
 include/rdma/ib_cm.h|   7 +-
 7 files changed, 130 insertions(+), 37 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index efc5cff..75c6ac9 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -241,6 +241,8 @@ struct cm_id_private {
u8 service_timeout;
u8 target_ack_delay;
 
+   struct net *net; /* A network namespace that the ID belongs to */
+
struct list_head work_list;
atomic_t work_count;
 };
@@ -347,12 +349,13 @@ static void cm_set_private_data(struct cm_id_private 
*cm_id_priv,
 }
 
 static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc,
-   struct ib_grh *grh, struct cm_av *av)
+   struct ib_grh *grh, struct cm_av *av,
+   struct net *net)
 {
av-port = port;
av-pkey_index = wc-pkey_index;
ib_init_ah_from_wc(port-cm_dev-ib_device, port-port_num, wc,
-  grh, av-ah_attr, init_net);
+  grh, av-ah_attr, net);
 }
 
 static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av)
@@ -521,10 +524,15 @@ static struct cm_id_private * cm_insert_listen(struct 
cm_id_private *cm_id_priv)
if ((cur_cm_id_priv-id.service_mask  service_id) ==
(service_mask  cur_cm_id_priv-id.service_id) 
(cm_id_priv-id.device == cur_cm_id_priv-id.device) 
-   !data_cmp)
+   !data_cmp 
+   net_eq(cm_id_priv-net, cur_cm_id_priv-net))
return cur_cm_id_priv;
 
-   if (cm_id_priv-id.device  cur_cm_id_priv-id.device)
+   if (cm_id_priv-net  cur_cm_id_priv-net)
+   link = (*link)-rb_left;
+   else if (cm_id_priv-net  cur_cm_id_priv-net)
+   link = (*link)-rb_right;
+   else if (cm_id_priv-id.device  cur_cm_id_priv-id.device)
link = (*link)-rb_left;
else if (cm_id_priv-id.device  cur_cm_id_priv-id.device)
link = (*link)-rb_right;
@@ -544,7 +552,8 @@ static struct cm_id_private * cm_insert_listen(struct 
cm_id_private *cm_id_priv)
 
 static struct cm_id_private * cm_find_listen(struct ib_device *device,
 __be64 service_id,
-u8 *private_data)
+u8 *private_data,
+struct net *net)
 {
struct rb_node *node = cm.listen_service_table.rb_node;
struct cm_id_private *cm_id_priv;
@@ -556,10 +565,14 @@ static struct cm_id_private * cm_find_listen(struct 
ib_device *device,
   cm_id_priv-compare_data);
if ((cm_id_priv-id.service_mask  service_id) ==
 cm_id_priv-id.service_id 
-   (cm_id_priv-id.device == device)  !data_cmp)
+   (cm_id_priv-id.device == device)  !data_cmp 
+   net_eq(cm_id_priv-net, net))
return cm_id_priv;
-
-   if (device  cm_id_priv-id.device)
+   if (net  cm_id_priv-net)
+   node = 

[PATCH v1 00/10] Add network namespace support in the RDMA-CM

2015-02-11 Thread Shachar Raindel
RDMA-CM uses IP based addressing and routing to setup RDMA connections between
hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM
must reside in the init_net namespace. This restricts the usage of containers
with RDMA to only work with host network namespace (aka the kernel init_net NS
instance).

This patchset allows using network namespaces with the RDMA-CM.

Each RDMA-CM and CM id is keeping a reference to a network namespace.

This reference is based on the process network namespace at the time of the
creation of the object or inherited from the listener.

This network namespace is used to perform all IP and network related
operations. Specifically, the local device lookup, as well as the remote GID
address resolution are done in the context of the RDMA-CM object's namespace.
This allows outgoing connections to reach the right target, even if the same
IP address exists in multiple network namespaces. This can happen if each
network namespace resides on a different pkey.

Additionally, the network namespace is used to split the listener service ID
table. From the user point of view, each network namespace has a unique,
completely independent table of service IDs. This allows running multiple
instances of a single service on the same machine, using containers. To
implement this, the CM layer now parses the IP address from the CM connect
requests, and searches for the matching networking device. The namespace of
the device found is used when looking up the service ID in the listener table.

The functionnality introduced by this series would come into play when the
transport is InfiniBand and IPoIB interfaces are assigned to each namespace.
Multiple IPoIB interfaces can be created and assigned to different RDMA-CM
capable containers, for example using pipework [1].

Full support for RoCE will be introduced in a later stage.

The patches apply against kernel v3.19-rc5, with the patch RDMA/CMA: Mark
IPv4 addresses correctly when the listener is IPv6 [2] applied.

The patchset is structured as follows:

Patches 1 and 2 are relatively trivial API extensions, requiring the callers
of certain ib_addr and ib_core functions to provide a network namespace, as
needed.

Patches 3 and 4 adds the ability to lookup a network namespace according to
the IP address, device and pkey. It finds the matching IPoIB interfaces, and
safely takes a reference on the network namespace before returning to the
caller.

Patch 5 moves the logic that extracts the IP address from a connect request
into the CM layer. This is needed for the upcoming listener lookup by
namespace.

Patch 6 adds support for network namespaces in the CM layer. All callers are
still passing init_net as the namespace, to maintain backward compatibility.
For incoming requests, the namespace of the relevant IPoIB device is used.

Patches 7 and 8 add proper namespace support to the RDMA-CM module.

Patches 9 and 10 add namespace support to the relevant user facing modules in
the IB stack.


[1] https://github.com/jpetazzo/pipework/pull/108
[2] https://patchwork.kernel.org/patch/5298971/

Change-log:

V0 - V1:
 - Fix code review comments by Yann
 - Rebase on top of linux-3.19

Guy Shapiro (7):
  IB/addr: Pass network namespace as a parameter
  IB/core: Pass network namespace as a parameter to relevant functions
  IB/ipoib: Return IPoIB devices as possible matches to
get_net_device_by_port_pkey_ip
  IB/cm,cma: Move RDMA IP CM private-data parsing code from ib_cma to
ib_cm
  IB/cm: Add network namespace support
  IB/cma: Add support for network namespaces
  IB/ucma: Take the network namespace from the process

Shachar Raindel (1):
  IB/ucm: Add partial support for network namespaces

Yotam Kenneth (2):
  IB/core: Find the network namespace matching connection parameters
  IB/cma: Separate port allocation to network namespaces

 drivers/infiniband/core/addr.c |  31 +-
 drivers/infiniband/core/agent.c|   4 +-
 drivers/infiniband/core/cm.c   | 287 --
 drivers/infiniband/core/cma.c  | 332 +
 drivers/infiniband/core/device.c   |  57 
 drivers/infiniband/core/mad_rmpp.c |  10 +-
 drivers/infiniband/core/ucm.c  |   4 +-
 drivers/infiniband/core/ucma.c |   4 +-
 drivers/infiniband/core/user_mad.c |   4 +-
 drivers/infiniband/core/verbs.c|  22 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c   |   3 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  21 +-
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 122 +++-
 drivers/infiniband/ulp/iser/iser_verbs.c   |   2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|   2 +-
 drivers/infiniband/ulp/srp/ib_srp.c|   2 +-
 drivers/infiniband/ulp/srpt/ib_srpt.c  |   5 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h  

[PATCH v1 08/10] IB/cma: Add support for network namespaces

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

Add support for network namespaces in the ib_cma module. This is
accomplished by:

1. Adding network namespace parameter for rdma_create_id. This parameter is used
   to populate the network namespace field in rdma_id_private. rdma_create_id
   keeps a reference on the network namespace.
2. Using the network namespace from the rdma_id instead of init_net inside of
   ib_cma.
3. Decrementing the reference count for the appropriate network namespace when
   calling rdma_destroy_id.

In order to preserve the current behavior init_net is passed when calling from
other modules.

Signed-off-by: Guy Shapiro gu...@mellanox.com
Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com

---
 drivers/infiniband/core/cma.c  | 52 +-
 drivers/infiniband/core/ucma.c |  3 +-
 drivers/infiniband/ulp/iser/iser_verbs.c   |  2 +-
 drivers/infiniband/ulp/isert/ib_isert.c|  2 +-
 .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h|  4 +-
 include/rdma/rdma_cm.h |  6 ++-
 net/9p/trans_rdma.c|  2 +-
 net/rds/ib.c   |  2 +-
 net/rds/ib_cm.c|  2 +-
 net/rds/iw.c   |  2 +-
 net/rds/iw_cm.c|  2 +-
 net/rds/rdma_transport.c   |  2 +-
 net/sunrpc/xprtrdma/svc_rdma_transport.c   |  2 +-
 net/sunrpc/xprtrdma/verbs.c|  3 +-
 14 files changed, 52 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 022b0d0..9ea42fe 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -540,7 +540,8 @@ static int cma_disable_callback(struct rdma_id_private 
*id_priv,
 
 struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler,
  void *context, enum rdma_port_space ps,
- enum ib_qp_type qp_type)
+ enum ib_qp_type qp_type,
+ struct net *net)
 {
struct rdma_id_private *id_priv;
 
@@ -562,7 +563,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler 
event_handler,
INIT_LIST_HEAD(id_priv-listen_list);
INIT_LIST_HEAD(id_priv-mc_list);
get_random_bytes(id_priv-seq_num, sizeof id_priv-seq_num);
-   id_priv-id.route.addr.dev_addr.net = init_net;
+   id_priv-id.route.addr.dev_addr.net = get_net(net);
 
return id_priv-id;
 }
@@ -689,7 +690,7 @@ static int cma_modify_qp_rtr(struct rdma_id_private 
*id_priv,
rdma_port_get_link_layer(id_priv-id.device, id_priv-id.port_num)
== IB_LINK_LAYER_ETHERNET) {
ret = rdma_addr_find_smac_by_sgid(sgid, qp_attr.smac, NULL,
- init_net);
+   id_priv-id.route.addr.dev_addr.net);
 
if (ret)
goto out;
@@ -953,6 +954,7 @@ static void cma_cancel_operation(struct rdma_id_private 
*id_priv,
 static void cma_release_port(struct rdma_id_private *id_priv)
 {
struct rdma_bind_list *bind_list = id_priv-bind_list;
+   struct net *net = id_priv-id.route.addr.dev_addr.net;
 
if (!bind_list)
return;
@@ -960,7 +962,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(lock);
hlist_del(id_priv-node);
if (hlist_empty(bind_list-owners)) {
-   cma_ps_remove(bind_list-ps, init_net, bind_list-port);
+   cma_ps_remove(bind_list-ps, net, bind_list-port);
kfree(bind_list);
}
mutex_unlock(lock);
@@ -1029,6 +1031,7 @@ void rdma_destroy_id(struct rdma_cm_id *id)
cma_deref_id(id_priv-id.context);
 
kfree(id_priv-id.route.path_rec);
+   put_net(id_priv-id.route.addr.dev_addr.net);
kfree(id_priv);
 }
 EXPORT_SYMBOL(rdma_destroy_id);
@@ -1156,7 +1159,8 @@ static struct rdma_id_private *cma_new_conn_id(struct 
rdma_cm_id *listen_id,
int ret;
 
id = rdma_create_id(listen_id-event_handler, listen_id-context,
-   listen_id-ps, ib_event-param.req_rcvd.qp_type);
+   listen_id-ps, ib_event-param.req_rcvd.qp_type,
+   listen_id-route.addr.dev_addr.net);
if (IS_ERR(id))
return NULL;
 
@@ -1201,10 +1205,11 @@ static struct rdma_id_private *cma_new_udp_id(struct 
rdma_cm_id *listen_id,
 {
struct rdma_id_private *id_priv;
struct rdma_cm_id *id;
+   struct net *net = listen_id-route.addr.dev_addr.net;
int ret;
 
id = rdma_create_id(listen_id-event_handler, 

[PATCH v1 07/10] IB/cma: Separate port allocation to network namespaces

2015-02-11 Thread Shachar Raindel
From: Yotam Kenneth yota...@mellanox.com

Keep a radix-tree for the network namespaces we support for each port-space.
Dynamically allocate idr for network namespace upon first bind request for a
port in the (ps, net) tuple.
Destroy the idr when the (ps, net) tuple does not contain any bounded ports.

This patch is internal infrastructure work for the following patch. In
this patch, init_net is statically used as the network namespace for
the new port-space API.

The radix-tree is protected under the same locking that protects the
rest of the port space data. This locking is practically a big, static
mutex lock for the entire module.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com

---
 drivers/infiniband/core/cma.c | 122 ++
 1 file changed, 99 insertions(+), 23 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 1ce84a0..022b0d0 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -39,11 +39,13 @@
 #include linux/mutex.h
 #include linux/random.h
 #include linux/idr.h
+#include linux/radix-tree.h
 #include linux/inetdevice.h
 #include linux/slab.h
 #include linux/module.h
 #include net/route.h
 
+#include net/netns/hash.h
 #include net/tcp.h
 #include net/ipv6.h
 
@@ -80,10 +82,83 @@ static LIST_HEAD(dev_list);
 static LIST_HEAD(listen_any_list);
 static DEFINE_MUTEX(lock);
 static struct workqueue_struct *cma_wq;
-static DEFINE_IDR(tcp_ps);
-static DEFINE_IDR(udp_ps);
-static DEFINE_IDR(ipoib_ps);
-static DEFINE_IDR(ib_ps);
+static RADIX_TREE(tcp_ps, GFP_KERNEL);
+static RADIX_TREE(udp_ps, GFP_KERNEL);
+static RADIX_TREE(ipoib_ps, GFP_KERNEL);
+static RADIX_TREE(ib_ps, GFP_KERNEL);
+
+static LIST_HEAD(idrs_list);
+
+struct idr_ll {
+   unsigned net_val;
+   struct net *net;
+   struct radix_tree_root *ps;
+   struct idr idr;
+};
+
+static void zap_ps_idr(struct idr_ll *idr_ll)
+{
+   radix_tree_delete(idr_ll-ps, idr_ll-net_val);
+   idr_destroy(idr_ll-idr);
+   kfree(idr_ll);
+}
+
+static int cma_ps_alloc(struct radix_tree_root *ps, struct net *net, void *ptr,
+   int snum)
+{
+   struct idr_ll *idr_ll;
+   int err;
+   int res;
+
+   idr_ll = radix_tree_lookup(ps, net_hash_mix(net));
+   if (!idr_ll) {
+   idr_ll = kmalloc(sizeof(*idr_ll), GFP_KERNEL);
+   if (!idr_ll)
+   return -ENOMEM;
+   idr_init(idr_ll-idr);
+   idr_ll-net_val = net_hash_mix(net);
+   idr_ll-net = net;
+   idr_ll-ps = ps;
+   err = radix_tree_insert(ps, idr_ll-net_val, idr_ll);
+   if (err) {
+   idr_destroy(idr_ll-idr);
+   kfree(idr_ll);
+   return err;
+   }
+   }
+   res = idr_alloc(idr_ll-idr, ptr, snum, snum + 1, GFP_KERNEL);
+   if (unlikely((res  0)  idr_is_empty(idr_ll-idr))) {
+   zap_ps_idr(idr_ll);
+   return res;
+   }
+   return res;
+}
+
+static void *cma_ps_find(struct radix_tree_root *ps, struct net *net, int snum)
+{
+   struct idr_ll *idr_ll;
+
+   idr_ll = radix_tree_lookup(ps, net_hash_mix(net));
+   if (!idr_ll)
+   return NULL;
+   return idr_find(idr_ll-idr, snum);
+}
+
+static void cma_ps_remove(struct radix_tree_root *ps, struct net *net, int 
snum)
+{
+   struct idr_ll *idr_ll;
+
+   idr_ll = radix_tree_lookup(ps, net_hash_mix(net));
+   if (unlikely(!idr_ll)) {
+   WARN(1, cma_ps_removed can't find expected net ns 0x%lx\n,
+(unsigned long)net);
+   return;
+   }
+   idr_remove(idr_ll-idr, snum);
+   if (idr_is_empty(idr_ll-idr)) {
+   zap_ps_idr(idr_ll);
+   }
+}
 
 struct cma_device {
struct list_headlist;
@@ -94,9 +169,9 @@ struct cma_device {
 };
 
 struct rdma_bind_list {
-   struct idr  *ps;
-   struct hlist_head   owners;
-   unsigned short  port;
+   struct radix_tree_root  *ps;
+   struct hlist_head   owners;
+   unsigned short  port;
 };
 
 enum {
@@ -885,7 +960,7 @@ static void cma_release_port(struct rdma_id_private 
*id_priv)
mutex_lock(lock);
hlist_del(id_priv-node);
if (hlist_empty(bind_list-owners)) {
-   idr_remove(bind_list-ps, bind_list-port);
+   cma_ps_remove(bind_list-ps, init_net, bind_list-port);
kfree(bind_list);
}
mutex_unlock(lock);
@@ -2198,8 +2273,8 @@ static void cma_bind_port(struct rdma_bind_list 
*bind_list,
hlist_add_head(id_priv-node, bind_list-owners);
 }
 
-static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,

[PATCH v1 10/10] IB/ucm: Add partial support for network namespaces

2015-02-11 Thread Shachar Raindel
It is impossible to completely support network namespaces for UCM, as
we cannot identify the target IPoIB device. However, we add support
which will work if the user is following the IB-Spec Annex 11 (RDMA IP
CM Services) with the service ID and private data formatting.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com

---
 drivers/infiniband/core/ucm.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c
index 9604ab0..4244210 100644
--- a/drivers/infiniband/core/ucm.c
+++ b/drivers/infiniband/core/ucm.c
@@ -45,6 +45,7 @@
 #include linux/idr.h
 #include linux/mutex.h
 #include linux/slab.h
+#include linux/nsproxy.h
 
 #include asm/uaccess.h
 
@@ -490,7 +491,7 @@ static ssize_t ib_ucm_create_id(struct ib_ucm_file *file,
ctx-uid = cmd.uid;
ctx-cm_id = ib_create_cm_id(file-device-ib_dev,
 ib_ucm_event_handler, ctx,
-init_net);
+current-nsproxy-net_ns);
if (IS_ERR(ctx-cm_id)) {
result = PTR_ERR(ctx-cm_id);
goto err1;
-- 
1.7.11.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH v1 02/10] IB/core: Pass network namespace as a parameter to relevant functions

2015-02-11 Thread Shachar Raindel
From: Guy Shapiro gu...@mellanox.com

Add network namespace parameters for the address related ib_core
functions. The parameter is passed to lower level function, instead of
init_net, so things are done in the correct namespace.

For now pass init_net on every caller.
Callers that will pass init_net permanently are marked with an
appropriate comment.

Signed-off-by: Haggai Eran hagg...@mellanox.com
Signed-off-by: Yotam Kenneth yota...@mellanox.com
Signed-off-by: Shachar Raindel rain...@mellanox.com
Signed-off-by: Guy Shapiro gu...@mellanox.com

---
 drivers/infiniband/core/agent.c   |  4 +++-
 drivers/infiniband/core/cm.c  |  9 +++--
 drivers/infiniband/core/mad_rmpp.c| 10 --
 drivers/infiniband/core/user_mad.c|  4 +++-
 drivers/infiniband/core/verbs.c   | 10 ++
 drivers/infiniband/ulp/srpt/ib_srpt.c |  3 ++-
 include/rdma/ib_verbs.h   | 15 +--
 7 files changed, 42 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c
index f6d2961..539378d 100644
--- a/drivers/infiniband/core/agent.c
+++ b/drivers/infiniband/core/agent.c
@@ -99,7 +99,9 @@ void agent_send_response(struct ib_mad *mad, struct ib_grh 
*grh,
}
 
agent = port_priv-agent[qpn];
-   ah = ib_create_ah_from_wc(agent-qp-pd, wc, grh, port_num);
+   /* Physical devices (and their MAD replies) always reside in the host
+* network namespace */
+   ah = ib_create_ah_from_wc(agent-qp-pd, wc, grh, port_num, init_net);
if (IS_ERR(ah)) {
dev_err(device-dev, ib_create_ah_from_wc error %ld\n,
PTR_ERR(ah));
diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index e28a494..5a45cb7 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -290,8 +290,13 @@ static int cm_alloc_response_msg(struct cm_port *port,
struct ib_mad_send_buf *m;
struct ib_ah *ah;
 
+   /* For IB, the network namespace doesn't affect the created address
+* handle, so we use init_net. In the future, RoCE support will
+* require finding a specific network namespace to send the response
+* from. */
ah = ib_create_ah_from_wc(port-mad_agent-qp-pd, mad_recv_wc-wc,
- mad_recv_wc-recv_buf.grh, port-port_num);
+ mad_recv_wc-recv_buf.grh, port-port_num,
+ init_net);
if (IS_ERR(ah))
return PTR_ERR(ah);
 
@@ -346,7 +351,7 @@ static void cm_init_av_for_response(struct cm_port *port, 
struct ib_wc *wc,
av-port = port;
av-pkey_index = wc-pkey_index;
ib_init_ah_from_wc(port-cm_dev-ib_device, port-port_num, wc,
-  grh, av-ah_attr);
+  grh, av-ah_attr, init_net);
 }
 
 static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av)
diff --git a/drivers/infiniband/core/mad_rmpp.c 
b/drivers/infiniband/core/mad_rmpp.c
index f37878c..6c15762 100644
--- a/drivers/infiniband/core/mad_rmpp.c
+++ b/drivers/infiniband/core/mad_rmpp.c
@@ -157,8 +157,11 @@ static struct ib_mad_send_buf *alloc_response_msg(struct 
ib_mad_agent *agent,
struct ib_ah *ah;
int hdr_len;
 
+   /* Physical devices (and their MAD replies) always reside in the host
+* network namespace */
ah = ib_create_ah_from_wc(agent-qp-pd, recv_wc-wc,
- recv_wc-recv_buf.grh, agent-port_num);
+ recv_wc-recv_buf.grh, agent-port_num,
+ init_net);
if (IS_ERR(ah))
return (void *) ah;
 
@@ -287,10 +290,13 @@ create_rmpp_recv(struct ib_mad_agent_private *agent,
if (!rmpp_recv)
return NULL;
 
+   /* Physical devices (and their MAD replies) always reside in the host
+* network namespace */
rmpp_recv-ah = ib_create_ah_from_wc(agent-agent.qp-pd,
 mad_recv_wc-wc,
 mad_recv_wc-recv_buf.grh,
-agent-agent.port_num);
+agent-agent.port_num,
+init_net);
if (IS_ERR(rmpp_recv-ah))
goto error;
 
diff --git a/drivers/infiniband/core/user_mad.c 
b/drivers/infiniband/core/user_mad.c
index 928cdd2..f34c607 100644
--- a/drivers/infiniband/core/user_mad.c
+++ b/drivers/infiniband/core/user_mad.c
@@ -239,7 +239,9 @@ static void recv_handler(struct ib_mad_agent *agent,
 
ib_init_ah_from_wc(agent-device, agent-port_num,
   mad_recv_wc-wc, mad_recv_wc-recv_buf.grh,
-  ah_attr);
+  ah_attr, init_net);
+   /* Note that network 

[PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback

2015-02-11 Thread Doug Ledford
When the core layer calls our callback with ENETRESET as the error, we
clear the status to 0 before returning indicating that we are going to
handle the error ourselves.  This causes the core layer to not free the
mcast-mc structure, but we are releasing our reference to it by
clearing mcast-mc.  So, preserve our reference to the multicast
structure so when we next run ipoib_mcast_dev_flush, it will be able
to properly release the mcast-mc entry at the right time in the right
way.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 12 ++--
 1 file changed, 6 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 957e7d2e80c..50d2de2270a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -315,8 +315,10 @@ ipoib_mcast_sendonly_join_complete(int status,
mutex_lock(mcast_mutex);
 
/* We trap for port events ourselves. */
-   if (status == -ENETRESET)
+   if (status == -ENETRESET) {
+   status = 0;
goto out;
+   }
 
if (!status)
status = ipoib_mcast_join_finish(mcast, multicast-rec);
@@ -344,8 +346,6 @@ out:
if (status)
mcast-mc = NULL;
complete(mcast-done);
-   if (status == -ENETRESET)
-   status = 0;
__ipoib_mcast_continue_join_thread(priv, NULL, 0);
mutex_unlock(mcast_mutex);
return status;
@@ -462,8 +462,10 @@ static int ipoib_mcast_join_complete(int status,
mutex_lock(mcast_mutex);
 
/* We trap for port events ourselves. */
-   if (status == -ENETRESET)
+   if (status == -ENETRESET) {
+   status = 0;
goto out;
+   }
 
if (!status)
status = ipoib_mcast_join_finish(mcast, multicast-rec);
@@ -499,8 +501,6 @@ out:
if (status)
mcast-mc = NULL;
complete(mcast-done);
-   if (status == -ENETRESET)
-   status = 0;
spin_unlock_irq(priv-lock);
mutex_unlock(mcast_mutex);
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race

2015-02-11 Thread Doug Ledford
Our mcast_dev_flush routine and our mcast_restart_task can race
against each other.  In particular, they both hold the priv-lock
while manipulating the rbtree and while removing mcast entries from
the multicast_list and while adding entries to the remove_list, but
they also both drop their locks prior to doing the actual removes.
The mcast_dev_flush routine is run entirely under the rtnl lock and so
has at least some locking.  The actual race condition is like this:

Thread 1Thread 2
ifconfig ib0 up
  start multicast join for broadcast
  multicast join completes for broadcast
  start to add more multicast joins
call mcast_restart_task to add new entries
ifconfig ib0 down
  mcast_dev_flush
mcast_leave(mcast A)
mcast_leave(mcast A)

As mcast_leave calls ib_sa_multicast_leave, and as member in
core/multicast.c is ref counted, we run into an unbalanced refcount
issue.  To avoid stomping on each others removes, take the rtnl lock
specifically when we are deleting the entries from the remove list.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++
 1 file changed, 32 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a52c9f3f7e4..41325960e4e 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -802,7 +802,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev)
 
spin_unlock_irqrestore(priv-lock, flags);
 
-   /* seperate between the wait to the leave*/
+   /*
+* make sure the in-flight joins have finished before we attempt
+* to leave
+*/
list_for_each_entry_safe(mcast, tmcast, remove_list, list)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
wait_for_completion(mcast-done);
@@ -923,14 +926,38 @@ void ipoib_mcast_restart_task(struct work_struct *work)
netif_addr_unlock(dev);
local_irq_restore(flags);
 
-   /* We have to cancel outside of the spinlock */
+   /*
+* make sure the in-flight joins have finished before we attempt
+* to leave
+*/
+   list_for_each_entry_safe(mcast, tmcast, remove_list, list)
+   if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
+   wait_for_completion(mcast-done);
+
+   /*
+* We have to cancel outside of the spinlock, but we have to
+* take the rtnl lock or else we race with the removal of
+* entries from the remove list in mcast_dev_flush as part
+* of ipoib_stop() which will call mcast_stop_thread with
+* flush == 1 while holding the rtnl lock, and the
+* flush_workqueue won't complete until this restart_mcast_task
+* completes.  So do like the carrier on task and attempt to
+* take the rtnl lock, but if we can't before the ADMIN_UP flag
+* goes away, then just return and know that the remove list will
+* get flushed later by mcast_dev_flush.
+*/
+   while (!rtnl_trylock()) {
+   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
+   return;
+   else
+   msleep(20);
+   }
list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
ipoib_mcast_leave(mcast-dev, mcast);
ipoib_mcast_free(mcast);
}
-
-   if (test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
-   ipoib_mcast_start_thread(dev);
+   ipoib_mcast_start_thread(dev);
+   rtnl_unlock();
 }
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue

2015-02-11 Thread Doug Ledford
Setting the MTU can safely be moved to the carrier_on_task, which keeps
us from needing to take the rtnl lock in the join_finish section.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++--
 1 file changed, 2 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ffb83b5f7e8..eee66d13e5b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
*mcast,
spin_unlock_irq(priv-lock);
priv-tx_wr.wr.ud.remote_qkey = priv-qkey;
set_qkey = 1;
-
-   if (!ipoib_cm_admin_enabled(dev)) {
-   rtnl_lock();
-   dev_set_mtu(dev, min(priv-mcast_mtu, priv-admin_mtu));
-   rtnl_unlock();
-   }
}
 
if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) {
@@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
}
 
rtnl_lock();
+   if (!ipoib_cm_admin_enabled(priv-dev))
+   dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu));
netif_carrier_on(priv-dev);
rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET

2015-02-11 Thread Doug Ledford
Make the send only join completion handler behave like the normal join
completion handler in that we should restart the multicast join thread
whenever our join completes with either an error or success, but don't
restart the thread if we got a net reset event, let ipoib_event queue
up a device flush, which will then call ipoib_mcast_dev_flush to remove
all of the current mcast entries, and ipoib_mcase_restart_task to check
out current mcast entry list against the netdev list of mcast entries
we are supposed to have and requeue the ones we need to get back.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++-
 1 file changed, 2 insertions(+), 1 deletion(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 50d2de2270a..a03b6a04203 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -336,17 +336,18 @@ ipoib_mcast_sendonly_join_complete(int status,
dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue));
}
netif_tx_unlock_bh(dev);
+   __ipoib_mcast_continue_join_thread(priv, mcast, 1);
} else {
/* Join completed, so reset any backoff parameters */
mcast-backoff = 1;
mcast-delay_until = jiffies;
+   __ipoib_mcast_continue_join_thread(priv, NULL, 0);
}
 out:
clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
if (status)
mcast-mc = NULL;
complete(mcast-done);
-   __ipoib_mcast_continue_join_thread(priv, NULL, 0);
mutex_unlock(mcast_mutex);
return status;
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage

2015-02-11 Thread Doug Ledford
Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast
objects) added a new flag MCAST_JOIN_STARTED, but was not very strict
in how it was used.  We didn't always initialize the completion struct
before we set the flag, and we didn't always call complete on the
completion struct from all paths that complete it.  This made it less
than totally effective, and certainly made its use confusing.  And in
the flush function we would use the presence of this flag to signal
that we should wait on the completion struct, but we never cleared
this flag, ever.  This is further muddied by the fact that we overload
the MCAST_FLAG_BUSY flag to mean two different things: we have a join
in flight, and we have succeeded in getting an ib_sa_join_multicast.

In order to make things clearer and aid in resolving the rtnl deadlock
bug I've been chasing, I cleaned this up a bit.

 1) Remove the MCAST_JOIN_STARTED flag entirely
 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight
 3) Test on mcast-mc directly to see if we have completed
ib_sa_join_multicast (using IS_ERR_OR_NULL)
 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize
the mcast-done completion struct
 5) Make sure that before calling complete(mcast-done), we always clear
the MCAST_FLAG_BUSY bit
 6) Take the mcast_mutex before we call ib_sa_multicast_join and also
take the mutex in our join callback.  This forces
ib_sa_multicast_join to return and set mcast-mc before we process
the callback.  This way, our callback can safely clear mcast-mc
if there is an error on the join and we will do the right thing as
a result in mcast_dev_flush.
 7) Because we need the mutex to synchronize mcast-mc, we can no
longer call mcast_sendonly_join directly from mcast_send and
instead must add sendonly join processing to the mcast_join_task

A number of different races are resolved with these changes.  These
races existed with the old MCAST_FLAG_BUSY usage, the
MCAST_JOIN_STARTED flag was an attempt to address them, and while it
helped, a determined effort could still trip things up.

One race looks something like this:

Thread 1 Thread 2
ib_sa_join_multicast (as part of running restart mcast task)
  alloc member
  call callback
 ifconfig ib0 down
 wait_for_completion
callback call completes
 wait_for_completion in
 mcast_dev_flush completes
   mcast-mc is PTR_ERR_OR_NULL
   so we skip ib_sa_leave_multicast
return from callback
  return from ib_sa_join_multicast
set mcast-mc = return from ib_sa_multicast

We now have a permanently unbalanced join/leave issue that trips up the
refcounting in core/multicast.c

Another like this:

Thread 1   Thread 2 Thread 3
ib_sa_multicast_join
ifconfig ib0 down
priv-broadcast = NULL
   join_complete
wait_for_completion
   mcast-mc is not yet set, so don't clear
return from ib_sa_join_multicast and set mcast-mc
   complete
   return -EAGAIN (making mcast-mc invalid)
call ib_sa_multicast_leave
on invalid mcast-mc, hang
forever

By holding the mutex around ib_sa_multicast_join and taking the mutex
early in the callback, we force mcast-mc to be valid at the time we
run the callback.  This allows us to clear mcast-mc if there is an
error and the join is going to fail.  We do this before we complete
the mcast.  In this way, mcast_dev_flush always sees consistent state
in regards to mcast-mc membership at the time that the
wait_for_completion() returns.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  10 +-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 -
 2 files changed, 101 insertions(+), 57 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index d7562beb542..f4c1b20b23b 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -98,9 +98,15 @@ enum {
 
IPOIB_MCAST_FLAG_FOUND= 0,  /* used in set_multicast_list */
IPOIB_MCAST_FLAG_SENDONLY = 1,
-   IPOIB_MCAST_FLAG_BUSY = 2,  /* joining or already joined */
+   /*
+* For IPOIB_MCAST_FLAG_BUSY
+* When set, in flight join and mcast-mc is unreliable
+* When clear and mcast-mc IS_ERR_OR_NULL, need to restart or
+*   haven't started yet
+* When clear and 

[PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join

2015-02-11 Thread Doug Ledford
There is a classic race between mcast_dev_flush and mcast_join_thread
in that the join thread is going to loop through all of the mcast
entries in the device mcast list, while dev_flush wants to remove
those entries from the list so that they can eventually be freed when
they are no longer busy.

As always, the list traversal and the list removal must be done under
the same lock to avoid races, so modify the join_task thread code to
hold onto the spinlock for the entirety of its run up until it has
saved the mcast entry that it's going to work on into a private pointer
and set the flags on that entry to mark it as busy.  Only then is it
safe to drop the lock and let mcast_dev_flush remove entries from the
list.

I also reworked the flow of the join_task thread to be more clear that
it is making a single pass at work to do and then exiting as soon as it
gets that single task queued off.

I then reworked both the regular and send only join routines to know
that they will be called with the BUSY flag already set on the mcast
entry they are passed, so any error condition means they need to
clear the flag and complete the task and they should not check the
busy status as it will now always be true.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 112 -
 1 file changed, 56 insertions(+), 56 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 424c602d5a0..972fa15a225 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -368,22 +368,16 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
if (!test_bit(IPOIB_FLAG_OPER_UP, priv-flags)) {
ipoib_dbg_mcast(priv, device shutting down, no sendonly 
multicast joins\n);
+   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
+   complete(mcast-done);
return -ENODEV;
}
 
-   if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) {
-   ipoib_dbg_mcast(priv, multicast entry busy, skipping 
-   sendonly join\n);
-   return -EBUSY;
-   }
-
rec.mgid = mcast-mcmember.mgid;
rec.port_gid = priv-local_gid;
rec.pkey = cpu_to_be16(priv-pkey);
 
mutex_lock(mcast_mutex);
-   init_completion(mcast-done);
-   set_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca,
 priv-port, rec,
 IB_SA_MCMEMBER_REC_MGID|
@@ -552,8 +546,6 @@ static void ipoib_mcast_join(struct net_device *dev, struct 
ipoib_mcast *mcast,
}
 
mutex_lock(mcast_mutex);
-   init_completion(mcast-done);
-   set_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port,
 rec, comp_mask, GFP_KERNEL,
 ipoib_mcast_join_complete, mcast);
@@ -574,6 +566,8 @@ void ipoib_mcast_join_task(struct work_struct *work)
container_of(work, struct ipoib_dev_priv, mcast_task.work);
struct net_device *dev = priv-dev;
struct ib_port_attr port_attr;
+   struct ipoib_mcast *mcast = NULL;
+   int create = 1;
 
if (!test_bit(IPOIB_MCAST_RUN, priv-flags))
return;
@@ -591,16 +585,25 @@ void ipoib_mcast_join_task(struct work_struct *work)
else
memcpy(priv-dev-dev_addr + 4, priv-local_gid.raw, sizeof 
(union ib_gid));
 
+   /*
+* We have to hold the mutex to keep from racing with the join
+* completion threads on setting flags on mcasts, and we have
+* to hold the priv-lock because dev_flush will remove entries
+* out from underneath us, so at a minimum we need the lock
+* through the time that we do the for_each loop of the mcast
+* list or else dev_flush can make us oops.
+*/
+   mutex_lock(mcast_mutex);
+   spin_lock_irq(priv-lock);
+   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
+   goto out;
+
if (!priv-broadcast) {
struct ipoib_mcast *broadcast;
 
-   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
-   return;
-
-   broadcast = ipoib_mcast_alloc(dev, 1);
+   broadcast = ipoib_mcast_alloc(dev, 0);
if (!broadcast) {
ipoib_warn(priv, failed to allocate broadcast 
group\n);
-   mutex_lock(mcast_mutex);
/*
 * Restart us after a 1 second delay to retry
 * creating our broadcast group and attaching to
@@ -608,67 +611,64 @@ void 

[PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything

2015-02-11 Thread Doug Ledford
Currently, the multicast join thread only processes one task at a
time.  It does this by having itself be scheduled as a delayed work
item, when it runs, it finds one, and only one, piece of work to
process, it then kicks that off via either the normal join process
or the sendonly join process, and then it immediately exits.  Both
of those process chains are responsible for restarting the task
when they have completed their specific action.  This makes the entire
join process serial with only one join supposedly ever outstanding at
a time.

However, if we fail a join, and we need to initiate a backoff delay,
that delay holds up the entire join process for all joins, not just
the failed join.  So modify the design such that we can have joins
in delay, and the multicast thread will ignore them until their time
comes around, and then we can process the rest of the queue without
waiting for the delayed items.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 86 +-
 2 files changed, 57 insertions(+), 30 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 8ba80a6d3a4..c79dcd5ee8a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -154,6 +154,7 @@ struct ipoib_mcast {
 
unsigned long created;
unsigned long backoff;
+   unsigned long delay_until;
 
unsigned long flags;
unsigned char logcount;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index cb1e495bd74..957e7d2e80c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -67,10 +67,34 @@ struct ipoib_mcast_iter {
 };
 
 static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+  struct ipoib_mcast *mcast,
   int delay)
 {
-   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
+   if (test_bit(IPOIB_MCAST_RUN, priv-flags)) {
+   /*
+* Mark this mcast for its delay and set a timer to kick the
+* thread when the delay completes
+*/
+   if (mcast  delay) {
+   mcast-backoff *= 2;
+   if (mcast-backoff  IPOIB_MAX_BACKOFF_SECONDS)
+   mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS;
+   mcast-delay_until = jiffies + (mcast-backoff * HZ);
+   queue_delayed_work(priv-wq, priv-mcast_task,
+  mcast-backoff * HZ);
+   } else if (delay) {
+   /* Special case of retrying after a failure to
+* allocate the broadcast multicast group, wait
+* 1 second and try again
+*/
+   queue_delayed_work(priv-wq, priv-mcast_task, HZ);
+   }
+   /*
+* But also rekick the thread immediately for any other
+* tasks in queue behind this one
+*/
queue_delayed_work(priv-wq, priv-mcast_task, delay);
+   }
 }
 
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
@@ -110,6 +134,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct 
net_device *dev,
 
mcast-dev = dev;
mcast-created = jiffies;
+   mcast-delay_until = jiffies;
mcast-backoff = 1;
 
INIT_LIST_HEAD(mcast-list);
@@ -309,6 +334,10 @@ ipoib_mcast_sendonly_join_complete(int status,
dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue));
}
netif_tx_unlock_bh(dev);
+   } else {
+   /* Join completed, so reset any backoff parameters */
+   mcast-backoff = 1;
+   mcast-delay_until = jiffies;
}
 out:
clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
@@ -317,7 +346,7 @@ out:
complete(mcast-done);
if (status == -ENETRESET)
status = 0;
-   __ipoib_mcast_continue_join_thread(priv, 0);
+   __ipoib_mcast_continue_join_thread(priv, NULL, 0);
mutex_unlock(mcast_mutex);
return status;
 }
@@ -369,7 +398,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
complete(mcast-done);
ipoib_warn(priv, ib_sa_join_multicast for sendonly join 
   failed (ret = %d)\n, ret);
-   __ipoib_mcast_continue_join_thread(priv, 0);
+   __ipoib_mcast_continue_join_thread(priv, NULL, 0);
} else {
ipoib_dbg_mcast(priv, no multicast record for %pI6, starting 
sendonly join\n, mcast-mcmember.mgid.raw);
@@ -441,7 +470,8 @@ 

[PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage

2015-02-11 Thread Doug Ledford
The usage of IPOIB_MCAST_RUN as a flag is inconsistent.  In some places
it is used to mean our device is administratively allowed to send
multicast joins/leaves/packets and in other places it means our
multicast join task thread is currently running and will process your
request if you put it on the queue.  However, this latter meaning is in
fact flawed as there is a race condition between the join task testing
the mcast list and finding it empty of remaining work, dropping the
mcast mutex and also the priv-lock spinlock, and clearing the
IPOIB_MCAST_RUN flag.  Further, there are numerous locations that use
the flag in the former fashion, and when all tasks complete and the task
thread clears the RUN flag, all of those other locations will fail to
ever again queue any work.  This results in the interface coming up fine
initially, but having problems adding new multicast groups after the
first round of groups have all been added and the RUN flag is cleared by
the join task thread when it thinks it is done.  To resolve this issue,
convert all locations in the code to treat the RUN flag as an indicator
that the multicast portion of this interface is in fact administratively
up and joins/leaves/sends can be performed.  There is no harm (other
than a slight performance penalty) to never clearing this flag and using
it in this fashion as it simply means that a few places that used to
micro-optimize how often this task was queued on a work queue will now
queue the task a few extra times.  We can address that suboptimal
behavior in future patches.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index bc50dd0d0e4..91b8fe118ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
}
 
ipoib_dbg_mcast(priv, successfully joined all multicast groups\n);
-
-   clear_bit(IPOIB_MCAST_RUN, priv-flags);
 }
 
 int ipoib_mcast_start_thread(struct net_device *dev)
@@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev)
ipoib_dbg_mcast(priv, starting multicast thread\n);
 
mutex_lock(mcast_mutex);
-   if (!test_and_set_bit(IPOIB_MCAST_RUN, priv-flags))
-   queue_delayed_work(priv-wq, priv-mcast_task, 0);
+   set_bit(IPOIB_MCAST_RUN, priv-flags);
+   queue_delayed_work(priv-wq, priv-mcast_task, 0);
mutex_unlock(mcast_mutex);
 
return 0;
@@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb)
memcpy(mcast-mcmember.mgid.raw, mgid, sizeof (union ib_gid));
__ipoib_mcast_add(dev, mcast);
list_add_tail(mcast-list, priv-multicast_list);
-   if (!test_and_set_bit(IPOIB_MCAST_RUN, priv-flags))
+   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
queue_delayed_work(priv-wq, priv-mcast_task, 0);
}
 
@@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work)
/*
 * Restart our join task if needed
 */
-   ipoib_mcast_start_thread(dev);
+   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
+   queue_delayed_work(priv-wq, priv-mcast_task, 0);
rtnl_unlock();
 }
 
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware

2015-02-11 Thread Doug Ledford
We blindly assume that we can just take the rtnl lock and that will
prevent races with downing this interface.  Unfortunately, that's not
the case.  In ipoib_mcast_stop_thread() we will call flush_workqueue()
in an attempt to clear out all remaining instances of ipoib_join_task.
But, since this task is put on the same workqueue as the join task,
the flush_workqueue waits on this thread too.  But this thread is
deadlocked on the rtnl lock.  The better thing here is to use trylock
and loop on that until we either get the lock or we see that
FLAG_ADMIN_UP has been cleared, in which case we don't need to do
anything anyway and we just return.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++--
 1 file changed, 15 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index eee66d13e5b..9862c76a83f 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work)
   carrier_on_task);
struct ib_port_attr attr;
 
-   /*
-* Take rtnl_lock to avoid racing with ipoib_stop() and
-* turning the carrier back on while a device is being
-* removed.
-*/
if (ib_query_port(priv-ca, priv-port, attr) ||
attr.state != IB_PORT_ACTIVE) {
ipoib_dbg(priv, Keeping carrier off until IB port is 
active\n);
return;
}
 
-   rtnl_lock();
+   /*
+* Take rtnl_lock to avoid racing with ipoib_stop() and
+* turning the carrier back on while a device is being
+* removed.  However, ipoib_stop() will attempt to flush
+* the workqueue while holding the rtnl lock, so loop
+* on trylock until either we get the lock or we see
+* FLAG_ADMIN_UP go away as that signals that we are bailing
+* and can safely ignore the carrier on work
+*/
+   while (!rtnl_trylock()) {
+   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
+   return;
+   else
+   msleep(20);
+   }
if (!ipoib_cm_admin_enabled(priv-dev))
dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu));
netif_carrier_on(priv-dev);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task

2015-02-11 Thread Doug Ledford
Add a simple helper to do the right thing when restarting
the multicast thread.  Call that helper from all the places
that need to restart the thread.  Add two places that didn't
restart the thread, but should have.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 46 --
 1 file changed, 29 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 91b8fe118ec..cb1e495bd74 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -66,6 +66,13 @@ struct ipoib_mcast_iter {
unsigned int   send_only;
 };
 
+static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv,
+  int delay)
+{
+   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
+   queue_delayed_work(priv-wq, priv-mcast_task, delay);
+}
+
 static void ipoib_mcast_free(struct ipoib_mcast *mcast)
 {
struct net_device *dev = mcast-dev;
@@ -270,6 +277,7 @@ ipoib_mcast_sendonly_join_complete(int status,
 {
struct ipoib_mcast *mcast = multicast-context;
struct net_device *dev = mcast-dev;
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
 
/*
 * We have to take the mutex to force mcast_sendonly_join to
@@ -309,6 +317,7 @@ out:
complete(mcast-done);
if (status == -ENETRESET)
status = 0;
+   __ipoib_mcast_continue_join_thread(priv, 0);
mutex_unlock(mcast_mutex);
return status;
 }
@@ -360,6 +369,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast 
*mcast)
complete(mcast-done);
ipoib_warn(priv, ib_sa_join_multicast for sendonly join 
   failed (ret = %d)\n, ret);
+   __ipoib_mcast_continue_join_thread(priv, 0);
} else {
ipoib_dbg_mcast(priv, no multicast record for %pI6, starting 
sendonly join\n, mcast-mcmember.mgid.raw);
@@ -431,8 +441,7 @@ static int ipoib_mcast_join_complete(int status,
 
if (!status) {
mcast-backoff = 1;
-   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
-   queue_delayed_work(priv-wq, priv-mcast_task, 0);
+   __ipoib_mcast_continue_join_thread(priv, 0);
 
/*
 * Defer carrier on work to priv-wq to avoid a
@@ -454,6 +463,13 @@ static int ipoib_mcast_join_complete(int status,
mcast-backoff *= 2;
if (mcast-backoff  IPOIB_MAX_BACKOFF_SECONDS)
mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS;
+   /*
+* XXX - This is wrong.  *Our* join failed, but because the
+* join thread does the joins in a serial fashion, if there
+* are any joins behind ours waiting to complete, they should
+* not be subjected to our backoff delay.
+*/
+   __ipoib_mcast_continue_join_thread(priv, mcast-backoff * HZ);
}
 out:
spin_lock_irq(priv-lock);
@@ -463,9 +479,6 @@ out:
complete(mcast-done);
if (status == -ENETRESET)
status = 0;
-   if (status  test_bit(IPOIB_MCAST_RUN, priv-flags))
-   queue_delayed_work(priv-wq, priv-mcast_task,
-  mcast-backoff * HZ);
spin_unlock_irq(priv-lock);
mutex_unlock(mcast_mutex);
 
@@ -532,10 +545,13 @@ static void ipoib_mcast_join(struct net_device *dev, 
struct ipoib_mcast *mcast,
mcast-backoff *= 2;
if (mcast-backoff  IPOIB_MAX_BACKOFF_SECONDS)
mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS;
-
-   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
-   queue_delayed_work(priv-wq, priv-mcast_task,
-  mcast-backoff * HZ);
+   /*
+* XXX - This is wrong.  *Our* join failed, but because the
+* join thread does the joins in a serial fashion, if there
+* are any joins behind ours waiting to complete, they should
+* not be subjected to our backoff delay.
+*/
+   __ipoib_mcast_continue_join_thread(priv, mcast-backoff * HZ);
}
mutex_unlock(mcast_mutex);
 }
@@ -573,9 +589,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
if (!broadcast) {
ipoib_warn(priv, failed to allocate broadcast 
group\n);
mutex_lock(mcast_mutex);
-   if (test_bit(IPOIB_MCAST_RUN, priv-flags))
-   queue_delayed_work(priv-wq, priv-mcast_task,
-  HZ);
+   

[PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue

2015-02-11 Thread Doug Ledford
We used to pass a flush variable to mcast_stop_thread to indicate if
we should flush the workqueue or not.  This was due to some code
trying to flush a workqueue that it was currently running on which is
a no-no.  Now that we have per-device work queues, and now that
ipoib_mcast_restart_task has taken the fact that it is queued on a
single thread workqueue with all of the ipoib_mcast_join_task's and
therefore has no need to stop the join task while it runs, we can do
away with the flush parameter and unilaterally flush always.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  2 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|  4 ++--
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +
 3 files changed, 12 insertions(+), 15 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 45fd10a72ec..28dc927c0e8 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -499,7 +499,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb);
 
 void ipoib_mcast_restart_task(struct work_struct *work);
 int ipoib_mcast_start_thread(struct net_device *dev);
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush);
+int ipoib_mcast_stop_thread(struct net_device *dev);
 
 void ipoib_mcast_dev_down(struct net_device *dev);
 void ipoib_mcast_dev_flush(struct net_device *dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index bfd17d41b5f..66096787119 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush)
clear_bit(IPOIB_FLAG_OPER_UP, priv-flags);
netif_carrier_off(dev);
 
-   ipoib_mcast_stop_thread(dev, flush);
+   ipoib_mcast_stop_thread(dev);
ipoib_mcast_dev_flush(dev);
 
ipoib_flush_paths(dev);
@@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
 */
ipoib_flush_paths(dev);
 
-   ipoib_mcast_stop_thread(dev, 1);
+   ipoib_mcast_stop_thread(dev);
ipoib_mcast_dev_flush(dev);
 
ipoib_transport_dev_cleanup(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 845f910eb21..bc50dd0d0e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -648,7 +648,7 @@ int ipoib_mcast_start_thread(struct net_device *dev)
return 0;
 }
 
-int ipoib_mcast_stop_thread(struct net_device *dev, int flush)
+int ipoib_mcast_stop_thread(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -659,8 +659,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int 
flush)
cancel_delayed_work(priv-mcast_task);
mutex_unlock(mcast_mutex);
 
-   if (flush)
-   flush_workqueue(priv-wq);
+   flush_workqueue(priv-wq);
 
return 0;
 }
@@ -838,8 +837,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 
ipoib_dbg_mcast(priv, restarting multicast task\n);
 
-   ipoib_mcast_stop_thread(dev, 0);
-
local_irq_save(flags);
netif_addr_lock(dev);
spin_lock(priv-lock);
@@ -936,13 +933,10 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 * We have to cancel outside of the spinlock, but we have to
 * take the rtnl lock or else we race with the removal of
 * entries from the remove list in mcast_dev_flush as part
-* of ipoib_stop() which will call mcast_stop_thread with
-* flush == 1 while holding the rtnl lock, and the
-* flush_workqueue won't complete until this restart_mcast_task
-* completes.  So do like the carrier on task and attempt to
-* take the rtnl lock, but if we can't before the ADMIN_UP flag
-* goes away, then just return and know that the remove list will
-* get flushed later by mcast_stop_thread.
+* of ipoib_stop().  We detect the drop of the ADMIN_UP flag
+* to signal that we have hit this particular race, and we
+* return since we know we don't need to do anything else
+* anyway.
 */
while (!rtnl_trylock()) {
if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
@@ -954,6 +948,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
ipoib_mcast_leave(mcast-dev, mcast);
ipoib_mcast_free(mcast);
}
+   /*
+* Restart our join task if needed
+*/
ipoib_mcast_start_thread(dev);
rtnl_unlock();
 }
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 00/22] IB/ipoib: Fixups for multicast issues

2015-02-11 Thread Doug Ledford
This patchset revives the 8 patches that were reverted from 3.19,
the 11 patches that fixed the problems with the first 8, the single
patch that was related to reaping of ah's and failure to dealloc
resources on shutdown, and then adds two new patches that would
have been enhancements and not bugfixes and hence weren't appropriate
to post in the 3.19 tussle.

Testing of this patchset is currently underway, but it has done
well so far.  IPv4 multicast, IPv6 multicast, connected mode,
datagram mode, rmmod/insmod while active, restart opensm while
active, ifconfig up/ifconfig down in a tight while loop have
all passed.

There are two outstanding issues that I think stilll need addressed
(while performing all the other testing I ran across these issues,
and I think they existed prior to my patchset, but I haven't booted
up a clean kernel to verify it yet...I'll do that tomorrow and if
things are not as I expect, I'll report back here):

1) In connected mode, the initial ip6 ping to any host takes almost
exactly 1 second to complete.  The debug messages show this delay
very clearly:

[19059.689967] qib_ib0: joining MGID ff12:601b::::0001:ff31:7791
[19059.689970] qib_ib0: successfully started all multicast joins
[19059.690313] qib_ib0: sendonly join completion for 
ff12:601b::::0001:ff31:7791 (status 0)
[19059.690314] qib_ib0: Created ah 88080cc0ef60
[19059.690315] qib_ib0: MGID ff12:601b::::0001:ff31:7791 AV 
88080cc0ef60, LID 0xc035, SL 0 - Final debug message from creating our AH 
and when we should have requeued our sends
[19060.694190] qib_ib0: REQ arrived - almost exactly 1 second later, we
finally start setting up our connection

In datagram mode, this does not happen and initial startup of ping6
is mostly immediate.

2) In connected mode, restartng opensm repeatedly can cause some of
the machines to start failing to find other machines when trying to
use ping6.  However, they don't loose connectivity to all machines,
only specific machines.  A rmmod/insmod cycle solves the problem.
So does a full ifdown/ifup cycle.  Given enough idle time, the
problem goes away.  I suspect that neighbor flushing when in
connected mode is not reliable/sufficient when opensm events come
in.  Again, I think this exists in the upstream kernel and I'll
test more on that tomorrow.

Doug Ledford (22):
  IB/ipoib: Consolidate rtnl_lock tasks in workqueue
  IB/ipoib: Make the carrier_on_task race aware
  IB/ipoib: fix MCAST_FLAG_BUSY usage
  IB/ipoib: fix mcast_dev_flush/mcast_restart_task race
  IB/ipoib: change init sequence ordering
  IB/ipoib: Use dedicated workqueues per interface
  IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue
  IB/ipoib: No longer use flush as a parameter
  IB/ipoib: fix IPOIB_MCAST_RUN flag usage
  IB/ipoib: Add a helper to restart the multicast task
  IB/ipoib: make delayed tasks not hold up everything
  IB/ipoib: Handle -ENETRESET properly in our callback
  IB/ipoib: don't restart our thread on ENETRESET
  IB/ipoib: remove unneeded locks
  IB/ipoib: fix race between mcast_dev_flush and mcast_join
  IB/ipoib: fix ipoib_mcast_restart_task
  IB/ipoib: flush the ipoib_workqueue on unregister
  IB/ipoib: cleanup a couple debug messages
  IB/ipoib: make sure we reap all our ah on shutdown
  IB/ipoib: don't queue a work struct up twice
  IB/ipoib: deserialize multicast joins
  IB/ipoib: drop mcast_mutex usage

 drivers/infiniband/ulp/ipoib/ipoib.h   |  20 +-
 drivers/infiniband/ulp/ipoib/ipoib_cm.c|  18 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|  69 ++--
 drivers/infiniband/ulp/ipoib/ipoib_main.c  |  51 ++-
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 479 -
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  24 +-
 6 files changed, 356 insertions(+), 305 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 21/22] IB/ipoib: deserialize multicast joins

2015-02-11 Thread Doug Ledford
Allow the ipoib layer to attempt to join all outstanding multicast
groups at once.  The ib_sa layer will serialize multiple attempts to
join the same group, but will process attempt to join different groups
in parallel.  Take advantage of that.

In order to make this happen, change the mcast_join_thread to loop
through all needed joins, sending a join request for each one that we
still need to join.  There are a few special cases we handle though:

1) Don't attempt to join anything but the broadcast group until the join
of the broadcast group has succeeded.
2) No longer restart the join task at the end of completion handling as
the join task will fire off everything.  Unless we are the broadcast
group, and then we need to.  Or in case we need to initiate a retry.
3) No longer use separate join/completion routines for regular and
sendonly joins, pass them all through the same routine and just do the
right thing based on the SENDONLY join flag.
4) Only try to join a SENDONLY join twice, then drop the packets and
quit trying.  We leave the mcast group in the list so that if we get a
new packet, all that we have to do is queue up the packet and restart
the join task and it will automatically try to join twice and then
either send or flush the queue again.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 264 +
 1 file changed, 88 insertions(+), 176 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 68b47bed33a..355c320dc4d 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -295,113 +295,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast 
*mcast,
return 0;
 }
 
-static int
-ipoib_mcast_sendonly_join_complete(int status,
-  struct ib_sa_multicast *multicast)
-{
-   struct ipoib_mcast *mcast = multicast-context;
-   struct net_device *dev = mcast-dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-
-   /*
-* We have to take the mutex to force mcast_sendonly_join to
-* return from ib_sa_multicast_join and set mcast-mc to a
-* valid value.  Otherwise we were racing with ourselves in
-* that we might fail here, but get a valid return from
-* ib_sa_multicast_join after we had cleared mcast-mc here,
-* resulting in mis-matched joins and leaves and a deadlock
-*/
-   mutex_lock(mcast_mutex);
-
-   /* We trap for port events ourselves. */
-   if (status == -ENETRESET) {
-   status = 0;
-   goto out;
-   }
-
-   if (!status)
-   status = ipoib_mcast_join_finish(mcast, multicast-rec);
-
-   if (status) {
-   if (mcast-logcount++  20)
-   ipoib_dbg_mcast(netdev_priv(dev), sendonly multicast 
-   join failed for %pI6, status %d\n,
-   mcast-mcmember.mgid.raw, status);
-
-   /* Flush out any queued packets */
-   netif_tx_lock_bh(dev);
-   while (!skb_queue_empty(mcast-pkt_queue)) {
-   ++dev-stats.tx_dropped;
-   dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue));
-   }
-   netif_tx_unlock_bh(dev);
-   __ipoib_mcast_continue_join_thread(priv, mcast, 1);
-   } else {
-   /* Join completed, so reset any backoff parameters */
-   mcast-backoff = 1;
-   mcast-delay_until = jiffies;
-   __ipoib_mcast_continue_join_thread(priv, NULL, 0);
-   }
-out:
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
-   if (status)
-   mcast-mc = NULL;
-   complete(mcast-done);
-   mutex_unlock(mcast_mutex);
-   return status;
-}
-
-static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast)
-{
-   struct net_device *dev = mcast-dev;
-   struct ipoib_dev_priv *priv = netdev_priv(dev);
-   struct ib_sa_mcmember_rec rec = {
-#if 0  /* Some SMs don't support send-only yet */
-   .join_state = 4
-#else
-   .join_state = 1
-#endif
-   };
-   int ret = 0;
-
-   if (!test_bit(IPOIB_FLAG_OPER_UP, priv-flags)) {
-   ipoib_dbg_mcast(priv, device shutting down, no sendonly 
-   multicast joins\n);
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
-   complete(mcast-done);
-   return -ENODEV;
-   }
-
-   rec.mgid = mcast-mcmember.mgid;
-   rec.port_gid = priv-local_gid;
-   rec.pkey = cpu_to_be16(priv-pkey);
-
-   mutex_lock(mcast_mutex);
-   mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca,
-priv-port, rec,
-  

[PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister

2015-02-11 Thread Doug Ledford
When we unregister our event handler as a prelude to removing our
device, we need to go ahead and flush the ipoib_workqueue to make sure
there are no delayed flushes for the device we are getting ready to
remove.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 ++
 1 file changed, 2 insertions(+)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 6bad17d4d58..8810514cf40 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1643,6 +1643,7 @@ sysfs_failed:
 
 register_failed:
ib_unregister_event_handler(priv-event_handler);
+   flush_workqueue(ipoib_workqueue);
/* Stop GC if started before flush */
set_bit(IPOIB_STOP_NEIGH_GC, priv-flags);
cancel_delayed_work(priv-neigh_reap_task);
@@ -1709,6 +1710,7 @@ static void ipoib_remove_one(struct ib_device *device)
 
list_for_each_entry_safe(priv, tmp, dev_list, list) {
ib_unregister_event_handler(priv-event_handler);
+   flush_workqueue(ipoib_workqueue);
 
rtnl_lock();
dev_change_flags(priv-dev, priv-dev-flags  ~IFF_UP);
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 14/22] IB/ipoib: remove unneeded locks

2015-02-11 Thread Doug Ledford
During the various work I've done on this, some extra locking has crept
in when things weren't working right.  This is one of those spots.
Remove the unneeded spinlocks.  The mutex is enough to protect against
what we need to protect against.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 --
 1 file changed, 2 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index a03b6a04203..424c602d5a0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -497,12 +497,10 @@ static int ipoib_mcast_join_complete(int status,
__ipoib_mcast_continue_join_thread(priv, mcast, 1);
}
 out:
-   spin_lock_irq(priv-lock);
clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
if (status)
mcast-mc = NULL;
complete(mcast-done);
-   spin_unlock_irq(priv-lock);
mutex_unlock(mcast_mutex);
 
return status;
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface

2015-02-11 Thread Doug Ledford
During my recent work on the rtnl lock deadlock in the IPoIB driver, I
saw that even once I fixed the apparent races for a single device, as
soon as that device had any children, new races popped up.  It turns
out that this is because no matter how well we protect against races
on a single device, the fact that all devices use the same workqueue,
and flush_workqueue() flushes *everything* from that workqueue, we can
have one device in the middle of a down and holding the rtnl lock and
another totally unrelated device needing to run mcast_restart_task,
which wants the rtnl lock and will loop trying to take it unless is
sees its own FLAG_ADMIN_UP flag go away.  Because the unrelated
interface will never see its own ADMIN_UP flag drop, the interface
going down will deadlock trying to flush the queue.  There are several
possible solutions to this problem:

Make carrier_on_task and mcast_restart_task try to take the rtnl for
some set period of time and if they fail, then bail.  This runs the
real risk of dropping work on the floor, which can end up being its
own separate kind of deadlock.

Set some global flag in the driver that says some device is in the
middle of going down, letting all tasks know to bail.  Again, this can
drop work on the floor.  I suppose if our own ADMIN_UP flag doesn't go
away, then maybe after a few tries on the rtnl lock we can queue our
own task back up as a delayed work and return and avoid dropping work
on the floor that way.  But I'm not 100% convinced that we won't cause
other problems.

Or the method this patch attempts to use, which is when we bring an
interface up, create a workqueue specifically for that interface, so
that when we take it back down, we are flushing only those tasks
associated with our interface.  In addition, keep the global
workqueue, but now limit it to only flush tasks.  In this way, the
flush tasks can always flush the device specific work queues without
having deadlock issues.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h   |  1 +
 drivers/infiniband/ulp/ipoib/ipoib_cm.c| 18 +-
 drivers/infiniband/ulp/ipoib/ipoib_ib.c|  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_main.c  | 19 ---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 --
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 +-
 6 files changed, 58 insertions(+), 34 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index f4c1b20b23b..45fd10a72ec 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -323,6 +323,7 @@ struct ipoib_dev_priv {
struct list_head multicast_list;
struct rb_root multicast_tree;
 
+   struct workqueue_struct *wq;
struct delayed_work mcast_task;
struct work_struct carrier_on_task;
struct work_struct flush_light;
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 933efcea0d0..56959adb6c7 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, 
struct ib_cm_event *even
}
 
spin_lock_irq(priv-lock);
-   queue_delayed_work(ipoib_workqueue,
+   queue_delayed_work(priv-wq,
   priv-cm.stale_task, IPOIB_CM_RX_DELAY);
/* Add this entry to passive ids list head, but do not re-add it
 * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */
@@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
spin_lock_irqsave(priv-lock, flags);
list_splice_init(priv-cm.rx_drain_list, 
priv-cm.rx_reap_list);
ipoib_cm_start_rx_drain(priv);
-   queue_work(ipoib_workqueue, priv-cm.rx_reap_task);
+   queue_work(priv-wq, priv-cm.rx_reap_task);
spin_unlock_irqrestore(priv-lock, flags);
} else
ipoib_warn(priv, cm recv completion event with wrid %d 
( %d)\n,
@@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct 
ib_wc *wc)
spin_lock_irqsave(priv-lock, flags);
list_move(p-list, priv-cm.rx_reap_list);
spin_unlock_irqrestore(priv-lock, flags);
-   queue_work(ipoib_workqueue, 
priv-cm.rx_reap_task);
+   queue_work(priv-wq, priv-cm.rx_reap_task);
}
return;
}
@@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct 
ib_wc *wc)
 
if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags)) {

[PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task

2015-02-11 Thread Doug Ledford
There are two intertwined problems here.

First, if we are downing the interface, we will get called as part of
the process.  When that happens, we don't really want to do anything
here as the next thing to happen is ipoib_mcast_dev_flush will get
called.  We just want to let it do all the work since we don't need to
do any address matches and such.  So, we should be testing for whether
or not the interface is up at the beginning of this function and
returning if it isn't.

Second, the attempt to grab the rtnl_lock to prevent racing against
ipoib_mcast_dev_flush was mistaken.  The removed mcast groups are on a
stack local list head and so once we've removed them from the mcast
rbtree, ipoib_mcast_dev_flush can no longer find them and race us on
removing them.  However, because we attempt to take the lock, and then
check the status of the interface, if our interface has gone down, we
simply return.  This is absolutely the wrong thing to do as it means we
just leaked all of the mcast entries on our remove list.

The fix is to check for the interface being up at the very beginning of
the function, and then to always remove our entries on the remove list
no matter what as they will get leaked otherwise, and skip taking the
rtnl_lock since it isn't needed anyway.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 19 +++
 1 file changed, 3 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 972fa15a225..10e05437c83 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -871,6 +871,9 @@ void ipoib_mcast_restart_task(struct work_struct *work)
unsigned long flags;
struct ib_sa_mcmember_rec rec;
 
+   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
+   return;
+
ipoib_dbg_mcast(priv, restarting multicast task\n);
 
local_irq_save(flags);
@@ -965,21 +968,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
wait_for_completion(mcast-done);
 
-   /*
-* We have to cancel outside of the spinlock, but we have to
-* take the rtnl lock or else we race with the removal of
-* entries from the remove list in mcast_dev_flush as part
-* of ipoib_stop().  We detect the drop of the ADMIN_UP flag
-* to signal that we have hit this particular race, and we
-* return since we know we don't need to do anything else
-* anyway.
-*/
-   while (!rtnl_trylock()) {
-   if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
-   return;
-   else
-   msleep(20);
-   }
list_for_each_entry_safe(mcast, tmcast, remove_list, list) {
ipoib_mcast_leave(mcast-dev, mcast);
ipoib_mcast_free(mcast);
@@ -988,7 +976,6 @@ void ipoib_mcast_restart_task(struct work_struct *work)
 * Restart our join task thread if needed
 */
__ipoib_mcast_continue_join_thread(priv, NULL, 0);
-   rtnl_unlock();
 }
 
 #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 20/22] IB/ipoib: don't queue a work struct up twice

2015-02-11 Thread Doug Ledford
In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I
added the ability to have some joins processed around delayed joins.
Because the join task processes one join at a time, we needed to queue
the task to run again immediately and then again at the appropriate
delay time.  I didn't think about the fact that our work struct can only
be on the workqueue once at a given time and instead tried to queue the
same struct up twice.  Instead, kick the queue again immediately on a
delayed restart attempt, when the task sees that it has nothing to do
but delayed work, it will pick the shortest work item and queue a delay
equal to its expiration.  Finally, in case we have delayed work, in the
send_mcast routine we need to cancel any delayed work before kicking the
thread immediately.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 49 --
 1 file changed, 31 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index ad2bcd105e4..68b47bed33a 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -71,29 +71,28 @@ static void __ipoib_mcast_continue_join_thread(struct 
ipoib_dev_priv *priv,
   int delay)
 {
if (test_bit(IPOIB_MCAST_RUN, priv-flags)) {
-   /*
-* Mark this mcast for its delay and set a timer to kick the
-* thread when the delay completes
-*/
if (mcast  delay) {
mcast-backoff *= 2;
if (mcast-backoff  IPOIB_MAX_BACKOFF_SECONDS)
mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS;
mcast-delay_until = jiffies + (mcast-backoff * HZ);
-   queue_delayed_work(priv-wq, priv-mcast_task,
-  mcast-backoff * HZ);
+   /*
+* Mark this mcast for its delay, but restart the
+* task immediately.  It will make sure to clear
+* out all entries without delays, and then
+* schedule itself for run again when the delay
+* expires
+*/
+   queue_delayed_work(priv-wq, priv-mcast_task, 0);
} else if (delay) {
-   /* Special case of retrying after a failure to
+   /*
+* Special case of retrying after a failure to
 * allocate the broadcast multicast group, wait
 * 1 second and try again
 */
queue_delayed_work(priv-wq, priv-mcast_task, HZ);
-   }
-   /*
-* But also rekick the thread immediately for any other
-* tasks in queue behind this one
-*/
-   queue_delayed_work(priv-wq, priv-mcast_task, delay);
+   } else
+   queue_delayed_work(priv-wq, priv-mcast_task, 0);
}
 }
 
@@ -568,6 +567,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
struct ib_port_attr port_attr;
struct ipoib_mcast *mcast = NULL;
int create = 1;
+   unsigned long delay_until = 0;
 
if (!test_bit(IPOIB_MCAST_RUN, priv-flags))
return;
@@ -636,11 +636,14 @@ void ipoib_mcast_join_task(struct work_struct *work)
list_for_each_entry(mcast, priv-multicast_list, list) {
if (IS_ERR_OR_NULL(mcast-mc) 
!test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags) 
-   !test_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags) 
-   (mcast-backoff == 1 ||
-time_after_eq(jiffies, mcast-delay_until))) {
-   /* Found the next unjoined group */
-   break;
+   !test_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags)) {
+   if (mcast-backoff == 1 ||
+   time_after_eq(jiffies, mcast-delay_until))
+   /* Found the next unjoined group */
+   break;
+   else if (!delay_until ||
+time_before(mcast-delay_until, delay_until))
+   delay_until = mcast-delay_until;
}
}
 
@@ -653,6 +656,9 @@ void ipoib_mcast_join_task(struct work_struct *work)
mcast = NULL;
ipoib_dbg_mcast(priv, successfully joined all 
multicast groups\n);
+   if (delay_until)
+   queue_delayed_work(priv-wq, priv-mcast_task,
+  

[PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown

2015-02-11 Thread Doug Ledford
The introduction of garbage collection for neighbors in the ipoib stack
caused a leak of resources.  In particular, an ah can have a refcount
from a mcast struct, from a neigh struct, and from a path struct.  When
we put an ah ref, the ah gets moved to a dead list to be reaped later if
the refcount goes to 0.

During ipoib_ib_dev_stop we attempt to reap these ah entries, but since
that function does not force the neigh garbage collection to be run, it
can leak any entries held by a neighbor.  And even though the code
attempts to warn us that we will leak entries, that warning is broken
because the entries we are leaking will not yet be on the dead_ahs list
because they still have valid references held elsewhere.  It is during
shutdown, when we finally kill the neighbor garbage collection, that our
final ahs will get put on the dead_ahs list, but after the point at
which we run the neighbor garbage collection, no attempt is made to reap
the newly added ahs, and because during ipoib_ib_dev_stop we killed the
ah_reap task, no timed attempt will be made to clear them either.

Instead, create a an ipoib_flush_ah and ipoib_stop_ah routines to use at
appropriate times to flush out all remaining ah entries before we shut
the device down.

This is done to prevent resource leaks on shutdown, which manifest with
this message on ib_ipoib module remove:

ibdev: ib_dealloc_pd failed

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c| 44 ++
 drivers/infiniband/ulp/ipoib/ipoib_verbs.c |  4 ++-
 2 files changed, 30 insertions(+), 18 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index fe65abb5150..e144d07d53c 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -659,6 +659,23 @@ void ipoib_reap_ah(struct work_struct *work)
   round_jiffies_relative(HZ));
 }
 
+static void ipoib_flush_ah(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   cancel_delayed_work(priv-ah_reap_task);
+   flush_workqueue(priv-wq);
+   ipoib_reap_ah(priv-ah_reap_task.work);
+}
+
+static void ipoib_stop_ah(struct net_device *dev)
+{
+   struct ipoib_dev_priv *priv = netdev_priv(dev);
+
+   set_bit(IPOIB_STOP_REAPER, priv-flags);
+   ipoib_flush_ah(dev);
+}
+
 static void ipoib_ib_tx_timer_func(unsigned long ctx)
 {
drain_tx_cq((struct net_device *)ctx);
@@ -877,23 +894,7 @@ timeout:
if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE))
ipoib_warn(priv, Failed to modify QP to RESET state\n);
 
-   /* Wait for all AHs to be reaped */
-   set_bit(IPOIB_STOP_REAPER, priv-flags);
-   cancel_delayed_work(priv-ah_reap_task);
-   flush_workqueue(priv-wq);
-
-   begin = jiffies;
-
-   while (!list_empty(priv-dead_ahs)) {
-   __ipoib_reap_ah(dev);
-
-   if (time_after(jiffies, begin + HZ)) {
-   ipoib_warn(priv, timing out; will leak address 
handles\n);
-   break;
-   }
-
-   msleep(1);
-   }
+   ipoib_flush_ah(dev);
 
ib_req_notify_cq(priv-recv_cq, IB_CQ_NEXT_COMP);
 
@@ -1036,6 +1037,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv 
*priv,
if (level == IPOIB_FLUSH_LIGHT) {
ipoib_mark_paths_invalid(dev);
ipoib_mcast_dev_flush(dev);
+   ipoib_flush_ah(dev);
}
 
if (level = IPOIB_FLUSH_NORMAL)
@@ -1099,6 +1101,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev)
ipoib_mcast_stop_thread(dev);
ipoib_mcast_dev_flush(dev);
 
+   /*
+* All of our ah references aren't free until after
+* ipoib_mcast_dev_flush(), ipoib_flush_paths, and
+* the neighbor garbage collection is stopped and reaped.
+* That should all be done now, so make a final ah flush.
+*/
+   ipoib_stop_ah(dev);
+
ipoib_transport_dev_cleanup(dev);
 }
 
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c 
b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
index b72a753eb41..66df78382a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c
@@ -263,6 +263,9 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
+   if (priv-wq)
+   flush_workqueue(priv-wq);
+
if (priv-qp) {
if (ib_destroy_qp(priv-qp))
ipoib_warn(priv, ib_qp_destroy failed\n);
@@ -286,7 +289,6 @@ void ipoib_transport_dev_cleanup(struct net_device *dev)
ipoib_warn(priv, ib_dealloc_pd failed\n);
 
if (priv-wq) {
-   flush_workqueue(priv-wq);
destroy_workqueue(priv-wq);
priv-wq = NULL;
}
-- 
2.1.0

--
To 

[PATCH 22/22] IB/ipoib: drop mcast_mutex usage

2015-02-11 Thread Doug Ledford
We needed the mcast_mutex when we had to protect two different workqueue
threads against stomping on each others work.  By no longer using
mcast-mc to directly store the return value of ib_sa_join_multicast, we
can switch all of the mcast flag/completion locking to being only the
priv-lock spinlock.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 55 +-
 1 file changed, 19 insertions(+), 36 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 355c320dc4d..e1fab519f96 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level,
 Enable multicast debug tracing if  0);
 #endif
 
-static DEFINE_MUTEX(mcast_mutex);
-
 struct ipoib_mcast_iter {
struct net_device *dev;
union ib_gid   mgid;
@@ -340,16 +338,6 @@ static int ipoib_mcast_join_complete(int status,
sendonly  : ,
mcast-mcmember.mgid.raw, status);
 
-   /*
-* We have to take the mutex to force mcast_join to
-* return from ib_sa_multicast_join and set mcast-mc to a
-* valid value.  Otherwise we were racing with ourselves in
-* that we might fail here, but get a valid return from
-* ib_sa_multicast_join after we had cleared mcast-mc here,
-* resulting in mis-matched joins and leaves and a deadlock
-*/
-   mutex_lock(mcast_mutex);
-
/* We trap for port events ourselves. */
if (status == -ENETRESET) {
status = 0;
@@ -410,11 +398,14 @@ static int ipoib_mcast_join_complete(int status,
__ipoib_mcast_continue_join_thread(priv, mcast, 1);
}
 out:
+   spin_lock_irq(priv-lock);
clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
if (status)
mcast-mc = NULL;
+   else
+   mcast-mc = multicast;
complete(mcast-done);
-   mutex_unlock(mcast_mutex);
+   spin_unlock_irq(priv-lock);
 
return status;
 }
@@ -423,6 +414,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct 
ipoib_mcast *mcast,
 int create)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
+   struct ib_sa_multicast *multicast;
struct ib_sa_mcmember_rec rec = {
.join_state = 1
};
@@ -464,19 +456,20 @@ static void ipoib_mcast_join(struct net_device *dev, 
struct ipoib_mcast *mcast,
rec.hop_limit = priv-broadcast-mcmember.hop_limit;
}
 
-   mutex_lock(mcast_mutex);
-   mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port,
+   multicast = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port,
 rec, comp_mask, GFP_KERNEL,
 ipoib_mcast_join_complete, mcast);
-   if (IS_ERR(mcast-mc)) {
-   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
-   complete(mcast-done);
-   ret = PTR_ERR(mcast-mc);
+   if (IS_ERR(multicast)) {
+   ret = PTR_ERR(multicast);
+   mcast-mc = NULL;
ipoib_warn(priv, ib_sa_join_multicast failed, status %d\n, 
ret);
+   spin_lock_irq(priv-lock);
+   clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags);
/* Requeue this join task with a backoff delay */
__ipoib_mcast_continue_join_thread(priv, mcast, 1);
+   complete(mcast-done);
+   spin_unlock_irq(priv-lock);
}
-   mutex_unlock(mcast_mutex);
 }
 
 void ipoib_mcast_join_task(struct work_struct *work)
@@ -505,15 +498,6 @@ void ipoib_mcast_join_task(struct work_struct *work)
else
memcpy(priv-dev-dev_addr + 4, priv-local_gid.raw, sizeof 
(union ib_gid));
 
-   /*
-* We have to hold the mutex to keep from racing with the join
-* completion threads on setting flags on mcasts, and we have
-* to hold the priv-lock because dev_flush will remove entries
-* out from underneath us, so at a minimum we need the lock
-* through the time that we do the for_each loop of the mcast
-* list or else dev_flush can make us oops.
-*/
-   mutex_lock(mcast_mutex);
spin_lock_irq(priv-lock);
if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags))
goto out;
@@ -573,9 +557,7 @@ void ipoib_mcast_join_task(struct work_struct *work)
else
create = 1;
spin_unlock_irq(priv-lock);
-   mutex_unlock(mcast_mutex);
ipoib_mcast_join(dev, mcast, create);
-   

[PATCH 08/22] IB/ipoib: No longer use flush as a parameter

2015-02-11 Thread Doug Ledford
Various places in the IPoIB code had a deadlock related to flushing
the ipoib workqueue.  Now that we have per device workqueues and a
specific flush workqueue, there is no longer a deadlock issue with
flushing the device specific workqueues and we can do so unilaterally.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib.h  |  6 +++---
 drivers/infiniband/ulp/ipoib/ipoib_ib.c   | 19 +--
 drivers/infiniband/ulp/ipoib/ipoib_main.c |  8 
 3 files changed, 16 insertions(+), 17 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h 
b/drivers/infiniband/ulp/ipoib/ipoib.h
index 28dc927c0e8..8ba80a6d3a4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib.h
+++ b/drivers/infiniband/ulp/ipoib/ipoib.h
@@ -484,10 +484,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work);
 void ipoib_pkey_event(struct work_struct *work);
 void ipoib_ib_dev_cleanup(struct net_device *dev);
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush);
+int ipoib_ib_dev_open(struct net_device *dev);
 int ipoib_ib_dev_up(struct net_device *dev);
-int ipoib_ib_dev_down(struct net_device *dev, int flush);
-int ipoib_ib_dev_stop(struct net_device *dev, int flush);
+int ipoib_ib_dev_down(struct net_device *dev);
+int ipoib_ib_dev_stop(struct net_device *dev);
 void ipoib_pkey_dev_check_presence(struct net_device *dev);
 
 int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c 
b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
index 66096787119..fe65abb5150 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c
@@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx)
drain_tx_cq((struct net_device *)ctx);
 }
 
-int ipoib_ib_dev_open(struct net_device *dev, int flush)
+int ipoib_ib_dev_open(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
int ret;
@@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush)
 dev_stop:
if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, priv-flags))
napi_enable(priv-napi);
-   ipoib_ib_dev_stop(dev, flush);
+   ipoib_ib_dev_stop(dev);
return -1;
 }
 
@@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev)
return ipoib_mcast_start_thread(dev);
 }
 
-int ipoib_ib_dev_down(struct net_device *dev, int flush)
+int ipoib_ib_dev_down(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
@@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev)
local_bh_enable();
 }
 
-int ipoib_ib_dev_stop(struct net_device *dev, int flush)
+int ipoib_ib_dev_stop(struct net_device *dev)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
struct ib_qp_attr qp_attr;
@@ -880,8 +880,7 @@ timeout:
/* Wait for all AHs to be reaped */
set_bit(IPOIB_STOP_REAPER, priv-flags);
cancel_delayed_work(priv-ah_reap_task);
-   if (flush)
-   flush_workqueue(priv-wq);
+   flush_workqueue(priv-wq);
 
begin = jiffies;
 
@@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
(unsigned long) dev);
 
if (dev-flags  IFF_UP) {
-   if (ipoib_ib_dev_open(dev, 1)) {
+   if (ipoib_ib_dev_open(dev)) {
ipoib_transport_dev_cleanup(dev);
return -ENODEV;
}
@@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv 
*priv,
}
 
if (level = IPOIB_FLUSH_NORMAL)
-   ipoib_ib_dev_down(dev, 0);
+   ipoib_ib_dev_down(dev);
 
if (level == IPOIB_FLUSH_HEAVY) {
if (test_bit(IPOIB_FLAG_INITIALIZED, priv-flags))
-   ipoib_ib_dev_stop(dev, 0);
-   if (ipoib_ib_dev_open(dev, 0) != 0)
+   ipoib_ib_dev_stop(dev);
+   if (ipoib_ib_dev_open(dev) != 0)
return;
if (netif_queue_stopped(dev))
netif_start_queue(dev);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 42e5c278f48..6bad17d4d58 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev)
 
set_bit(IPOIB_FLAG_ADMIN_UP, priv-flags);
 
-   if (ipoib_ib_dev_open(dev, 1)) {
+   if (ipoib_ib_dev_open(dev)) {
if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags))
return 0;
goto err_disable;
@@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev)
return 0;
 
 err_stop:
-   ipoib_ib_dev_stop(dev, 1);
+   ipoib_ib_dev_stop(dev);
 
 err_disable:
clear_bit(IPOIB_FLAG_ADMIN_UP, priv-flags);
@@ -157,8 

[PATCH 05/22] IB/ipoib: change init sequence ordering

2015-02-11 Thread Doug Ledford
In preparation for using per device work queues, we need to move the
start of the neighbor thread task to after ipoib_ib_dev_init and move
the destruction of the neighbor task to before ipoib_ib_dev_cleanup.
Otherwise we will end up freeing our workqueue with work possibly
still on it.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +---
 1 file changed, 17 insertions(+), 7 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c 
b/drivers/infiniband/ulp/ipoib/ipoib_main.c
index 58b5aa3b6f2..2cf81ef5141 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_main.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c
@@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
 {
struct ipoib_dev_priv *priv = netdev_priv(dev);
 
-   if (ipoib_neigh_hash_init(priv)  0)
-   goto out;
/* Allocate RX/TX rings to hold queued skbs */
priv-rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv-rx_ring,
GFP_KERNEL);
if (!priv-rx_ring) {
printk(KERN_WARNING %s: failed to allocate RX ring (%d 
entries)\n,
   ca-name, ipoib_recvq_size);
-   goto out_neigh_hash_cleanup;
+   goto out;
}
 
priv-tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv-tx_ring);
@@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct 
ib_device *ca, int port)
if (ipoib_ib_dev_init(dev, ca, port))
goto out_tx_ring_cleanup;
 
+   /*
+* Must be after ipoib_ib_dev_init so we can allocate a per
+* device wq there and use it here
+*/
+   if (ipoib_neigh_hash_init(priv)  0)
+   goto out_dev_uninit;
+
return 0;
 
+out_dev_uninit:
+   ipoib_ib_dev_cleanup();
+
 out_tx_ring_cleanup:
vfree(priv-tx_ring);
 
 out_rx_ring_cleanup:
kfree(priv-rx_ring);
 
-out_neigh_hash_cleanup:
-   ipoib_neigh_hash_uninit(dev);
 out:
return -ENOMEM;
 }
@@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev)
}
unregister_netdevice_many(head);
 
+   /*
+* Must be before ipoib_ib_dev_cleanup or we delete an in use
+* work queue
+*/
+   ipoib_neigh_hash_uninit(dev);
+
ipoib_ib_dev_cleanup(dev);
 
kfree(priv-rx_ring);
@@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev)
 
priv-rx_ring = NULL;
priv-tx_ring = NULL;
-
-   ipoib_neigh_hash_uninit(dev);
 }
 
 static const struct header_ops ipoib_header_ops = {
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH 18/22] IB/ipoib: cleanup a couple debug messages

2015-02-11 Thread Doug Ledford
The one I removed in particular can prevent progress in the driver from
happening because it isn't rate limited.  In my testing, I have seen the
flood of messages from that debug message delay multicast joins for in
excess of 30 seconds.

The added debug message was just so I would know when we were leaving a
multicast group that we didn't fully join.

Signed-off-by: Doug Ledford dledf...@redhat.com
---
 drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +--
 1 file changed, 5 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c 
b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
index 10e05437c83..ad2bcd105e4 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c
@@ -711,6 +711,8 @@ static int ipoib_mcast_leave(struct net_device *dev, struct 
ipoib_mcast *mcast)
 
if (!IS_ERR_OR_NULL(mcast-mc))
ib_sa_free_multicast(mcast-mc);
+   else
+   ipoib_dbg(priv, ipoib_mcast_leave with mcast-mc invalid\n);
 
if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags)) {
ipoib_dbg_mcast(priv, leaving MGID %pI6\n,
@@ -721,7 +723,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct 
ipoib_mcast *mcast)
  be16_to_cpu(mcast-mcmember.mlid));
if (ret)
ipoib_warn(priv, ib_detach_mcast failed (result = 
%d)\n, ret);
-   }
+   } else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags))
+   ipoib_dbg(priv, leaving with no mcmember but not a 
+ SENDONLY join\n);
 
return 0;
 }
@@ -772,11 +776,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, 
struct sk_buff *skb)
++dev-stats.tx_dropped;
dev_kfree_skb_any(skb);
}
-
-   if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags))
-   ipoib_dbg_mcast(priv, no address vector, 
-   but multicast join already started\n);
-
/*
 * If lookup completes between here and out:, don't
 * want to send packet twice.
-- 
2.1.0

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues

2015-02-11 Thread Or Gerlitz

On 2/11/2015 8:43 PM, Doug Ledford wrote:

This patchset revives the 8 patches that were reverted from 3.19,
the 11 patches that fixed the problems with the first 8, the single
patch that was related to reaping of ah's and failure to dealloc
resources on shutdown, and then adds two new patches that would
have been enhancements and not bugfixes and hence weren't appropriate
to post in the 3.19 tussle.
Doug, as asked few times... please provide public git tree where this 
series can be pulled from. So no squashes for patches which actually 
represent a fix to a downstream patch within the series? any specific 
reason for that?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: mlx4: IB_EVENT_PATH_MIG not generated on path migration

2015-02-11 Thread Fabian Holler
Hello Jason,

thanks for your answer.

On Tue, Feb 10, 2015 at 11:00:39AM -0700, Jason Gunthorpe wrote:
 On Tue, Feb 10, 2015 at 04:56:43PM +0100, Fabian Holler wrote:
 
  Does anybody have an idea what could be wrong?
  Are the PATH_MIG* notifications with mlx4 drivers are working for somebody?
 
 IIRC rdmacm does not forward these events.
 
 There is a lot missing to support APM with RDMA CM, it doesn't do any
 of the CM state machine operations to coordinate loading alternate
 paths, for instance.

Alternate paths can be set directly at the QP.

  As soon as the migration happens I expect that a IB_EVENT_PATH_MIG* is
  generated on the QP event handler, but I don't receive any QP events.
 
 qp_event_handler is a RDMA CM event handler, it handles
 RDMA_CM_EVENT_* and there is no RDMA_CM_EVENT_PATH_MIGRATED.
 You'd need to listen to the verbs level events directly, I don't
 recall if this is possible with RDMA CM or not, sorry.

The RDMA CM event handler is registered by the rdma_create_id() call.

The event handler specified in the struct ib_qp_init_attr * parameter of
rdma_create_qp() is registered as QP event handler.
rdma_create_qp() calls ib_create_qp() and sets qp-event_handler to
ib_qp_init_attr-event_handler.


Fabian
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH opensm] osm_ucast_mgr.c: Support diverse paths for LMC 0 when scatter_ports is enabled

2015-02-11 Thread Hal Rosenstock
From: Vladimir Koushnir vladim...@mellanox.com

The patch adds an ability to calculate diverse paths for all LIDs
(LMC  0) derived from the same Base LID,
when scatter ports is enabled (different from 0).
We apply randomization during selection of the outports
only when routing is calculated for the Base LID. For LIDs different
from the Base LID, we don't randomize the outport but apply the procedure
for calculating diverse path as if scatter_ports equals 0.
The patch is relevant for minhop/updn routing.

Signed-off-by: Vladimir Koushnir vladim...@mellanox.com
Signed-off-by: Shlomi Nimrodi shlo...@mellanox.com
---
 opensm/osm_ucast_mgr.c |2 +-
 1 files changed, 1 insertions(+), 1 deletions(-)

diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c
index 0b24e73..219e080 100644
--- a/opensm/osm_ucast_mgr.c
+++ b/opensm/osm_ucast_mgr.c
@@ -260,7 +260,7 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * 
p_mgr,
 p_mgr-p_subn-opt.lmc,
 p_mgr-is_dor,
 p_mgr-p_subn-opt.port_shifting,
-p_port-use_scatter,
+!lid_offset  p_port-use_scatter,
 OSM_LFT);
 
if (port == OSM_NO_PATH) {
-- 
1.7.8.2

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues

2015-02-11 Thread Or Gerlitz

On 2/11/2015 11:55 PM, Doug Ledford wrote:

I tried, but the most appropriate squashes were things like patch 18 and
19 squashed into patch 3, and some other similar jumping around of
patches.  The number of conflicts that created in both the original
squash and the follow on patches was perverse.  So, since the original
20 patches as I submitted previously are what has gone through our QA, I
left them as they were.  If I fixed them up and squashed them, and if
some overlooked fixup failure snuck in there, it would invalidate all of
that testing.


I see. Does this sequence bisect-able? namely the driver builds and 
functions after applying patch by patch? note this is hard requirement 
when modifying existing upstream drivers.


Or.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues

2015-02-11 Thread Doug Ledford
On Wed, 2015-02-11 at 23:46 -0500, Or Gerlitz wrote:
 On 2/11/2015 8:43 PM, Doug Ledford wrote:
  This patchset revives the 8 patches that were reverted from 3.19,
  the 11 patches that fixed the problems with the first 8, the single
  patch that was related to reaping of ah's and failure to dealloc
  resources on shutdown, and then adds two new patches that would
  have been enhancements and not bugfixes and hence weren't appropriate
  to post in the 3.19 tussle.
 Doug, as asked few times... please provide public git tree where this 
 series can be pulled from.

git://github.com/dledford/linux.git for-3.20

  So no squashes for patches which actually 
 represent a fix to a downstream patch within the series? any specific 
 reason for that?

I tried, but the most appropriate squashes were things like patch 18 and
19 squashed into patch 3, and some other similar jumping around of
patches.  The number of conflicts that created in both the original
squash and the follow on patches was perverse.  So, since the original
20 patches as I submitted previously are what has gone through our QA, I
left them as they were.  If I fixed them up and squashed them, and if
some overlooked fixup failure snuck in there, it would invalidate all of
that testing.

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD




signature.asc
Description: This is a digitally signed message part