RE: [PATCH v4 14/19] IB/core: Add IB_DEVICE_OPA_MAD_SUPPORT device cap flag
On 2/4/2015 6:29 PM, ira.we...@intel.com wrote: From: Ira Weiny ira.we...@intel.com OPA MADs share a common header with IBTA MADs but with a different base version and an extended length. These jumbo MADs increase the performance of management traffic. Sharing a common header with IBTA MADs allows us to share most of the MAD processing code when dealing with OPA MADs in addition to supporting some IBTA MADs on OPA devices. Add a device capability flag to indicate OPA MAD support on the device. Signed-off-by: Ira Weiny ira.we...@intel.com --- include/rdma/ib_verbs.h | 5 + 1 file changed, 5 insertions(+) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 3ab4033..2614233 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -128,6 +128,10 @@ enum ib_device_cap_flags { IB_DEVICE_ON_DEMAND_PAGING = (131), }; +enum ib_device_cap_flags2 { + IB_DEVICE_OPA_MAD_SUPPORT = 1 +}; + enum ib_signature_prot_cap { IB_PROT_T10DIF_TYPE_1 = 1, IB_PROT_T10DIF_TYPE_2 = 1 1, @@ -210,6 +214,7 @@ struct ib_device_attr { int sig_prot_cap; int sig_guard_cap; struct ib_odp_caps odp_caps; + u64 device_cap_flags2; u32 max_mad_size; }; Why is OPA support determined via a device capability flag ? What are the tradeoffs for doing it this way versus the other choices that have been used in the past for other RDMA technologies like RoCE, iWARP, usNIC, ... ? None of those technologies use the MAD stack for Subnet Management. Other MAD support is very limited (ie IB compatible PMA queries on the local port only). Do you have a suggestion for alternatives? Ira -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mlx4: IB_EVENT_PATH_MIG not generated on path migration
On Wed, Feb 11, 2015 at 10:21:36AM +0100, Fabian Holler wrote: Hello Jason, thanks for your answer. On Tue, Feb 10, 2015 at 11:00:39AM -0700, Jason Gunthorpe wrote: On Tue, Feb 10, 2015 at 04:56:43PM +0100, Fabian Holler wrote: Does anybody have an idea what could be wrong? Are the PATH_MIG* notifications with mlx4 drivers are working for somebody? IIRC rdmacm does not forward these events. There is a lot missing to support APM with RDMA CM, it doesn't do any of the CM state machine operations to coordinate loading alternate paths, for instance. Alternate paths can be set directly at the QP. APM requires a certain amount of synchronization between the local and remote QP state, particularly if you want it to work more than once. That is supposed to be done with CMA mads. The event handler specified in the struct ib_qp_init_attr * parameter of rdma_create_qp() is registered as QP event handler. rdma_create_qp() calls ib_create_qp() and sets qp-event_handler to ib_qp_init_attr-event_handler. Okay, right, I was thinking of something else. We had exactly this working a few months ago for a demo, in kernel using Lustre and RDMA CM. I checked the patches we made and yes, the event_handler was being properly called with IB_EVENT_PATH_MIG on mlx4 and qib cards. IIRC we may have needed the latest firmware on the mlx side. Jason -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 05/10] IB/cm,cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm
From: Guy Shapiro gu...@mellanox.com When receiving a connection request, ib_cm needs to associate the request with a network namespace. To do this, it needs to know the request's destination IP. For this the RDMA IP CM packet formatting functionality needs to be exposed to ib_cm. This patch merely moves the RDMA IP CM data formatting and parsing functions to be part of ib_cm. The following patch will utilize the new knowledge to look-up the appropriate namespace. Each namespace maintains an independent table of RDMA CM service IDs, allowing isolation and separation between the network namespaces. When creating a new incoming connection ID, the code in cm_save_ip_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. This required saving the service ID in cm_format_paths_from_req. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/core/cm.c | 156 +++ drivers/infiniband/core/cma.c | 166 +- include/rdma/ib_cm.h | 56 ++ 3 files changed, 230 insertions(+), 148 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 5a45cb7..efc5cff 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -51,6 +51,7 @@ #include rdma/ib_cache.h #include rdma/ib_cm.h +#include rdma/ib.h #include cm_msgs.h MODULE_AUTHOR(Sean Hefty); @@ -701,6 +702,159 @@ static void cm_reject_sidr_req(struct cm_id_private *cm_id_priv, ib_send_cm_sidr_rep(cm_id_priv-id, param); } +int cm_format_hdr(void *hdr, int family, + struct sockaddr *src_addr, + struct sockaddr *dst_addr) +{ + struct cm_hdr *cm_hdr; + + cm_hdr = hdr; + cm_hdr-cm_version = RDMA_IP_CM_VERSION; + if (family == AF_INET) { + struct sockaddr_in *src4, *dst4; + + src4 = (struct sockaddr_in *)src_addr; + dst4 = (struct sockaddr_in *)dst_addr; + + cm_set_ip_ver(cm_hdr, 4); + cm_hdr-src_addr.ip4.addr = src4-sin_addr.s_addr; + cm_hdr-dst_addr.ip4.addr = dst4-sin_addr.s_addr; + cm_hdr-port = src4-sin_port; + } else if (family == AF_INET6) { + struct sockaddr_in6 *src6, *dst6; + + src6 = (struct sockaddr_in6 *)src_addr; + dst6 = (struct sockaddr_in6 *)dst_addr; + + cm_set_ip_ver(cm_hdr, 6); + cm_hdr-src_addr.ip6 = src6-sin6_addr; + cm_hdr-dst_addr.ip6 = dst6-sin6_addr; + cm_hdr-port = src6-sin6_port; + } + return 0; +} +EXPORT_SYMBOL(cm_format_hdr); + +static void cm_save_ib_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct ib_sa_path_rec *path) +{ + struct sockaddr_ib *ib; + + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib-sib_family = AF_IB; + ib-sib_pkey = path-pkey; + ib-sib_flowinfo = path-flow_label; + memcpy(ib-sib_addr, path-sgid, 16); + ib-sib_sid = path-service_id; + ib-sib_sid_mask = cpu_to_be64(0xULL); + ib-sib_scope_id = 0; + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib-sib_family = AF_IB; + ib-sib_pkey = path-pkey; + ib-sib_flowinfo = path-flow_label; + memcpy(ib-sib_addr, path-dgid, 16); + } +} + +static void cm_save_ip6_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, +struct cm_hdr *hdr, +__be16 local_port) +{ + struct sockaddr_in6 *ip6; + + if (src_addr) { + ip6 = (struct sockaddr_in6 *)src_addr; + ip6-sin6_family = AF_INET6; + ip6-sin6_addr = hdr-dst_addr.ip6; + ip6-sin6_port = local_port; + } + + if (dst_addr) { + ip6 = (struct sockaddr_in6 *)dst_addr; + ip6-sin6_family = AF_INET6; + ip6-sin6_addr = hdr-src_addr.ip6; + ip6-sin6_port = hdr-port; + } +} + +static void cm_save_ip4_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, +struct cm_hdr *hdr, +__be16 local_port) +{ + struct sockaddr_in *ip4; + + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4-sin_family = AF_INET; + ip4-sin_addr.s_addr = hdr-dst_addr.ip4.addr; + ip4-sin_port = local_port; +
[PATCH v1 01/10] IB/addr: Pass network namespace as a parameter
From: Guy Shapiro gu...@mellanox.com Add network namespace support to the ib_addr module. For that, all the address resolution and matching should be done using the appropriate namespace instead of init_net. This is achieved by: 1. Adding an explicit network namespace argument to exported function that require a namespace. 2. Saving the namespace in the rdma_addr_client structure. 3. Using it when calling networking functions. In order to preserve the behavior of calling modules, init_net is passed as the parameter in calls from other modules. This is modified as namespace support is added on more levels. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/addr.c | 31 -- drivers/infiniband/core/cma.c| 4 ++- drivers/infiniband/core/verbs.c | 14 +++--- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 3 ++- include/rdma/ib_addr.h | 44 5 files changed, 72 insertions(+), 24 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index f80da50..95beaef 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, int ret = -EADDRNOTAVAIL; if (dev_addr-bound_dev_if) { - dev = dev_get_by_index(init_net, dev_addr-bound_dev_if); + dev = dev_get_by_index(dev_addr-net, dev_addr-bound_dev_if); if (!dev) return -ENODEV; ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -137,9 +137,10 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, } switch (addr-sa_family) { - case AF_INET: - dev = ip_dev_find(init_net, - ((struct sockaddr_in *) addr)-sin_addr.s_addr); + case AF_INET: { + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + + dev = ip_dev_find(dev_addr-net, addr_in-sin_addr.s_addr); if (!dev) return ret; @@ -149,12 +150,12 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, *vlan_id = rdma_vlan_dev_vlan_id(dev); dev_put(dev); break; - + } #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: rcu_read_lock(); - for_each_netdev_rcu(init_net, dev) { - if (ipv6_chk_addr(init_net, + for_each_netdev_rcu(dev_addr-net, dev) { + if (ipv6_chk_addr(dev_addr-net, ((struct sockaddr_in6 *) addr)-sin6_addr, dev, 1)) { ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -236,7 +237,7 @@ static int addr4_resolve(struct sockaddr_in *src_in, fl4.daddr = dst_ip; fl4.saddr = src_ip; fl4.flowi4_oif = addr-bound_dev_if; - rt = ip_route_output_key(init_net, fl4); + rt = ip_route_output_key(addr-net, fl4); if (IS_ERR(rt)) { ret = PTR_ERR(rt); goto out; @@ -278,12 +279,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, fl6.saddr = src_in-sin6_addr; fl6.flowi6_oif = addr-bound_dev_if; - dst = ip6_route_output(init_net, NULL, fl6); + dst = ip6_route_output(addr-net, NULL, fl6); if ((ret = dst-error)) goto put; if (ipv6_addr_any(fl6.saddr)) { - ret = ipv6_dev_get_saddr(init_net, ip6_dst_idev(dst)-dev, + ret = ipv6_dev_get_saddr(addr-net, +ip6_dst_idev(dst)-dev, fl6.daddr, 0, fl6.saddr); if (ret) goto put; @@ -458,7 +460,7 @@ static void resolve_cb(int status, struct sockaddr *src_addr, } int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, - u16 *vlan_id) + u16 *vlan_id, struct net *net) { int ret = 0; struct rdma_dev_addr dev_addr; @@ -481,6 +483,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, return ret; memset(dev_addr, 0, sizeof(dev_addr)); + dev_addr.net = net; ctx.addr = dev_addr; init_completion(ctx.comp); @@ -492,7 +495,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, wait_for_completion(ctx.comp); memcpy(dmac, dev_addr.dst_dev_addr, ETH_ALEN); - dev = dev_get_by_index(init_net, dev_addr.bound_dev_if); +
[PATCH v1 03/10] IB/core: Find the network namespace matching connection parameters
From: Yotam Kenneth yota...@mellanox.com In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the IP address should be enough to uniquely identify the ULP net device. This function is passed to ib_core as part of the ib_client registration. Using this functionality, add a way to get the network namespace corresponding to a work completion. This is needed so that responses to CM requests can be sent from the same network namespace as the request. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/device.c | 57 include/rdma/ib_verbs.h | 29 2 files changed, 86 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 18c1ece..2f06be5 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include linux/slab.h #include linux/init.h #include linux/mutex.h +#include linux/netdevice.h #include rdma/rdma_netlink.h #include core_priv.h @@ -733,6 +734,62 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev, +u8 port, +u16 pkey, +struct sockaddr *addr) +{ + struct net_device *ret = NULL; + struct ib_client *client; + + mutex_lock(device_mutex); + list_for_each_entry(client, client_list, list) + if (client-get_net_device_by_port_pkey_ip) { + ret = client-get_net_device_by_port_pkey_ip(dev, port, +pkey, +addr); + if (ret) + break; + } + + mutex_unlock(device_mutex); + return ret; +} + +struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr) +{ + struct net_device *ndev = NULL; + struct net *ns; + + switch (rdma_port_get_link_layer(dev, port)) { + case IB_LINK_LAYER_INFINIBAND: + if (!addr) + goto not_found; + ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr); + break; + default: + goto not_found; + } + + if (!ndev) + goto not_found; + + rcu_read_lock(); + ns = maybe_get_net(dev_net(ndev)); + dev_put(ndev); + rcu_read_unlock(); + if (!ns) + goto not_found; + return ns; + +not_found: + return get_net(init_net); +} +EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f4a85de..74b2394 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1683,6 +1683,21 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev:An RDMA device that the net_dev use for communication. +* @port: A physical port number on the RDMA device. +* @pkey: P_Key that the net_dev uses if applicable. +* @addr: An IP address the net_dev is configured with. +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. */ + struct net_device *(*get_net_device_by_port_pkey_ip)( + struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr); struct list_head list; }; @@ -2679,4 +2694,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace + * for a received CM request + * @dev: An RDMA device on
[PATCH v1 04/10] IB/ipoib: Return IPoIB devices as possible matches to get_net_device_by_port_pkey_ip
From: Guy Shapiro gu...@mellanox.com Implement callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each ipoib device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 122 +- 1 file changed, 121 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 58b5aa3..63119ff 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include linux/jhash.h #include net/arp.h +#include net/addrconf.h +#include linux/inetdevice.h +#include rdma/ib_cache.h #define DRV_VERSION 1.0.0 @@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, + struct sockaddr *addr); static struct ib_client ipoib_client = { .name = ipoib, .add= ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_device_by_port_pkey_ip = ipoib_get_net_device_by_port_pkey_ip, }; int ipoib_open(struct net_device *dev) @@ -222,6 +229,119 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +static bool ipoib_is_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + + if (addr-sa_family == AF_INET) { + struct in_device *in_dev = in_dev_get(dev); + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + __be32 ret_addr; + + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, +addr_in-sin_addr.s_addr, +RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + } +#if IS_ENABLED(CONFIG_IPV6) + else if (addr-sa_family == AF_INET6) { + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; + + if (ipv6_chk_addr(net, addr_in6-sin6_addr, dev, 1)) + return true; + } +#endif + return false; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + if (ipoib_is_dev_match_addr(addr, dev)) { + dev_hold(dev); + return dev; + } + + rcu_read_lock(); + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } + rcu_read_unlock(); + return result; +} + +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, struct sockaddr *addr) +{ + struct ipoib_dev_priv *priv; + struct list_head *dev_list; + u16 pkey_index; + + ib_find_cached_pkey(dev, port, pkey, pkey_index); + if (pkey_index == (u16)-1) + return NULL; + + if (rdma_node_get_transport(dev-node_type) != RDMA_TRANSPORT_IB) + return NULL; + + dev_list = ib_get_client_data(dev, ipoib_client); + if (!dev_list) + return NULL; + + list_for_each_entry(priv, dev_list, list) { + struct net_device *net_dev = NULL; + struct ipoib_dev_priv *child_priv; + + if (priv-port != port) + continue; + + if (priv-pkey_index == pkey_index) { + net_dev = ipoib_get_net_dev_match_addr(addr, priv-dev); + if (net_dev) + return net_dev; + } + +
[PATCH v1 06/10] IB/cm: Add network namespace support
From: Guy Shapiro gu...@mellanox.com Add namespace support to the IB-CM layer. - Each CM-ID now has a network namespace it is associated with, assigned at creation. This namespace is used as needed during subsequent action on the CM-ID or related objects. - All of the relevant calls to ib_addr and ib_core were updated to use the namespace from the CM-ID. External APIs were extended as needed to allow specifying the namespace where relevant. - The listening service ID table is now also indexed by the CM-ID namespace. - For incoming connection requests, we use the connection parameters to select namespace. The namespace is matched when looking for listening service ID. To preserve current behavior pass init_net to ib_cm wherever network namespace function parameters were added. The ib_cm_create_id interface now takes a reference to the relevant network namespace. CM-IDs created by accepting a connection for a listening CM-ID will also take a reference to the namespace. When the ID is destroyed, the namespace reference is released. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/core/cm.c| 124 drivers/infiniband/core/cma.c | 8 ++- drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 21 +- drivers/infiniband/ulp/srp/ib_srp.c | 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h| 7 +- 7 files changed, 130 insertions(+), 37 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index efc5cff..75c6ac9 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -241,6 +241,8 @@ struct cm_id_private { u8 service_timeout; u8 target_ack_delay; + struct net *net; /* A network namespace that the ID belongs to */ + struct list_head work_list; atomic_t work_count; }; @@ -347,12 +349,13 @@ static void cm_set_private_data(struct cm_id_private *cm_id_priv, } static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc, - struct ib_grh *grh, struct cm_av *av) + struct ib_grh *grh, struct cm_av *av, + struct net *net) { av-port = port; av-pkey_index = wc-pkey_index; ib_init_ah_from_wc(port-cm_dev-ib_device, port-port_num, wc, - grh, av-ah_attr, init_net); + grh, av-ah_attr, net); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) @@ -521,10 +524,15 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) if ((cur_cm_id_priv-id.service_mask service_id) == (service_mask cur_cm_id_priv-id.service_id) (cm_id_priv-id.device == cur_cm_id_priv-id.device) - !data_cmp) + !data_cmp + net_eq(cm_id_priv-net, cur_cm_id_priv-net)) return cur_cm_id_priv; - if (cm_id_priv-id.device cur_cm_id_priv-id.device) + if (cm_id_priv-net cur_cm_id_priv-net) + link = (*link)-rb_left; + else if (cm_id_priv-net cur_cm_id_priv-net) + link = (*link)-rb_right; + else if (cm_id_priv-id.device cur_cm_id_priv-id.device) link = (*link)-rb_left; else if (cm_id_priv-id.device cur_cm_id_priv-id.device) link = (*link)-rb_right; @@ -544,7 +552,8 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) static struct cm_id_private * cm_find_listen(struct ib_device *device, __be64 service_id, -u8 *private_data) +u8 *private_data, +struct net *net) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; @@ -556,10 +565,14 @@ static struct cm_id_private * cm_find_listen(struct ib_device *device, cm_id_priv-compare_data); if ((cm_id_priv-id.service_mask service_id) == cm_id_priv-id.service_id - (cm_id_priv-id.device == device) !data_cmp) + (cm_id_priv-id.device == device) !data_cmp + net_eq(cm_id_priv-net, net)) return cm_id_priv; - - if (device cm_id_priv-id.device) + if (net cm_id_priv-net) + node =
[PATCH v1 00/10] Add network namespace support in the RDMA-CM
RDMA-CM uses IP based addressing and routing to setup RDMA connections between hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM must reside in the init_net namespace. This restricts the usage of containers with RDMA to only work with host network namespace (aka the kernel init_net NS instance). This patchset allows using network namespaces with the RDMA-CM. Each RDMA-CM and CM id is keeping a reference to a network namespace. This reference is based on the process network namespace at the time of the creation of the object or inherited from the listener. This network namespace is used to perform all IP and network related operations. Specifically, the local device lookup, as well as the remote GID address resolution are done in the context of the RDMA-CM object's namespace. This allows outgoing connections to reach the right target, even if the same IP address exists in multiple network namespaces. This can happen if each network namespace resides on a different pkey. Additionally, the network namespace is used to split the listener service ID table. From the user point of view, each network namespace has a unique, completely independent table of service IDs. This allows running multiple instances of a single service on the same machine, using containers. To implement this, the CM layer now parses the IP address from the CM connect requests, and searches for the matching networking device. The namespace of the device found is used when looking up the service ID in the listener table. The functionnality introduced by this series would come into play when the transport is InfiniBand and IPoIB interfaces are assigned to each namespace. Multiple IPoIB interfaces can be created and assigned to different RDMA-CM capable containers, for example using pipework [1]. Full support for RoCE will be introduced in a later stage. The patches apply against kernel v3.19-rc5, with the patch RDMA/CMA: Mark IPv4 addresses correctly when the listener is IPv6 [2] applied. The patchset is structured as follows: Patches 1 and 2 are relatively trivial API extensions, requiring the callers of certain ib_addr and ib_core functions to provide a network namespace, as needed. Patches 3 and 4 adds the ability to lookup a network namespace according to the IP address, device and pkey. It finds the matching IPoIB interfaces, and safely takes a reference on the network namespace before returning to the caller. Patch 5 moves the logic that extracts the IP address from a connect request into the CM layer. This is needed for the upcoming listener lookup by namespace. Patch 6 adds support for network namespaces in the CM layer. All callers are still passing init_net as the namespace, to maintain backward compatibility. For incoming requests, the namespace of the relevant IPoIB device is used. Patches 7 and 8 add proper namespace support to the RDMA-CM module. Patches 9 and 10 add namespace support to the relevant user facing modules in the IB stack. [1] https://github.com/jpetazzo/pipework/pull/108 [2] https://patchwork.kernel.org/patch/5298971/ Change-log: V0 - V1: - Fix code review comments by Yann - Rebase on top of linux-3.19 Guy Shapiro (7): IB/addr: Pass network namespace as a parameter IB/core: Pass network namespace as a parameter to relevant functions IB/ipoib: Return IPoIB devices as possible matches to get_net_device_by_port_pkey_ip IB/cm,cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm IB/cm: Add network namespace support IB/cma: Add support for network namespaces IB/ucma: Take the network namespace from the process Shachar Raindel (1): IB/ucm: Add partial support for network namespaces Yotam Kenneth (2): IB/core: Find the network namespace matching connection parameters IB/cma: Separate port allocation to network namespaces drivers/infiniband/core/addr.c | 31 +- drivers/infiniband/core/agent.c| 4 +- drivers/infiniband/core/cm.c | 287 -- drivers/infiniband/core/cma.c | 332 + drivers/infiniband/core/device.c | 57 drivers/infiniband/core/mad_rmpp.c | 10 +- drivers/infiniband/core/ucm.c | 4 +- drivers/infiniband/core/ucma.c | 4 +- drivers/infiniband/core/user_mad.c | 4 +- drivers/infiniband/core/verbs.c| 22 +- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c| 21 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 122 +++- drivers/infiniband/ulp/iser/iser_verbs.c | 2 +- drivers/infiniband/ulp/isert/ib_isert.c| 2 +- drivers/infiniband/ulp/srp/ib_srp.c| 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 5 +- .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h
[PATCH v1 08/10] IB/cma: Add support for network namespaces
From: Guy Shapiro gu...@mellanox.com Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro gu...@mellanox.com Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com --- drivers/infiniband/core/cma.c | 52 +- drivers/infiniband/core/ucma.c | 3 +- drivers/infiniband/ulp/iser/iser_verbs.c | 2 +- drivers/infiniband/ulp/isert/ib_isert.c| 2 +- .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h| 4 +- include/rdma/rdma_cm.h | 6 ++- net/9p/trans_rdma.c| 2 +- net/rds/ib.c | 2 +- net/rds/ib_cm.c| 2 +- net/rds/iw.c | 2 +- net/rds/iw_cm.c| 2 +- net/rds/rdma_transport.c | 2 +- net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +- net/sunrpc/xprtrdma/verbs.c| 3 +- 14 files changed, 52 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 022b0d0..9ea42fe 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -540,7 +540,8 @@ static int cma_disable_callback(struct rdma_id_private *id_priv, struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps, - enum ib_qp_type qp_type) + enum ib_qp_type qp_type, + struct net *net) { struct rdma_id_private *id_priv; @@ -562,7 +563,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, INIT_LIST_HEAD(id_priv-listen_list); INIT_LIST_HEAD(id_priv-mc_list); get_random_bytes(id_priv-seq_num, sizeof id_priv-seq_num); - id_priv-id.route.addr.dev_addr.net = init_net; + id_priv-id.route.addr.dev_addr.net = get_net(net); return id_priv-id; } @@ -689,7 +690,7 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv, rdma_port_get_link_layer(id_priv-id.device, id_priv-id.port_num) == IB_LINK_LAYER_ETHERNET) { ret = rdma_addr_find_smac_by_sgid(sgid, qp_attr.smac, NULL, - init_net); + id_priv-id.route.addr.dev_addr.net); if (ret) goto out; @@ -953,6 +954,7 @@ static void cma_cancel_operation(struct rdma_id_private *id_priv, static void cma_release_port(struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list = id_priv-bind_list; + struct net *net = id_priv-id.route.addr.dev_addr.net; if (!bind_list) return; @@ -960,7 +962,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(lock); hlist_del(id_priv-node); if (hlist_empty(bind_list-owners)) { - cma_ps_remove(bind_list-ps, init_net, bind_list-port); + cma_ps_remove(bind_list-ps, net, bind_list-port); kfree(bind_list); } mutex_unlock(lock); @@ -1029,6 +1031,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv-id.context); kfree(id_priv-id.route.path_rec); + put_net(id_priv-id.route.addr.dev_addr.net); kfree(id_priv); } EXPORT_SYMBOL(rdma_destroy_id); @@ -1156,7 +1159,8 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, int ret; id = rdma_create_id(listen_id-event_handler, listen_id-context, - listen_id-ps, ib_event-param.req_rcvd.qp_type); + listen_id-ps, ib_event-param.req_rcvd.qp_type, + listen_id-route.addr.dev_addr.net); if (IS_ERR(id)) return NULL; @@ -1201,10 +1205,11 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, { struct rdma_id_private *id_priv; struct rdma_cm_id *id; + struct net *net = listen_id-route.addr.dev_addr.net; int ret; id = rdma_create_id(listen_id-event_handler,
[PATCH v1 07/10] IB/cma: Separate port allocation to network namespaces
From: Yotam Kenneth yota...@mellanox.com Keep a radix-tree for the network namespaces we support for each port-space. Dynamically allocate idr for network namespace upon first bind request for a port in the (ps, net) tuple. Destroy the idr when the (ps, net) tuple does not contain any bounded ports. This patch is internal infrastructure work for the following patch. In this patch, init_net is statically used as the network namespace for the new port-space API. The radix-tree is protected under the same locking that protects the rest of the port space data. This locking is practically a big, static mutex lock for the entire module. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/cma.c | 122 ++ 1 file changed, 99 insertions(+), 23 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 1ce84a0..022b0d0 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -39,11 +39,13 @@ #include linux/mutex.h #include linux/random.h #include linux/idr.h +#include linux/radix-tree.h #include linux/inetdevice.h #include linux/slab.h #include linux/module.h #include net/route.h +#include net/netns/hash.h #include net/tcp.h #include net/ipv6.h @@ -80,10 +82,83 @@ static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); static struct workqueue_struct *cma_wq; -static DEFINE_IDR(tcp_ps); -static DEFINE_IDR(udp_ps); -static DEFINE_IDR(ipoib_ps); -static DEFINE_IDR(ib_ps); +static RADIX_TREE(tcp_ps, GFP_KERNEL); +static RADIX_TREE(udp_ps, GFP_KERNEL); +static RADIX_TREE(ipoib_ps, GFP_KERNEL); +static RADIX_TREE(ib_ps, GFP_KERNEL); + +static LIST_HEAD(idrs_list); + +struct idr_ll { + unsigned net_val; + struct net *net; + struct radix_tree_root *ps; + struct idr idr; +}; + +static void zap_ps_idr(struct idr_ll *idr_ll) +{ + radix_tree_delete(idr_ll-ps, idr_ll-net_val); + idr_destroy(idr_ll-idr); + kfree(idr_ll); +} + +static int cma_ps_alloc(struct radix_tree_root *ps, struct net *net, void *ptr, + int snum) +{ + struct idr_ll *idr_ll; + int err; + int res; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (!idr_ll) { + idr_ll = kmalloc(sizeof(*idr_ll), GFP_KERNEL); + if (!idr_ll) + return -ENOMEM; + idr_init(idr_ll-idr); + idr_ll-net_val = net_hash_mix(net); + idr_ll-net = net; + idr_ll-ps = ps; + err = radix_tree_insert(ps, idr_ll-net_val, idr_ll); + if (err) { + idr_destroy(idr_ll-idr); + kfree(idr_ll); + return err; + } + } + res = idr_alloc(idr_ll-idr, ptr, snum, snum + 1, GFP_KERNEL); + if (unlikely((res 0) idr_is_empty(idr_ll-idr))) { + zap_ps_idr(idr_ll); + return res; + } + return res; +} + +static void *cma_ps_find(struct radix_tree_root *ps, struct net *net, int snum) +{ + struct idr_ll *idr_ll; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (!idr_ll) + return NULL; + return idr_find(idr_ll-idr, snum); +} + +static void cma_ps_remove(struct radix_tree_root *ps, struct net *net, int snum) +{ + struct idr_ll *idr_ll; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (unlikely(!idr_ll)) { + WARN(1, cma_ps_removed can't find expected net ns 0x%lx\n, +(unsigned long)net); + return; + } + idr_remove(idr_ll-idr, snum); + if (idr_is_empty(idr_ll-idr)) { + zap_ps_idr(idr_ll); + } +} struct cma_device { struct list_headlist; @@ -94,9 +169,9 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; - struct hlist_head owners; - unsigned short port; + struct radix_tree_root *ps; + struct hlist_head owners; + unsigned short port; }; enum { @@ -885,7 +960,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(lock); hlist_del(id_priv-node); if (hlist_empty(bind_list-owners)) { - idr_remove(bind_list-ps, bind_list-port); + cma_ps_remove(bind_list-ps, init_net, bind_list-port); kfree(bind_list); } mutex_unlock(lock); @@ -2198,8 +2273,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(id_priv-node, bind_list-owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv,
[PATCH v1 10/10] IB/ucm: Add partial support for network namespaces
It is impossible to completely support network namespaces for UCM, as we cannot identify the target IPoIB device. However, we add support which will work if the user is following the IB-Spec Annex 11 (RDMA IP CM Services) with the service ID and private data formatting. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/ucm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 9604ab0..4244210 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -45,6 +45,7 @@ #include linux/idr.h #include linux/mutex.h #include linux/slab.h +#include linux/nsproxy.h #include asm/uaccess.h @@ -490,7 +491,7 @@ static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, ctx-uid = cmd.uid; ctx-cm_id = ib_create_cm_id(file-device-ib_dev, ib_ucm_event_handler, ctx, -init_net); +current-nsproxy-net_ns); if (IS_ERR(ctx-cm_id)) { result = PTR_ERR(ctx-cm_id); goto err1; -- 1.7.11.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 02/10] IB/core: Pass network namespace as a parameter to relevant functions
From: Guy Shapiro gu...@mellanox.com Add network namespace parameters for the address related ib_core functions. The parameter is passed to lower level function, instead of init_net, so things are done in the correct namespace. For now pass init_net on every caller. Callers that will pass init_net permanently are marked with an appropriate comment. Signed-off-by: Haggai Eran hagg...@mellanox.com Signed-off-by: Yotam Kenneth yota...@mellanox.com Signed-off-by: Shachar Raindel rain...@mellanox.com Signed-off-by: Guy Shapiro gu...@mellanox.com --- drivers/infiniband/core/agent.c | 4 +++- drivers/infiniband/core/cm.c | 9 +++-- drivers/infiniband/core/mad_rmpp.c| 10 -- drivers/infiniband/core/user_mad.c| 4 +++- drivers/infiniband/core/verbs.c | 10 ++ drivers/infiniband/ulp/srpt/ib_srpt.c | 3 ++- include/rdma/ib_verbs.h | 15 +-- 7 files changed, 42 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index f6d2961..539378d 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -99,7 +99,9 @@ void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, } agent = port_priv-agent[qpn]; - ah = ib_create_ah_from_wc(agent-qp-pd, wc, grh, port_num); + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ + ah = ib_create_ah_from_wc(agent-qp-pd, wc, grh, port_num, init_net); if (IS_ERR(ah)) { dev_err(device-dev, ib_create_ah_from_wc error %ld\n, PTR_ERR(ah)); diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index e28a494..5a45cb7 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -290,8 +290,13 @@ static int cm_alloc_response_msg(struct cm_port *port, struct ib_mad_send_buf *m; struct ib_ah *ah; + /* For IB, the network namespace doesn't affect the created address +* handle, so we use init_net. In the future, RoCE support will +* require finding a specific network namespace to send the response +* from. */ ah = ib_create_ah_from_wc(port-mad_agent-qp-pd, mad_recv_wc-wc, - mad_recv_wc-recv_buf.grh, port-port_num); + mad_recv_wc-recv_buf.grh, port-port_num, + init_net); if (IS_ERR(ah)) return PTR_ERR(ah); @@ -346,7 +351,7 @@ static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc, av-port = port; av-pkey_index = wc-pkey_index; ib_init_ah_from_wc(port-cm_dev-ib_device, port-port_num, wc, - grh, av-ah_attr); + grh, av-ah_attr, init_net); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index f37878c..6c15762 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -157,8 +157,11 @@ static struct ib_mad_send_buf *alloc_response_msg(struct ib_mad_agent *agent, struct ib_ah *ah; int hdr_len; + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ ah = ib_create_ah_from_wc(agent-qp-pd, recv_wc-wc, - recv_wc-recv_buf.grh, agent-port_num); + recv_wc-recv_buf.grh, agent-port_num, + init_net); if (IS_ERR(ah)) return (void *) ah; @@ -287,10 +290,13 @@ create_rmpp_recv(struct ib_mad_agent_private *agent, if (!rmpp_recv) return NULL; + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ rmpp_recv-ah = ib_create_ah_from_wc(agent-agent.qp-pd, mad_recv_wc-wc, mad_recv_wc-recv_buf.grh, -agent-agent.port_num); +agent-agent.port_num, +init_net); if (IS_ERR(rmpp_recv-ah)) goto error; diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 928cdd2..f34c607 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -239,7 +239,9 @@ static void recv_handler(struct ib_mad_agent *agent, ib_init_ah_from_wc(agent-device, agent-port_num, mad_recv_wc-wc, mad_recv_wc-recv_buf.grh, - ah_attr); + ah_attr, init_net); + /* Note that network
[PATCH 12/22] IB/ipoib: Handle -ENETRESET properly in our callback
When the core layer calls our callback with ENETRESET as the error, we clear the status to 0 before returning indicating that we are going to handle the error ourselves. This causes the core layer to not free the mcast-mc structure, but we are releasing our reference to it by clearing mcast-mc. So, preserve our reference to the multicast structure so when we next run ipoib_mcast_dev_flush, it will be able to properly release the mcast-mc entry at the right time in the right way. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 12 ++-- 1 file changed, 6 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 957e7d2e80c..50d2de2270a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -315,8 +315,10 @@ ipoib_mcast_sendonly_join_complete(int status, mutex_lock(mcast_mutex); /* We trap for port events ourselves. */ - if (status == -ENETRESET) + if (status == -ENETRESET) { + status = 0; goto out; + } if (!status) status = ipoib_mcast_join_finish(mcast, multicast-rec); @@ -344,8 +346,6 @@ out: if (status) mcast-mc = NULL; complete(mcast-done); - if (status == -ENETRESET) - status = 0; __ipoib_mcast_continue_join_thread(priv, NULL, 0); mutex_unlock(mcast_mutex); return status; @@ -462,8 +462,10 @@ static int ipoib_mcast_join_complete(int status, mutex_lock(mcast_mutex); /* We trap for port events ourselves. */ - if (status == -ENETRESET) + if (status == -ENETRESET) { + status = 0; goto out; + } if (!status) status = ipoib_mcast_join_finish(mcast, multicast-rec); @@ -499,8 +501,6 @@ out: if (status) mcast-mc = NULL; complete(mcast-done); - if (status == -ENETRESET) - status = 0; spin_unlock_irq(priv-lock); mutex_unlock(mcast_mutex); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/22] IB/ipoib: fix mcast_dev_flush/mcast_restart_task race
Our mcast_dev_flush routine and our mcast_restart_task can race against each other. In particular, they both hold the priv-lock while manipulating the rbtree and while removing mcast entries from the multicast_list and while adding entries to the remove_list, but they also both drop their locks prior to doing the actual removes. The mcast_dev_flush routine is run entirely under the rtnl lock and so has at least some locking. The actual race condition is like this: Thread 1Thread 2 ifconfig ib0 up start multicast join for broadcast multicast join completes for broadcast start to add more multicast joins call mcast_restart_task to add new entries ifconfig ib0 down mcast_dev_flush mcast_leave(mcast A) mcast_leave(mcast A) As mcast_leave calls ib_sa_multicast_leave, and as member in core/multicast.c is ref counted, we run into an unbalanced refcount issue. To avoid stomping on each others removes, take the rtnl lock specifically when we are deleting the entries from the remove list. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 37 ++ 1 file changed, 32 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a52c9f3f7e4..41325960e4e 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -802,7 +802,10 @@ void ipoib_mcast_dev_flush(struct net_device *dev) spin_unlock_irqrestore(priv-lock, flags); - /* seperate between the wait to the leave*/ + /* +* make sure the in-flight joins have finished before we attempt +* to leave +*/ list_for_each_entry_safe(mcast, tmcast, remove_list, list) if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) wait_for_completion(mcast-done); @@ -923,14 +926,38 @@ void ipoib_mcast_restart_task(struct work_struct *work) netif_addr_unlock(dev); local_irq_restore(flags); - /* We have to cancel outside of the spinlock */ + /* +* make sure the in-flight joins have finished before we attempt +* to leave +*/ + list_for_each_entry_safe(mcast, tmcast, remove_list, list) + if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) + wait_for_completion(mcast-done); + + /* +* We have to cancel outside of the spinlock, but we have to +* take the rtnl lock or else we race with the removal of +* entries from the remove list in mcast_dev_flush as part +* of ipoib_stop() which will call mcast_stop_thread with +* flush == 1 while holding the rtnl lock, and the +* flush_workqueue won't complete until this restart_mcast_task +* completes. So do like the carrier on task and attempt to +* take the rtnl lock, but if we can't before the ADMIN_UP flag +* goes away, then just return and know that the remove list will +* get flushed later by mcast_dev_flush. +*/ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + return; + else + msleep(20); + } list_for_each_entry_safe(mcast, tmcast, remove_list, list) { ipoib_mcast_leave(mcast-dev, mcast); ipoib_mcast_free(mcast); } - - if (test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) - ipoib_mcast_start_thread(dev); + ipoib_mcast_start_thread(dev); + rtnl_unlock(); } #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/22] IB/ipoib: Consolidate rtnl_lock tasks in workqueue
Setting the MTU can safely be moved to the carrier_on_task, which keeps us from needing to take the rtnl lock in the join_finish section. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 8 ++-- 1 file changed, 2 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ffb83b5f7e8..eee66d13e5b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -190,12 +190,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, spin_unlock_irq(priv-lock); priv-tx_wr.wr.ud.remote_qkey = priv-qkey; set_qkey = 1; - - if (!ipoib_cm_admin_enabled(dev)) { - rtnl_lock(); - dev_set_mtu(dev, min(priv-mcast_mtu, priv-admin_mtu)); - rtnl_unlock(); - } } if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) { @@ -371,6 +365,8 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) } rtnl_lock(); + if (!ipoib_cm_admin_enabled(priv-dev)) + dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu)); netif_carrier_on(priv-dev); rtnl_unlock(); } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 13/22] IB/ipoib: don't restart our thread on ENETRESET
Make the send only join completion handler behave like the normal join completion handler in that we should restart the multicast join thread whenever our join completes with either an error or success, but don't restart the thread if we got a net reset event, let ipoib_event queue up a device flush, which will then call ipoib_mcast_dev_flush to remove all of the current mcast entries, and ipoib_mcase_restart_task to check out current mcast entry list against the netdev list of mcast entries we are supposed to have and requeue the ones we need to get back. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 50d2de2270a..a03b6a04203 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -336,17 +336,18 @@ ipoib_mcast_sendonly_join_complete(int status, dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue)); } netif_tx_unlock_bh(dev); + __ipoib_mcast_continue_join_thread(priv, mcast, 1); } else { /* Join completed, so reset any backoff parameters */ mcast-backoff = 1; mcast-delay_until = jiffies; + __ipoib_mcast_continue_join_thread(priv, NULL, 0); } out: clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); if (status) mcast-mc = NULL; complete(mcast-done); - __ipoib_mcast_continue_join_thread(priv, NULL, 0); mutex_unlock(mcast_mutex); return status; } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 03/22] IB/ipoib: fix MCAST_FLAG_BUSY usage
Commit a9c8ba5884 (IPoIB: Fix usage of uninitialized multicast objects) added a new flag MCAST_JOIN_STARTED, but was not very strict in how it was used. We didn't always initialize the completion struct before we set the flag, and we didn't always call complete on the completion struct from all paths that complete it. This made it less than totally effective, and certainly made its use confusing. And in the flush function we would use the presence of this flag to signal that we should wait on the completion struct, but we never cleared this flag, ever. This is further muddied by the fact that we overload the MCAST_FLAG_BUSY flag to mean two different things: we have a join in flight, and we have succeeded in getting an ib_sa_join_multicast. In order to make things clearer and aid in resolving the rtnl deadlock bug I've been chasing, I cleaned this up a bit. 1) Remove the MCAST_JOIN_STARTED flag entirely 2) Un-overload MCAST_FLAG_BUSY so it now only means a join is in-flight 3) Test on mcast-mc directly to see if we have completed ib_sa_join_multicast (using IS_ERR_OR_NULL) 4) Make sure that before setting MCAST_FLAG_BUSY we always initialize the mcast-done completion struct 5) Make sure that before calling complete(mcast-done), we always clear the MCAST_FLAG_BUSY bit 6) Take the mcast_mutex before we call ib_sa_multicast_join and also take the mutex in our join callback. This forces ib_sa_multicast_join to return and set mcast-mc before we process the callback. This way, our callback can safely clear mcast-mc if there is an error on the join and we will do the right thing as a result in mcast_dev_flush. 7) Because we need the mutex to synchronize mcast-mc, we can no longer call mcast_sendonly_join directly from mcast_send and instead must add sendonly join processing to the mcast_join_task A number of different races are resolved with these changes. These races existed with the old MCAST_FLAG_BUSY usage, the MCAST_JOIN_STARTED flag was an attempt to address them, and while it helped, a determined effort could still trip things up. One race looks something like this: Thread 1 Thread 2 ib_sa_join_multicast (as part of running restart mcast task) alloc member call callback ifconfig ib0 down wait_for_completion callback call completes wait_for_completion in mcast_dev_flush completes mcast-mc is PTR_ERR_OR_NULL so we skip ib_sa_leave_multicast return from callback return from ib_sa_join_multicast set mcast-mc = return from ib_sa_multicast We now have a permanently unbalanced join/leave issue that trips up the refcounting in core/multicast.c Another like this: Thread 1 Thread 2 Thread 3 ib_sa_multicast_join ifconfig ib0 down priv-broadcast = NULL join_complete wait_for_completion mcast-mc is not yet set, so don't clear return from ib_sa_join_multicast and set mcast-mc complete return -EAGAIN (making mcast-mc invalid) call ib_sa_multicast_leave on invalid mcast-mc, hang forever By holding the mutex around ib_sa_multicast_join and taking the mutex early in the callback, we force mcast-mc to be valid at the time we run the callback. This allows us to clear mcast-mc if there is an error and the join is going to fail. We do this before we complete the mcast. In this way, mcast_dev_flush always sees consistent state in regards to mcast-mc membership at the time that the wait_for_completion() returns. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 10 +- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 148 - 2 files changed, 101 insertions(+), 57 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index d7562beb542..f4c1b20b23b 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -98,9 +98,15 @@ enum { IPOIB_MCAST_FLAG_FOUND= 0, /* used in set_multicast_list */ IPOIB_MCAST_FLAG_SENDONLY = 1, - IPOIB_MCAST_FLAG_BUSY = 2, /* joining or already joined */ + /* +* For IPOIB_MCAST_FLAG_BUSY +* When set, in flight join and mcast-mc is unreliable +* When clear and mcast-mc IS_ERR_OR_NULL, need to restart or +* haven't started yet +* When clear and
[PATCH 15/22] IB/ipoib: fix race between mcast_dev_flush and mcast_join
There is a classic race between mcast_dev_flush and mcast_join_thread in that the join thread is going to loop through all of the mcast entries in the device mcast list, while dev_flush wants to remove those entries from the list so that they can eventually be freed when they are no longer busy. As always, the list traversal and the list removal must be done under the same lock to avoid races, so modify the join_task thread code to hold onto the spinlock for the entirety of its run up until it has saved the mcast entry that it's going to work on into a private pointer and set the flags on that entry to mark it as busy. Only then is it safe to drop the lock and let mcast_dev_flush remove entries from the list. I also reworked the flow of the join_task thread to be more clear that it is making a single pass at work to do and then exiting as soon as it gets that single task queued off. I then reworked both the regular and send only join routines to know that they will be called with the BUSY flag already set on the mcast entry they are passed, so any error condition means they need to clear the flag and complete the task and they should not check the busy status as it will now always be true. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 112 - 1 file changed, 56 insertions(+), 56 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 424c602d5a0..972fa15a225 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -368,22 +368,16 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) if (!test_bit(IPOIB_FLAG_OPER_UP, priv-flags)) { ipoib_dbg_mcast(priv, device shutting down, no sendonly multicast joins\n); + clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); + complete(mcast-done); return -ENODEV; } - if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) { - ipoib_dbg_mcast(priv, multicast entry busy, skipping - sendonly join\n); - return -EBUSY; - } - rec.mgid = mcast-mcmember.mgid; rec.port_gid = priv-local_gid; rec.pkey = cpu_to_be16(priv-pkey); mutex_lock(mcast_mutex); - init_completion(mcast-done); - set_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port, rec, IB_SA_MCMEMBER_REC_MGID| @@ -552,8 +546,6 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, } mutex_lock(mcast_mutex); - init_completion(mcast-done); - set_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port, rec, comp_mask, GFP_KERNEL, ipoib_mcast_join_complete, mcast); @@ -574,6 +566,8 @@ void ipoib_mcast_join_task(struct work_struct *work) container_of(work, struct ipoib_dev_priv, mcast_task.work); struct net_device *dev = priv-dev; struct ib_port_attr port_attr; + struct ipoib_mcast *mcast = NULL; + int create = 1; if (!test_bit(IPOIB_MCAST_RUN, priv-flags)) return; @@ -591,16 +585,25 @@ void ipoib_mcast_join_task(struct work_struct *work) else memcpy(priv-dev-dev_addr + 4, priv-local_gid.raw, sizeof (union ib_gid)); + /* +* We have to hold the mutex to keep from racing with the join +* completion threads on setting flags on mcasts, and we have +* to hold the priv-lock because dev_flush will remove entries +* out from underneath us, so at a minimum we need the lock +* through the time that we do the for_each loop of the mcast +* list or else dev_flush can make us oops. +*/ + mutex_lock(mcast_mutex); + spin_lock_irq(priv-lock); + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + goto out; + if (!priv-broadcast) { struct ipoib_mcast *broadcast; - if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) - return; - - broadcast = ipoib_mcast_alloc(dev, 1); + broadcast = ipoib_mcast_alloc(dev, 0); if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); - mutex_lock(mcast_mutex); /* * Restart us after a 1 second delay to retry * creating our broadcast group and attaching to @@ -608,67 +611,64 @@ void
[PATCH 11/22] IB/ipoib: make delayed tasks not hold up everything
Currently, the multicast join thread only processes one task at a time. It does this by having itself be scheduled as a delayed work item, when it runs, it finds one, and only one, piece of work to process, it then kicks that off via either the normal join process or the sendonly join process, and then it immediately exits. Both of those process chains are responsible for restarting the task when they have completed their specific action. This makes the entire join process serial with only one join supposedly ever outstanding at a time. However, if we fail a join, and we need to initiate a backoff delay, that delay holds up the entire join process for all joins, not just the failed join. So modify the design such that we can have joins in delay, and the multicast thread will ignore them until their time comes around, and then we can process the rest of the queue without waiting for the delayed items. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 86 +- 2 files changed, 57 insertions(+), 30 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 8ba80a6d3a4..c79dcd5ee8a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -154,6 +154,7 @@ struct ipoib_mcast { unsigned long created; unsigned long backoff; + unsigned long delay_until; unsigned long flags; unsigned char logcount; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index cb1e495bd74..957e7d2e80c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -67,10 +67,34 @@ struct ipoib_mcast_iter { }; static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv, + struct ipoib_mcast *mcast, int delay) { - if (test_bit(IPOIB_MCAST_RUN, priv-flags)) + if (test_bit(IPOIB_MCAST_RUN, priv-flags)) { + /* +* Mark this mcast for its delay and set a timer to kick the +* thread when the delay completes +*/ + if (mcast delay) { + mcast-backoff *= 2; + if (mcast-backoff IPOIB_MAX_BACKOFF_SECONDS) + mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS; + mcast-delay_until = jiffies + (mcast-backoff * HZ); + queue_delayed_work(priv-wq, priv-mcast_task, + mcast-backoff * HZ); + } else if (delay) { + /* Special case of retrying after a failure to +* allocate the broadcast multicast group, wait +* 1 second and try again +*/ + queue_delayed_work(priv-wq, priv-mcast_task, HZ); + } + /* +* But also rekick the thread immediately for any other +* tasks in queue behind this one +*/ queue_delayed_work(priv-wq, priv-mcast_task, delay); + } } static void ipoib_mcast_free(struct ipoib_mcast *mcast) @@ -110,6 +134,7 @@ static struct ipoib_mcast *ipoib_mcast_alloc(struct net_device *dev, mcast-dev = dev; mcast-created = jiffies; + mcast-delay_until = jiffies; mcast-backoff = 1; INIT_LIST_HEAD(mcast-list); @@ -309,6 +334,10 @@ ipoib_mcast_sendonly_join_complete(int status, dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue)); } netif_tx_unlock_bh(dev); + } else { + /* Join completed, so reset any backoff parameters */ + mcast-backoff = 1; + mcast-delay_until = jiffies; } out: clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); @@ -317,7 +346,7 @@ out: complete(mcast-done); if (status == -ENETRESET) status = 0; - __ipoib_mcast_continue_join_thread(priv, 0); + __ipoib_mcast_continue_join_thread(priv, NULL, 0); mutex_unlock(mcast_mutex); return status; } @@ -369,7 +398,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) complete(mcast-done); ipoib_warn(priv, ib_sa_join_multicast for sendonly join failed (ret = %d)\n, ret); - __ipoib_mcast_continue_join_thread(priv, 0); + __ipoib_mcast_continue_join_thread(priv, NULL, 0); } else { ipoib_dbg_mcast(priv, no multicast record for %pI6, starting sendonly join\n, mcast-mcmember.mgid.raw); @@ -441,7 +470,8 @@
[PATCH 09/22] IB/ipoib: fix IPOIB_MCAST_RUN flag usage
The usage of IPOIB_MCAST_RUN as a flag is inconsistent. In some places it is used to mean our device is administratively allowed to send multicast joins/leaves/packets and in other places it means our multicast join task thread is currently running and will process your request if you put it on the queue. However, this latter meaning is in fact flawed as there is a race condition between the join task testing the mcast list and finding it empty of remaining work, dropping the mcast mutex and also the priv-lock spinlock, and clearing the IPOIB_MCAST_RUN flag. Further, there are numerous locations that use the flag in the former fashion, and when all tasks complete and the task thread clears the RUN flag, all of those other locations will fail to ever again queue any work. This results in the interface coming up fine initially, but having problems adding new multicast groups after the first round of groups have all been added and the RUN flag is cleared by the join task thread when it thinks it is done. To resolve this issue, convert all locations in the code to treat the RUN flag as an indicator that the multicast portion of this interface is in fact administratively up and joins/leaves/sends can be performed. There is no harm (other than a slight performance penalty) to never clearing this flag and using it in this fashion as it simply means that a few places that used to micro-optimize how often this task was queued on a work queue will now queue the task a few extra times. We can address that suboptimal behavior in future patches. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index bc50dd0d0e4..91b8fe118ec 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -630,8 +630,6 @@ void ipoib_mcast_join_task(struct work_struct *work) } ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); - - clear_bit(IPOIB_MCAST_RUN, priv-flags); } int ipoib_mcast_start_thread(struct net_device *dev) @@ -641,8 +639,8 @@ int ipoib_mcast_start_thread(struct net_device *dev) ipoib_dbg_mcast(priv, starting multicast thread\n); mutex_lock(mcast_mutex); - if (!test_and_set_bit(IPOIB_MCAST_RUN, priv-flags)) - queue_delayed_work(priv-wq, priv-mcast_task, 0); + set_bit(IPOIB_MCAST_RUN, priv-flags); + queue_delayed_work(priv-wq, priv-mcast_task, 0); mutex_unlock(mcast_mutex); return 0; @@ -725,7 +723,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) memcpy(mcast-mcmember.mgid.raw, mgid, sizeof (union ib_gid)); __ipoib_mcast_add(dev, mcast); list_add_tail(mcast-list, priv-multicast_list); - if (!test_and_set_bit(IPOIB_MCAST_RUN, priv-flags)) + if (test_bit(IPOIB_MCAST_RUN, priv-flags)) queue_delayed_work(priv-wq, priv-mcast_task, 0); } @@ -951,7 +949,8 @@ void ipoib_mcast_restart_task(struct work_struct *work) /* * Restart our join task if needed */ - ipoib_mcast_start_thread(dev); + if (test_bit(IPOIB_MCAST_RUN, priv-flags)) + queue_delayed_work(priv-wq, priv-mcast_task, 0); rtnl_unlock(); } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/22] IB/ipoib: Make the carrier_on_task race aware
We blindly assume that we can just take the rtnl lock and that will prevent races with downing this interface. Unfortunately, that's not the case. In ipoib_mcast_stop_thread() we will call flush_workqueue() in an attempt to clear out all remaining instances of ipoib_join_task. But, since this task is put on the same workqueue as the join task, the flush_workqueue waits on this thread too. But this thread is deadlocked on the rtnl lock. The better thing here is to use trylock and loop on that until we either get the lock or we see that FLAG_ADMIN_UP has been cleared, in which case we don't need to do anything anyway and we just return. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 +++-- 1 file changed, 15 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index eee66d13e5b..9862c76a83f 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -353,18 +353,27 @@ void ipoib_mcast_carrier_on_task(struct work_struct *work) carrier_on_task); struct ib_port_attr attr; - /* -* Take rtnl_lock to avoid racing with ipoib_stop() and -* turning the carrier back on while a device is being -* removed. -*/ if (ib_query_port(priv-ca, priv-port, attr) || attr.state != IB_PORT_ACTIVE) { ipoib_dbg(priv, Keeping carrier off until IB port is active\n); return; } - rtnl_lock(); + /* +* Take rtnl_lock to avoid racing with ipoib_stop() and +* turning the carrier back on while a device is being +* removed. However, ipoib_stop() will attempt to flush +* the workqueue while holding the rtnl lock, so loop +* on trylock until either we get the lock or we see +* FLAG_ADMIN_UP go away as that signals that we are bailing +* and can safely ignore the carrier on work +*/ + while (!rtnl_trylock()) { + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + return; + else + msleep(20); + } if (!ipoib_cm_admin_enabled(priv-dev)) dev_set_mtu(priv-dev, min(priv-mcast_mtu, priv-admin_mtu)); netif_carrier_on(priv-dev); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 10/22] IB/ipoib: Add a helper to restart the multicast task
Add a simple helper to do the right thing when restarting the multicast thread. Call that helper from all the places that need to restart the thread. Add two places that didn't restart the thread, but should have. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 46 -- 1 file changed, 29 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 91b8fe118ec..cb1e495bd74 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -66,6 +66,13 @@ struct ipoib_mcast_iter { unsigned int send_only; }; +static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv, + int delay) +{ + if (test_bit(IPOIB_MCAST_RUN, priv-flags)) + queue_delayed_work(priv-wq, priv-mcast_task, delay); +} + static void ipoib_mcast_free(struct ipoib_mcast *mcast) { struct net_device *dev = mcast-dev; @@ -270,6 +277,7 @@ ipoib_mcast_sendonly_join_complete(int status, { struct ipoib_mcast *mcast = multicast-context; struct net_device *dev = mcast-dev; + struct ipoib_dev_priv *priv = netdev_priv(dev); /* * We have to take the mutex to force mcast_sendonly_join to @@ -309,6 +317,7 @@ out: complete(mcast-done); if (status == -ENETRESET) status = 0; + __ipoib_mcast_continue_join_thread(priv, 0); mutex_unlock(mcast_mutex); return status; } @@ -360,6 +369,7 @@ static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) complete(mcast-done); ipoib_warn(priv, ib_sa_join_multicast for sendonly join failed (ret = %d)\n, ret); + __ipoib_mcast_continue_join_thread(priv, 0); } else { ipoib_dbg_mcast(priv, no multicast record for %pI6, starting sendonly join\n, mcast-mcmember.mgid.raw); @@ -431,8 +441,7 @@ static int ipoib_mcast_join_complete(int status, if (!status) { mcast-backoff = 1; - if (test_bit(IPOIB_MCAST_RUN, priv-flags)) - queue_delayed_work(priv-wq, priv-mcast_task, 0); + __ipoib_mcast_continue_join_thread(priv, 0); /* * Defer carrier on work to priv-wq to avoid a @@ -454,6 +463,13 @@ static int ipoib_mcast_join_complete(int status, mcast-backoff *= 2; if (mcast-backoff IPOIB_MAX_BACKOFF_SECONDS) mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS; + /* +* XXX - This is wrong. *Our* join failed, but because the +* join thread does the joins in a serial fashion, if there +* are any joins behind ours waiting to complete, they should +* not be subjected to our backoff delay. +*/ + __ipoib_mcast_continue_join_thread(priv, mcast-backoff * HZ); } out: spin_lock_irq(priv-lock); @@ -463,9 +479,6 @@ out: complete(mcast-done); if (status == -ENETRESET) status = 0; - if (status test_bit(IPOIB_MCAST_RUN, priv-flags)) - queue_delayed_work(priv-wq, priv-mcast_task, - mcast-backoff * HZ); spin_unlock_irq(priv-lock); mutex_unlock(mcast_mutex); @@ -532,10 +545,13 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, mcast-backoff *= 2; if (mcast-backoff IPOIB_MAX_BACKOFF_SECONDS) mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS; - - if (test_bit(IPOIB_MCAST_RUN, priv-flags)) - queue_delayed_work(priv-wq, priv-mcast_task, - mcast-backoff * HZ); + /* +* XXX - This is wrong. *Our* join failed, but because the +* join thread does the joins in a serial fashion, if there +* are any joins behind ours waiting to complete, they should +* not be subjected to our backoff delay. +*/ + __ipoib_mcast_continue_join_thread(priv, mcast-backoff * HZ); } mutex_unlock(mcast_mutex); } @@ -573,9 +589,7 @@ void ipoib_mcast_join_task(struct work_struct *work) if (!broadcast) { ipoib_warn(priv, failed to allocate broadcast group\n); mutex_lock(mcast_mutex); - if (test_bit(IPOIB_MCAST_RUN, priv-flags)) - queue_delayed_work(priv-wq, priv-mcast_task, - HZ); +
[PATCH 07/22] IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue
We used to pass a flush variable to mcast_stop_thread to indicate if we should flush the workqueue or not. This was due to some code trying to flush a workqueue that it was currently running on which is a no-no. Now that we have per-device work queues, and now that ipoib_mcast_restart_task has taken the fact that it is queued on a single thread workqueue with all of the ipoib_mcast_join_task's and therefore has no need to stop the join task while it runs, we can do away with the flush parameter and unilaterally flush always. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 2 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 4 ++-- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 21 + 3 files changed, 12 insertions(+), 15 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 45fd10a72ec..28dc927c0e8 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -499,7 +499,7 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb); void ipoib_mcast_restart_task(struct work_struct *work); int ipoib_mcast_start_thread(struct net_device *dev); -int ipoib_mcast_stop_thread(struct net_device *dev, int flush); +int ipoib_mcast_stop_thread(struct net_device *dev); void ipoib_mcast_dev_down(struct net_device *dev); void ipoib_mcast_dev_flush(struct net_device *dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index bfd17d41b5f..66096787119 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -747,7 +747,7 @@ int ipoib_ib_dev_down(struct net_device *dev, int flush) clear_bit(IPOIB_FLAG_OPER_UP, priv-flags); netif_carrier_off(dev); - ipoib_mcast_stop_thread(dev, flush); + ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); ipoib_flush_paths(dev); @@ -1097,7 +1097,7 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) */ ipoib_flush_paths(dev); - ipoib_mcast_stop_thread(dev, 1); + ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); ipoib_transport_dev_cleanup(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 845f910eb21..bc50dd0d0e4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -648,7 +648,7 @@ int ipoib_mcast_start_thread(struct net_device *dev) return 0; } -int ipoib_mcast_stop_thread(struct net_device *dev, int flush) +int ipoib_mcast_stop_thread(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -659,8 +659,7 @@ int ipoib_mcast_stop_thread(struct net_device *dev, int flush) cancel_delayed_work(priv-mcast_task); mutex_unlock(mcast_mutex); - if (flush) - flush_workqueue(priv-wq); + flush_workqueue(priv-wq); return 0; } @@ -838,8 +837,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) ipoib_dbg_mcast(priv, restarting multicast task\n); - ipoib_mcast_stop_thread(dev, 0); - local_irq_save(flags); netif_addr_lock(dev); spin_lock(priv-lock); @@ -936,13 +933,10 @@ void ipoib_mcast_restart_task(struct work_struct *work) * We have to cancel outside of the spinlock, but we have to * take the rtnl lock or else we race with the removal of * entries from the remove list in mcast_dev_flush as part -* of ipoib_stop() which will call mcast_stop_thread with -* flush == 1 while holding the rtnl lock, and the -* flush_workqueue won't complete until this restart_mcast_task -* completes. So do like the carrier on task and attempt to -* take the rtnl lock, but if we can't before the ADMIN_UP flag -* goes away, then just return and know that the remove list will -* get flushed later by mcast_stop_thread. +* of ipoib_stop(). We detect the drop of the ADMIN_UP flag +* to signal that we have hit this particular race, and we +* return since we know we don't need to do anything else +* anyway. */ while (!rtnl_trylock()) { if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) @@ -954,6 +948,9 @@ void ipoib_mcast_restart_task(struct work_struct *work) ipoib_mcast_leave(mcast-dev, mcast); ipoib_mcast_free(mcast); } + /* +* Restart our join task if needed +*/ ipoib_mcast_start_thread(dev); rtnl_unlock(); } -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/22] IB/ipoib: Fixups for multicast issues
This patchset revives the 8 patches that were reverted from 3.19, the 11 patches that fixed the problems with the first 8, the single patch that was related to reaping of ah's and failure to dealloc resources on shutdown, and then adds two new patches that would have been enhancements and not bugfixes and hence weren't appropriate to post in the 3.19 tussle. Testing of this patchset is currently underway, but it has done well so far. IPv4 multicast, IPv6 multicast, connected mode, datagram mode, rmmod/insmod while active, restart opensm while active, ifconfig up/ifconfig down in a tight while loop have all passed. There are two outstanding issues that I think stilll need addressed (while performing all the other testing I ran across these issues, and I think they existed prior to my patchset, but I haven't booted up a clean kernel to verify it yet...I'll do that tomorrow and if things are not as I expect, I'll report back here): 1) In connected mode, the initial ip6 ping to any host takes almost exactly 1 second to complete. The debug messages show this delay very clearly: [19059.689967] qib_ib0: joining MGID ff12:601b::::0001:ff31:7791 [19059.689970] qib_ib0: successfully started all multicast joins [19059.690313] qib_ib0: sendonly join completion for ff12:601b::::0001:ff31:7791 (status 0) [19059.690314] qib_ib0: Created ah 88080cc0ef60 [19059.690315] qib_ib0: MGID ff12:601b::::0001:ff31:7791 AV 88080cc0ef60, LID 0xc035, SL 0 - Final debug message from creating our AH and when we should have requeued our sends [19060.694190] qib_ib0: REQ arrived - almost exactly 1 second later, we finally start setting up our connection In datagram mode, this does not happen and initial startup of ping6 is mostly immediate. 2) In connected mode, restartng opensm repeatedly can cause some of the machines to start failing to find other machines when trying to use ping6. However, they don't loose connectivity to all machines, only specific machines. A rmmod/insmod cycle solves the problem. So does a full ifdown/ifup cycle. Given enough idle time, the problem goes away. I suspect that neighbor flushing when in connected mode is not reliable/sufficient when opensm events come in. Again, I think this exists in the upstream kernel and I'll test more on that tomorrow. Doug Ledford (22): IB/ipoib: Consolidate rtnl_lock tasks in workqueue IB/ipoib: Make the carrier_on_task race aware IB/ipoib: fix MCAST_FLAG_BUSY usage IB/ipoib: fix mcast_dev_flush/mcast_restart_task race IB/ipoib: change init sequence ordering IB/ipoib: Use dedicated workqueues per interface IB/ipoib: Make ipoib_mcast_stop_thread flush the workqueue IB/ipoib: No longer use flush as a parameter IB/ipoib: fix IPOIB_MCAST_RUN flag usage IB/ipoib: Add a helper to restart the multicast task IB/ipoib: make delayed tasks not hold up everything IB/ipoib: Handle -ENETRESET properly in our callback IB/ipoib: don't restart our thread on ENETRESET IB/ipoib: remove unneeded locks IB/ipoib: fix race between mcast_dev_flush and mcast_join IB/ipoib: fix ipoib_mcast_restart_task IB/ipoib: flush the ipoib_workqueue on unregister IB/ipoib: cleanup a couple debug messages IB/ipoib: make sure we reap all our ah on shutdown IB/ipoib: don't queue a work struct up twice IB/ipoib: deserialize multicast joins IB/ipoib: drop mcast_mutex usage drivers/infiniband/ulp/ipoib/ipoib.h | 20 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c| 18 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 69 ++-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 51 ++- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 479 - drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 24 +- 6 files changed, 356 insertions(+), 305 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 21/22] IB/ipoib: deserialize multicast joins
Allow the ipoib layer to attempt to join all outstanding multicast groups at once. The ib_sa layer will serialize multiple attempts to join the same group, but will process attempt to join different groups in parallel. Take advantage of that. In order to make this happen, change the mcast_join_thread to loop through all needed joins, sending a join request for each one that we still need to join. There are a few special cases we handle though: 1) Don't attempt to join anything but the broadcast group until the join of the broadcast group has succeeded. 2) No longer restart the join task at the end of completion handling as the join task will fire off everything. Unless we are the broadcast group, and then we need to. Or in case we need to initiate a retry. 3) No longer use separate join/completion routines for regular and sendonly joins, pass them all through the same routine and just do the right thing based on the SENDONLY join flag. 4) Only try to join a SENDONLY join twice, then drop the packets and quit trying. We leave the mcast group in the list so that if we get a new packet, all that we have to do is queue up the packet and restart the join task and it will automatically try to join twice and then either send or flush the queue again. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 264 + 1 file changed, 88 insertions(+), 176 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 68b47bed33a..355c320dc4d 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -295,113 +295,6 @@ static int ipoib_mcast_join_finish(struct ipoib_mcast *mcast, return 0; } -static int -ipoib_mcast_sendonly_join_complete(int status, - struct ib_sa_multicast *multicast) -{ - struct ipoib_mcast *mcast = multicast-context; - struct net_device *dev = mcast-dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); - - /* -* We have to take the mutex to force mcast_sendonly_join to -* return from ib_sa_multicast_join and set mcast-mc to a -* valid value. Otherwise we were racing with ourselves in -* that we might fail here, but get a valid return from -* ib_sa_multicast_join after we had cleared mcast-mc here, -* resulting in mis-matched joins and leaves and a deadlock -*/ - mutex_lock(mcast_mutex); - - /* We trap for port events ourselves. */ - if (status == -ENETRESET) { - status = 0; - goto out; - } - - if (!status) - status = ipoib_mcast_join_finish(mcast, multicast-rec); - - if (status) { - if (mcast-logcount++ 20) - ipoib_dbg_mcast(netdev_priv(dev), sendonly multicast - join failed for %pI6, status %d\n, - mcast-mcmember.mgid.raw, status); - - /* Flush out any queued packets */ - netif_tx_lock_bh(dev); - while (!skb_queue_empty(mcast-pkt_queue)) { - ++dev-stats.tx_dropped; - dev_kfree_skb_any(skb_dequeue(mcast-pkt_queue)); - } - netif_tx_unlock_bh(dev); - __ipoib_mcast_continue_join_thread(priv, mcast, 1); - } else { - /* Join completed, so reset any backoff parameters */ - mcast-backoff = 1; - mcast-delay_until = jiffies; - __ipoib_mcast_continue_join_thread(priv, NULL, 0); - } -out: - clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); - if (status) - mcast-mc = NULL; - complete(mcast-done); - mutex_unlock(mcast_mutex); - return status; -} - -static int ipoib_mcast_sendonly_join(struct ipoib_mcast *mcast) -{ - struct net_device *dev = mcast-dev; - struct ipoib_dev_priv *priv = netdev_priv(dev); - struct ib_sa_mcmember_rec rec = { -#if 0 /* Some SMs don't support send-only yet */ - .join_state = 4 -#else - .join_state = 1 -#endif - }; - int ret = 0; - - if (!test_bit(IPOIB_FLAG_OPER_UP, priv-flags)) { - ipoib_dbg_mcast(priv, device shutting down, no sendonly - multicast joins\n); - clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); - complete(mcast-done); - return -ENODEV; - } - - rec.mgid = mcast-mcmember.mgid; - rec.port_gid = priv-local_gid; - rec.pkey = cpu_to_be16(priv-pkey); - - mutex_lock(mcast_mutex); - mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, -priv-port, rec, -
[PATCH 17/22] IB/ipoib: flush the ipoib_workqueue on unregister
When we unregister our event handler as a prelude to removing our device, we need to go ahead and flush the ipoib_workqueue to make sure there are no delayed flushes for the device we are getting ready to remove. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 2 ++ 1 file changed, 2 insertions(+) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 6bad17d4d58..8810514cf40 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1643,6 +1643,7 @@ sysfs_failed: register_failed: ib_unregister_event_handler(priv-event_handler); + flush_workqueue(ipoib_workqueue); /* Stop GC if started before flush */ set_bit(IPOIB_STOP_NEIGH_GC, priv-flags); cancel_delayed_work(priv-neigh_reap_task); @@ -1709,6 +1710,7 @@ static void ipoib_remove_one(struct ib_device *device) list_for_each_entry_safe(priv, tmp, dev_list, list) { ib_unregister_event_handler(priv-event_handler); + flush_workqueue(ipoib_workqueue); rtnl_lock(); dev_change_flags(priv-dev, priv-dev-flags ~IFF_UP); -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 14/22] IB/ipoib: remove unneeded locks
During the various work I've done on this, some extra locking has crept in when things weren't working right. This is one of those spots. Remove the unneeded spinlocks. The mutex is enough to protect against what we need to protect against. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 2 -- 1 file changed, 2 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index a03b6a04203..424c602d5a0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -497,12 +497,10 @@ static int ipoib_mcast_join_complete(int status, __ipoib_mcast_continue_join_thread(priv, mcast, 1); } out: - spin_lock_irq(priv-lock); clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); if (status) mcast-mc = NULL; complete(mcast-done); - spin_unlock_irq(priv-lock); mutex_unlock(mcast_mutex); return status; -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 06/22] IB/ipoib: Use dedicated workqueues per interface
During my recent work on the rtnl lock deadlock in the IPoIB driver, I saw that even once I fixed the apparent races for a single device, as soon as that device had any children, new races popped up. It turns out that this is because no matter how well we protect against races on a single device, the fact that all devices use the same workqueue, and flush_workqueue() flushes *everything* from that workqueue, we can have one device in the middle of a down and holding the rtnl lock and another totally unrelated device needing to run mcast_restart_task, which wants the rtnl lock and will loop trying to take it unless is sees its own FLAG_ADMIN_UP flag go away. Because the unrelated interface will never see its own ADMIN_UP flag drop, the interface going down will deadlock trying to flush the queue. There are several possible solutions to this problem: Make carrier_on_task and mcast_restart_task try to take the rtnl for some set period of time and if they fail, then bail. This runs the real risk of dropping work on the floor, which can end up being its own separate kind of deadlock. Set some global flag in the driver that says some device is in the middle of going down, letting all tasks know to bail. Again, this can drop work on the floor. I suppose if our own ADMIN_UP flag doesn't go away, then maybe after a few tries on the rtnl lock we can queue our own task back up as a delayed work and return and avoid dropping work on the floor that way. But I'm not 100% convinced that we won't cause other problems. Or the method this patch attempts to use, which is when we bring an interface up, create a workqueue specifically for that interface, so that when we take it back down, we are flushing only those tasks associated with our interface. In addition, keep the global workqueue, but now limit it to only flush tasks. In this way, the flush tasks can always flush the device specific work queues without having deadlock issues. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 1 + drivers/infiniband/ulp/ipoib/ipoib_cm.c| 18 +- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 6 +++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 19 --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 26 -- drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 22 +- 6 files changed, 58 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index f4c1b20b23b..45fd10a72ec 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -323,6 +323,7 @@ struct ipoib_dev_priv { struct list_head multicast_list; struct rb_root multicast_tree; + struct workqueue_struct *wq; struct delayed_work mcast_task; struct work_struct carrier_on_task; struct work_struct flush_light; diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 933efcea0d0..56959adb6c7 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -474,7 +474,7 @@ static int ipoib_cm_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *even } spin_lock_irq(priv-lock); - queue_delayed_work(ipoib_workqueue, + queue_delayed_work(priv-wq, priv-cm.stale_task, IPOIB_CM_RX_DELAY); /* Add this entry to passive ids list head, but do not re-add it * if IB_EVENT_QP_LAST_WQE_REACHED has moved it to flush list. */ @@ -576,7 +576,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(priv-lock, flags); list_splice_init(priv-cm.rx_drain_list, priv-cm.rx_reap_list); ipoib_cm_start_rx_drain(priv); - queue_work(ipoib_workqueue, priv-cm.rx_reap_task); + queue_work(priv-wq, priv-cm.rx_reap_task); spin_unlock_irqrestore(priv-lock, flags); } else ipoib_warn(priv, cm recv completion event with wrid %d ( %d)\n, @@ -603,7 +603,7 @@ void ipoib_cm_handle_rx_wc(struct net_device *dev, struct ib_wc *wc) spin_lock_irqsave(priv-lock, flags); list_move(p-list, priv-cm.rx_reap_list); spin_unlock_irqrestore(priv-lock, flags); - queue_work(ipoib_workqueue, priv-cm.rx_reap_task); + queue_work(priv-wq, priv-cm.rx_reap_task); } return; } @@ -827,7 +827,7 @@ void ipoib_cm_handle_tx_wc(struct net_device *dev, struct ib_wc *wc) if (test_and_clear_bit(IPOIB_FLAG_INITIALIZED, tx-flags)) {
[PATCH 16/22] IB/ipoib: fix ipoib_mcast_restart_task
There are two intertwined problems here. First, if we are downing the interface, we will get called as part of the process. When that happens, we don't really want to do anything here as the next thing to happen is ipoib_mcast_dev_flush will get called. We just want to let it do all the work since we don't need to do any address matches and such. So, we should be testing for whether or not the interface is up at the beginning of this function and returning if it isn't. Second, the attempt to grab the rtnl_lock to prevent racing against ipoib_mcast_dev_flush was mistaken. The removed mcast groups are on a stack local list head and so once we've removed them from the mcast rbtree, ipoib_mcast_dev_flush can no longer find them and race us on removing them. However, because we attempt to take the lock, and then check the status of the interface, if our interface has gone down, we simply return. This is absolutely the wrong thing to do as it means we just leaked all of the mcast entries on our remove list. The fix is to check for the interface being up at the very beginning of the function, and then to always remove our entries on the remove list no matter what as they will get leaked otherwise, and skip taking the rtnl_lock since it isn't needed anyway. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 19 +++ 1 file changed, 3 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 972fa15a225..10e05437c83 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -871,6 +871,9 @@ void ipoib_mcast_restart_task(struct work_struct *work) unsigned long flags; struct ib_sa_mcmember_rec rec; + if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) + return; + ipoib_dbg_mcast(priv, restarting multicast task\n); local_irq_save(flags); @@ -965,21 +968,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) wait_for_completion(mcast-done); - /* -* We have to cancel outside of the spinlock, but we have to -* take the rtnl lock or else we race with the removal of -* entries from the remove list in mcast_dev_flush as part -* of ipoib_stop(). We detect the drop of the ADMIN_UP flag -* to signal that we have hit this particular race, and we -* return since we know we don't need to do anything else -* anyway. -*/ - while (!rtnl_trylock()) { - if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) - return; - else - msleep(20); - } list_for_each_entry_safe(mcast, tmcast, remove_list, list) { ipoib_mcast_leave(mcast-dev, mcast); ipoib_mcast_free(mcast); @@ -988,7 +976,6 @@ void ipoib_mcast_restart_task(struct work_struct *work) * Restart our join task thread if needed */ __ipoib_mcast_continue_join_thread(priv, NULL, 0); - rtnl_unlock(); } #ifdef CONFIG_INFINIBAND_IPOIB_DEBUG -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 20/22] IB/ipoib: don't queue a work struct up twice
In b2d408f78b3 (IB/ipoib: make delayed tasks not hold up everything) I added the ability to have some joins processed around delayed joins. Because the join task processes one join at a time, we needed to queue the task to run again immediately and then again at the appropriate delay time. I didn't think about the fact that our work struct can only be on the workqueue once at a given time and instead tried to queue the same struct up twice. Instead, kick the queue again immediately on a delayed restart attempt, when the task sees that it has nothing to do but delayed work, it will pick the shortest work item and queue a delay equal to its expiration. Finally, in case we have delayed work, in the send_mcast routine we need to cancel any delayed work before kicking the thread immediately. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 49 -- 1 file changed, 31 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index ad2bcd105e4..68b47bed33a 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -71,29 +71,28 @@ static void __ipoib_mcast_continue_join_thread(struct ipoib_dev_priv *priv, int delay) { if (test_bit(IPOIB_MCAST_RUN, priv-flags)) { - /* -* Mark this mcast for its delay and set a timer to kick the -* thread when the delay completes -*/ if (mcast delay) { mcast-backoff *= 2; if (mcast-backoff IPOIB_MAX_BACKOFF_SECONDS) mcast-backoff = IPOIB_MAX_BACKOFF_SECONDS; mcast-delay_until = jiffies + (mcast-backoff * HZ); - queue_delayed_work(priv-wq, priv-mcast_task, - mcast-backoff * HZ); + /* +* Mark this mcast for its delay, but restart the +* task immediately. It will make sure to clear +* out all entries without delays, and then +* schedule itself for run again when the delay +* expires +*/ + queue_delayed_work(priv-wq, priv-mcast_task, 0); } else if (delay) { - /* Special case of retrying after a failure to + /* +* Special case of retrying after a failure to * allocate the broadcast multicast group, wait * 1 second and try again */ queue_delayed_work(priv-wq, priv-mcast_task, HZ); - } - /* -* But also rekick the thread immediately for any other -* tasks in queue behind this one -*/ - queue_delayed_work(priv-wq, priv-mcast_task, delay); + } else + queue_delayed_work(priv-wq, priv-mcast_task, 0); } } @@ -568,6 +567,7 @@ void ipoib_mcast_join_task(struct work_struct *work) struct ib_port_attr port_attr; struct ipoib_mcast *mcast = NULL; int create = 1; + unsigned long delay_until = 0; if (!test_bit(IPOIB_MCAST_RUN, priv-flags)) return; @@ -636,11 +636,14 @@ void ipoib_mcast_join_task(struct work_struct *work) list_for_each_entry(mcast, priv-multicast_list, list) { if (IS_ERR_OR_NULL(mcast-mc) !test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags) - !test_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags) - (mcast-backoff == 1 || -time_after_eq(jiffies, mcast-delay_until))) { - /* Found the next unjoined group */ - break; + !test_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags)) { + if (mcast-backoff == 1 || + time_after_eq(jiffies, mcast-delay_until)) + /* Found the next unjoined group */ + break; + else if (!delay_until || +time_before(mcast-delay_until, delay_until)) + delay_until = mcast-delay_until; } } @@ -653,6 +656,9 @@ void ipoib_mcast_join_task(struct work_struct *work) mcast = NULL; ipoib_dbg_mcast(priv, successfully joined all multicast groups\n); + if (delay_until) + queue_delayed_work(priv-wq, priv-mcast_task, +
[PATCH 19/22] IB/ipoib: make sure we reap all our ah on shutdown
The introduction of garbage collection for neighbors in the ipoib stack caused a leak of resources. In particular, an ah can have a refcount from a mcast struct, from a neigh struct, and from a path struct. When we put an ah ref, the ah gets moved to a dead list to be reaped later if the refcount goes to 0. During ipoib_ib_dev_stop we attempt to reap these ah entries, but since that function does not force the neigh garbage collection to be run, it can leak any entries held by a neighbor. And even though the code attempts to warn us that we will leak entries, that warning is broken because the entries we are leaking will not yet be on the dead_ahs list because they still have valid references held elsewhere. It is during shutdown, when we finally kill the neighbor garbage collection, that our final ahs will get put on the dead_ahs list, but after the point at which we run the neighbor garbage collection, no attempt is made to reap the newly added ahs, and because during ipoib_ib_dev_stop we killed the ah_reap task, no timed attempt will be made to clear them either. Instead, create a an ipoib_flush_ah and ipoib_stop_ah routines to use at appropriate times to flush out all remaining ah entries before we shut the device down. This is done to prevent resource leaks on shutdown, which manifest with this message on ib_ipoib module remove: ibdev: ib_dealloc_pd failed Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_ib.c| 44 ++ drivers/infiniband/ulp/ipoib/ipoib_verbs.c | 4 ++- 2 files changed, 30 insertions(+), 18 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index fe65abb5150..e144d07d53c 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -659,6 +659,23 @@ void ipoib_reap_ah(struct work_struct *work) round_jiffies_relative(HZ)); } +static void ipoib_flush_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + cancel_delayed_work(priv-ah_reap_task); + flush_workqueue(priv-wq); + ipoib_reap_ah(priv-ah_reap_task.work); +} + +static void ipoib_stop_ah(struct net_device *dev) +{ + struct ipoib_dev_priv *priv = netdev_priv(dev); + + set_bit(IPOIB_STOP_REAPER, priv-flags); + ipoib_flush_ah(dev); +} + static void ipoib_ib_tx_timer_func(unsigned long ctx) { drain_tx_cq((struct net_device *)ctx); @@ -877,23 +894,7 @@ timeout: if (ib_modify_qp(priv-qp, qp_attr, IB_QP_STATE)) ipoib_warn(priv, Failed to modify QP to RESET state\n); - /* Wait for all AHs to be reaped */ - set_bit(IPOIB_STOP_REAPER, priv-flags); - cancel_delayed_work(priv-ah_reap_task); - flush_workqueue(priv-wq); - - begin = jiffies; - - while (!list_empty(priv-dead_ahs)) { - __ipoib_reap_ah(dev); - - if (time_after(jiffies, begin + HZ)) { - ipoib_warn(priv, timing out; will leak address handles\n); - break; - } - - msleep(1); - } + ipoib_flush_ah(dev); ib_req_notify_cq(priv-recv_cq, IB_CQ_NEXT_COMP); @@ -1036,6 +1037,7 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, if (level == IPOIB_FLUSH_LIGHT) { ipoib_mark_paths_invalid(dev); ipoib_mcast_dev_flush(dev); + ipoib_flush_ah(dev); } if (level = IPOIB_FLUSH_NORMAL) @@ -1099,6 +1101,14 @@ void ipoib_ib_dev_cleanup(struct net_device *dev) ipoib_mcast_stop_thread(dev); ipoib_mcast_dev_flush(dev); + /* +* All of our ah references aren't free until after +* ipoib_mcast_dev_flush(), ipoib_flush_paths, and +* the neighbor garbage collection is stopped and reaped. +* That should all be done now, so make a final ah flush. +*/ + ipoib_stop_ah(dev); + ipoib_transport_dev_cleanup(dev); } diff --git a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c index b72a753eb41..66df78382a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_verbs.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_verbs.c @@ -263,6 +263,9 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); + if (priv-wq) + flush_workqueue(priv-wq); + if (priv-qp) { if (ib_destroy_qp(priv-qp)) ipoib_warn(priv, ib_qp_destroy failed\n); @@ -286,7 +289,6 @@ void ipoib_transport_dev_cleanup(struct net_device *dev) ipoib_warn(priv, ib_dealloc_pd failed\n); if (priv-wq) { - flush_workqueue(priv-wq); destroy_workqueue(priv-wq); priv-wq = NULL; } -- 2.1.0 -- To
[PATCH 22/22] IB/ipoib: drop mcast_mutex usage
We needed the mcast_mutex when we had to protect two different workqueue threads against stomping on each others work. By no longer using mcast-mc to directly store the return value of ib_sa_join_multicast, we can switch all of the mcast flag/completion locking to being only the priv-lock spinlock. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 55 +- 1 file changed, 19 insertions(+), 36 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 355c320dc4d..e1fab519f96 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -55,8 +55,6 @@ MODULE_PARM_DESC(mcast_debug_level, Enable multicast debug tracing if 0); #endif -static DEFINE_MUTEX(mcast_mutex); - struct ipoib_mcast_iter { struct net_device *dev; union ib_gid mgid; @@ -340,16 +338,6 @@ static int ipoib_mcast_join_complete(int status, sendonly : , mcast-mcmember.mgid.raw, status); - /* -* We have to take the mutex to force mcast_join to -* return from ib_sa_multicast_join and set mcast-mc to a -* valid value. Otherwise we were racing with ourselves in -* that we might fail here, but get a valid return from -* ib_sa_multicast_join after we had cleared mcast-mc here, -* resulting in mis-matched joins and leaves and a deadlock -*/ - mutex_lock(mcast_mutex); - /* We trap for port events ourselves. */ if (status == -ENETRESET) { status = 0; @@ -410,11 +398,14 @@ static int ipoib_mcast_join_complete(int status, __ipoib_mcast_continue_join_thread(priv, mcast, 1); } out: + spin_lock_irq(priv-lock); clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); if (status) mcast-mc = NULL; + else + mcast-mc = multicast; complete(mcast-done); - mutex_unlock(mcast_mutex); + spin_unlock_irq(priv-lock); return status; } @@ -423,6 +414,7 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, int create) { struct ipoib_dev_priv *priv = netdev_priv(dev); + struct ib_sa_multicast *multicast; struct ib_sa_mcmember_rec rec = { .join_state = 1 }; @@ -464,19 +456,20 @@ static void ipoib_mcast_join(struct net_device *dev, struct ipoib_mcast *mcast, rec.hop_limit = priv-broadcast-mcmember.hop_limit; } - mutex_lock(mcast_mutex); - mcast-mc = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port, + multicast = ib_sa_join_multicast(ipoib_sa_client, priv-ca, priv-port, rec, comp_mask, GFP_KERNEL, ipoib_mcast_join_complete, mcast); - if (IS_ERR(mcast-mc)) { - clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); - complete(mcast-done); - ret = PTR_ERR(mcast-mc); + if (IS_ERR(multicast)) { + ret = PTR_ERR(multicast); + mcast-mc = NULL; ipoib_warn(priv, ib_sa_join_multicast failed, status %d\n, ret); + spin_lock_irq(priv-lock); + clear_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags); /* Requeue this join task with a backoff delay */ __ipoib_mcast_continue_join_thread(priv, mcast, 1); + complete(mcast-done); + spin_unlock_irq(priv-lock); } - mutex_unlock(mcast_mutex); } void ipoib_mcast_join_task(struct work_struct *work) @@ -505,15 +498,6 @@ void ipoib_mcast_join_task(struct work_struct *work) else memcpy(priv-dev-dev_addr + 4, priv-local_gid.raw, sizeof (union ib_gid)); - /* -* We have to hold the mutex to keep from racing with the join -* completion threads on setting flags on mcasts, and we have -* to hold the priv-lock because dev_flush will remove entries -* out from underneath us, so at a minimum we need the lock -* through the time that we do the for_each loop of the mcast -* list or else dev_flush can make us oops. -*/ - mutex_lock(mcast_mutex); spin_lock_irq(priv-lock); if (!test_bit(IPOIB_FLAG_ADMIN_UP, priv-flags)) goto out; @@ -573,9 +557,7 @@ void ipoib_mcast_join_task(struct work_struct *work) else create = 1; spin_unlock_irq(priv-lock); - mutex_unlock(mcast_mutex); ipoib_mcast_join(dev, mcast, create); -
[PATCH 08/22] IB/ipoib: No longer use flush as a parameter
Various places in the IPoIB code had a deadlock related to flushing the ipoib workqueue. Now that we have per device workqueues and a specific flush workqueue, there is no longer a deadlock issue with flushing the device specific workqueues and we can do so unilaterally. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib.h | 6 +++--- drivers/infiniband/ulp/ipoib/ipoib_ib.c | 19 +-- drivers/infiniband/ulp/ipoib/ipoib_main.c | 8 3 files changed, 16 insertions(+), 17 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib.h b/drivers/infiniband/ulp/ipoib/ipoib.h index 28dc927c0e8..8ba80a6d3a4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib.h +++ b/drivers/infiniband/ulp/ipoib/ipoib.h @@ -484,10 +484,10 @@ void ipoib_ib_dev_flush_heavy(struct work_struct *work); void ipoib_pkey_event(struct work_struct *work); void ipoib_ib_dev_cleanup(struct net_device *dev); -int ipoib_ib_dev_open(struct net_device *dev, int flush); +int ipoib_ib_dev_open(struct net_device *dev); int ipoib_ib_dev_up(struct net_device *dev); -int ipoib_ib_dev_down(struct net_device *dev, int flush); -int ipoib_ib_dev_stop(struct net_device *dev, int flush); +int ipoib_ib_dev_down(struct net_device *dev); +int ipoib_ib_dev_stop(struct net_device *dev); void ipoib_pkey_dev_check_presence(struct net_device *dev); int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_ib.c b/drivers/infiniband/ulp/ipoib/ipoib_ib.c index 66096787119..fe65abb5150 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_ib.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_ib.c @@ -664,7 +664,7 @@ static void ipoib_ib_tx_timer_func(unsigned long ctx) drain_tx_cq((struct net_device *)ctx); } -int ipoib_ib_dev_open(struct net_device *dev, int flush) +int ipoib_ib_dev_open(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); int ret; @@ -706,7 +706,7 @@ int ipoib_ib_dev_open(struct net_device *dev, int flush) dev_stop: if (!test_and_set_bit(IPOIB_FLAG_INITIALIZED, priv-flags)) napi_enable(priv-napi); - ipoib_ib_dev_stop(dev, flush); + ipoib_ib_dev_stop(dev); return -1; } @@ -738,7 +738,7 @@ int ipoib_ib_dev_up(struct net_device *dev) return ipoib_mcast_start_thread(dev); } -int ipoib_ib_dev_down(struct net_device *dev, int flush) +int ipoib_ib_dev_down(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); @@ -807,7 +807,7 @@ void ipoib_drain_cq(struct net_device *dev) local_bh_enable(); } -int ipoib_ib_dev_stop(struct net_device *dev, int flush) +int ipoib_ib_dev_stop(struct net_device *dev) { struct ipoib_dev_priv *priv = netdev_priv(dev); struct ib_qp_attr qp_attr; @@ -880,8 +880,7 @@ timeout: /* Wait for all AHs to be reaped */ set_bit(IPOIB_STOP_REAPER, priv-flags); cancel_delayed_work(priv-ah_reap_task); - if (flush) - flush_workqueue(priv-wq); + flush_workqueue(priv-wq); begin = jiffies; @@ -918,7 +917,7 @@ int ipoib_ib_dev_init(struct net_device *dev, struct ib_device *ca, int port) (unsigned long) dev); if (dev-flags IFF_UP) { - if (ipoib_ib_dev_open(dev, 1)) { + if (ipoib_ib_dev_open(dev)) { ipoib_transport_dev_cleanup(dev); return -ENODEV; } @@ -1040,12 +1039,12 @@ static void __ipoib_ib_dev_flush(struct ipoib_dev_priv *priv, } if (level = IPOIB_FLUSH_NORMAL) - ipoib_ib_dev_down(dev, 0); + ipoib_ib_dev_down(dev); if (level == IPOIB_FLUSH_HEAVY) { if (test_bit(IPOIB_FLAG_INITIALIZED, priv-flags)) - ipoib_ib_dev_stop(dev, 0); - if (ipoib_ib_dev_open(dev, 0) != 0) + ipoib_ib_dev_stop(dev); + if (ipoib_ib_dev_open(dev) != 0) return; if (netif_queue_stopped(dev)) netif_start_queue(dev); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 42e5c278f48..6bad17d4d58 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -108,7 +108,7 @@ int ipoib_open(struct net_device *dev) set_bit(IPOIB_FLAG_ADMIN_UP, priv-flags); - if (ipoib_ib_dev_open(dev, 1)) { + if (ipoib_ib_dev_open(dev)) { if (!test_bit(IPOIB_PKEY_ASSIGNED, priv-flags)) return 0; goto err_disable; @@ -139,7 +139,7 @@ int ipoib_open(struct net_device *dev) return 0; err_stop: - ipoib_ib_dev_stop(dev, 1); + ipoib_ib_dev_stop(dev); err_disable: clear_bit(IPOIB_FLAG_ADMIN_UP, priv-flags); @@ -157,8
[PATCH 05/22] IB/ipoib: change init sequence ordering
In preparation for using per device work queues, we need to move the start of the neighbor thread task to after ipoib_ib_dev_init and move the destruction of the neighbor task to before ipoib_ib_dev_cleanup. Otherwise we will end up freeing our workqueue with work possibly still on it. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 24 +--- 1 file changed, 17 insertions(+), 7 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 58b5aa3b6f2..2cf81ef5141 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -1262,15 +1262,13 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) { struct ipoib_dev_priv *priv = netdev_priv(dev); - if (ipoib_neigh_hash_init(priv) 0) - goto out; /* Allocate RX/TX rings to hold queued skbs */ priv-rx_ring = kzalloc(ipoib_recvq_size * sizeof *priv-rx_ring, GFP_KERNEL); if (!priv-rx_ring) { printk(KERN_WARNING %s: failed to allocate RX ring (%d entries)\n, ca-name, ipoib_recvq_size); - goto out_neigh_hash_cleanup; + goto out; } priv-tx_ring = vzalloc(ipoib_sendq_size * sizeof *priv-tx_ring); @@ -1285,16 +1283,24 @@ int ipoib_dev_init(struct net_device *dev, struct ib_device *ca, int port) if (ipoib_ib_dev_init(dev, ca, port)) goto out_tx_ring_cleanup; + /* +* Must be after ipoib_ib_dev_init so we can allocate a per +* device wq there and use it here +*/ + if (ipoib_neigh_hash_init(priv) 0) + goto out_dev_uninit; + return 0; +out_dev_uninit: + ipoib_ib_dev_cleanup(); + out_tx_ring_cleanup: vfree(priv-tx_ring); out_rx_ring_cleanup: kfree(priv-rx_ring); -out_neigh_hash_cleanup: - ipoib_neigh_hash_uninit(dev); out: return -ENOMEM; } @@ -1317,6 +1323,12 @@ void ipoib_dev_cleanup(struct net_device *dev) } unregister_netdevice_many(head); + /* +* Must be before ipoib_ib_dev_cleanup or we delete an in use +* work queue +*/ + ipoib_neigh_hash_uninit(dev); + ipoib_ib_dev_cleanup(dev); kfree(priv-rx_ring); @@ -1324,8 +1336,6 @@ void ipoib_dev_cleanup(struct net_device *dev) priv-rx_ring = NULL; priv-tx_ring = NULL; - - ipoib_neigh_hash_uninit(dev); } static const struct header_ops ipoib_header_ops = { -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 18/22] IB/ipoib: cleanup a couple debug messages
The one I removed in particular can prevent progress in the driver from happening because it isn't rate limited. In my testing, I have seen the flood of messages from that debug message delay multicast joins for in excess of 30 seconds. The added debug message was just so I would know when we were leaving a multicast group that we didn't fully join. Signed-off-by: Doug Ledford dledf...@redhat.com --- drivers/infiniband/ulp/ipoib/ipoib_multicast.c | 11 +-- 1 file changed, 5 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c index 10e05437c83..ad2bcd105e4 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_multicast.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_multicast.c @@ -711,6 +711,8 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) if (!IS_ERR_OR_NULL(mcast-mc)) ib_sa_free_multicast(mcast-mc); + else + ipoib_dbg(priv, ipoib_mcast_leave with mcast-mc invalid\n); if (test_and_clear_bit(IPOIB_MCAST_FLAG_ATTACHED, mcast-flags)) { ipoib_dbg_mcast(priv, leaving MGID %pI6\n, @@ -721,7 +723,9 @@ static int ipoib_mcast_leave(struct net_device *dev, struct ipoib_mcast *mcast) be16_to_cpu(mcast-mcmember.mlid)); if (ret) ipoib_warn(priv, ib_detach_mcast failed (result = %d)\n, ret); - } + } else if (!test_bit(IPOIB_MCAST_FLAG_SENDONLY, mcast-flags)) + ipoib_dbg(priv, leaving with no mcmember but not a + SENDONLY join\n); return 0; } @@ -772,11 +776,6 @@ void ipoib_mcast_send(struct net_device *dev, u8 *daddr, struct sk_buff *skb) ++dev-stats.tx_dropped; dev_kfree_skb_any(skb); } - - if (test_bit(IPOIB_MCAST_FLAG_BUSY, mcast-flags)) - ipoib_dbg_mcast(priv, no address vector, - but multicast join already started\n); - /* * If lookup completes between here and out:, don't * want to send packet twice. -- 2.1.0 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
On 2/11/2015 8:43 PM, Doug Ledford wrote: This patchset revives the 8 patches that were reverted from 3.19, the 11 patches that fixed the problems with the first 8, the single patch that was related to reaping of ah's and failure to dealloc resources on shutdown, and then adds two new patches that would have been enhancements and not bugfixes and hence weren't appropriate to post in the 3.19 tussle. Doug, as asked few times... please provide public git tree where this series can be pulled from. So no squashes for patches which actually represent a fix to a downstream patch within the series? any specific reason for that? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: mlx4: IB_EVENT_PATH_MIG not generated on path migration
Hello Jason, thanks for your answer. On Tue, Feb 10, 2015 at 11:00:39AM -0700, Jason Gunthorpe wrote: On Tue, Feb 10, 2015 at 04:56:43PM +0100, Fabian Holler wrote: Does anybody have an idea what could be wrong? Are the PATH_MIG* notifications with mlx4 drivers are working for somebody? IIRC rdmacm does not forward these events. There is a lot missing to support APM with RDMA CM, it doesn't do any of the CM state machine operations to coordinate loading alternate paths, for instance. Alternate paths can be set directly at the QP. As soon as the migration happens I expect that a IB_EVENT_PATH_MIG* is generated on the QP event handler, but I don't receive any QP events. qp_event_handler is a RDMA CM event handler, it handles RDMA_CM_EVENT_* and there is no RDMA_CM_EVENT_PATH_MIGRATED. You'd need to listen to the verbs level events directly, I don't recall if this is possible with RDMA CM or not, sorry. The RDMA CM event handler is registered by the rdma_create_id() call. The event handler specified in the struct ib_qp_init_attr * parameter of rdma_create_qp() is registered as QP event handler. rdma_create_qp() calls ib_create_qp() and sets qp-event_handler to ib_qp_init_attr-event_handler. Fabian -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH opensm] osm_ucast_mgr.c: Support diverse paths for LMC 0 when scatter_ports is enabled
From: Vladimir Koushnir vladim...@mellanox.com The patch adds an ability to calculate diverse paths for all LIDs (LMC 0) derived from the same Base LID, when scatter ports is enabled (different from 0). We apply randomization during selection of the outports only when routing is calculated for the Base LID. For LIDs different from the Base LID, we don't randomize the outport but apply the procedure for calculating diverse path as if scatter_ports equals 0. The patch is relevant for minhop/updn routing. Signed-off-by: Vladimir Koushnir vladim...@mellanox.com Signed-off-by: Shlomi Nimrodi shlo...@mellanox.com --- opensm/osm_ucast_mgr.c |2 +- 1 files changed, 1 insertions(+), 1 deletions(-) diff --git a/opensm/osm_ucast_mgr.c b/opensm/osm_ucast_mgr.c index 0b24e73..219e080 100644 --- a/opensm/osm_ucast_mgr.c +++ b/opensm/osm_ucast_mgr.c @@ -260,7 +260,7 @@ static void ucast_mgr_process_port(IN osm_ucast_mgr_t * p_mgr, p_mgr-p_subn-opt.lmc, p_mgr-is_dor, p_mgr-p_subn-opt.port_shifting, -p_port-use_scatter, +!lid_offset p_port-use_scatter, OSM_LFT); if (port == OSM_NO_PATH) { -- 1.7.8.2 -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
On 2/11/2015 11:55 PM, Doug Ledford wrote: I tried, but the most appropriate squashes were things like patch 18 and 19 squashed into patch 3, and some other similar jumping around of patches. The number of conflicts that created in both the original squash and the follow on patches was perverse. So, since the original 20 patches as I submitted previously are what has gone through our QA, I left them as they were. If I fixed them up and squashed them, and if some overlooked fixup failure snuck in there, it would invalidate all of that testing. I see. Does this sequence bisect-able? namely the driver builds and functions after applying patch by patch? note this is hard requirement when modifying existing upstream drivers. Or. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 00/22] IB/ipoib: Fixups for multicast issues
On Wed, 2015-02-11 at 23:46 -0500, Or Gerlitz wrote: On 2/11/2015 8:43 PM, Doug Ledford wrote: This patchset revives the 8 patches that were reverted from 3.19, the 11 patches that fixed the problems with the first 8, the single patch that was related to reaping of ah's and failure to dealloc resources on shutdown, and then adds two new patches that would have been enhancements and not bugfixes and hence weren't appropriate to post in the 3.19 tussle. Doug, as asked few times... please provide public git tree where this series can be pulled from. git://github.com/dledford/linux.git for-3.20 So no squashes for patches which actually represent a fix to a downstream patch within the series? any specific reason for that? I tried, but the most appropriate squashes were things like patch 18 and 19 squashed into patch 3, and some other similar jumping around of patches. The number of conflicts that created in both the original squash and the follow on patches was perverse. So, since the original 20 patches as I submitted previously are what has gone through our QA, I left them as they were. If I fixed them up and squashed them, and if some overlooked fixup failure snuck in there, it would invalidate all of that testing. -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD signature.asc Description: This is a digitally signed message part