Re: OVS VXLAN decap rule has full match on TTL for the outer headers?
On 14/11/2015 08:45, Joe Stringer wrote: > On 13 November 2015 at 06:46, Or Gerlitz wrote: >> > On Fri, Nov 13, 2015 at 10:14 AM, Joe Stringer >> > wrote: >> > >>> >> I don't follow the logic. You observed one flow which matched on >>> >> TTL=64, therefore all vxlan packets terminated at OVS have TTL=64? >> > >>> >> If OVS received packets with different TTLs, they would miss and >>> >> ovs-vswitchd would generate flows to match that traffic too. >> > >> > ok, that makes things a bit better, but (see next) >> > >>> >> If that becomes an issue, presumably the wildcard generation can be >>> >> improved. >> > >> > is there a deep reason for vlxan "learned flows" to actually match w >> > or w.o wild cards on TTLs?? for non-tunneled flow I don't see this >> > happening. > No deep reason I'm aware of. Hi, We looked into the OVS kernel module, and apparently there's a check that rejects new tunnel flows if they don't have the TTL mask set [1]. I was able to trace it to this commit [2] on the OVS tree, but I don't quite understand why the check was added. There was some discussion about the patch on the mailing list [3] that hints this was about catching zero TTL, but it has too little context for me to understand. I'm adding the author and reviewer of the patch, perhaps they can help explain this requirement. Regards, Haggai [1] http://lxr.free-electrons.com/source/net/openvswitch/flow_netlink.c?v=4.3#L660 [2] datapath: More flexible kernel/userspace tunneling attribute. 9b405f1aa8d175dc63ad3ffe5d0fe05d5ee09162 [3] http://openvswitch.org/pipermail/dev/2013-January/024573.html -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM
On 13/07/2015 21:14, Jason Gunthorpe wrote: > On Mon, Jun 22, 2015 at 03:42:37PM +0300, Haggai Eran wrote: >> +switch (ib_event->event) { >> +case IB_CM_REQ_RECEIVED: >> +req->device = req_param->listen_id->device; >> +req->port = req_param->port; >> +req->local_gid = &req_param->primary_path->sgid; >> +req->service_id = req_param->primary_path->service_id; >> +req->pkey = be16_to_cpu(req_param->primary_path->pkey); > > I feel pretty strongly that we should be using the pkey from the work > completion, not the pkey in the message. > > The reason, if someone is using pkey like vlan, and expecting a > container to never receive packets outside the assigned pkey, then we > need to check each and every packet for the correct pkey before > associating it with that container. The way I see it is that you have one RDMA CM agent in the system, and the header parameters address it. This agent allows addressing several namespaces, and they are de-muxed according to the parameters in the payload. So a container never receives a packet outside of its assigned pkeys, but the pkey you look at (as well as the GID, and possibly the IP address) all come from the payload. > When doing the namespace patches you should probably also look at > other CM GMPs than just the REQ and how the paths are setup and > consider what to do with the pkey. I'd probably suggest that the pkey > should be forced throughout the entire process to ensure it always > matches the ip device - at least for containers that is the right > thing.. I probably wouldn't turn it on for the root namespace though.. Once a connection has been established, following GMPs use a unique ID to address this connection, so no more de-muxing is needed. What is really missing here I guess is a mechanism that would enforce containers to only use certain pkeys - perhaps with something like an RDMA cgroup. It could force containers to only use approved pkeys not only with RDMA CM, but through uverbs, and other user-space interfaces. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 01/12] IB/core: pass client data to remove() callbacks
On 09/07/2015 00:34, Jason Gunthorpe wrote: > On Wed, Jul 08, 2015 at 02:29:10PM -0600, Jason Gunthorpe wrote: >> On Mon, Jun 22, 2015 at 03:42:30PM +0300, Haggai Eran wrote: >>> An ib_client callback that is called with the lists_rwsem locked only for >>> read is protected from changes to the IB client lists, but not from >>> ib_unregister_device() freeing its client data. This is because >>> ib_unregister_device() will remove the device from the device list with >>> lists_rwsem locked for write, but perform the rest of the cleanup, >>> including the call to remove() without that lock. >> >> I was going to look at this, but, uh.. it seems mangled, doesn't >> apply, doesn't seem fixable from here. > > Okay, I see, it sits on top of the patch from Matan's last > posting.. My bad. No problem. > Hum... I have to say I don't really like this, changing the ordering > of client_data = NULL with respect to client->remove doesn't seem like > a great idea - and the rds changes look scary to me, at least I > couldn't confidently say they were OK.. > > And that isn't really the issue - this has nothing to do with > client_data, it is all about not having a callback running when doing > remove. > > It looks like the way out of this is to have ib_get_net_dev_by_params > iterate over the client_data_list and use a dedicated flag in that > struct to indicate that client&device combination is > remove-in-progress. > > This would be a bit more efficient as well, and I would suggest > passing the context in as an arg to the callback. > > client_data_list would change a bit to become write locked first by > write(lists_rwsem), and then second by the spin lock, so holding > read(lists_rwsem) while iterating is enough locking, and you'd hold > lists_rwsem while kfreeing. So, I don't want to keep lists_rwsem for write while calling add() and remove(). This would cause the deadlock that required the lists_rwsem patch in the first place. I guess I can drop lists_rwsem before the add/remove call and take it afterwards. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 11/12] IB/cma: Share ib_cm_ids between rdma_cm_ids
On 13/07/2015 21:06, Jason Gunthorpe wrote: > On Mon, Jun 22, 2015 at 03:42:40PM +0300, Haggai Eran wrote: >> Use ib_cm_id_create_and_listen to create listening IB CM IDs or share > ^^^ > Is that the wrong name? ib_cm_insert_listen perhaps? Yes, I missed that. Thanks. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 05/12] IB/cm: Share listening CM IDs
On 13/07/2015 20:48, Jason Gunthorpe wrote: > On Mon, Jun 22, 2015 at 03:42:34PM +0300, Haggai Eran wrote: >> spin_lock_irq(&cm.lock); >> +if (--cm_id_priv->listen_sharecount > 0) { >> +/* The id is still shared. */ >> +atomic_dec(&cm_id_priv->refcount); > > Nit: This looks very strange not to be cm_deref_id .. Looks OK as is > because we are sure refcount cannot be 0 here? Yes, and because of the spin lock. But I'll change it to cm_deref_id() so that it's more uniform. > >> @@ -958,8 +988,10 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 >> service_id, __be64 service_mask, >> } >> >> cm_id->state = IB_CM_LISTEN; >> +++cm_id_priv->listen_sharecount; >> >> -spin_lock_irqsave(&cm.lock, flags); >> +if (lock) >> +spin_lock_irqsave(&cm.lock, flags); > > Hmm, I'd like to see the listen_sharecount consistently locked, so it > should be manipulated only while cm.lock is held.. You're right. I'll move it inside the lock. > >> if (service_id == IB_CM_ASSIGN_SERVICE_ID) { >> cm_id->service_id = cpu_to_be64(cm.listen_service_id++); >> cm_id->service_mask = ~cpu_to_be64(0); >> @@ -968,18 +1000,98 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 >> service_id, __be64 service_mask, >> cm_id->service_mask = service_mask; >> } >> cur_cm_id_priv = cm_insert_listen(cm_id_priv); >> -spin_unlock_irqrestore(&cm.lock, flags); >> +if (lock) >> +spin_unlock_irqrestore(&cm.lock, flags); >> >> if (cur_cm_id_priv) { >> cm_id->state = IB_CM_IDLE; >> +--cm_id_priv->listen_sharecount; > > Ditto Okay. I'll move the lock release to the end of the function. > > Otherwise I don't see any other mechanical problems with this. Sean > said he was happy with the idea right? I assume he is, since he reviewed the patch and found that listen_sharecount leak, but he can speak for himself. > > Reviewed-By: Jason Gunthorpe Thanks. Can I add it with the modifications above? Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 02/12] IB/core: Find the network device matching connection parameters
On 08/07/2015 23:33, Jason Gunthorpe wrote: > On Mon, Jun 22, 2015 at 03:42:31PM +0300, Haggai Eran wrote: >> +/** >> + * ib_get_net_dev_by_params() - Return the appropriate net_dev >> + * for a received CM request >> + * @dev:An RDMA device on which the request has been received. >> + * @port: Port number on the RDMA device. >> + * @pkey: The Pkey the request came on. >> + * @gid:A GID that the net_dev uses to communicate. >> + * @addr: Contains the IP address that the request specified as its >> + * destination. >> + */ >> +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, u8 port, >> +u16 pkey, const union ib_gid *gid, >> +const struct sockaddr *addr); > > I feel like this has been repated a few times now, but kdocs should be > with the function body, not in the header. Right. I fixed it in the IB/addr patch, but missed it here. I'll move it to the function's definition. > Reviewed-By: Jason Gunthorpe Thanks, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v1 03/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 09/07/2015 02:41, Jason Gunthorpe wrote: > On Mon, Jun 22, 2015 at 03:42:32PM +0300, Haggai Eran wrote: >> +if (net_dev) { >> +ipoib_warn(priv, "matching net_dev found: %s\n", >> + net_dev->name); > > Is that a debug print? Yes, I'm afraid it slipped in somehow. I'll remove it. > >> +default: >> +dev_warn(&dev->dev, "duplicate IP address detected\n"); >> +/* Fall through */ > > Surely that needs some kind of rate limit? I'll use dev_warn_ratelimited(). Thanks, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v1 10/12] IB/cma: use found net_dev for passive connections
When receiving a new connection in cma_req_handler, we actually already know the net_dev that is used for the connection's creation. Instead of calling cma_translate_addr to resolve the new connection id's source address, just use the net_dev that was found. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 64 ++- 1 file changed, 33 insertions(+), 31 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 759697d6e627..729511d3ec64 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1251,27 +1251,25 @@ static struct rdma_id_private *cma_find_listener( } static struct rdma_id_private *cma_id_from_event(struct ib_cm_id *cm_id, -struct ib_cm_event *ib_event) +struct ib_cm_event *ib_event, +struct net_device **net_dev) { struct cma_req_info req; struct rdma_bind_list *bind_list; struct rdma_id_private *id_priv; - struct net_device *net_dev; int err; err = cma_save_req_info(ib_event, &req); if (err) return ERR_PTR(err); - net_dev = cma_get_net_dev(ib_event, &req); - if (IS_ERR(net_dev)) - return ERR_PTR(PTR_ERR(net_dev)); + *net_dev = cma_get_net_dev(ib_event, &req); + if (IS_ERR(*net_dev)) + return ERR_PTR(PTR_ERR(*net_dev)); bind_list = cma_ps_find(rdma_ps_from_service_id(req.service_id), cma_port_from_service_id(req.service_id)); - id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, net_dev); - - dev_put(net_dev); + id_priv = cma_find_listener(bind_list, cm_id, ib_event, &req, *net_dev); return id_priv; } @@ -1521,7 +1519,8 @@ out: } static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) + struct ib_cm_event *ib_event, + struct net_device *net_dev) { struct rdma_id_private *id_priv; struct rdma_cm_id *id; @@ -1553,15 +1552,9 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, if (rt->num_paths == 2) rt->path_rec[1] = *ib_event->param.req_rcvd.alternate_path; - if (cma_any_addr(cma_src_addr(id_priv))) { - rt->addr.dev_addr.dev_type = ARPHRD_INFINIBAND; - rdma_addr_set_sgid(&rt->addr.dev_addr, &rt->path_rec[0].sgid); - ib_addr_set_pkey(&rt->addr.dev_addr, be16_to_cpu(rt->path_rec[0].pkey)); - } else { - ret = cma_translate_addr(cma_src_addr(id_priv), &rt->addr.dev_addr); - if (ret) - goto err; - } + ret = rdma_copy_addr(&rt->addr.dev_addr, net_dev, NULL); + if (ret) + goto err; rdma_addr_set_dgid(&rt->addr.dev_addr, &rt->path_rec[0].dgid); id_priv->state = RDMA_CM_CONNECT; @@ -1573,7 +1566,8 @@ err: } static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, - struct ib_cm_event *ib_event) + struct ib_cm_event *ib_event, + struct net_device *net_dev) { struct rdma_id_private *id_priv; struct rdma_cm_id *id; @@ -1592,11 +1586,9 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, ib_event->param.sidr_req_rcvd.service_id)) goto err; - if (!cma_any_addr((struct sockaddr *) &id->route.addr.src_addr)) { - ret = cma_translate_addr(cma_src_addr(id_priv), &id->route.addr.dev_addr); - if (ret) - goto err; - } + ret = rdma_copy_addr(&id->route.addr.dev_addr, net_dev, NULL); + if (ret) + goto err; id_priv->state = RDMA_CM_CONNECT; return id_priv; @@ -1633,28 +1625,33 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) { struct rdma_id_private *listen_id, *conn_id; struct rdma_cm_event event; + struct net_device *net_dev; int offset, ret; - listen_id = cma_id_from_event(cm_id, ib_event); + listen_id = cma_id_from_event(cm_id, ib_event, &net_dev); if (IS_ERR(listen_id)) return PTR_ERR(listen_id); - if (!cma_check_req_qp_type(&listen_id->id, ib_event)) - return -EINVAL; + if (!cma_check_req_qp_type(&listen_id->id, ib_eve
[PATCH v1 08/12] IB/cma: Add net_dev and private data checks to RDMA CM
Instead of relying on a the ib_cm module to check an incoming CM request's private data header, add these checks to the RDMA CM module. This allows a following patch to to clean up the ib_cm interface and remove the code that looks into the private headers. It will also allow supporting namespaces in RDMA CM by making these checks namespace aware later on. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 170 +- 1 file changed, 168 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 02914bf1e548..4701167f5117 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -300,7 +300,7 @@ static enum rdma_cm_state cma_exch(struct rdma_id_private *id_priv, return old; } -static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) +static inline u8 cma_get_ip_ver(const struct cma_hdr *hdr) { return hdr->ip_version >> 4; } @@ -1024,6 +1024,169 @@ static int cma_save_net_info(struct sockaddr *src_addr, return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); } +struct cma_req_info { + struct ib_device *device; + int port; + const union ib_gid *local_gid; + __be64 service_id; + u16 pkey; +}; + +static int cma_save_req_info(const struct ib_cm_event *ib_event, +struct cma_req_info *req) +{ + const struct ib_cm_req_event_param *req_param = + &ib_event->param.req_rcvd; + const struct ib_cm_sidr_req_event_param *sidr_param = + &ib_event->param.sidr_req_rcvd; + + switch (ib_event->event) { + case IB_CM_REQ_RECEIVED: + req->device = req_param->listen_id->device; + req->port = req_param->port; + req->local_gid = &req_param->primary_path->sgid; + req->service_id = req_param->primary_path->service_id; + req->pkey = be16_to_cpu(req_param->primary_path->pkey); + break; + case IB_CM_SIDR_REQ_RECEIVED: + req->device = sidr_param->listen_id->device; + req->port = sidr_param->port; + req->local_gid = NULL; + req->service_id = sidr_param->service_id; + req->pkey = sidr_param->pkey; + break; + default: + return -EINVAL; + } + + return 0; +} + +static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, + const struct cma_req_info *req) +{ + struct sockaddr_storage listen_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage; + struct net_device *net_dev; + int err; + + err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id); + if (err) + return ERR_PTR(err); + + net_dev = ib_get_net_dev_by_params(req->device, req->port, req->pkey, + req->local_gid, listen_addr); + if (!net_dev) + return ERR_PTR(-ENODEV); + + return net_dev; +} + +static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id) +{ + return (be64_to_cpu(service_id) >> 16) & 0x; +} + +static bool cma_match_private_data(struct rdma_id_private *id_priv, + const struct cma_hdr *hdr) +{ + struct sockaddr *addr = cma_src_addr(id_priv); + __be32 ip4_addr; + struct in6_addr ip6_addr; + + if (cma_any_addr(addr) && !id_priv->afonly) + return true; + + switch (addr->sa_family) { + case AF_INET: + ip4_addr = ((struct sockaddr_in *)addr)->sin_addr.s_addr; + if (cma_get_ip_ver(hdr) != 4) + return false; + if (!cma_any_addr(addr) && + hdr->dst_addr.ip4.addr != ip4_addr) + return false; + break; + case AF_INET6: + ip6_addr = ((struct sockaddr_in6 *)addr)->sin6_addr; + if (cma_get_ip_ver(hdr) != 6) + return false; + if (!cma_any_addr(addr) && + memcmp(&hdr->dst_addr.ip6, &ip6_addr, sizeof(ip6_addr))) + return false; + break; + default: + return false; + } + + return true; +} + +static bool cma_match_net_dev(const struct rdma_id_private *id_priv, + const struct net_device *net_dev) +{ + const struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + + return !addr->bound_dev_if || + (net_eq(dev_net(net_dev), &init_net) &&
[PATCH v1 03/12] IB/ipoib: Return IPoIB devices matching connection parameters
From: Guy Shapiro Implement the get_net_device_by_port_pkey_ip callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each IPoIB device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 227 +- 1 file changed, 226 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index d128432f226d..eb33a7b2be04 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include #include +#include +#include +#include #define DRV_VERSION "1.0.0" @@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device, void *client_data); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_dev_by_params( + struct ib_device *dev, u8 port, u16 pkey, + const union ib_gid *gid, const struct sockaddr *addr); static struct ib_client ipoib_client = { .name = "ipoib", .add= ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_dev_by_params = ipoib_get_net_dev_by_params, }; int ipoib_open(struct net_device *dev) @@ -222,6 +229,224 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +/* Called with an RCU read lock taken */ +static bool ipoib_is_dev_match_addr_rcu(const struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + struct in_device *in_dev; + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; + __be32 ret_addr; + + switch (addr->sa_family) { + case AF_INET: + in_dev = in_dev_get(dev); + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, +addr_in->sin_addr.s_addr, +RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + + break; + case AF_INET6: + if (IS_ENABLED(CONFIG_IPV6) && + ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) + return true; + + break; + } + return false; +} + +/** + * Find the master net_device on top of the given net_device. + * @dev: base IPoIB net_device + * + * Returns the master net_device with a reference held, or the same net_device + * if no master exists. + */ +static struct net_device *ipoib_get_master_net_dev(struct net_device *dev) +{ + struct net_device *master; + + rcu_read_lock(); + master = netdev_master_upper_dev_get_rcu(dev); + if (master) + dev_hold(master); + rcu_read_unlock(); + + if (master) + return master; + + dev_hold(dev); + return dev; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr( + const struct sockaddr *addr, struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + rcu_read_lock(); + if (ipoib_is_dev_match_addr_rcu(addr, dev)) { + dev_hold(dev); + result = dev; + goto out; + } + + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr_rcu(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } +out: + rcu_read_unlock(); + return result; +} + +/* returns the number of IPoIB netdevs on top a given ipoib device matching a + * pkey_index and address, if one exists. + * + * @found_net_dev: contains a matching net_device if the return value >= 1, + * with a reference held. */ +static int ipoib_match_gid_pkey_addr(struct ipoib_dev_priv *priv, +const union ib_gid *gid, +u16 pkey_index, +
[PATCH v1 11/12] IB/cma: Share ib_cm_ids between rdma_cm_ids
Use ib_cm_id_create_and_listen to create listening IB CM IDs or share existing ones if needed. When given a request on a specific CM ID, the code now matches the request to the RDMA CM ID based on the request parameters, so it no longer needs to rely on the ib_cm's private data matching capabilities. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 60 --- 1 file changed, 5 insertions(+), 55 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 729511d3ec64..322858842bae 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1719,42 +1719,6 @@ __be64 rdma_get_service_id(struct rdma_cm_id *id, struct sockaddr *addr) } EXPORT_SYMBOL(rdma_get_service_id); -static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr *addr, -struct ib_cm_compare_data *compare) -{ - struct cma_hdr *cma_data, *cma_mask; - __be32 ip4_addr; - struct in6_addr ip6_addr; - - memset(compare, 0, sizeof *compare); - cma_data = (void *) compare->data; - cma_mask = (void *) compare->mask; - - switch (addr->sa_family) { - case AF_INET: - ip4_addr = ((struct sockaddr_in *) addr)->sin_addr.s_addr; - cma_set_ip_ver(cma_data, 4); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data->dst_addr.ip4.addr = ip4_addr; - cma_mask->dst_addr.ip4.addr = htonl(~0); - } - break; - case AF_INET6: - ip6_addr = ((struct sockaddr_in6 *) addr)->sin6_addr; - cma_set_ip_ver(cma_data, 6); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data->dst_addr.ip6 = ip6_addr; - memset(&cma_mask->dst_addr.ip6, 0xFF, - sizeof cma_mask->dst_addr.ip6); - } - break; - default: - break; - } -} - static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) { struct rdma_id_private *id_priv = iw_id->context; @@ -1908,33 +1872,19 @@ out: static int cma_ib_listen(struct rdma_id_private *id_priv) { - struct ib_cm_compare_data compare_data; struct sockaddr *addr; struct ib_cm_id *id; __be64 svc_id; - int ret; - id = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); + addr = cma_src_addr(id_priv); + svc_id = rdma_get_service_id(&id_priv->id, addr); + id = ib_cm_insert_listen(id_priv->id.device, cma_req_handler, svc_id, +0); if (IS_ERR(id)) return PTR_ERR(id); - id_priv->cm_id.ib = id; - addr = cma_src_addr(id_priv); - svc_id = rdma_get_service_id(&id_priv->id, addr); - if (cma_any_addr(addr) && !id_priv->afonly) - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); - else { - cma_set_compare_data(id_priv->id.ps, addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); - } - - if (ret) { - ib_destroy_cm_id(id_priv->cm_id.ib); - id_priv->cm_id.ib = NULL; - } - - return ret; + return 0; } static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in
[PATCH v1 12/12] IB/cm: Remove compare_data checks
Now that there are no ib_cm clients using the compare_data feature for matching IB CM requests' private data, remove the compare_data parameter of ib_cm_listen and remove the code implementing the feature. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c| 105 ++-- drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h| 14 + 5 files changed, 22 insertions(+), 104 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 6d91bf69b11a..411c08b0c743 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -221,7 +221,6 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; - struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -442,40 +441,6 @@ static struct cm_id_private * cm_acquire_id(__be32 local_id, __be32 remote_id) return cm_id_priv; } -static void cm_mask_copy(u32 *dst, const u32 *src, const u32 *mask) -{ - int i; - - for (i = 0; i < IB_CM_COMPARE_SIZE; i++) - dst[i] = src[i] & mask[i]; -} - -static int cm_compare_data(struct ib_cm_compare_data *src_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - u32 dst[IB_CM_COMPARE_SIZE]; - - if (!src_data || !dst_data) - return 0; - - cm_mask_copy(src, src_data->data, dst_data->mask); - cm_mask_copy(dst, dst_data->data, src_data->mask); - return memcmp(src, dst, sizeof(src)); -} - -static int cm_compare_private_data(u32 *private_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - - if (!dst_data) - return 0; - - cm_mask_copy(src, private_data, dst_data->mask); - return memcmp(src, dst_data->data, sizeof(src)); -} - /* * Trivial helpers to strip endian annotation and compare; the * endianness doesn't actually matter since we just need a stable @@ -508,18 +473,14 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; - int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); - data_cmp = cm_compare_data(cm_id_priv->compare_data, - cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device) && - !data_cmp) + (cm_id_priv->id.device == cur_cm_id_priv->id.device)) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -530,8 +491,6 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) link = &(*link)->rb_left; else if (be64_gt(service_id, cur_cm_id_priv->id.service_id)) link = &(*link)->rb_right; - else if (data_cmp < 0) - link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -541,20 +500,16 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id, -u32 *private_data) +__be64 service_id) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; - int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); - data_cmp = cm_compare_private_data(private_data, - cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device) && !data_cmp) + (cm_id_priv->id.device == device)) return cm_id_priv; if (d
[PATCH v1 07/12] IB/cma: Helper functions to access port space IDRs
Add helper functions to access the IDRs by port-space and port number. Pass around the port-space enum in cma.c instead of using pointers to port-space IDRs. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/cma.c | 81 --- 1 file changed, 60 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 21cc6c99008b..02914bf1e548 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,6 +113,22 @@ static DEFINE_IDR(udp_ps); static DEFINE_IDR(ipoib_ps); static DEFINE_IDR(ib_ps); +static struct idr *cma_idr(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_TCP: + return &tcp_ps; + case RDMA_PS_UDP: + return &udp_ps; + case RDMA_PS_IPOIB: + return &ipoib_ps; + case RDMA_PS_IB: + return &ib_ps; + default: + return NULL; + } +} + struct cma_device { struct list_headlist; struct ib_device*device; @@ -122,11 +138,33 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; + enum rdma_port_spaceps; struct hlist_head owners; unsigned short port; }; +static int cma_ps_alloc(enum rdma_port_space ps, + struct rdma_bind_list *bind_list, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL); +} + +static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_find(idr, snum); +} + +static void cma_ps_remove(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + idr_remove(idr, snum); +} + enum { CMA_OPTION_AFONLY, }; @@ -1053,7 +1091,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - idr_remove(bind_list->ps, bind_list->port); + cma_ps_remove(bind_list->ps, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -2349,8 +2387,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(&id_priv->node, &bind_list->owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, - unsigned short snum) +static int cma_alloc_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; int ret; @@ -2359,7 +2397,7 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, if (!bind_list) return -ENOMEM; - ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + ret = cma_ps_alloc(ps, bind_list, snum); if (ret < 0) goto err; @@ -2372,7 +2410,8 @@ err: return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } -static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_alloc_any_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { static unsigned int last_used_port; int low, high, remaining; @@ -2383,7 +2422,7 @@ static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) rover = prandom_u32() % remaining + low; retry: if (last_used_port != rover && - !idr_find(ps, (unsigned short) rover)) { + !cma_ps_find(ps, (unsigned short)rover)) { int ret = cma_alloc_port(ps, id_priv, rover); /* * Remember previously used port number in order to avoid @@ -2438,7 +2477,8 @@ static int cma_check_port(struct rdma_bind_list *bind_list, return 0; } -static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_use_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list; unsigned short snum; @@ -2448,7 +2488,7 @@ static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) return -EACCES; - bind_list = idr_find(ps, snum); + bind_list = cma_ps_find(ps, snum); if (!bind_list) { ret = cma_alloc_port(ps, id_priv, snum); } else { @@ -2471,25 +2511,24 @@ static int cma_bind_listen(struct rdma_id_private *id_priv) return ret; } -static struct idr *cma_select_inet_ps(struct rdma_id_pri
[PATCH v1 06/12] IB/cma: Refactor RDMA IP CM private-data parsing code
When receiving a connection request, rdma_cm needs to associate the request with a network device, in order to disambiguate requests. To do this, it needs to know the request's destination IP. For this the module needs to allow getting this information from the private data in the request packet, instead of relying on the information already being in the listening RDMA CM ID. When creating a new incoming connection ID, the code in cma_save_ip{4,6}_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/core/cma.c | 150 ++ 1 file changed, 92 insertions(+), 58 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index df456b6beb40..21cc6c99008b 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -870,97 +870,122 @@ static inline int cma_any_port(struct sockaddr *addr) return !cma_port(addr); } -static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, +static void cma_save_ib_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, struct ib_sa_path_rec *path) { - struct sockaddr_ib *listen_ib, *ib; + struct sockaddr_ib *ib; - listen_ib = (struct sockaddr_ib *) &listen_id->route.addr.src_addr; - ib = (struct sockaddr_ib *) &id->route.addr.src_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->sgid, 16); - ib->sib_sid = listen_ib->sib_sid; - ib->sib_sid_mask = cpu_to_be64(0xULL); - ib->sib_scope_id = listen_ib->sib_scope_id; - - ib = (struct sockaddr_ib *) &id->route.addr.dst_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->dgid, 16); -} - -static __be16 ss_get_port(const struct sockaddr_storage *ss) -{ - if (ss->ss_family == AF_INET) - return ((struct sockaddr_in *)ss)->sin_port; - else if (ss->ss_family == AF_INET6) - return ((struct sockaddr_in6 *)ss)->sin6_port; - BUG(); + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->sgid, 16); + ib->sib_sid = path->service_id; + ib->sib_sid_mask = cpu_to_be64(0xULL); + ib->sib_scope_id = 0; + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->dgid, 16); + } } -static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip4_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct cma_hdr *hdr, + __be16 local_port) { struct sockaddr_in *ip4; - ip4 = (struct sockaddr_in *) &id->route.addr.src_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; - ip4->sin_port = ss_get_port(&listen_id->route.addr.src_addr); + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; + ip4->sin_port = local_port; + } - ip4 = (struct sockaddr_in *) &id->route.addr.dst_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; - ip4->sin_port = hdr->port; + if (dst_addr) { + ip4 = (struct sockaddr_in *)dst_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; + ip4->sin_port = hdr->port; + } } -static void cma_save_ip6_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip6_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, +
[PATCH v1 09/12] IB/cma: validate routing of incoming requests
Pass incoming request parameters through the relevant IPv4/IPv6 routing tables and make sure the network stack is configured to handle such requests. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 95 +-- 1 file changed, 92 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 4701167f5117..759697d6e627 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -46,6 +46,8 @@ #include #include +#include +#include #include #include @@ -1062,15 +1064,97 @@ static int cma_save_req_info(const struct ib_cm_event *ib_event, return 0; } +static bool validate_ipv4_net_dev(struct net_device *net_dev, + const struct sockaddr_in *dst_addr, + const struct sockaddr_in *src_addr) +{ + __be32 daddr = dst_addr->sin_addr.s_addr, + saddr = src_addr->sin_addr.s_addr; + struct fib_result res; + struct flowi4 fl4; + int err; + bool ret; + + if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr) || + ipv4_is_lbcast(daddr) || ipv4_is_zeronet(saddr) || + ipv4_is_zeronet(daddr) || ipv4_is_loopback(daddr) || + ipv4_is_loopback(saddr)) + return false; + + memset(&fl4, 0, sizeof(fl4)); + fl4.flowi4_iif = net_dev->ifindex; + fl4.daddr = daddr; + fl4.saddr = saddr; + + rcu_read_lock(); + err = fib_lookup(dev_net(net_dev), &fl4, &res); + if (err) + return false; + + ret = FIB_RES_DEV(res) == net_dev; + rcu_read_unlock(); + + return ret; +} + +static bool validate_ipv6_net_dev(struct net_device *net_dev, + const struct sockaddr_in6 *dst_addr, + const struct sockaddr_in6 *src_addr) +{ +#if IS_ENABLED(CONFIG_IPV6) + const int strict = ipv6_addr_type(&dst_addr->sin6_addr) & + IPV6_ADDR_LINKLOCAL; + struct rt6_info *rt = rt6_lookup(dev_net(net_dev), &dst_addr->sin6_addr, +&src_addr->sin6_addr, net_dev->ifindex, +strict); + bool ret; + + if (!rt) + return false; + + ret = rt->rt6i_idev->dev == net_dev; + ip6_rt_put(rt); + + return ret; +#else + return false; +#endif +} + +static bool validate_net_dev(struct net_device *net_dev, +const struct sockaddr *daddr, +const struct sockaddr *saddr) +{ + const struct sockaddr_in *daddr4 = (const struct sockaddr_in *)daddr; + const struct sockaddr_in *saddr4 = (const struct sockaddr_in *)saddr; + const struct sockaddr_in6 *daddr6 = (const struct sockaddr_in6 *)daddr; + const struct sockaddr_in6 *saddr6 = (const struct sockaddr_in6 *)saddr; + + switch (daddr->sa_family) { + case AF_INET: + return saddr->sa_family == AF_INET && + validate_ipv4_net_dev(net_dev, daddr4, saddr4); + + case AF_INET6: + return saddr->sa_family == AF_INET6 && + validate_ipv6_net_dev(net_dev, daddr6, saddr6); + + default: + return false; + } +} + static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, const struct cma_req_info *req) { - struct sockaddr_storage listen_addr_storage; - struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage; + struct sockaddr_storage listen_addr_storage, src_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage, + *src_addr = (struct sockaddr *)&src_addr_storage; struct net_device *net_dev; int err; - err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id); + err = cma_save_ip_info(listen_addr, src_addr, ib_event, + req->service_id); if (err) return ERR_PTR(err); @@ -1079,6 +1163,11 @@ static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, if (!net_dev) return ERR_PTR(-ENODEV); + if (!validate_net_dev(net_dev, listen_addr, src_addr)) { + dev_put(net_dev); + return ERR_PTR(-EHOSTUNREACH); + } + return net_dev; } -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in
[PATCH v1 05/12] IB/cm: Share listening CM IDs
Enabling network namespaces for RDMA CM will allow processes on different namespaces to listen on the same port. In order to leave namespace support out of the CM layer, this requires that multiple RDMA CM IDs will be able to share a single CM ID. This patch adds infrastructure to retrieve an existing listening ib_cm_id, based on its device and service ID, or create a new one if one does not already exist. It also adds a reference count for such instances (cm_id_private.listen_sharecount), and prevents cm_destroy_id from destroying a CM if it is still shared. See the relevant discussion [1]. [1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids http://www.spinics.net/lists/netdev/msg328860.html Cc: Jason Gunthorpe Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 124 --- include/rdma/ib_cm.h | 4 ++ 2 files changed, 122 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index dd92c999e6e9..6d91bf69b11a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -212,6 +212,9 @@ struct cm_id_private { spinlock_t lock;/* Do not acquire inside cm.lock */ struct completion comp; atomic_t refcount; + /* Number of clients sharing this ib_cm_id. Only valid for listeners. +* Protected by the cm.lock spinlock. */ + int listen_sharecount; struct ib_mad_send_buf *msg; struct cm_timewait_info *timewait_info; @@ -847,9 +850,15 @@ retest: spin_lock_irq(&cm_id_priv->lock); switch (cm_id->state) { case IB_CM_LISTEN: - cm_id->state = IB_CM_IDLE; spin_unlock_irq(&cm_id_priv->lock); + spin_lock_irq(&cm.lock); + if (--cm_id_priv->listen_sharecount > 0) { + /* The id is still shared. */ + atomic_dec(&cm_id_priv->refcount); + spin_unlock_irq(&cm.lock); + return; + } rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); spin_unlock_irq(&cm.lock); break; @@ -929,11 +938,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, -struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin-lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -958,8 +988,10 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, } cm_id->state = IB_CM_LISTEN; + ++cm_id_priv->listen_sharecount; - spin_lock_irqsave(&cm.lock, flags); + if (lock) + spin_lock_irqsave(&cm.lock, flags); if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id->service_id = cpu_to_be64(cm.listen_service_id++); cm_id->service_mask = ~cpu_to_be64(0); @@ -968,18 +1000,98 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); - spin_unlock_irqrestore(&cm.lock, flags); + if (lock) + spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; +
[PATCH v1 02/12] IB/core: Find the network device matching connection parameters
From: Yotam Kenneth In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the GID should be enough to uniquely identify the ULP net device. However, in current kernels there can be multiple IPoIB interfaces created with the same GID. Furthermore, such configuration may be desireable to support ipvlan-like configurations for RDMA CM with IPoIB. To resolve the device in these cases the code will also take the IP address as an additional input. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/device.c | 29 + include/rdma/ib_verbs.h | 35 +++ 2 files changed, 64 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index c62974187b5c..2d26eb383400 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "core_priv.h" @@ -749,6 +750,34 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr) +{ + struct net_device *net_dev = NULL; + struct ib_client *client; + + if (!rdma_protocol_ib(dev, port)) + return NULL; + + down_read(&lists_rwsem); + + list_for_each_entry(client, &client_list, list) + if (client->get_net_dev_by_params) { + net_dev = client->get_net_dev_by_params(dev, port, pkey, + gid, addr); + if (net_dev) + break; + } + + up_read(&lists_rwsem); + + return net_dev; +} +EXPORT_SYMBOL(ib_get_net_dev_by_params); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 1767a11e86fd..1d3c418778ad 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -1762,6 +1763,26 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *, void *client_data); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev:An RDMA device that the net_dev use for communication. +* @port: A physical port number on the RDMA device. +* @pkey: P_Key that the net_dev uses if applicable. +* @gid:A GID that the net_dev uses to communicate. +* @addr: An IP address the net_dev is configured with. +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. +* +* The caller is responsible for calling dev_put on the returned +* netdev. */ + struct net_device *(*get_net_dev_by_params)( + struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr); struct list_head list; }; @@ -3030,4 +3051,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_dev_by_params() - Return the appropriate net_dev + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA device. + * @pkey: The Pkey the request came on. + * @gid: A GID that the net_dev uses to communicate. + * @addr: Contains the IP address that the request specified as its + * destination. + */ +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, u8 port, + u16 pkey, const union ib_gid *gid, + const struct sockaddr *addr); + #endif /* IB_VERBS_H */ -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in
[PATCH v1 00/12] Demux IB CM requests in the rdma_cm module
Thanks everyone for the review comments. I've updated the patch set accordingly. The changes are listed below. Changes from v0: - Added a patch to prevent a race between ib_unregister_device() and ib_get_net_dev_by_params(). - Removed the patch that exported a UD GMP packet's GID from the GRH, and related code. - Patch 3: * Add _rcu suffix to ipoib_is_dev_match_addr(). * Add helper function to get the master netdev for bonding support. * Scan for matching net devices in two phases: first without looking at * the IP address, and then looking at the IP address only when the first phase did not find a unique net device. - Patch 5: * Do not init listen_sharecount = 1 for non-listening ib_cm_ids. * Remove code that sets a CM ID's state to IB_CM_IDLE right before destruction. * Rename ib_cm_id_create_and_listen() to ib_cm_insert_listen(). * Do not increase reference counts when failing to add a shared CM ID due to having a different handler callback. - Patch 9: Clean IPv4 net_dev validation function. - Added patch 10: new patch to use the found net_dev in IB/cma for eliminating unneeded calls to cma_translate_addr. - Patch 12: Remove the lock argument to __ib_cm_listen(). The rdma_cm module relies today on the ib_cm module to demux incoming requests based on their service ID and IP address. The ib_cm module is the wrong place to perform this task, as it can also be used with services that do not adhere to the RDMA IP CM service as defined in the IBA specifications. It is forced to use an opaque private data struct and mask to compare incoming requests against. This series moves that demux task responsibility to the rdma_cm module. The rdma_cm module can look into the private data attached to a CM request, containing the IP addresses related to the request. It uses the details of the request to find the net device associated with the request, and use that net device to find the correct listening rdma_cm_id. The series applies against Doug's for-v4.2 tree with the patch adding a rwsem to IB core [2] applied. The series is structured as follows: Patch 1 prevents a possible race between ib_client.remove() callbacks from ib_unregister_device(), and ib_client callbacks that rely on the lists_rwsem locked for read, such as ib_get_net_dev_by_params(). Both callbacks may call ib_get_client_data(), and the patch makes sure that the remove callback doesn't free the client data while it is being used by the other callback. Patches 2-3 add the ability to lookup a network device according to the IB device, port, P_Key, GID and IP address. They find the matching IPoIB interfaces, and return a matching net_device if one exists. Patches 4-5 make necessary changes in ib_cm to allow RDMA CM get the information it needs out of CM and SIDR requests, and share a single ib_cm_id with multiple RDMA CM listeners. Patches 6-7 do some preliminary refactoring to the rdma_cm module. They allow extracting information out of incoming requests instead of retrieving them from a listening CM ID, and add helper functions to access the port space IDRs. Finally, patches 8-11 change rdma_cm to demultiplex requests on its own, and patch 12 cleans up the now unneeded code in ib_cm to compare against the private data. This series contains a subset of the RDMA CM namespaces patches [1]. The changes from v4 of the relevant patches are: - Patch 1 * in addition to the IB device, port, P_Key and IP address, pass also the GID, to make future IPoIB devices with alias GIDs to unique. * return the matching net_device instead of a network namespace. - Patch 2: use IS_ENABLED(CONFIG_IPV6) without ifdefs. - Patch 5: * rename sharecount -> listen_sharecount. * use a regular int instead of atomic for the share count, protected by the cm.lock spinlock. * change id destruction and shared listener creation to prevent the case where an id is found but it is under destruction. [1] [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM http://www.spinics.net/lists/linux-rdma/msg25244.html [2] [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list http://www.spinics.net/lists/linux-rdma/msg25931.html Guy Shapiro (1): IB/ipoib: Return IPoIB devices matching connection parameters Haggai Eran (10): IB/core: pass client data to remove() callbacks IB/cm: Expose service ID in request events IB/cm: Share listening CM IDs IB/cma: Refactor RDMA IP CM private-data parsing code IB/cma: Helper functions to access port space IDRs IB/cma: Add net_dev and private data checks to RDMA CM IB/cma: validate routing of incoming requests IB/cma: use found net_dev for passive connections IB/cma: Share ib_cm_ids between rdma_cm_ids IB/cm: Remove compare_data checks Yotam Kenneth (1): IB/core: Find the network device matching connection parameters drivers/infiniband/core/cache.c | 2 +-
[PATCH v1 04/12] IB/cm: Expose service ID in request events
Expose the service ID on an incoming CM or SIDR request to the event handler. This will allow the RDMA CM module to de-multiplex connection requests based on the information encoded in the service ID. Acked-by: Sean Hefty Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 3 +++ include/rdma/ib_cm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 0103534472e0..dd92c999e6e9 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1256,6 +1256,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, primary_path->packet_life_time = cm_req_get_primary_local_ack_timeout(req_msg); primary_path->packet_life_time -= (primary_path->packet_life_time > 0); + primary_path->service_id = req_msg->service_id; if (req_msg->alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); @@ -1277,6 +1278,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path->packet_life_time -= (alt_path->packet_life_time > 0); + alt_path->service_id = req_msg->service_id; } } @@ -2980,6 +2982,7 @@ static void cm_format_sidr_req_event(struct cm_work *work, param = &work->cm_event.param.sidr_req_rcvd; param->pkey = __be16_to_cpu(sidr_req_msg->pkey); param->listen_id = listen_id; + param->service_id = sidr_req_msg->service_id; param->port = work->port->port_num; work->cm_event.private_data = &sidr_req_msg->private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 39ed2d2fbd51..1b567bbc3ad4 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -223,6 +223,7 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; + __be64 service_id; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in
[PATCH v1 01/12] IB/core: pass client data to remove() callbacks
An ib_client callback that is called with the lists_rwsem locked only for read is protected from changes to the IB client lists, but not from ib_unregister_device() freeing its client data. This is because ib_unregister_device() will remove the device from the device list with lists_rwsem locked for write, but perform the rest of the cleanup, including the call to remove() without that lock. This can cause the remove() callback and the second callback to race when accessing the private client data. To solve this, make sure ib_get_client_data() returns NULL before releasing lists_rwsem in ib_unregister_device(). Since remove() also needs access to the client data, change the remove callback to accept it as a parameter. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cache.c | 2 +- drivers/infiniband/core/cm.c | 7 +++ drivers/infiniband/core/cma.c | 7 +++ drivers/infiniband/core/device.c | 27 +++ drivers/infiniband/core/mad.c | 2 +- drivers/infiniband/core/multicast.c | 7 +++ drivers/infiniband/core/sa_query.c| 6 +++--- drivers/infiniband/core/ucm.c | 6 +++--- drivers/infiniband/core/user_mad.c| 6 +++--- drivers/infiniband/core/uverbs_main.c | 6 +++--- drivers/infiniband/ulp/ipoib/ipoib_main.c | 7 +++ drivers/infiniband/ulp/srp/ib_srp.c | 6 +++--- drivers/infiniband/ulp/srpt/ib_srpt.c | 5 ++--- include/rdma/ib_verbs.h | 6 +- net/rds/ib.c | 8 ++-- net/rds/ib_send.c | 3 --- net/rds/iw.c | 11 ++- net/rds/iw_cm.c | 18 ++ net/rds/iw_send.c | 4 19 files changed, 77 insertions(+), 67 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 871da832d016..c93af66cc091 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -394,7 +394,7 @@ err: kfree(device->cache.lmc_cache); } -static void ib_cache_cleanup_one(struct ib_device *device) +static void ib_cache_cleanup_one(struct ib_device *device, void *client_data) { int p; diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 32063add9280..0103534472e0 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -58,7 +58,7 @@ MODULE_DESCRIPTION("InfiniBand CM"); MODULE_LICENSE("Dual BSD/GPL"); static void cm_add_one(struct ib_device *device); -static void cm_remove_one(struct ib_device *device); +static void cm_remove_one(struct ib_device *device, void *client_data); static struct ib_client cm_client = { .name = "cm", @@ -3846,9 +3846,9 @@ free: kfree(cm_dev); } -static void cm_remove_one(struct ib_device *ib_device) +static void cm_remove_one(struct ib_device *ib_device, void *client_data) { - struct cm_device *cm_dev; + struct cm_device *cm_dev = client_data; struct cm_port *port; struct ib_port_modify port_modify = { .clr_port_cap_mask = IB_PORT_CM_SUP @@ -3856,7 +3856,6 @@ static void cm_remove_one(struct ib_device *ib_device) unsigned long flags; int i; - cm_dev = ib_get_client_data(ib_device, &cm_client); if (!cm_dev) return; diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 3b943b700a63..df456b6beb40 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -94,7 +94,7 @@ const char *rdma_event_msg(enum rdma_cm_event_type event) EXPORT_SYMBOL(rdma_event_msg); static void cma_add_one(struct ib_device *device); -static void cma_remove_one(struct ib_device *device); +static void cma_remove_one(struct ib_device *device, void *client_data); static struct ib_client cma_client = { .name = "cma", @@ -3541,11 +3541,10 @@ static void cma_process_remove(struct cma_device *cma_dev) wait_for_completion(&cma_dev->comp); } -static void cma_remove_one(struct ib_device *device) +static void cma_remove_one(struct ib_device *device, void *client_data) { - struct cma_device *cma_dev; + struct cma_device *cma_dev = client_data; - cma_dev = ib_get_client_data(device, &cma_client); if (!cma_dev) return; diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index f08d438205ed..c62974187b5c 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -340,7 +340,7 @@ EXPORT_SYMBOL(ib_register_device); */ void ib_unregister_device(struct ib_device *device) { - struct ib_client *client; + struct list_head client_data_list; struct ib_client_data *context, *tmp; unsigned long flags
Re: [PATCH 07/11] IB/cma: Helper functions to access port space IDRs
On 16/06/2015 01:36, Hefty, Sean wrote: >> Add helper functions to access the IDRs by port-space and port number. >> >> Pass around the port-space enum in cma.c instead of using pointers to >> port-space IDRs. > > What is the motivation for this change? In the next patch ("IB/cma: Add net_dev and private data checks to RDMA CM"), in cma_id_from_event(), I pass the port-space that is extracted from the service ID to cma_ps_find(). Without this patch, I would need to find the correct port-space IDR, effectively duplicating the code in cma_select_inet_ps(). Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in
Re: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
On 17/06/2015 20:06, Jason Gunthorpe wrote: > On Tue, Jun 16, 2015 at 02:25:07PM +0300, Haggai Eran wrote: >> Regarding APM, currently the ib_cm code always sends the GMP to the >> primary path anyway, right? And in any case, one would expect the >> primary path's GID to have a valid net_device and local routing rules, >> so I think for the purpose of demuxing and validating the request using >> the primary path should be fine. > > The current code works that way, but it is not what I'd expect > generally. > > For instance, future APM support will be able to drive dual-rail and > policy will decide which rail is the current best rail for data > transfer. So the GMP may be directed to the IPoIB device with port 1, > but the data transfer may happen on the RDMA port 2. [Note, I already > have very rough patches that do this de-coupling] > >> Why do you think the GMP's net_device should be used over the one of the >> future RDMA channel? > > The code needs to match the incoming GMP with the logical netdev that > rx's *that GMP*. The fact that goes on to setup an RDMA channel is not > relevant, the nature of the future RDMA channel should not impact how > the GMP is recieved. >From what I understand, ib_cm and rdma_cm keeps their own addresses. I thought that ib_cm's addresses would be used to handle GMPs, and the rdma_cm addresses (id.route.addr) to represent the created RDMA channel. After all, that is what ucma_query_addr returns. So are you proposing that we use the logical netdev that was resolved by the GMP to fill up the source address returned to user-space? It sounds like it would prevent the APM usage you described above. > >> So far we can work without GRH for CM requests, and also without GRH for >> SIDR requests if we rely on layer 3 for the interface resolution. I'm >> not against adding a LLADDR to the protocol somehow, but I don't think >> we should abandon all these use cases and the interoperability with >> existing software. > > Well, there is a middle ground. Lets say we get the LLADDR in the GMP > somehow, then we get 100% correct operation when it is present. > > For degraded operation we have the (device,port,pkey) and possibly > (device,port,pkey,gid) if there was a GRH. We also have the IP address > hack. > > So, I'd say, search in this sequence: > - If the LLADDR is present, just find the right netdev > - Otherwise search for (device,port,pkey) / (device,port,pkey,gid) >If there is only one match then that is unambiguously the correct >device to use. > - Repeat the above search, but add the IP address. This is the hack >we perform for compatibility. > > There is no reason to hackily look at the GMP path parameters if we are > relying on #3 above as the hack to save us in the legacy ambiguous case. > > .. and to answer your question in the other email, I think we should > keep the hack clearly distinct from the proper operation (in fact, > perhaps it should be user configurable). So #3 should be a distinct > step taken when all else fails, not integrated into earlier steps. > > So, this series as it stands just needs to do #2/#3 above and you guys > can figure out the LLADDR business later. Okay. I can add a first search without the IP address. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/11] IB/cma: Add net_dev and private data checks to RDMA CM
On 17/06/2015 20:18, Jason Gunthorpe wrote: > On Tue, Jun 16, 2015 at 08:26:26AM +0300, Haggai Eran wrote: >> On 15/06/2015 20:08, Jason Gunthorpe wrote: >>> On Mon, Jun 15, 2015 at 11:47:13AM +0300, Haggai Eran wrote: >>>> Instead of relying on a the ib_cm module to check an incoming CM request's >>>> private data header, add these checks to the RDMA CM module. This allows a >>>> following patch to to clean up the ib_cm interface and remove the code that >>>> looks into the private headers. It will also allow supporting namespaces in >>>> RDMA CM by making these checks namespace aware later on. >>> >>> I was expecting one of these patches to flow the net_device from here: >>> >>>> +static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, >>>> +const struct cma_req_info *req) >>>> +{ >>> >>> Down through cma_req_handler and cma_new_conn_id so that we get rid of >>> the cma_translate_addr on the ingress side. >>> >>> Having the ingress side use one ingress net_device for all processing >>> seems very important to me... >> >> Is it really very important? I thought the bound_dev_if of a passive >> connection id is only used by the netlink statistics mechanism. > > I mean 'very important' in the sense it makes the RDMA-CM *make > logical sense*, not so much in the 'can user space tell'. > > So yes, cleaning this seems very important to establish that logical > narrative of how the packet flows through this code. > > Plus, there is an init_net in the cma_translate_addr path that needs to > be addressed - so purging cma_translate_addr is a great way to handle > that. That would leave only the call in rdma_bind_addr, and for bind, > the process net namespace is the correct thing to use. Okay, I'll add a patch that cleans these cma_translate_addr calls. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
On Tuesday, June 16, 2015 7:47 PM, Hefty, Sean wrote: > To: Haggai Eran; Doug Ledford > Cc: linux-r...@vger.kernel.org; netdev@vger.kernel.org; Liran Liss; Guy > Shapiro; Shachar Raindel; Yotam Kenneth; Jason Gunthorpe > Subject: RE: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events > >> The idea is to allow SIDR request to be sorted by the GID, when we will >> have alias GIDs for IPoIB. > > Please limit this series, or at least the early patches in this series, to > simply moving the de-mux out of the ib_cm and into the rdma_cm. Fair enough. I'll remove this patch and the related code from this series. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 06/11] IB/cma: Refactor RDMA IP CM private-data parsing code
On 16/06/2015 01:33, Hefty, Sean wrote: >> -static int cma_save_net_info(struct rdma_cm_id *id, struct rdma_cm_id >> *listen_id, >> - struct ib_cm_event *ib_event) >> +static u16 cma_port_from_service_id(__be64 service_id) >> { >> -struct cma_hdr *hdr; >> +return be64_to_cpu(service_id); >> +} > > Nit - Does the compiler not complain about the cast from u64 to u16? > Apparently it does, but only with W=3 (-Wconversion is included there). W=3 produces about 6k warnings when compiling cma.c, so I don't usually enable it. I'll add a cast there to prevent the warning. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 05/11] IB/cm: Share listening CM IDs
On 16/06/2015 01:13, Hefty, Sean wrote: >> @@ -722,6 +725,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device >> *device, >> INIT_LIST_HEAD(&cm_id_priv->work_list); >> atomic_set(&cm_id_priv->work_count, -1); >> atomic_set(&cm_id_priv->refcount, 1); >> +cm_id_priv->listen_sharecount = 1; > > This is setting the listen count before we know whether the cm_id will > actually be used to listen. Right. I'll move it to the new_id case in ib_cm_id_create_and_listen. > > >> return &cm_id_priv->id; >> >> error: >> @@ -847,11 +851,21 @@ retest: >> spin_lock_irq(&cm_id_priv->lock); >> switch (cm_id->state) { >> case IB_CM_LISTEN: >> -cm_id->state = IB_CM_IDLE; >> spin_unlock_irq(&cm_id_priv->lock); >> + >> spin_lock_irq(&cm.lock); >> +if (--cm_id_priv->listen_sharecount > 0) { >> +/* The id is still shared. */ >> +atomic_dec(&cm_id_priv->refcount); >> +spin_unlock_irq(&cm.lock); >> +return; >> +} >> rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); >> spin_unlock_irq(&cm.lock); >> + >> +spin_lock_irq(&cm_id_priv->lock); >> +cm_id->state = IB_CM_IDLE; >> +spin_unlock_irq(&cm_id_priv->lock); > > Why is the state being changed? The cm_id is about to be freed anyway. It matches the rest of the code, but I don't think it is actually being used for listening ids. I will drop it. > > >> break; >> case IB_CM_SIDR_REQ_SENT: >> cm_id->state = IB_CM_IDLE; >> @@ -929,11 +943,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) >> } >> EXPORT_SYMBOL(ib_destroy_cm_id); >> >> -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 >> service_mask, >> - struct ib_cm_compare_data *compare_data) >> +/** >> + * __ib_cm_listen - Initiates listening on the specified service ID for >> + * connection and service ID resolution requests. >> + * @cm_id: Connection identifier associated with the listen request. >> + * @service_id: Service identifier matched against incoming connection >> + * and service ID resolution requests. The service ID should be >> specified >> + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will >> + * assign a service ID to the caller. >> + * @service_mask: Mask applied to service ID used to listen across a >> + * range of service IDs. If set to 0, the service ID is matched >> + * exactly. This parameter is ignored if %service_id is set to >> + * IB_CM_ASSIGN_SERVICE_ID. >> + * @compare_data: This parameter is optional. It specifies data that >> must >> + * appear in the private data of a connection request for the specified >> + * listen request. >> + * @lock: If set, lock the cm.lock spin-lock when adding the id to the >> + * listener tree. When false, the caller must already hold the spin- >> lock, >> + * and compare_data must be NULL. >> + */ >> +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, >> + __be64 service_mask, >> + struct ib_cm_compare_data *compare_data, >> + bool lock) >> { >> struct cm_id_private *cm_id_priv, *cur_cm_id_priv; >> -unsigned long flags; >> +unsigned long flags = 0; >> int ret = 0; >> >> service_mask = service_mask ? service_mask : ~cpu_to_be64(0); >> @@ -959,7 +994,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 >> service_id, __be64 service_mask, >> >> cm_id->state = IB_CM_LISTEN; >> >> -spin_lock_irqsave(&cm.lock, flags); >> +if (lock) >> +spin_lock_irqsave(&cm.lock, flags); > > I'm not a fan of this sort of locking structure. Why not just move the > locking into the outside calls completely? I.e. move to ib_cm_listen() > instead of passing in true. The reason is that this function can sleep when called compare_data != NULL, allocating the id's compare_data with GFP_KERNEL. But, since the compare_data is going away in a later patch, I can actually fix the locking at that point. I'll change the patch that removes compare_data to also remove the lock parameter. > > >> if (service_id == IB_CM_ASSIGN_SERVICE_ID) { >> cm_id->service_id = cpu_to_be64(cm.listen_service_id++); >> cm_id->service_mask = ~cpu_to_be64(0); >> @@ -968,7 +1004,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 >> service_id, __be64 service_mask, >> cm_id->service_mask = service_mask; >> } >> cur_cm_id_priv = cm_insert_listen(cm_id_priv); >> -spin_unlock_irqrestore(&cm.lock, flags); >> +if (lock) >> +spin_unlock_irqrestore(&cm.lock, flags); >> >> if (cur_cm_id_priv) { >> cm_id->state = IB_CM_IDLE; >> @@ -978,8 +1015,86 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 >> service_id, __be64 service_mask, >> } >>
Re: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
On 16/06/2015 01:08, Jason Gunthorpe wrote: > On Mon, Jun 15, 2015 at 09:32:53PM +, Hefty, Sean wrote: >>> drivers/infiniband/core/cm.c | 7 +++ >>> include/rdma/ib_cm.h | 2 ++ >>> 2 files changed, 9 insertions(+) >>> >>> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >>> index c5f5f89e274a..46f99ec4080a 100644 >>> +++ b/drivers/infiniband/core/cm.c >>> @@ -2983,6 +2983,13 @@ static void cm_format_sidr_req_event(struct cm_work >>> *work, >>> param->pkey = __be16_to_cpu(sidr_req_msg->pkey); >>> param->listen_id = listen_id; >>> param->service_id = sidr_req_msg->service_id; >>> + if (work->mad_recv_wc->wc->wc_flags & IB_WC_GRH) { >>> + param->grh = 1; >>> + memcpy(¶m->dgid, &work->mad_recv_wc->recv_buf.grh->dgid, >>> + sizeof(param->dgid)); >>> + } else { >>> + param->grh = 0; >> >> What is the use case here? Are you trying to sort by device? How does the >> GID of the GMP relate to the listen? > > Ouch, that is getting fugly, it used by cma_save_req_info, which feeds > the data into ib_get_net_dev_by_params - basically it chooses the > alias GUID'd netdev to use. > > But how is that going to work? How is the sender to know it should be > sending a GRH with the CM message? If the admin wants to use SIDR with alias GIDs, they will need to configure the system to enable GRH for such GMPs. (This series doesn't include such a patch). > > Falling back to use the primary_path sgid seems like a poor > substitute, if APM is being used that might be a totally different > port than the CM message. Note that the patchset uses primary_path for CM requests always (not as a fallback), and uses GRH as a fallback for SIDR requests. Regarding APM, currently the ib_cm code always sends the GMP to the primary path anyway, right? And in any case, one would expect the primary path's GID to have a valid net_device and local routing rules, so I think for the purpose of demuxing and validating the request using the primary path should be fine. > > I'm also not sure about the pkey, it seems to me that the pkey used to > select the ingress netdevice should be the pkey of the rx'd CM GMP, > not the pkey of the future RDMA channel, so this looks like it should > change: > > +static int cma_save_req_info(const struct ib_cm_event *ib_event, > + struct cma_req_info *req) > [..] > + switch (ib_event->event) { > + case IB_CM_REQ_RECEIVED: > + req->pkey = be16_to_cpu(req_param->primary_path->pkey); > [..] > + case IB_CM_SIDR_REQ_RECEIVED: > + req->pkey = sidr_param->pkey; > > Some comment for the GID, if the GRH is present, then the DGID from > there should alwas be used, not the content of the REQ. Why do you think the GMP's net_device should be used over the one of the future RDMA channel? Thinking about the network namespaces implications, I'm having trouble thinking of a good use case where a port redirector is in one namespace and the future RDMA channel is in another namespace. > > All this is because the CM IP protocol didn't include the LLADDR of > the target's IPoIB interface.. If we are already looking at a breaking > change to force GRH, how hard would it be to add on the LLADDR > somehow instead? So far we can work without GRH for CM requests, and also without GRH for SIDR requests if we rely on layer 3 for the interface resolution. I'm not against adding a LLADDR to the protocol somehow, but I don't think we should abandon all these use cases and the interoperability with existing software. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 04/11] IB/cm: Expose DGID in SIDR request events
On 16/06/2015 00:32, Hefty, Sean wrote: >> drivers/infiniband/core/cm.c | 7 +++ >> include/rdma/ib_cm.h | 2 ++ >> 2 files changed, 9 insertions(+) >> >> diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c >> index c5f5f89e274a..46f99ec4080a 100644 >> --- a/drivers/infiniband/core/cm.c >> +++ b/drivers/infiniband/core/cm.c >> @@ -2983,6 +2983,13 @@ static void cm_format_sidr_req_event(struct cm_work >> *work, >> param->pkey = __be16_to_cpu(sidr_req_msg->pkey); >> param->listen_id = listen_id; >> param->service_id = sidr_req_msg->service_id; >> +if (work->mad_recv_wc->wc->wc_flags & IB_WC_GRH) { >> +param->grh = 1; >> +memcpy(¶m->dgid, &work->mad_recv_wc->recv_buf.grh->dgid, >> + sizeof(param->dgid)); >> +} else { >> +param->grh = 0; > > What is the use case here? Are you trying to sort by device? How does the > GID of the GMP relate to the listen? The idea is to allow SIDR request to be sorted by the GID, when we will have alias GIDs for IPoIB. Unlike the CM requests, SIDR requests do not contain the remote GID, so I thought we could use the GID from the GRH and turn on GRH on such systems. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 02/11] IB/ipoib: Return IPoIB devices matching connection parameters
On 15/06/2015 20:22, Jason Gunthorpe wrote: > On Mon, Jun 15, 2015 at 11:47:07AM +0300, Haggai Eran wrote: > >> +/* Called with an RCU read lock taken */ > > Add _rcu to the name? That is the standard convention. Sure, I'll change that. > >> +/* returns an IPoIB netdev on top a given ipoib device matching a pkey_index >> + * and address, if one exists. */ >> +static struct net_device *ipoib_match_gid_pkey_addr(struct ipoib_dev_priv >> *priv, >> +const union ib_gid *gid, >> +u16 pkey_index, >> +const struct sockaddr *addr) >> +{ >> +struct ipoib_dev_priv *child_priv; >> +struct net_device *net_dev = NULL; >> + >> +if (priv->pkey_index == pkey_index && >> +(!gid || !memcmp(gid, &priv->local_gid, sizeof(*gid { >> +net_dev = ipoib_get_net_dev_match_addr(addr, priv->dev); >> +if (net_dev) >> +return net_dev; > > As I said already, this should not even look at the sockaddr unless > there are multiple possible hits on the other parameters, What is the goal here? The only difference omitting the IP check will make is when sending a request to a matching GID but with the wrong IP. Is it important that we pass these requests here so that they will be dropped at the rdma_cm module? Also, note that ipoib_get_net_dev_match_addr can return a different net_dev from the one ipoib created. When using bonding, it will find the IP address on the bonding device, and return the bonding net_dev instead. > and there > should be a comment explaining the sockaddr is only a hack to make up > for having an incomplete LLADDR. Sure, I'll add a comment. > > That way people not using same guid children do not get incorrect > functionality.. > >> +static struct net_device *ipoib_get_net_dev_by_params( >> +struct ib_device *dev, u8 port, u16 pkey, >> +const union ib_gid *gid, const struct sockaddr *addr) > > [..] > >> +ret = ib_find_cached_pkey(dev, port, pkey, &pkey_index); >> +if (ret) >> +return NULL; >> + >> +if (!rdma_protocol_ib(dev, port)) >> +return NULL; > > This if should be first I'd think. Okay. > > >> +dev_list = ib_get_client_data(dev, &ipoib_client); >> +if (!dev_list) >> +return NULL; > > Is the locking OK here? This access protected by lists_rwsem - > but for instance ib_unregister_device holds only the device_mutex when > calling client->remove, which kfree's dev_list. Looks wrong to me. I think you're right. Perhaps we can switch the client data to NULL in ib_unregister_device under the lists_rwsem. Then the ipoib_get_net_dev_by_params call will know to skip it. The remove() callback will need to be augmented with the client data as a parameter, because it won't be able to retrieve it using ib_get_client_data anymore. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 08/11] IB/cma: Add net_dev and private data checks to RDMA CM
On 15/06/2015 20:08, Jason Gunthorpe wrote: > On Mon, Jun 15, 2015 at 11:47:13AM +0300, Haggai Eran wrote: >> Instead of relying on a the ib_cm module to check an incoming CM request's >> private data header, add these checks to the RDMA CM module. This allows a >> following patch to to clean up the ib_cm interface and remove the code that >> looks into the private headers. It will also allow supporting namespaces in >> RDMA CM by making these checks namespace aware later on. > > I was expecting one of these patches to flow the net_device from here: > >> +static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, >> + const struct cma_req_info *req) >> +{ > > Down through cma_req_handler and cma_new_conn_id so that we get rid of > the cma_translate_addr on the ingress side. > > Having the ingress side use one ingress net_device for all processing > seems very important to me... Is it really very important? I thought the bound_dev_if of a passive connection id is only used by the netlink statistics mechanism. If you insist, I can set bound_dev_if based on the results of cma_get_net_dev. That would cause the rdma_translate_ip call in cma_translate_addr to exit early after calling rdma_copy_addr with the chosen net_dev. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 07/11] IB/cma: Helper functions to access port space IDRs
Add helper functions to access the IDRs by port-space and port number. Pass around the port-space enum in cma.c instead of using pointers to port-space IDRs. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/cma.c | 81 --- 1 file changed, 60 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index bc061987485d..bb90d462f819 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -113,6 +113,22 @@ static DEFINE_IDR(udp_ps); static DEFINE_IDR(ipoib_ps); static DEFINE_IDR(ib_ps); +static struct idr *cma_idr(enum rdma_port_space ps) +{ + switch (ps) { + case RDMA_PS_TCP: + return &tcp_ps; + case RDMA_PS_UDP: + return &udp_ps; + case RDMA_PS_IPOIB: + return &ipoib_ps; + case RDMA_PS_IB: + return &ib_ps; + default: + return NULL; + } +} + struct cma_device { struct list_headlist; struct ib_device*device; @@ -122,11 +138,33 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; + enum rdma_port_spaceps; struct hlist_head owners; unsigned short port; }; +static int cma_ps_alloc(enum rdma_port_space ps, + struct rdma_bind_list *bind_list, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_alloc(idr, bind_list, snum, snum + 1, GFP_KERNEL); +} + +static struct rdma_bind_list *cma_ps_find(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + return idr_find(idr, snum); +} + +static void cma_ps_remove(enum rdma_port_space ps, int snum) +{ + struct idr *idr = cma_idr(ps); + + idr_remove(idr, snum); +} + enum { CMA_OPTION_AFONLY, }; @@ -1053,7 +1091,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - idr_remove(bind_list->ps, bind_list->port); + cma_ps_remove(bind_list->ps, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -2349,8 +2387,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(&id_priv->node, &bind_list->owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, - unsigned short snum) +static int cma_alloc_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv, unsigned short snum) { struct rdma_bind_list *bind_list; int ret; @@ -2359,7 +2397,7 @@ static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, if (!bind_list) return -ENOMEM; - ret = idr_alloc(ps, bind_list, snum, snum + 1, GFP_KERNEL); + ret = cma_ps_alloc(ps, bind_list, snum); if (ret < 0) goto err; @@ -2372,7 +2410,8 @@ err: return ret == -ENOSPC ? -EADDRNOTAVAIL : ret; } -static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_alloc_any_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { static unsigned int last_used_port; int low, high, remaining; @@ -2383,7 +2422,7 @@ static int cma_alloc_any_port(struct idr *ps, struct rdma_id_private *id_priv) rover = prandom_u32() % remaining + low; retry: if (last_used_port != rover && - !idr_find(ps, (unsigned short) rover)) { + !cma_ps_find(ps, (unsigned short)rover)) { int ret = cma_alloc_port(ps, id_priv, rover); /* * Remember previously used port number in order to avoid @@ -2438,7 +2477,8 @@ static int cma_check_port(struct rdma_bind_list *bind_list, return 0; } -static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) +static int cma_use_port(enum rdma_port_space ps, + struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list; unsigned short snum; @@ -2448,7 +2488,7 @@ static int cma_use_port(struct idr *ps, struct rdma_id_private *id_priv) if (snum < PROT_SOCK && !capable(CAP_NET_BIND_SERVICE)) return -EACCES; - bind_list = idr_find(ps, snum); + bind_list = cma_ps_find(ps, snum); if (!bind_list) { ret = cma_alloc_port(ps, id_priv, snum); } else { @@ -2471,25 +2511,24 @@ static int cma_bind_listen(struct rdma_id_private *id_priv) return ret; } -static struct idr *cma_select_inet_ps(struct rdma_id_pri
[PATCH 06/11] IB/cma: Refactor RDMA IP CM private-data parsing code
When receiving a connection request, rdma_cm needs to associate the request with a network device, in order to disambiguate requests. To do this, it needs to know the request's destination IP. For this the module needs to allow getting this information from the private data in the request packet, instead of relying on the information already being in the listening RDMA CM ID. When creating a new incoming connection ID, the code in cma_save_ip{4,6}_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/core/cma.c | 150 ++ 1 file changed, 92 insertions(+), 58 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 3b943b700a63..bc061987485d 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -870,97 +870,122 @@ static inline int cma_any_port(struct sockaddr *addr) return !cma_port(addr); } -static void cma_save_ib_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, +static void cma_save_ib_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, struct ib_sa_path_rec *path) { - struct sockaddr_ib *listen_ib, *ib; + struct sockaddr_ib *ib; - listen_ib = (struct sockaddr_ib *) &listen_id->route.addr.src_addr; - ib = (struct sockaddr_ib *) &id->route.addr.src_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->sgid, 16); - ib->sib_sid = listen_ib->sib_sid; - ib->sib_sid_mask = cpu_to_be64(0xULL); - ib->sib_scope_id = listen_ib->sib_scope_id; - - ib = (struct sockaddr_ib *) &id->route.addr.dst_addr; - ib->sib_family = listen_ib->sib_family; - ib->sib_pkey = path->pkey; - ib->sib_flowinfo = path->flow_label; - memcpy(&ib->sib_addr, &path->dgid, 16); -} - -static __be16 ss_get_port(const struct sockaddr_storage *ss) -{ - if (ss->ss_family == AF_INET) - return ((struct sockaddr_in *)ss)->sin_port; - else if (ss->ss_family == AF_INET6) - return ((struct sockaddr_in6 *)ss)->sin6_port; - BUG(); + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->sgid, 16); + ib->sib_sid = path->service_id; + ib->sib_sid_mask = cpu_to_be64(0xULL); + ib->sib_scope_id = 0; + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->dgid, 16); + } } -static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip4_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct cma_hdr *hdr, + __be16 local_port) { struct sockaddr_in *ip4; - ip4 = (struct sockaddr_in *) &id->route.addr.src_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; - ip4->sin_port = ss_get_port(&listen_id->route.addr.src_addr); + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; + ip4->sin_port = local_port; + } - ip4 = (struct sockaddr_in *) &id->route.addr.dst_addr; - ip4->sin_family = AF_INET; - ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; - ip4->sin_port = hdr->port; + if (dst_addr) { + ip4 = (struct sockaddr_in *)dst_addr; + ip4->sin_family = AF_INET; + ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; + ip4->sin_port = hdr->port; + } } -static void cma_save_ip6_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_id, - struct cma_hdr *hdr) +static void cma_save_ip6_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, +
[PATCH 05/11] IB/cm: Share listening CM IDs
Enabling network namespaces for RDMA CM will allow processes on different namespaces to listen on the same port. In order to leave namespace support out of the CM layer, this requires that multiple RDMA CM IDs will be able to share a single CM ID. This patch adds infrastructure to retrieve an existing listening ib_cm_id, based on its device and service ID, or create a new one if one does not already exist. It also adds a reference count for such instances (cm_id_private.listen_sharecount), and prevents cm_destroy_id from destroying a CM if it is still shared. See the relevant discussion [1]. [1] Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids http://www.spinics.net/lists/netdev/msg328860.html Cc: Jason Gunthorpe Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 127 +-- include/rdma/ib_cm.h | 4 ++ 2 files changed, 125 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 46f99ec4080a..d124a891430b 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -212,6 +212,9 @@ struct cm_id_private { spinlock_t lock;/* Do not acquire inside cm.lock */ struct completion comp; atomic_t refcount; + /* Number of clients sharing this ib_cm_id. Only valid for listeners. +* Protected by the cm.lock spinlock. */ + int listen_sharecount; struct ib_mad_send_buf *msg; struct cm_timewait_info *timewait_info; @@ -722,6 +725,7 @@ struct ib_cm_id *ib_create_cm_id(struct ib_device *device, INIT_LIST_HEAD(&cm_id_priv->work_list); atomic_set(&cm_id_priv->work_count, -1); atomic_set(&cm_id_priv->refcount, 1); + cm_id_priv->listen_sharecount = 1; return &cm_id_priv->id; error: @@ -847,11 +851,21 @@ retest: spin_lock_irq(&cm_id_priv->lock); switch (cm_id->state) { case IB_CM_LISTEN: - cm_id->state = IB_CM_IDLE; spin_unlock_irq(&cm_id_priv->lock); + spin_lock_irq(&cm.lock); + if (--cm_id_priv->listen_sharecount > 0) { + /* The id is still shared. */ + atomic_dec(&cm_id_priv->refcount); + spin_unlock_irq(&cm.lock); + return; + } rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); spin_unlock_irq(&cm.lock); + + spin_lock_irq(&cm_id_priv->lock); + cm_id->state = IB_CM_IDLE; + spin_unlock_irq(&cm_id_priv->lock); break; case IB_CM_SIDR_REQ_SENT: cm_id->state = IB_CM_IDLE; @@ -929,11 +943,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, -struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin-lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -959,7 +994,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->state = IB_CM_LISTEN; - spin_lock_irqsave(&cm.lock, flags); + if (lock) + spin_lock_irqsave(&cm.lock, flags); if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id->service_id = cpu_to_b
[PATCH 10/11] IB/cma: Share ib_cm_ids between rdma_cm_ids
Use ib_cm_id_create_and_listen to create listening IB CM IDs or share existing ones if needed. When given a request on a specific CM ID, the code now matches the request to the RDMA CM ID based on the request parameters, so it no longer needs to rely on the ib_cm's private data matching capabilities. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 60 --- 1 file changed, 5 insertions(+), 55 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 74e562ec5b93..c6d5fa7ab64a 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -1722,42 +1722,6 @@ __be64 rdma_get_service_id(struct rdma_cm_id *id, struct sockaddr *addr) } EXPORT_SYMBOL(rdma_get_service_id); -static void cma_set_compare_data(enum rdma_port_space ps, struct sockaddr *addr, -struct ib_cm_compare_data *compare) -{ - struct cma_hdr *cma_data, *cma_mask; - __be32 ip4_addr; - struct in6_addr ip6_addr; - - memset(compare, 0, sizeof *compare); - cma_data = (void *) compare->data; - cma_mask = (void *) compare->mask; - - switch (addr->sa_family) { - case AF_INET: - ip4_addr = ((struct sockaddr_in *) addr)->sin_addr.s_addr; - cma_set_ip_ver(cma_data, 4); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data->dst_addr.ip4.addr = ip4_addr; - cma_mask->dst_addr.ip4.addr = htonl(~0); - } - break; - case AF_INET6: - ip6_addr = ((struct sockaddr_in6 *) addr)->sin6_addr; - cma_set_ip_ver(cma_data, 6); - cma_set_ip_ver(cma_mask, 0xF); - if (!cma_any_addr(addr)) { - cma_data->dst_addr.ip6 = ip6_addr; - memset(&cma_mask->dst_addr.ip6, 0xFF, - sizeof cma_mask->dst_addr.ip6); - } - break; - default: - break; - } -} - static int cma_iw_handler(struct iw_cm_id *iw_id, struct iw_cm_event *iw_event) { struct rdma_id_private *id_priv = iw_id->context; @@ -1911,33 +1875,19 @@ out: static int cma_ib_listen(struct rdma_id_private *id_priv) { - struct ib_cm_compare_data compare_data; struct sockaddr *addr; struct ib_cm_id *id; __be64 svc_id; - int ret; - id = ib_create_cm_id(id_priv->id.device, cma_req_handler, id_priv); + addr = cma_src_addr(id_priv); + svc_id = rdma_get_service_id(&id_priv->id, addr); + id = ib_cm_id_create_and_listen(id_priv->id.device, cma_req_handler, + svc_id, 0); if (IS_ERR(id)) return PTR_ERR(id); - id_priv->cm_id.ib = id; - addr = cma_src_addr(id_priv); - svc_id = rdma_get_service_id(&id_priv->id, addr); - if (cma_any_addr(addr) && !id_priv->afonly) - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); - else { - cma_set_compare_data(id_priv->id.ps, addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); - } - - if (ret) { - ib_destroy_cm_id(id_priv->cm_id.ib); - id_priv->cm_id.ib = NULL; - } - - return ret; + return 0; } static int cma_iw_listen(struct rdma_id_private *id_priv, int backlog) -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 04/11] IB/cm: Expose DGID in SIDR request events
The destination GID can be used to uniquely resolve the request. Expose the GID in SIDR request events when it is available, so that the rdma_cm module can use that information. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 7 +++ include/rdma/ib_cm.h | 2 ++ 2 files changed, 9 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index c5f5f89e274a..46f99ec4080a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -2983,6 +2983,13 @@ static void cm_format_sidr_req_event(struct cm_work *work, param->pkey = __be16_to_cpu(sidr_req_msg->pkey); param->listen_id = listen_id; param->service_id = sidr_req_msg->service_id; + if (work->mad_recv_wc->wc->wc_flags & IB_WC_GRH) { + param->grh = 1; + memcpy(¶m->dgid, &work->mad_recv_wc->recv_buf.grh->dgid, + sizeof(param->dgid)); + } else { + param->grh = 0; + } param->port = work->port->port_num; work->cm_event.private_data = &sidr_req_msg->private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 1b567bbc3ad4..3a5d70d79790 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -224,6 +224,8 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; __be64 service_id; + union ib_giddgid; + u8 grh:1; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 00/11] Demux IB CM requests in the rdma_cm module
The rdma_cm module relies today on the ib_cm module to demux incoming requests based on their service ID and IP address. The ib_cm module is the wrong place to perform this task, as it can also be used with services that do not adhere to the RDMA IP CM service as defined in the IBA specifications. It is forced to use an opaque private data struct and mask to compare incoming requests against. This series moves that demux task responsibility to the rdma_cm module. The rdma_cm module can look into the private data attached to a CM request, containing the IP addresses related to the request. It uses the details of the request to find the net device associated with the request, and use that net device to find the correct listening rdma_cm_id. The series applies against Doug's for-v4.2 tree with the patch adding a rwsem to IB core [2] applied. The series is structured as follows: Patches 1-2 add the ability to lookup a network device according to the IB device, port, P_Key, GID and IP address. They find the matching IPoIB interfaces, and return a matching net_device if one exists. Patches 3-5 make necessary changes in ib_cm to allow RDMA CM get the information it needs out of CM and SIDR requests, and share a single ib_cm_id with multiple RDMA CM listeners. Patches 6-7 do some preliminary refactoring to the rdma_cm module. They allow extracting information out of incoming requests instead of retrieving them from a listening CM ID, and add helper functions to access the port space IDRs. Finally, patches 8-10 change rdma_cm to demultiplex requests on its own, and patch 11 cleans up the now unneeded code in ib_cm to compare against the private data. This series contains a subset of the RDMA CM namespaces patches [1]. The changes from v4 of the relevant patches are: - Patch 1 * in addition to the IB device, port, P_Key and IP address, pass also the GID, to make future IPoIB devices with alias GIDs to unique. * return the matching net_device instead of a network namespace. - Patch 2: use IS_ENABLED(CONFIG_IPV6) without ifdefs. - Patch 5: * rename sharecount -> listen_sharecount. * use a regular int instead of atomic for the share count, protected by the cm.lock spinlock. * change id destruction and shared listener creation to prevent the case where an id is found but it is under destruction. [1] [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM http://www.spinics.net/lists/linux-rdma/msg25244.html [2] [PATCH for-next V5 02/12] IB/core: Add rwsem to allow reading device list or client list http://www.spinics.net/lists/linux-rdma/msg25931.html Guy Shapiro (1): IB/ipoib: Return IPoIB devices matching connection parameters Haggai Eran (9): IB/cm: Expose service ID in request events IB/cm: Expose DGID in SIDR request events IB/cm: Share listening CM IDs IB/cma: Refactor RDMA IP CM private-data parsing code IB/cma: Helper functions to access port space IDRs IB/cma: Add net_dev and private data checks to RDMA CM IB/cma: validate routing of incoming requests IB/cma: Share ib_cm_ids between rdma_cm_ids IB/cm: Remove compare_data checks Yotam Kenneth (1): IB/core: Find the network device matching connection parameters drivers/infiniband/core/cm.c | 204 +++ drivers/infiniband/core/cma.c | 555 ++ drivers/infiniband/core/device.c | 29 ++ drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/ipoib/ipoib_main.c | 141 +++- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h | 21 +- include/rdma/ib_verbs.h | 35 ++ 9 files changed, 765 insertions(+), 227 deletions(-) -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 02/11] IB/ipoib: Return IPoIB devices matching connection parameters
From: Guy Shapiro Implement the get_net_device_by_port_pkey_ip callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each IPoIB device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 141 +- 1 file changed, 140 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index da149c278cb8..15af32665a87 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include #include +#include +#include +#include #define DRV_VERSION "1.0.0" @@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_dev_by_params( + struct ib_device *dev, u8 port, u16 pkey, + const union ib_gid *gid, const struct sockaddr *addr); static struct ib_client ipoib_client = { .name = "ipoib", .add= ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_dev_by_params = ipoib_get_net_dev_by_params, }; int ipoib_open(struct net_device *dev) @@ -222,6 +229,138 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +/* Called with an RCU read lock taken */ +static bool ipoib_is_dev_match_addr(const struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + struct in_device *in_dev; + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; + __be32 ret_addr; + + switch (addr->sa_family) { + case AF_INET: + in_dev = in_dev_get(dev); + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, +addr_in->sin_addr.s_addr, +RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + + break; + case AF_INET6: + if (IS_ENABLED(CONFIG_IPV6) && + ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) + return true; + + break; + } + return false; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr( + const struct sockaddr *addr, struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + rcu_read_lock(); + if (ipoib_is_dev_match_addr(addr, dev)) { + dev_hold(dev); + result = dev; + goto out; + } + + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } +out: + rcu_read_unlock(); + return result; +} + +/* returns an IPoIB netdev on top a given ipoib device matching a pkey_index + * and address, if one exists. */ +static struct net_device *ipoib_match_gid_pkey_addr(struct ipoib_dev_priv *priv, + const union ib_gid *gid, + u16 pkey_index, + const struct sockaddr *addr) +{ + struct ipoib_dev_priv *child_priv; + struct net_device *net_dev = NULL; + + if (priv->pkey_index == pkey_index && + (!gid || !memcmp(gid, &priv->local_gid, sizeof(*gid { + net_dev = ipoib_get_net_dev_match_addr(addr, priv->dev); + if (net_dev) + return net_dev; + } + + /* Check child interfaces */ + down_read(&priv->vlan_rwsem); + list_for_each_entry(child_priv, &priv->child_intfs, list) { + net_dev = ipoib_match_gid_pkey_addr(child_priv, gid, +
[PATCH 03/11] IB/cm: Expose service ID in request events
Expose the service ID on an incoming CM or SIDR request to the event handler. This will allow the RDMA CM module to de-multiplex connection requests based on the information encoded in the service ID. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 3 +++ include/rdma/ib_cm.h | 1 + 2 files changed, 4 insertions(+) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 32063add9280..c5f5f89e274a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -1256,6 +1256,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, primary_path->packet_life_time = cm_req_get_primary_local_ack_timeout(req_msg); primary_path->packet_life_time -= (primary_path->packet_life_time > 0); + primary_path->service_id = req_msg->service_id; if (req_msg->alt_local_lid) { memset(alt_path, 0, sizeof *alt_path); @@ -1277,6 +1278,7 @@ static void cm_format_paths_from_req(struct cm_req_msg *req_msg, alt_path->packet_life_time = cm_req_get_alt_local_ack_timeout(req_msg); alt_path->packet_life_time -= (alt_path->packet_life_time > 0); + alt_path->service_id = req_msg->service_id; } } @@ -2980,6 +2982,7 @@ static void cm_format_sidr_req_event(struct cm_work *work, param = &work->cm_event.param.sidr_req_rcvd; param->pkey = __be16_to_cpu(sidr_req_msg->pkey); param->listen_id = listen_id; + param->service_id = sidr_req_msg->service_id; param->port = work->port->port_num; work->cm_event.private_data = &sidr_req_msg->private_data; } diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 39ed2d2fbd51..1b567bbc3ad4 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -223,6 +223,7 @@ struct ib_cm_apr_event_param { struct ib_cm_sidr_req_event_param { struct ib_cm_id *listen_id; + __be64 service_id; u8 port; u16 pkey; }; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 11/11] IB/cm: Remove compare_data checks
Now that there are no ib_cm clients using the compare_data feature for matching IB CM requests' private data, remove the compare_data parameter of ib_cm_listen and remove the code implementing the feature. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c| 87 - drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h| 14 +- 5 files changed, 14 insertions(+), 94 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index d124a891430b..2d769aad5c4f 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -221,7 +221,6 @@ struct cm_id_private { /* todo: use alternate port on send failure */ struct cm_av av; struct cm_av alt_av; - struct ib_cm_compare_data *compare_data; void *private_data; __be64 tid; @@ -442,40 +441,6 @@ static struct cm_id_private * cm_acquire_id(__be32 local_id, __be32 remote_id) return cm_id_priv; } -static void cm_mask_copy(u32 *dst, const u32 *src, const u32 *mask) -{ - int i; - - for (i = 0; i < IB_CM_COMPARE_SIZE; i++) - dst[i] = src[i] & mask[i]; -} - -static int cm_compare_data(struct ib_cm_compare_data *src_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - u32 dst[IB_CM_COMPARE_SIZE]; - - if (!src_data || !dst_data) - return 0; - - cm_mask_copy(src, src_data->data, dst_data->mask); - cm_mask_copy(dst, dst_data->data, src_data->mask); - return memcmp(src, dst, sizeof(src)); -} - -static int cm_compare_private_data(u32 *private_data, - struct ib_cm_compare_data *dst_data) -{ - u32 src[IB_CM_COMPARE_SIZE]; - - if (!dst_data) - return 0; - - cm_mask_copy(src, private_data, dst_data->mask); - return memcmp(src, dst_data->data, sizeof(src)); -} - /* * Trivial helpers to strip endian annotation and compare; the * endianness doesn't actually matter since we just need a stable @@ -508,18 +473,14 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) struct cm_id_private *cur_cm_id_priv; __be64 service_id = cm_id_priv->id.service_id; __be64 service_mask = cm_id_priv->id.service_mask; - int data_cmp; while (*link) { parent = *link; cur_cm_id_priv = rb_entry(parent, struct cm_id_private, service_node); - data_cmp = cm_compare_data(cm_id_priv->compare_data, - cur_cm_id_priv->compare_data); if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && - (cm_id_priv->id.device == cur_cm_id_priv->id.device) && - !data_cmp) + (cm_id_priv->id.device == cur_cm_id_priv->id.device)) return cur_cm_id_priv; if (cm_id_priv->id.device < cur_cm_id_priv->id.device) @@ -530,8 +491,6 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) link = &(*link)->rb_left; else if (be64_gt(service_id, cur_cm_id_priv->id.service_id)) link = &(*link)->rb_right; - else if (data_cmp < 0) - link = &(*link)->rb_left; else link = &(*link)->rb_right; } @@ -541,20 +500,16 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) } static struct cm_id_private * cm_find_listen(struct ib_device *device, -__be64 service_id, -u32 *private_data) +__be64 service_id) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; - int data_cmp; while (node) { cm_id_priv = rb_entry(node, struct cm_id_private, service_node); - data_cmp = cm_compare_private_data(private_data, - cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device) && !data_cmp) + (cm_id_priv->id.device == device)) return cm_id_priv; if (device < cm_id_priv->
[PATCH 08/11] IB/cma: Add net_dev and private data checks to RDMA CM
Instead of relying on a the ib_cm module to check an incoming CM request's private data header, add these checks to the RDMA CM module. This allows a following patch to to clean up the ib_cm interface and remove the code that looks into the private headers. It will also allow supporting namespaces in RDMA CM by making these checks namespace aware later on. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 170 +- 1 file changed, 168 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index bb90d462f819..a43bbd57400c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -300,7 +300,7 @@ static enum rdma_cm_state cma_exch(struct rdma_id_private *id_priv, return old; } -static inline u8 cma_get_ip_ver(struct cma_hdr *hdr) +static inline u8 cma_get_ip_ver(const struct cma_hdr *hdr) { return hdr->ip_version >> 4; } @@ -1024,6 +1024,169 @@ static int cma_save_net_info(struct sockaddr *src_addr, return cma_save_ip_info(src_addr, dst_addr, ib_event, service_id); } +struct cma_req_info { + struct ib_device *device; + int port; + const union ib_gid *local_gid; + __be64 service_id; + u16 pkey; +}; + +static int cma_save_req_info(const struct ib_cm_event *ib_event, +struct cma_req_info *req) +{ + const struct ib_cm_req_event_param *req_param = + &ib_event->param.req_rcvd; + const struct ib_cm_sidr_req_event_param *sidr_param = + &ib_event->param.sidr_req_rcvd; + + switch (ib_event->event) { + case IB_CM_REQ_RECEIVED: + req->device = req_param->listen_id->device; + req->port = req_param->port; + req->local_gid = &req_param->primary_path->sgid; + req->service_id = req_param->primary_path->service_id; + req->pkey = be16_to_cpu(req_param->primary_path->pkey); + break; + case IB_CM_SIDR_REQ_RECEIVED: + req->device = sidr_param->listen_id->device; + req->port = sidr_param->port; + req->local_gid = sidr_param->grh ? &sidr_param->dgid : NULL; + req->service_id = sidr_param->service_id; + req->pkey = sidr_param->pkey; + break; + default: + return -EINVAL; + } + + return 0; +} + +static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, + const struct cma_req_info *req) +{ + struct sockaddr_storage listen_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage; + struct net_device *net_dev; + int err; + + err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id); + if (err) + return ERR_PTR(err); + + net_dev = ib_get_net_dev_by_params(req->device, req->port, req->pkey, + req->local_gid, listen_addr); + if (!net_dev) + return ERR_PTR(-ENODEV); + + return net_dev; +} + +static enum rdma_port_space rdma_ps_from_service_id(__be64 service_id) +{ + return (be64_to_cpu(service_id) >> 16) & 0x; +} + +static bool cma_match_private_data(struct rdma_id_private *id_priv, + const struct cma_hdr *hdr) +{ + struct sockaddr *addr = cma_src_addr(id_priv); + __be32 ip4_addr; + struct in6_addr ip6_addr; + + if (cma_any_addr(addr) && !id_priv->afonly) + return true; + + switch (addr->sa_family) { + case AF_INET: + ip4_addr = ((struct sockaddr_in *)addr)->sin_addr.s_addr; + if (cma_get_ip_ver(hdr) != 4) + return false; + if (!cma_any_addr(addr) && + hdr->dst_addr.ip4.addr != ip4_addr) + return false; + break; + case AF_INET6: + ip6_addr = ((struct sockaddr_in6 *)addr)->sin6_addr; + if (cma_get_ip_ver(hdr) != 6) + return false; + if (!cma_any_addr(addr) && + memcmp(&hdr->dst_addr.ip6, &ip6_addr, sizeof(ip6_addr))) + return false; + break; + default: + return false; + } + + return true; +} + +static bool cma_match_net_dev(const struct rdma_id_private *id_priv, + const struct net_device *net_dev) +{ + const struct rdma_dev_addr *addr = &id_priv->id.route.addr.dev_addr; + + return !addr->bound_dev_if || + (
[PATCH 09/11] IB/cma: validate routing of incoming requests
Pass incoming request parameters through the relevant IPv4/IPv6 routing tables and make sure the network stack is configured to handle such requests. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cma.c | 100 -- 1 file changed, 97 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index a43bbd57400c..74e562ec5b93 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -46,6 +46,8 @@ #include #include +#include +#include #include #include @@ -1062,15 +1064,102 @@ static int cma_save_req_info(const struct ib_cm_event *ib_event, return 0; } +static bool validate_ipv4_net_dev(struct net_device *net_dev, + const struct sockaddr_in *dst_addr, + const struct sockaddr_in *src_addr) +{ + struct in_device *in_dev = in_dev_get(net_dev); + __be32 daddr = dst_addr->sin_addr.s_addr, + saddr = src_addr->sin_addr.s_addr; + struct fib_result res; + struct flowi4 fl4; + int err; + bool ret = false; + + if (!in_dev) + return false; + + if (ipv4_is_multicast(saddr) || ipv4_is_lbcast(saddr) || + ipv4_is_lbcast(daddr) || ipv4_is_zeronet(saddr) || + ipv4_is_zeronet(daddr) || ipv4_is_loopback(daddr) || + ipv4_is_loopback(saddr)) + goto out; + + memset(&fl4, 0, sizeof(fl4)); + fl4.flowi4_iif = net_dev->ifindex; + fl4.daddr = daddr; + fl4.saddr = saddr; + err = fib_lookup(dev_net(net_dev), &fl4, &res); + if (err) + goto out; + + if (res.fi->fib_dev != net_dev) + goto out; + + ret = true; +out: + in_dev_put(in_dev); + return ret; +} + +static bool validate_ipv6_net_dev(struct net_device *net_dev, + const struct sockaddr_in6 *dst_addr, + const struct sockaddr_in6 *src_addr) +{ +#if IS_ENABLED(CONFIG_IPV6) + const int strict = ipv6_addr_type(&dst_addr->sin6_addr) & + IPV6_ADDR_LINKLOCAL; + struct rt6_info *rt = rt6_lookup(dev_net(net_dev), &dst_addr->sin6_addr, +&src_addr->sin6_addr, net_dev->ifindex, +strict); + bool ret; + + if (!rt) + return false; + + ret = rt->rt6i_idev->dev == net_dev; + ip6_rt_put(rt); + + return ret; +#else + return false; +#endif +} + +static bool validate_net_dev(struct net_device *net_dev, +const struct sockaddr *daddr, +const struct sockaddr *saddr) +{ + const struct sockaddr_in *daddr4 = (const struct sockaddr_in *)daddr; + const struct sockaddr_in *saddr4 = (const struct sockaddr_in *)saddr; + const struct sockaddr_in6 *daddr6 = (const struct sockaddr_in6 *)daddr; + const struct sockaddr_in6 *saddr6 = (const struct sockaddr_in6 *)saddr; + + switch (daddr->sa_family) { + case AF_INET: + return saddr->sa_family == AF_INET && + validate_ipv4_net_dev(net_dev, daddr4, saddr4); + + case AF_INET6: + return saddr->sa_family == AF_INET6 && + validate_ipv6_net_dev(net_dev, daddr6, saddr6); + + default: + return false; + } +} + static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, const struct cma_req_info *req) { - struct sockaddr_storage listen_addr_storage; - struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage; + struct sockaddr_storage listen_addr_storage, src_addr_storage; + struct sockaddr *listen_addr = (struct sockaddr *)&listen_addr_storage, + *src_addr = (struct sockaddr *)&src_addr_storage; struct net_device *net_dev; int err; - err = cma_save_ip_info(listen_addr, NULL, ib_event, req->service_id); + err = cma_save_ip_info(listen_addr, src_addr, ib_event, + req->service_id); if (err) return ERR_PTR(err); @@ -1079,6 +1168,11 @@ static struct net_device *cma_get_net_dev(struct ib_cm_event *ib_event, if (!net_dev) return ERR_PTR(-ENODEV); + if (!validate_net_dev(net_dev, listen_addr, src_addr)) { + dev_put(net_dev); + return ERR_PTR(-EHOSTUNREACH); + } + return net_dev; } -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH 01/11] IB/core: Find the network device matching connection parameters
From: Yotam Kenneth In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the GID should be enough to uniquely identify the ULP net device. However, in current kernels there can be multiple IPoIB interfaces created with the same GID. Furthermore, such configuration may be desireable to support ipvlan-like configurations for RDMA CM with IPoIB. To resolve the device in these cases the code will also take the IP address as an additional input. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/device.c | 29 + include/rdma/ib_verbs.h | 35 +++ 2 files changed, 64 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index f08d438205ed..300609e8f841 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "core_priv.h" @@ -746,6 +747,34 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr) +{ + struct net_device *net_dev = NULL; + struct ib_client *client; + + if (!rdma_protocol_ib(dev, port)) + return NULL; + + down_read(&lists_rwsem); + + list_for_each_entry(client, &client_list, list) + if (client->get_net_dev_by_params) { + net_dev = client->get_net_dev_by_params(dev, port, pkey, + gid, addr); + if (net_dev) + break; + } + + up_read(&lists_rwsem); + + return net_dev; +} +EXPORT_SYMBOL(ib_get_net_dev_by_params); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 986fddb08579..03899eebf53c 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -1762,6 +1763,26 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev:An RDMA device that the net_dev use for communication. +* @port: A physical port number on the RDMA device. +* @pkey: P_Key that the net_dev uses if applicable. +* @gid:A GID that the net_dev uses to communicate. +* @addr: An IP address the net_dev is configured with. +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. +* +* The caller is responsible for calling dev_put on the returned +* netdev. */ + struct net_device *(*get_net_dev_by_params)( + struct ib_device *dev, + u8 port, + u16 pkey, + const union ib_gid *gid, + const struct sockaddr *addr); struct list_head list; }; @@ -3026,4 +3047,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_dev_by_params() - Return the appropriate net_dev + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA device. + * @pkey: The Pkey the request came on. + * @gid: A GID that the net_dev uses to communicate. + * @addr: Contains the IP address that the request specified as its + * destination. + */ +struct net_device *ib_get_net_dev_by_params(struct ib_device *dev, u8 port, + u16 pkey, const union ib_gid *gid, + const struct sockaddr *addr); + #endif /* IB_VERBS_H */ -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.k
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 04/06/2015 19:40, Jason Gunthorpe wrote: > Discussion idea: Can we actually use the netstack to process the > RDMA-CM packets? It looks like the netstack wants a skb to do this > mid-layer work, so rdma-cm would have to synthesize a skb for the CM > packets and pass it through netdev to apply all the transformations > and access the various internal states (eg from ipvlan, bonding, > etc). rdma-cm would have to 'catch' the skb once it is done traveling > and resume its normal processing. Very similar to your notion of using > UDP, but without any on-the-wire change. > > This would fit in that same ingress spot I suggested adding the > routing lookup, instead of routing we want the full stack to have a go > at figuring out the final netdev. > > This seems the most general because it will work for all the *vlan > type drivers, bonding, and all of the RDMA technologies. (each would > have a slightly different way to make the skb, but same basic idea) > > Lots and lots of details to do that, but conceptually it seems pretty > solid? The problem is that the network stack can do all sort of changes to the packets (like NAT), and it may be the case that the hardware can't reflect these changes later on when creating a QP. I think it would be best to stick with resolving the net_dev using the request parameters, and the simpler routing lookup. This way RDMA CM remains in control, and if the user configures routing in an unexpected way, it can just block the request. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 04/06/2015 00:45, Jason Gunthorpe wrote: > On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: >> As Haggai wrote, if we let the using IP address thing to fly up, we have >> support for RDMA in containers using the RDMA-CM at IPoIB environments. >> This will let people test, use, experiment, fix, interact (and even >> production-it when static IP address assignment scheme is used). > > Sure, I think we all understand the goal, and you've explained some > reasonable use cases for the child support. > >> Later, usage of alias GUIDs for IPoIB RTNL childs would allow to >> remove the IP thing. > > How do we remove it? Along with same-guid child support? What is your > idea here? > >>> Also, now that this has been brought up, I think you need to make a >>> patch to fix the IPv6 SLAAC breakage this caused. It looks trivial to >>> modify addrconf_ifid_infiniband to return error if the IPoIB child is >>> sharing a guid. It was not good at all to push the child patches >>> forward to 3.6/3.7 if you knew that IPv6 SLAAC was broken by them. >> >> Till the alias GUID thing is introduced, maybe we can patch >> addrconf_ifid_infiniband to use the QPN value from the device HW >> address to come up with unique IPv6 link local address, agree? where >> you think we can place the 24 bits QPN? > > I don't know if that is a good idea, an unstable SLAAC is not in > spirit with the RFCs. The safest bet is to return error and disable > SLAAC completely. Maybe this is a silly question, but doesn't DAD already disable SLAAC addresses when there's a conflict? Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 04/06/2015 02:48, Jason Gunthorpe wrote: > On Wed, Jun 03, 2015 at 11:07:37PM +0300, Or Gerlitz wrote: > >>> I'm mostly fine with it as an optional capability, similar to macvlan, >>> I just don't see how to cleanly integrate it with RDMA CM and >>> namespaces. And I don't see what RDMA CM is supposed to do when >>> it hits this case. >>> >>> So, any ideas that don't involve the searching for IP hack?? >>> >>> [And yes, as discussed with Haggie, it is not the worst hack in the >>> world, and maybe we can live with it, but lets understand the trade >>> offs carefully] >> >> As Haggai wrote, if we let the using IP address thing to fly up, we have >> support for RDMA in containers using the RDMA-CM at IPoIB environments. >> This will let people test, use, experiment, fix, interact (and even >> production-it when static IP address assignment scheme is used). > > I just noticed ipvlan got merged a few months ago.. That certainly > changed my view on this topic. It is basically a software > version of the same-guid ipoib children scheme. Similar issues: Same MAC > address as the parent, IPv6 SLAAC is disabled (?), DHCP has similar > issue (solved with RFC4361, and broadcasting fallback, it seems).. > > The l2/l3 distinction in ipvlan is also very interesting. The L3 mode > solves some of the security type issues. What do you think Haggi? I think some issues ipvlan is trying to solve would also affect us using the alias GUIDs solution. ipvlan tries to solve among other the problem of a limited MAC filter table in NICs, and avoid using promiscuous mode. But the GID table is also limited, and we don't have something like promiscuous mode for GIDs in InfiniBand. For large scale use of containers we would need to also allow the current model. As for L3 mode, it does seem more restrictive, as all routing decisions are done in the controlling namespace. Our current ipoib child interface implementation is more like the L2 version of ipvlan. > > Is there any chance standard things like ipvlan and macvlan could be > used with rdma-cm if their master devices are IPoIB? These standard interfaces seem very much connected with Ethernet (both have an ARPHDR_ETHER-only check for their upper devices). I think macvlan's functionality would be covered by adding alias GUIDs to ipoib, and ipvlan L2 is covered by the current behavior. Perhaps it would be beneficial to try and make ipvlan more generic so that it would work over ipoib, giving us support for L3 mode. As for rdma-cm support, the patch I had for ipoib attempts to scan each child's upper devices in order to support such topologies. We only tested it with bonding, but I think it would also work with such devices. > Are we even on > the right path to do that someday? Is that the plan for roce? Yes, for RoCE our goal for the start was to support namespaces in RDMA CM through macvlan devices. As long as we can update the RoCE gid table correctly for macvlan and ipvlan devices, the RDMA CM implementation shouldn't care where the details come from. > Any thoughts on the idea we still need ipoib same-guid children if > ipvlan is available? If we port ipvlan to work over IPoIB interfaces and not just Ethernet, then ipvlan L2 would provide exactly the same functionality. There onyl difference I can think of is that ipvlan would use a single UD QP for all devices (and in connected-mode, a single RC QP between a pair of hosts), while ipoib would use a QP per child device, and multiple RC QPs for such pairs. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 28/05/2015 18:46, Jason Gunthorpe wrote: > On Thu, May 28, 2015 at 04:22:36PM +0300, Haggai Eran wrote: >> wouldn't care if they share the "QP number namespace", etc. RDMA CM >> ports are different because they are chosen by the applications, but >> they map directly to the network namespace, so they don't require their >> own namespace. > > Different containers should have restricted access to the PKey and GID > tables, and the presence device itself. Just like in the SRIOV > case. > > That is what the 'RDMA Namespace' would control. We were thinking here that there is a room for an RDMA cgroup. It would limit the amount of RDMA resources a container can use. It can also be used for the restrictions you mentioned, but maybe they are more suitable for a namespace. I'm not sure. In RoCE for instance, a restricted access to the GID table can be derived from the network namespace directly, but perhaps not in InfiniBand. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 29/05/2015 00:55, Doug Ledford wrote: > On Thu, 2015-05-28 at 22:05 +0300, Or Gerlitz wrote: >> So what's your concrete saying here? where should we go from here? > > This excerpt is from the commit log of patch 3/12: > > The IB device and port, together with the P_Key and the IP address should > be enough to uniquely identify the ULP net device. > > The problem here is that this is wrong. If we allow more than one > device per pkey with the same GUID, then DHCP breaks, which is bad in > and of itself, but it also breaks ipv6 link local addressing. Which > means that this hunk in patch 4/12: > > +#if IS_ENABLED(CONFIG_IPV6) > + case AF_INET6: > + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) > + return true; > + > + break; > +#endif > > can now be tricked into returning true for incorrect devices. > > Where do we go from here? > > First, I'm inclined to say we should modify the add_child portion of > IPoIB to refuse to add links to a PKey if that GUID is already present > on that PKey. You could then use different PKeys on the default GUID > for separate namespaces. If you need separate namespaces on the same > PKey, then enable alias GUIDs for use on the local adapter and require > one GUID per namespace on the same PKey. I don't think blocking the current add_child implementation is needed. I agree IPv6 SLAAC and DHCP currently don't work well, and adding alias GUID for child interfaces is important, but the current implementation can be used with static IPv4 addresses, so I don't think it must be disabled. > Then I'm inclined to say that we should map for namespaces using device, > port, guid/gid, pkey. And in this situation, since a unique guid/gid on > any given pkey maps to a unique dhcp identifier and a unique ipv6 > lladdr, this becomes freely interchangeable with device, port, pkey, > address mappings that this patchset was built around. What if we change the namespaces patches to map (device, port, GID, P_Key, IP) to netdev / namespace? That is, to use both the GID and the IP address. This would allow people to use namespaces with the current implementation (provided they have a valid configuration with no conflicting IP addresses), and once alias GUIDs are added, the GUIDs will be used to uniquely resolve the namespace even with such misconfigurations. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 26/05/2015 20:46, Doug Ledford wrote: >> Remember, this isn't RDMA namespaces, this is netdev namespace support >> > for RDMA-CM -> very different things. > That was the point of my email. This is a very myopic view of the > feature. It *should* at least have an idea of these other things too. We did give some thought to the question of whether an RDMA namespace is needed, and concluded that it isn't. RDMA resources such as QP numbers, memory keys, etc. are allocated by the devices. So different containers wouldn't care if they share the "QP number namespace", etc. RDMA CM ports are different because they are chosen by the applications, but they map directly to the network namespace, so they don't require their own namespace. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 26/05/2015 19:59, Jason Gunthorpe wrote: > The big open question for ethernet is how to work without relying on > VLAN to create delgated netdevs - typically one would use a bridge and > veth's, which do not seem very RDMA compatible. But that doesn't need > to be answered right now. I think in Ethernet the first step would be to support macvlan devices. Like IPoIB child devices, they are directly attached to an RDMA device, so they don't require handling a complex virtual bridging topology as veths do. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 00/12] Add network namespace support in the RDMA-CM
On 26/05/2015 16:34, Doug Ledford wrote: > On Sun, 2015-05-17 at 08:50 +0300, Haggai Eran wrote: >> Thanks again everyone for the review comments. I've updated the patch set >> accordingly. The main changes are in the first patch to use a read-write >> semaphore instead of an SRCU, and with the reference counting of shared >> ib_cm_ids. >> Please let me know if I missed anything, or if there are other issues with >> the series. > > Hi Haggai, > > I know you are probably busy reworking this right now on the basis of > Jason's comments. However, my biggest issue with this patch set right > now is not technical (well, it is, but it's only partially technical). Hi, I'm sorry about the late reply. We had a holiday here, and then some other tasks took precedence. I've only got back to working on this today. > > This is a core feature more than anything else. Namespaces for RDMA > devices is not unique to IB or RoCE in any way. Yet no thought has been > given to how this will work universally across all of the RDMA capable > devices (mainly I'm talking about iWARP here... I don't agree. It is true we have are not planning to provide an iWarp implementation for network namespaces, as we lack the capacity and the expertise. However, I think that the changes we proposed to the rdma_cm module will work with iWarp too. Perhaps with some of Jason's suggestions it will be smoother, but even in the current design, I think that if iWarp drivers can provide iw_cm with the network device on which a request is received, then it should be simple to modify it for namespace support without significant change to rdma_cm. > I don't think this is an > issue for usNIC as if you want namespace support there, you just start > the user space app in a given namespace and you are probably 90% of the > way there since the user space application gets its own device and so > its own MAC/IP and all of the RDMA transfers are UDP, so the > application's namespace should get inherited by all the rest, but Cisco > would need to confirm that, hence why I say 90% of the way there, it > needs confirmed). > > So, while you are reworking things right now, you would ideally contact > Steve Wise and/or Tatyana Nikolova and discuss the iWARP story on this. > I know there won't be a lot of overlap between IB and iWARP, but last > time you were asked you didn't even know if this setup could be extended > to iWARP. > > For this next statement, I know I'm directing this to you Haggai, but > please don't take it that way. I'm really using your patch set to make > a broader point to everyone on the list. > > When I look at patches for support for a given feature, one of the > things I'm going to look at is whether or not that feature is specific > to a given hardware type, or if it's a generic feature. If it's a > generic feature, then I'm going to want to know that the person > submitting it has designed it well. A pre-requisite of designing a > generic feature well is that it considers all hardware types, not just > your specific hardware type. So when you come back with the next > version of this patch set, please have an answer for how it should work > on each hardware type even if you don't have implementation patches for > each hardware type. Well, because the RDMA subsystem supports a very diverse set of devices, I think there are few people who know the details of all hardware types well. If we are going to evolve the generic parts of the stack, we have to cooperate. We have to rely on the knowledge of people on the mailing list to say whether the feature is well designed for all hardware types, or whether changes are warranted. In this specific case, the patches has been on the list since February. I think it is enough time to allow anyone who is interested in network namespace support to chime in. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 21/05/2015 20:43, Jason Gunthorpe wrote: > On Thu, May 21, 2015 at 08:33:53AM +0300, Haggai Eran wrote: > >> To create a new child interface on the default P_Key, its possible to >> use iproute: >> # ip link add link ib0 name ib0.1 type ipoib > > Uh.. > > A key invariant of the IP stack is that is it possible to uniquely > identify the ingress device. > > So the above scheme is fine for IPoIB, because it uses the interface > unique QPN to uniquely ID the netdevice: (Device,Port,Pkey,QPN) > is the unique ID tuple. The world is happy. > > But RDMA CM doesn't provide the QPN. So when RDMA CM searches the > netdevs for an address it cannot *uniquely* map to a IPoIB interface. This is technically true, but if someone configures their system that way, they will also have ARP conflicts in addition. I don't see why we should support such a configuration. > This is bad, and *completely wrong*, but today, nobody is going to > really notice or care. The cases where it does something you don't > want are not very significant. > > But with containers.. Think this through for a minute: 'In some cases > the RDMA CM selecs the wrong child' - that goes from being a minor > annoyance to a violation of containment! Worse the criteria for > 'selects the wrong child' can be triggered by the contained users. Eg > the contained user adds a IP to their child that duplicates another > container. Now we've lost control. No, this is exactly what would happen in the Ethernet world. If you create a conflicting configuration between two containers on the same Ethernet segment, then one of them could get the traffic that was intended for the other. I don't see this as a violation of containment, because these containers are assigned net_devs that communicate on the same segment, so they behave just as two different hosts would, with or without conflicts. > > The very idea of ib_get_net_dev_by_port_pkey_ip is broken. > > So, I don't know what to say here.. Ideas? > > 1) Forbid creating more than one pkey per ipoib interface? You probably mean more than one IP on the same pkey. The pkey is actually part of the request, so its not an issue. > 2) Somehow extend the RDMA CM to send the IPoIB qpn too? > 3) ?? We can do something crazy in the future like moving all CM requests to run over UDP as in RoCEv2. But both adding the QPN or moving to UDP require a wire protocol change and won't be compatible with today's systems. > > Right now the only case that comes to mind is duplicating IPs, that is > already going to cause an ARP collision, so maybe having the RDMA CM > randomly select an IP is not the end of the world... But with > containers and security, who knows? I'm not confident I've > exhaustively thought of all possibilities here. > > -- > > Anyhow.. looking again through this series and the existing code, the > flow is wrong, and really needs to be changed before this starts to > make sense to anyone, and is no doubt part of how we got here.. > > When a REQ arrives RDMA CM needs to run down these steps (this is identical > to what ip_input.c does) > > 1) Locate the netdev associated with the ingress of the packet, > in a sane world this is done by only checking the > unique (Device,Port,Pkey,QPN) tuple. > If we keep our brokeness, we'd do this based on > (Device,Port,Pkey,IP) - if there are IP collisions then randomly > select a netdev (similar to how ARP collision is handled). That's what ib_get_net_dev_by_port_pkey_ip intends to do. > 2) Then we do the ip_route_input_noref step, this will set skb_dst to > the netdev that will handle the packet, or tell us to drop it. > This is not always the same as the netdev that accepts the > packet!!! > > NOTE: This route step is missing today, it does critical things > like check that the node is actually listening on the dest IP! Isn't this a little over-engineered? If all you want is to make sure the net dev is up, can't we use something like netif_running()? Also, this sounds like a major change in behavior even for applications that do not use containers. I think today RDMA CM will accept connections even if the ipoib interface is down. > > 3) Now we can use skb_dst to iterate over the set of all RDMA CM listens: > 1) Bound to the skb_dst netdev > 2) Unbound in the same namespace as skb_dst netdev > The first to match the dst IP + port is the listen that will accept the > connection, now we go into the cma_new_conn_id path, and we don't > need rdma_translate_ip because we already have the handling netdev. You shouldn't be able to bind one listener to a netdev in a namespace an
Re: [PATCH] staging: r8712u: Fix kernel warning for improper call of del_timer_sync()
On 24 May 2015 at 00:16, Larry Finger wrote: > The driver is reporting a warning at kernel/time/timer.c:1096 due to calling > del_timer_sync() while in interrupt mode. Such warnings are fixed by calling > del_timer() instead. > > Signed-off-by: Larry Finger > Cc: Stable > Cc: Haggi Eran Hi, I haven't been using kernel v4.1 so I haven't seen this warning, but looking at the code it seems to originate from the two recent patches to remove _cancel_timer and _cancel_timer_ex. I see that there's another patch in lkml [1] that changes del_timer_sync back to del_timer in more places. Perhaps it could prevent other warnings like this in the future. Regards, Haggai [1] https://lkml.org/lkml/2015/5/15/226 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
On 20/05/2015 01:35, Jason Gunthorpe wrote: > On Tue, May 19, 2015 at 12:35:45PM -0600, Jason Gunthorpe wrote: >> On Sun, May 17, 2015 at 08:51:01AM +0300, Haggai Eran wrote: >>> @@ -212,6 +212,8 @@ struct cm_id_private { >>> spinlock_t lock;/* Do not acquire inside cm.lock */ >>> struct completion comp; >>> atomic_t refcount; >>> + /* Number of clients sharing this ib_cm_id. Only valid for listeners. */ >>> + atomic_t sharecount; >> >> No need for this atomic, hold the lock >> >> The use of the atomic looks racy: >> >>> + if (!atomic_dec_and_test(&cm_id_priv->sharecount)) { >>> + /* The id is still shared. */ >>> + return; >>> + } >> >> Might race with this: >> >>> + if (atomic_inc_return(&cm_id_priv->sharecount) == 1) { >>> + /* This ID is already being destroyed */ >>> + atomic_dec(&cm_id_priv->sharecount); >>> + goto new_id; >>> + } >>> + >> >> Resulting in use-after-free of cm_id_priv->sharecount > > Actually, there is something else odd here.. I mentioned the above > because there wasn't obvious ref'ing on the cm_id_priv. Looking closer > the cm.lock should prevent use-after-free, but there is still no ref. > > The more I look at this, the more I think it is sketchy. Don't try and > merge sharecount and refcount together, I'm not sure what you mean here. The way I was thinking about it was that sharecount = num of rdma_cm_ids sharing this listener, while refcount = num of active internal uses of this ID (work items, timers, etc.) Do you want refcount to also include the sharecount? > after cm_find_listen is called > you have to increment the refcount before dropping cm.lock. > > Decrement the refcount when destroying a shared listen. You mean to decrement event if listen_sharecount > 0, and the id isn't destroyed, right? The code already calls cm_deref_id when listen_sharecount = 0 of course. > I also don't see how the 'goto new_id' can work, if cm_find_listen > succeeds then __ib_cm_listen is guarenteed to fail. > > Fix the locking to make that impossible, associate sharecount with the > cm.lock and, rework how cm_destroy_id grabs the cm_id_priv->lock spinlock: > > case IB_CM_LISTEN: > spin_lock_irq(&cm.lock); > if (cm_id_priv->sharecount != 0) { >cm_id_prv->sharecount--; >// paired with in in ib_cm_id_create_and_listen >atomic_dec(&cm_id_priv->refcount); >spin_unlock_irq(&cm.lock); >return; > } > rb_erase(&cm_id_priv->service_node, &cm.listen_service_table); > spin_unlock_irq(&cm.lock); > > spin_lock_irq(&cm_id_priv->lock); > cm_id->state = IB_CM_IDLE; > spin_unlock_irq(&cm_id_priv->lock); > break; > > Now that condition is eliminated, the unneeded atomic is gone, and > refcount still acts like a proper kref should. Thanks, that looks like a better solution. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 05/12] IB/cm: Share listening CM IDs
On 19/05/2015 21:35, Jason Gunthorpe wrote: ... > The share count should be 'listen_sharecount' because it *only* works > for listen. > > The above test in cm_destroy_id should only be in the listen branch of > the if. Okay. > >> + * Create a new listening ib_cm_id and listen on the given service ID. >> + * >> + * If there's an existing ID listening on that same device and service ID, >> + * return it. >> + * > > .. Callers should call cm_destroy_id when done with the listen .. I'll add that (with ib_destroy_cm_id of course). -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v3 for-next 05/13] IB/cm: Reference count ib_cm_ids
On 20/05/2015 03:49, Hefty, Sean wrote: >>> I wonder if the existing ib_cm interface is even what we want. >>> Currently, the rdma_cm pushes the private data (i.e. IP address) >>> comparison into the ib_cm. This is only used by the rdma_cm. >>> Should that instead be moved out of the ib_cm and handled in the >>> rdma_cm? And then update the ib_cm to support multiple listens on >>> the same service id. >> >> Well, that really is the problem here, the private data compare >> doesn't really do enough because rdma cm wildcard's the IP address and >> does it's own check anyhow. So I see all of this as different >> proposals on how to fix that. rdma_cm basically already does most of >> the private data compare itself. > > Maybe the first thing to fix is this. Since the private data compare is no > longer sufficient, remove it and update the rdma_cm to handle the change. > Regardless of how the code ends up structured, the namespace support should > then easily fit into that solution. Patch 8 in this series (IB/cma: Add compare_data checks to the RDMA CM module) basically does that (change rdma_cm so that it doesn't rely on private_data checks in ib_cm). Indeed, it is positioned in the series before adding the rdma_cm namespace support. It remains to clean up ib_cm's ib_cm_listen interface now that compare_data isn't used, but I'm not sure this belongs in this series. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 21/05/2015 08:48, Or Gerlitz wrote: > On Thu, May 21, 2015 at 8:33 AM, Haggai Eran wrote: >> On 20/05/2015 02:55, Jason Gunthorpe wrote: >>> On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >>>>> From: Guy Shapiro >>>>> >>>>> Implement the get_net_device_by_port_pkey_ip callback that returns network >>>>> device to ib_core according to connection parameters. Check the ipoib >>>>> device and iterate over all child devices to look for a match. >>> Can you give a run down on how to actually set this up? Like what >>> shell command do you execute? >> >> Sure. >> >> There are two methods to create new child interface for IPoIB. >> For a specific P_Key, write the desired P_Key to the create_child sysfs >> file: >> # echo 0x8000 > /sys/class/net/ib0/create_child >> This creates a new interface ib0.8000 operating with P_Key 0x8000. > > 0x8000 is practically zero (bits 15-0), right? not sure this is a valid pkey. Right, bad example :) > > >> To create a new child interface on the default P_Key, its possible to >> use iproute: >> # ip link add link ib0 name ib0.1 type ipoib > > you can use non default pkeys as well here. > > $ ip link add link ib0 name ib0.1 type ipoib pkey 0x8001 > Right. I think when we started development of the namespaces patches these child interfaces (rtnetlink with pkey) didn't work for us, but I checked now on an updated kernel and they do. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 20/05/2015 02:55, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >> For each ipoib device we iterate through all upper devices when searching >> for a matching IP, in order to support bonding. > > Checking an IP address in a packet against a device without consulting > the routing table is a big red flag for me. Can you elaborate? We want to match the incoming CM request to a specific net_dev, and use the network namespace of that net_dev to find the right rdma_cm_id. Because there can be multiple network namespaces, there would also be multiple routing tables, so I'm not sure how we could use them. The solution we used is to walk all the upper net_devs of all the IPoIB child devices of a given IB device and port, and find one that matches the request's IB device, port number, P_Key and IP address. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 20/05/2015 02:55, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: >> > From: Guy Shapiro >> > >> > Implement the get_net_device_by_port_pkey_ip callback that returns network >> > device to ib_core according to connection parameters. Check the ipoib >> > device and iterate over all child devices to look for a match. > Can you give a run down on how to actually set this up? Like what > shell command do you execute? Sure. There are two methods to create new child interface for IPoIB. For a specific P_Key, write the desired P_Key to the create_child sysfs file: # echo 0x8000 > /sys/class/net/ib0/create_child This creates a new interface ib0.8000 operating with P_Key 0x8000. To create a new child interface on the default P_Key, its possible to use iproute: # ip link add link ib0 name ib0.1 type ipoib In order to create a new network namespace: # ip netns add ns1 Then, you can assign the new netdev to the namespace: # ip link set ib0.1 netns ns1 You can then set an IP address in the network namespace, and try some RDMA CM applications: # ip netns exec ns1 ip addr add dev ib0.1 192.168.0.1/24 # ip netns exec ns1 rdma_server Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v4 for-next 04/12] IB/ipoib: Return IPoIB devices matching connection parameters
On 19/05/2015 21:28, Jason Gunthorpe wrote: > On Sun, May 17, 2015 at 08:51:00AM +0300, Haggai Eran wrote: > >> +#if IS_ENABLED(CONFIG_IPV6) >> +struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; >> +#endif >> +__be32 ret_addr; >> + >> +switch (addr->sa_family) { >> +case AF_INET: >> +in_dev = in_dev_get(dev); >> +if (!in_dev) >> +return false; >> + >> +ret_addr = inet_confirm_addr(net, in_dev, 0, >> + addr_in->sin_addr.s_addr, >> + RT_SCOPE_HOST); >> +in_dev_put(in_dev); >> +if (ret_addr) >> +return true; >> + >> +break; >> +#if IS_ENABLED(CONFIG_IPV6) >> +case AF_INET6: >> +if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) >> +return true; >> + >> +break; >> +#endif > > Can you use > > if (IS_ENABLED(CONFIG_IPV6)) > > At the call site instead of the #if guards? Sure, I'll do that in the next revision of the patch-set. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 for-next 09/13] IB/cma: Add compare_data checks to the RDMA CM module
Previously RDMA CM relied on the CM module to check the incoming requests against "compare data" struct to dispatch events for different RDMA CM IDs based on the request parameters (IP address, address family, etc.). With namespace support, multiple namespaces in RDMA CM will need to share a single CM ID. Such an ID cannot be associated with a specific compare data, because that could create conflicts with other namespaces. The patch adds checks to verify that incoming requests match their RDMA CM ID destination inside the RDMA CM module itself. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 5 +++-- drivers/infiniband/core/cma.c | 12 +--- include/rdma/ib_cm.h | 3 +++ 3 files changed, 15 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index fc33fb215e55..d023639a9f75 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -459,8 +459,8 @@ static int cm_compare_data(struct ib_cm_compare_data *src_data, return memcmp(src, dst, sizeof(src)); } -static int cm_compare_private_data(u32 *private_data, - struct ib_cm_compare_data *dst_data) +int cm_compare_private_data(u32 *private_data, + struct ib_cm_compare_data *dst_data) { u32 src[IB_CM_COMPARE_SIZE]; @@ -470,6 +470,7 @@ static int cm_compare_private_data(u32 *private_data, cm_mask_copy(src, private_data, dst_data->mask); return memcmp(src, dst_data->data, sizeof(src)); } +EXPORT_SYMBOL(cm_compare_private_data); /* * Trivial helpers to strip endian annotation and compare; the diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index b5d3321e613e..1f591458b4de 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -146,6 +146,7 @@ struct rdma_id_private { u8 tos; u8 reuseaddr; u8 afonly; + struct ib_cm_compare_data compare_data; }; struct cma_multicast { @@ -1319,6 +1320,10 @@ static int cma_req_handler(struct ib_cm_id *cm_id, struct ib_cm_event *ib_event) int offset, ret; listen_id = cm_id->context; + if (cm_compare_private_data(ib_event->private_data, + &listen_id->compare_data)) + return -EINVAL; + if (!cma_check_req_qp_type(&listen_id->id, ib_event)) return -EINVAL; @@ -1586,7 +1591,6 @@ out: static int cma_ib_listen(struct rdma_id_private *id_priv) { - struct ib_cm_compare_data compare_data; struct sockaddr *addr; struct ib_cm_id *id; __be64 svc_id; @@ -1603,8 +1607,10 @@ static int cma_ib_listen(struct rdma_id_private *id_priv) if (cma_any_addr(addr) && !id_priv->afonly) ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, NULL); else { - cma_set_compare_data(id_priv->id.ps, addr, &compare_data); - ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, &compare_data); + cma_set_compare_data(id_priv->id.ps, addr, +&id_priv->compare_data); + ret = ib_cm_listen(id_priv->cm_id.ib, svc_id, 0, + &id_priv->compare_data); } if (ret) { diff --git a/include/rdma/ib_cm.h b/include/rdma/ib_cm.h index 17e072d1677c..4c767b5daf50 100644 --- a/include/rdma/ib_cm.h +++ b/include/rdma/ib_cm.h @@ -347,6 +347,9 @@ struct ib_cm_compare_data { u32 mask[IB_CM_COMPARE_SIZE]; }; +int cm_compare_private_data(u32 *private_data, + struct ib_cm_compare_data *dst_data); + /** * ib_cm_listen - Initiates listening on the specified service ID for * connection and service ID resolution requests. -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v3 for-next 06/13] IB/cm: API to retrieve existing listening CM IDs
Enabling network namespaces for RDMA CM will allow processes on different namespaces to listen on the same port. In order to leave namespace support out of the CM layer, this requires that multiple RDMA CM IDs will be able to share a single CM ID. This patch adds infrastructure to retrieve an existing listening ib_cm_id, based on its device and service ID, or create a new one if one does not already exist. Signed-off-by: Haggai Eran --- drivers/infiniband/core/cm.c | 104 --- include/rdma/ib_cm.h | 4 ++ 2 files changed, 103 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 6b68402fd6df..0be96a19734d 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -960,11 +960,32 @@ void ib_destroy_cm_id(struct ib_cm_id *cm_id) } EXPORT_SYMBOL(ib_destroy_cm_id); -int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, -struct ib_cm_compare_data *compare_data) +/** + * __ib_cm_listen - Initiates listening on the specified service ID for + * connection and service ID resolution requests. + * @cm_id: Connection identifier associated with the listen request. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + * @compare_data: This parameter is optional. It specifies data that must + * appear in the private data of a connection request for the specified + * listen request. + * @lock: If set, lock the cm.lock spin-lock when adding the id to the + * listener tree. When false, the caller must already hold the spin-lock, + * and compare_data must be NULL. + */ +static int __ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, + __be64 service_mask, + struct ib_cm_compare_data *compare_data, + bool lock) { struct cm_id_private *cm_id_priv, *cur_cm_id_priv; - unsigned long flags; + unsigned long flags = 0; int ret = 0; service_mask = service_mask ? service_mask : ~cpu_to_be64(0); @@ -990,7 +1011,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->state = IB_CM_LISTEN; - spin_lock_irqsave(&cm.lock, flags); + if (lock) + spin_lock_irqsave(&cm.lock, flags); if (service_id == IB_CM_ASSIGN_SERVICE_ID) { cm_id->service_id = cpu_to_be64(cm.listen_service_id++); cm_id->service_mask = ~cpu_to_be64(0); @@ -999,7 +1021,8 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, cm_id->service_mask = service_mask; } cur_cm_id_priv = cm_insert_listen(cm_id_priv); - spin_unlock_irqrestore(&cm.lock, flags); + if (lock) + spin_unlock_irqrestore(&cm.lock, flags); if (cur_cm_id_priv) { cm_id->state = IB_CM_IDLE; @@ -1009,8 +1032,79 @@ int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, } return ret; } + +int ib_cm_listen(struct ib_cm_id *cm_id, __be64 service_id, __be64 service_mask, +struct ib_cm_compare_data *compare_data) +{ + return __ib_cm_listen(cm_id, service_id, service_mask, compare_data, + true); +} EXPORT_SYMBOL(ib_cm_listen); +/** + * Create a new listening ib_cm_id and listen on the given service ID. + * + * If there's an existing ID listening on that same device and service ID, + * return it. + * + * @device: Device associated with the cm_id. All related communication will + * be associated with the specified device. + * @cm_handler: Callback invoked to notify the user of CM events. + * @service_id: Service identifier matched against incoming connection + * and service ID resolution requests. The service ID should be specified + * network-byte order. If set to IB_CM_ASSIGN_SERVICE_ID, the CM will + * assign a service ID to the caller. + * @service_mask: Mask applied to service ID used to listen across a + * range of service IDs. If set to 0, the service ID is matched + * exactly. This parameter is ignored if %service_id is set to + * IB_CM_ASSIGN_SERVICE_ID. + */ +struct ib_cm_id *ib_cm_id_create_and_listen( + struct ib_device *device, + ib_cm_handler cm_handler, + __be64 service_id, + __be64 service_mask) +{ + struct cm_id_private *cm_id_
[PATCH v3 for-next 03/13] IB/core: Find the network namespace matching connection parameters
From: Yotam Kenneth In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the IP address should be enough to uniquely identify the ULP net device. This function is passed to ib_core as part of the ib_client registration. Using this functionality, add a way to get the network namespace corresponding to a work completion. This is needed so that responses to CM requests can be sent from the same network namespace as the request. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/device.c | 53 include/rdma/ib_verbs.h | 33 + 2 files changed, 86 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 7d90b2ca2eba..3b9e80ce0b42 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "core_priv.h" @@ -760,6 +761,58 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev, +u8 port, +u16 pkey, +struct sockaddr *addr) +{ + struct net_device *ret = NULL; + struct ib_client *client; + int id; + + id = srcu_read_lock(&device_srcu); + + list_for_each_entry_rcu(client, &client_list, list) + if (client->get_net_device_by_port_pkey_ip) { + ret = client->get_net_device_by_port_pkey_ip(dev, port, +pkey, +addr); + if (ret) + break; + } + + srcu_read_unlock(&device_srcu, id); + + return ret; +} + +struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr) +{ + struct net_device *ndev = NULL; + struct net *ns; + + if (rdma_protocol_ib(dev, port)) + ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr); + + if (!ndev) + goto not_found; + + rcu_read_lock(); + ns = maybe_get_net(dev_net(ndev)); + dev_put(ndev); + rcu_read_unlock(); + if (!ns) + goto not_found; + return ns; + +not_found: + return get_net(&init_net); +} +EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index c7241149f142..7baf3c40ef97 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -48,6 +48,7 @@ #include #include #include +#include #include #include @@ -1691,6 +1692,24 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev:An RDMA device that the net_dev use for communication. +* @port: A physical port number on the RDMA device. +* @pkey: P_Key that the net_dev uses if applicable. +* @addr: An IP address the net_dev is configured with. +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. +* +* The caller is responsible for calling dev_put on the returned +* netdev. */ + struct net_device *(*get_net_device_by_port_pkey_ip)( + struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr); struct list_head list; }; @@ -2834,4 +2853,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA devi
[PATCH v3 for-next 01/13] IB/core: Use SRCU when reading client_list or device_list
Currently the RDMA subsystem's device list and client list are protected by a single mutex. This prevents adding user-facing APIs that iterate these lists, since using them may cause a deadlock. The patch attempts to solve this problem by adding an SRCU to protect the lists. Readers now don't need the mutex, and are safe just by using srcu_read_lock/unlock. The ib_register_device, ib_register_client, and ib_unregister_client functions are modified to only lock the device_mutex during their respective list modification, and use the SRCU for iteration on the other list. In ib_unregister_device, the client list iteration remains in the mutex critical section as it is done in reverse order. This patch attempts to solve a similar need [1] that was seen in the RoCE v2 patch series. [1] http://www.spinics.net/lists/linux-rdma/msg24733.html Cc: Matan Barak Cc: Jason Gunthorpe Signed-off-by: Haggai Eran --- drivers/infiniband/core/device.c | 75 ++-- 1 file changed, 56 insertions(+), 19 deletions(-) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index b360350a0b20..7d90b2ca2eba 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -58,12 +58,11 @@ EXPORT_SYMBOL_GPL(ib_wq); static LIST_HEAD(device_list); static LIST_HEAD(client_list); +/* device_srcu protects access to both device_list and client_list. */ +static struct srcu_struct device_srcu; + /* - * device_mutex protects access to both device_list and client_list. - * There's no real point to using multiple locks or something fancier - * like an rwsem: we always access both lists, and we're always - * modifying one list or the other list. In any case this is not a - * hot path so there's no point in trying to optimize. + * device_mutex protects writer access to both device_list and client_list. */ static DEFINE_MUTEX(device_mutex); @@ -276,6 +275,7 @@ int ib_register_device(struct ib_device *device, u8, struct kobject *)) { int ret; + int id; mutex_lock(&device_mutex); @@ -315,13 +315,19 @@ int ib_register_device(struct ib_device *device, device->reg_state = IB_DEV_REGISTERED; + mutex_unlock(&device_mutex); + + id = srcu_read_lock(&device_srcu); { struct ib_client *client; - list_for_each_entry(client, &client_list, list) + list_for_each_entry_rcu(client, &client_list, list) if (client->add && !add_client_context(device, client)) client->add(device); } + srcu_read_unlock(&device_srcu, id); + + return 0; out: mutex_unlock(&device_mutex); @@ -338,6 +344,7 @@ EXPORT_SYMBOL(ib_register_device); void ib_unregister_device(struct ib_device *device) { struct ib_client *client; + LIST_HEAD(contexts); struct ib_client_data *context, *tmp; unsigned long flags; @@ -347,21 +354,26 @@ void ib_unregister_device(struct ib_device *device) if (client->remove) client->remove(device); - list_del(&device->core_list); + list_del_rcu(&device->core_list); + + mutex_unlock(&device_mutex); + + synchronize_srcu(&device_srcu); kfree(device->gid_tbl_len); kfree(device->pkey_tbl_len); - mutex_unlock(&device_mutex); - ib_device_unregister_sysfs(device); spin_lock_irqsave(&device->client_data_lock, flags); - list_for_each_entry_safe(context, tmp, &device->client_data_list, list) - kfree(context); + list_cut_position(&contexts, &device->client_data_list, + device->client_data_list.prev); spin_unlock_irqrestore(&device->client_data_lock, flags); device->reg_state = IB_DEV_UNREGISTERED; + + list_for_each_entry_safe(context, tmp, &contexts, list) + kfree(context); } EXPORT_SYMBOL(ib_unregister_device); @@ -381,15 +393,19 @@ EXPORT_SYMBOL(ib_unregister_device); int ib_register_client(struct ib_client *client) { struct ib_device *device; + int id; mutex_lock(&device_mutex); + list_add_tail_rcu(&client->list, &client_list); + mutex_unlock(&device_mutex); - list_add_tail(&client->list, &client_list); - list_for_each_entry(device, &device_list, core_list) + id = srcu_read_lock(&device_srcu); + + list_for_each_entry_rcu(device, &device_list, core_list) if (client->add && !add_client_context(device, client)) client->add(device); - mutex_unlock(&device_mutex); + srcu_read_unlock(&device_src
Re: [PATCH v2 01/11] RDMA/CMA: Mark IPv4 addresses correctly when the listener is IPv6
On Tuesday, April 21, 2015 1:15 PM, Haggai Eran wrote: > On 20/04/2015 23:01, Jason Gunthorpe wrote: >> This should take care of it, testing, and figuring the fixes tag is >> left as an exercise to the reader.. > > Fixes: e51060f08a61 ("IB: IP address based RDMA connection manager") > Tested-by: Haggai Eran > Roland, Doug, Could you pick Jason's patch without the rest of the series? It seems the namespace series will need more work, but I don't think this patch should be delayed as well. Thanks, Haggai-- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/11] Add network namespace support in the RDMA-CM
On 21/04/2015 17:11, Steve Wise wrote: > On 4/21/2015 1:36 AM, Haggai Eran wrote: >> On 20/04/2015 17:53, Steve Wise wrote: >>> Hey Haggai, >>> >>> Did you check for changes needed in drivers/infiniband/core/iwcm.c? >> We focused on namespace support for InfiniBand alone in this series. We >> didn't handle iWARP, nor did we implement support for RoCE or other >> transports. >> >>> I notice that it uses init_net here: >>> >>> static int __init iw_cm_init(void) >>> { >>> iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); >>> if (!iwcm_wq) >>> return -ENOMEM; >>> >>> iwcm_ctl_table_hdr = register_net_sysctl(&init_net, >>> "net/iw_cm", >>> iwcm_ctl_table); >>> if (!iwcm_ctl_table_hdr) { >>> pr_err("iw_cm: couldn't register sysctl paths\n"); >>> destroy_workqueue(iwcm_wq); >>> return -ENOMEM; >>> } >>> >>> return 0; >>> } >>> >> I see the only thing in the iWARP sysctl registered here is the default >> backlog. If you want to control this parameter per namespace, we could >> store it per network namespace, and add a namespace parameter to >> iw_cm_listen. I'm not sure how important this is though. > > I don't think it needs to be per namespace, as long as it still applies > across all name spaces. It will, but it will currently only be visible and controllable through init's namespace. -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 07/11] IB/cm: Add network namespace support
On 20/04/2015 20:06, Jason Gunthorpe wrote: > On Mon, Apr 20, 2015 at 12:03:38PM +0300, Haggai Eran wrote: >> From: Guy Shapiro >> >> Add namespace support to the IB-CM layer. > >> - Each CM-ID now has a network namespace it is associated with, assigned at >> creation. This namespace is used as needed during subsequent action on the >> CM-ID or related objects. > > There is really something weird about this layering. At the CM layer > there should be no concept of an IP address, it only deals with GIDs. Using the GID alone is not enough to distinguish between namespaces, because you can have multiple IPoIB interfaces, all using the GID (and possibly the same P_Key), and each belonging to a different namespace. > So how can a CM object have a network namespace associated with it? The listener rbtree's key is currently the service ID, for instance. Now with namespaces, you can have multiple listeners listening on the same service ID, so we need to use (service ID, namespace) as the key. > >> { >> av->port = port; >> av->pkey_index = wc->pkey_index; >> ib_init_ah_from_wc(port->cm_dev->ib_device, port->port_num, wc, >> - grh, &av->ah_attr, &init_net); >> + grh, &av->ah_attr, net); > > There is something deeply wrong with adding network namespace > arguments to verbs. > > For rocee the gid index clearly specifies the network namespace > to use, so much of this should go away and have rocee get the > namespace from the gid index. > > Ie in ib_init_ah_from_wc we have the ib_wc which contains the sgid > index. I don't see it there. The code seem to fetch the GID from the GRH. Because the IP address in the source GID can be the same for different namespaces, this is not enough to pick the right namespace. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 02/11] IB/addr: Pass network namespace as a parameter
On 21/04/2015 01:05, Doug Ledford wrote: > On Mon, 2015-04-20 at 12:03 +0300, Haggai Eran wrote: >> From: Guy Shapiro >> >> Add network namespace support to the ib_addr module. For that, all the >> address >> resolution and matching should be done using the appropriate namespace >> instead >> of init_net. >> >> This is achieved by: >> >> 1. Adding an explicit network namespace argument to exported function that >>require a namespace. >> 2. Saving the namespace in the rdma_addr_client structure. >> 3. Using it when calling networking functions. >> >> In order to preserve the behavior of calling modules, &init_net is >> passed as the parameter in calls from other modules. This is modified as >> namespace support is added on more levels. >> >> Signed-off-by: Haggai Eran >> Signed-off-by: Yotam Kenneth >> Signed-off-by: Shachar Raindel >> Signed-off-by: Guy Shapiro >> --- >> drivers/infiniband/core/addr.c | 31 -- >> drivers/infiniband/core/cma.c| 4 ++- >> drivers/infiniband/core/verbs.c | 14 +++--- >> drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 3 ++- >> include/rdma/ib_addr.h | 44 >> >> 5 files changed, 72 insertions(+), 24 deletions(-) >> >> diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c >> index f80da50d84a5..95beaef6b66d 100644 >> --- a/drivers/infiniband/core/addr.c >> +++ b/drivers/infiniband/core/addr.c >> @@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct >> rdma_dev_addr *dev_addr, >> int ret = -EADDRNOTAVAIL; >> >> if (dev_addr->bound_dev_if) { >> -dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if); >> +dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if); >> if (!dev) >> return -ENODEV; >> ret = rdma_copy_addr(dev_addr, dev, NULL); >> @@ -137,9 +137,10 @@ int rdma_translate_ip(struct sockaddr *addr, struct >> rdma_dev_addr *dev_addr, >> } >> >> switch (addr->sa_family) { >> -case AF_INET: >> -dev = ip_dev_find(&init_net, >> -((struct sockaddr_in *) addr)->sin_addr.s_addr); >> +case AF_INET: { >^ Please don't add brackets just so you can > convert a cast into a variable declaration that's unnecessary > >> +struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; >> + >> +dev = ip_dev_find(dev_addr->net, addr_in->sin_addr.s_addr); >> >> if (!dev) >> return ret; >> @@ -149,12 +150,12 @@ int rdma_translate_ip(struct sockaddr *addr, struct >> rdma_dev_addr *dev_addr, >> *vlan_id = rdma_vlan_dev_vlan_id(dev); >> dev_put(dev); >> break; >> - >> +} >> #if IS_ENABLED(CONFIG_IPV6) >> case AF_INET6: >> rcu_read_lock(); >> -for_each_netdev_rcu(&init_net, dev) { >> -if (ipv6_chk_addr(&init_net, >> +for_each_netdev_rcu(dev_addr->net, dev) { >> +if (ipv6_chk_addr(dev_addr->net, >>&((struct sockaddr_in6 *) >> addr)->sin6_addr, >>dev, 1)) { >> ret = rdma_copy_addr(dev_addr, dev, NULL); >> @@ -236,7 +237,7 @@ static int addr4_resolve(struct sockaddr_in *src_in, >> fl4.daddr = dst_ip; >> fl4.saddr = src_ip; >> fl4.flowi4_oif = addr->bound_dev_if; >> -rt = ip_route_output_key(&init_net, &fl4); >> +rt = ip_route_output_key(addr->net, &fl4); >> if (IS_ERR(rt)) { >> ret = PTR_ERR(rt); >> goto out; >> @@ -278,12 +279,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, >> fl6.saddr = src_in->sin6_addr; >> fl6.flowi6_oif = addr->bound_dev_if; >> >> -dst = ip6_route_output(&init_net, NULL, &fl6); >> +dst = ip6_route_output(addr->net, NULL, &fl6); >> if ((ret = dst->error)) >> goto put; >> >> if (ipv6_addr_any(&fl6.saddr)) { >> -ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev, >> +ret = ipv6_dev_get_saddr(addr->net, &
Re: [PATCH v2 02/11] IB/addr: Pass network namespace as a parameter
On 20/04/2015 20:09, Jason Gunthorpe wrote: > On Mon, Apr 20, 2015 at 12:03:33PM +0300, Haggai Eran wrote: >> +/** rdma_addr_find_smac_by_sgid() - Find the src MAC and VLAN ID for a src >> GID >> + * @sgid: Source GID to find the MAC and VLAN for. >> + * @smac: A buffer to contain the resulting MAC address. >> + * @vlan_id:Will contain the resulting VLAN ID. >> + * @net:Network namespace to use for the address resolution. >> + * >> + * It is the caller's responsibility to keep the network namespace alive >> until >> + * the function returns. >> + */ >> +int rdma_addr_find_smac_by_sgid(union ib_gid *sgid, u8 *smac, u16 *vlan_id, >> +struct net *net); > > kdocs are typically placed with the body of the function, not at the > prototype. I'll move it in the next revision. We did that because other functions (rdma_translate_ip, rdma_resolve_ip) are documented inside ib_addr.h this way. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 01/11] RDMA/CMA: Mark IPv4 addresses correctly when the listener is IPv6
On 20/04/2015 23:01, Jason Gunthorpe wrote: > On Mon, Apr 20, 2015 at 09:38:02PM +0300, Or Gerlitz wrote: >> On Mon, Apr 20, 2015 at 7:41 PM, Jason Gunthorpe >> wrote: >>> On Mon, Apr 20, 2015 at 12:03:32PM +0300, Haggai Eran wrote: >>>> From: Yotam Kenneth >>>> >>>> When accepting a new connection with the listener being IPv6, the >>>> family of the new connection is set as IPv6. This causes cma_zero_addr >>>> function to return true on an non-zero address. As a result, the wrong >>>> code path is taken. This causes the connection request to be rejected, >>>> as the RDMA-CM code looks for the wrong type of device. >>> >>> This description doesn't really make sense as to what the problem is. > >> Jason, could you take a look @ this thread >> http://marc.info/?t=14158939504&r=1&w=2 where the authors >> addressed some comments from Sean and he eventually Acked the patch? > > Please actually read my comments: > > If listen_id->route.addr.src_addr.ss_family != AF_INET then it is > invalid to cast to sockaddr_in. That's correct. We didn't address it because it was part of the existing code. Anyway, in a later patch in this series we move this code from the CMA to the CM module. Then we get the port number from the service ID instead of from the listener ID, since the listener ID's port isn't available. > > Sean asked basically the same thing, and his question was ignored too. > > This should take care of it, testing, and figuring the fixes tag is > left as an exercise to the reader.. > Fixes: e51060f08a61 ("IB: IP address based RDMA connection manager") Tested-by: Haggai Eran Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 00/11] Add network namespace support in the RDMA-CM
On 20/04/2015 17:53, Steve Wise wrote: > > Hey Haggai, > > Did you check for changes needed in drivers/infiniband/core/iwcm.c? We focused on namespace support for InfiniBand alone in this series. We didn't handle iWARP, nor did we implement support for RoCE or other transports. > I notice that it uses init_net here: > > static int __init iw_cm_init(void) > { > iwcm_wq = create_singlethread_workqueue("iw_cm_wq"); > if (!iwcm_wq) > return -ENOMEM; > > iwcm_ctl_table_hdr = register_net_sysctl(&init_net, "net/iw_cm", > iwcm_ctl_table); > if (!iwcm_ctl_table_hdr) { > pr_err("iw_cm: couldn't register sysctl paths\n"); > destroy_workqueue(iwcm_wq); > return -ENOMEM; > } > > return 0; > } > I see the only thing in the iWARP sysctl registered here is the default backlog. If you want to control this parameter per namespace, we could store it per network namespace, and add a namespace parameter to iw_cm_listen. I'm not sure how important this is though. Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 08/11] IB/cma: Separate port allocation to network namespaces
From: Yotam Kenneth Keep a radix-tree for the network namespaces we support for each port-space. Dynamically allocate idr for network namespace upon first bind request for a port in the (ps, net) tuple. Destroy the idr when the (ps, net) tuple does not contain any bounded ports. This patch is internal infrastructure work for the following patch. In this patch, init_net is statically used as the network namespace for the new port-space API. The radix-tree is protected under the same locking that protects the rest of the port space data. This locking is practically a big, static mutex lock for the entire module. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/cma.c | 122 ++ 1 file changed, 99 insertions(+), 23 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 1ce84a03c883..022b0d0a51cc 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -39,11 +39,13 @@ #include #include #include +#include #include #include #include #include +#include #include #include @@ -80,10 +82,83 @@ static LIST_HEAD(dev_list); static LIST_HEAD(listen_any_list); static DEFINE_MUTEX(lock); static struct workqueue_struct *cma_wq; -static DEFINE_IDR(tcp_ps); -static DEFINE_IDR(udp_ps); -static DEFINE_IDR(ipoib_ps); -static DEFINE_IDR(ib_ps); +static RADIX_TREE(tcp_ps, GFP_KERNEL); +static RADIX_TREE(udp_ps, GFP_KERNEL); +static RADIX_TREE(ipoib_ps, GFP_KERNEL); +static RADIX_TREE(ib_ps, GFP_KERNEL); + +static LIST_HEAD(idrs_list); + +struct idr_ll { + unsigned net_val; + struct net *net; + struct radix_tree_root *ps; + struct idr idr; +}; + +static void zap_ps_idr(struct idr_ll *idr_ll) +{ + radix_tree_delete(idr_ll->ps, idr_ll->net_val); + idr_destroy(&idr_ll->idr); + kfree(idr_ll); +} + +static int cma_ps_alloc(struct radix_tree_root *ps, struct net *net, void *ptr, + int snum) +{ + struct idr_ll *idr_ll; + int err; + int res; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (!idr_ll) { + idr_ll = kmalloc(sizeof(*idr_ll), GFP_KERNEL); + if (!idr_ll) + return -ENOMEM; + idr_init(&idr_ll->idr); + idr_ll->net_val = net_hash_mix(net); + idr_ll->net = net; + idr_ll->ps = ps; + err = radix_tree_insert(ps, idr_ll->net_val, idr_ll); + if (err) { + idr_destroy(&idr_ll->idr); + kfree(idr_ll); + return err; + } + } + res = idr_alloc(&idr_ll->idr, ptr, snum, snum + 1, GFP_KERNEL); + if (unlikely((res < 0) && idr_is_empty(&idr_ll->idr))) { + zap_ps_idr(idr_ll); + return res; + } + return res; +} + +static void *cma_ps_find(struct radix_tree_root *ps, struct net *net, int snum) +{ + struct idr_ll *idr_ll; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (!idr_ll) + return NULL; + return idr_find(&idr_ll->idr, snum); +} + +static void cma_ps_remove(struct radix_tree_root *ps, struct net *net, int snum) +{ + struct idr_ll *idr_ll; + + idr_ll = radix_tree_lookup(ps, net_hash_mix(net)); + if (unlikely(!idr_ll)) { + WARN(1, "cma_ps_removed can't find expected net ns 0x%lx\n", +(unsigned long)net); + return; + } + idr_remove(&idr_ll->idr, snum); + if (idr_is_empty(&idr_ll->idr)) { + zap_ps_idr(idr_ll); + } +} struct cma_device { struct list_headlist; @@ -94,9 +169,9 @@ struct cma_device { }; struct rdma_bind_list { - struct idr *ps; - struct hlist_head owners; - unsigned short port; + struct radix_tree_root *ps; + struct hlist_head owners; + unsigned short port; }; enum { @@ -885,7 +960,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - idr_remove(bind_list->ps, bind_list->port); + cma_ps_remove(bind_list->ps, &init_net, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -2198,8 +2273,8 @@ static void cma_bind_port(struct rdma_bind_list *bind_list, hlist_add_head(&id_priv->node, &bind_list->owners); } -static int cma_alloc_port(struct idr *ps, struct rdma_id_private *id_priv, - un
[PATCH v2 06/11] IB/cm, cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm
From: Guy Shapiro When receiving a connection request, ib_cm needs to associate the request with a network namespace. To do this, it needs to know the request's destination IP. For this the RDMA IP CM packet formatting functionality needs to be exposed to ib_cm. This patch merely moves the RDMA IP CM data formatting and parsing functions to be part of ib_cm. The following patch will utilize the new knowledge to look-up the appropriate namespace. Each namespace maintains an independent table of RDMA CM service IDs, allowing isolation and separation between the network namespaces. When creating a new incoming connection ID, the code in cm_save_ip_info can no longer rely on the listener's private data to find the port number, so it reads it from the requested service ID. This required saving the service ID in cm_format_paths_from_req. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/core/cm.c | 156 +++ drivers/infiniband/core/cma.c | 166 +- include/rdma/ib_cm.h | 56 ++ 3 files changed, 230 insertions(+), 148 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index 5a45cb76c43e..efc5cffb675a 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -51,6 +51,7 @@ #include #include +#include #include "cm_msgs.h" MODULE_AUTHOR("Sean Hefty"); @@ -701,6 +702,159 @@ static void cm_reject_sidr_req(struct cm_id_private *cm_id_priv, ib_send_cm_sidr_rep(&cm_id_priv->id, ¶m); } +int cm_format_hdr(void *hdr, int family, + struct sockaddr *src_addr, + struct sockaddr *dst_addr) +{ + struct cm_hdr *cm_hdr; + + cm_hdr = hdr; + cm_hdr->cm_version = RDMA_IP_CM_VERSION; + if (family == AF_INET) { + struct sockaddr_in *src4, *dst4; + + src4 = (struct sockaddr_in *)src_addr; + dst4 = (struct sockaddr_in *)dst_addr; + + cm_set_ip_ver(cm_hdr, 4); + cm_hdr->src_addr.ip4.addr = src4->sin_addr.s_addr; + cm_hdr->dst_addr.ip4.addr = dst4->sin_addr.s_addr; + cm_hdr->port = src4->sin_port; + } else if (family == AF_INET6) { + struct sockaddr_in6 *src6, *dst6; + + src6 = (struct sockaddr_in6 *)src_addr; + dst6 = (struct sockaddr_in6 *)dst_addr; + + cm_set_ip_ver(cm_hdr, 6); + cm_hdr->src_addr.ip6 = src6->sin6_addr; + cm_hdr->dst_addr.ip6 = dst6->sin6_addr; + cm_hdr->port = src6->sin6_port; + } + return 0; +} +EXPORT_SYMBOL(cm_format_hdr); + +static void cm_save_ib_info(struct sockaddr *src_addr, + struct sockaddr *dst_addr, + struct ib_sa_path_rec *path) +{ + struct sockaddr_ib *ib; + + if (src_addr) { + ib = (struct sockaddr_ib *)src_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->sgid, 16); + ib->sib_sid = path->service_id; + ib->sib_sid_mask = cpu_to_be64(0xULL); + ib->sib_scope_id = 0; + } + if (dst_addr) { + ib = (struct sockaddr_ib *)dst_addr; + ib->sib_family = AF_IB; + ib->sib_pkey = path->pkey; + ib->sib_flowinfo = path->flow_label; + memcpy(&ib->sib_addr, &path->dgid, 16); + } +} + +static void cm_save_ip6_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, +struct cm_hdr *hdr, +__be16 local_port) +{ + struct sockaddr_in6 *ip6; + + if (src_addr) { + ip6 = (struct sockaddr_in6 *)src_addr; + ip6->sin6_family = AF_INET6; + ip6->sin6_addr = hdr->dst_addr.ip6; + ip6->sin6_port = local_port; + } + + if (dst_addr) { + ip6 = (struct sockaddr_in6 *)dst_addr; + ip6->sin6_family = AF_INET6; + ip6->sin6_addr = hdr->src_addr.ip6; + ip6->sin6_port = hdr->port; + } +} + +static void cm_save_ip4_info(struct sockaddr *src_addr, +struct sockaddr *dst_addr, +struct cm_hdr *hdr, +__be16 local_port) +{ + struct sockaddr_in *ip4; + + if (src_addr) { + ip4 = (struct sockaddr_in *)src_addr; + ip4->sin_family = AF_INET; +
[PATCH v2 07/11] IB/cm: Add network namespace support
From: Guy Shapiro Add namespace support to the IB-CM layer. - Each CM-ID now has a network namespace it is associated with, assigned at creation. This namespace is used as needed during subsequent action on the CM-ID or related objects. - All of the relevant calls to ib_addr and ib_core were updated to use the namespace from the CM-ID. External APIs were extended as needed to allow specifying the namespace where relevant. - The listening service ID table is now also indexed by the CM-ID namespace. - For incoming connection requests, we use the connection parameters to select namespace. The namespace is matched when looking for listening service ID. To preserve current behavior pass init_net to ib_cm wherever network namespace function parameters were added. The ib_cm_create_id interface now takes a reference to the relevant network namespace. CM-IDs created by accepting a connection for a listening CM-ID will also take a reference to the namespace. When the ID is destroyed, the namespace reference is released. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/core/cm.c| 124 drivers/infiniband/core/cma.c | 8 ++- drivers/infiniband/core/ucm.c | 3 +- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 21 +- drivers/infiniband/ulp/srp/ib_srp.c | 2 +- drivers/infiniband/ulp/srpt/ib_srpt.c | 2 +- include/rdma/ib_cm.h| 7 +- 7 files changed, 130 insertions(+), 37 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index efc5cffb675a..75c6ac9a4aee 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -241,6 +241,8 @@ struct cm_id_private { u8 service_timeout; u8 target_ack_delay; + struct net *net; /* A network namespace that the ID belongs to */ + struct list_head work_list; atomic_t work_count; }; @@ -347,12 +349,13 @@ static void cm_set_private_data(struct cm_id_private *cm_id_priv, } static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc, - struct ib_grh *grh, struct cm_av *av) + struct ib_grh *grh, struct cm_av *av, + struct net *net) { av->port = port; av->pkey_index = wc->pkey_index; ib_init_ah_from_wc(port->cm_dev->ib_device, port->port_num, wc, - grh, &av->ah_attr, &init_net); + grh, &av->ah_attr, net); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) @@ -521,10 +524,15 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) if ((cur_cm_id_priv->id.service_mask & service_id) == (service_mask & cur_cm_id_priv->id.service_id) && (cm_id_priv->id.device == cur_cm_id_priv->id.device) && - !data_cmp) + !data_cmp && + net_eq(cm_id_priv->net, cur_cm_id_priv->net)) return cur_cm_id_priv; - if (cm_id_priv->id.device < cur_cm_id_priv->id.device) + if (cm_id_priv->net < cur_cm_id_priv->net) + link = &(*link)->rb_left; + else if (cm_id_priv->net > cur_cm_id_priv->net) + link = &(*link)->rb_right; + else if (cm_id_priv->id.device < cur_cm_id_priv->id.device) link = &(*link)->rb_left; else if (cm_id_priv->id.device > cur_cm_id_priv->id.device) link = &(*link)->rb_right; @@ -544,7 +552,8 @@ static struct cm_id_private * cm_insert_listen(struct cm_id_private *cm_id_priv) static struct cm_id_private * cm_find_listen(struct ib_device *device, __be64 service_id, -u8 *private_data) +u8 *private_data, +struct net *net) { struct rb_node *node = cm.listen_service_table.rb_node; struct cm_id_private *cm_id_priv; @@ -556,10 +565,14 @@ static struct cm_id_private * cm_find_listen(struct ib_device *device, cm_id_priv->compare_data); if ((cm_id_priv->id.service_mask & service_id) == cm_id_priv->id.service_id && - (cm_id_priv->id.device == device) && !data_cmp) + (cm_id_priv->id.device == device) && !data_cmp && + net_
[PATCH v2 04/11] IB/core: Find the network namespace matching connection parameters
From: Yotam Kenneth In the case of IPoIB, and maybe in other cases, the network device is managed by an upper-layer protocol (ULP). In order to expose this network device to other users of the IB device, let ULPs implement a callback that returns network device according to connection parameters. The IB device and port, together with the P_Key and the IP address should be enough to uniquely identify the ULP net device. This function is passed to ib_core as part of the ib_client registration. Using this functionality, add a way to get the network namespace corresponding to a work completion. This is needed so that responses to CM requests can be sent from the same network namespace as the request. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/device.c | 57 include/rdma/ib_verbs.h | 29 2 files changed, 86 insertions(+) diff --git a/drivers/infiniband/core/device.c b/drivers/infiniband/core/device.c index 18c1ece765f2..2f06be5b0b59 100644 --- a/drivers/infiniband/core/device.c +++ b/drivers/infiniband/core/device.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include "core_priv.h" @@ -733,6 +734,62 @@ int ib_find_pkey(struct ib_device *device, } EXPORT_SYMBOL(ib_find_pkey); +static struct net_device *ib_get_net_dev_by_port_pkey_ip(struct ib_device *dev, +u8 port, +u16 pkey, +struct sockaddr *addr) +{ + struct net_device *ret = NULL; + struct ib_client *client; + + mutex_lock(&device_mutex); + list_for_each_entry(client, &client_list, list) + if (client->get_net_device_by_port_pkey_ip) { + ret = client->get_net_device_by_port_pkey_ip(dev, port, +pkey, +addr); + if (ret) + break; + } + + mutex_unlock(&device_mutex); + return ret; +} + +struct net *ib_get_net_ns_by_port_pkey_ip(struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr) +{ + struct net_device *ndev = NULL; + struct net *ns; + + switch (rdma_port_get_link_layer(dev, port)) { + case IB_LINK_LAYER_INFINIBAND: + if (!addr) + goto not_found; + ndev = ib_get_net_dev_by_port_pkey_ip(dev, port, pkey, addr); + break; + default: + goto not_found; + } + + if (!ndev) + goto not_found; + + rcu_read_lock(); + ns = maybe_get_net(dev_net(ndev)); + dev_put(ndev); + rcu_read_unlock(); + if (!ns) + goto not_found; + return ns; + +not_found: + return get_net(&init_net); +} +EXPORT_SYMBOL(ib_get_net_ns_by_port_pkey_ip); + static int __init ib_core_init(void) { int ret; diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index f4a85decc60f..74b239410562 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1683,6 +1683,21 @@ struct ib_client { void (*add) (struct ib_device *); void (*remove)(struct ib_device *); + /* Returns the net_dev belonging to this ib_client and matching the +* given parameters. +* @dev:An RDMA device that the net_dev use for communication. +* @port: A physical port number on the RDMA device. +* @pkey: P_Key that the net_dev uses if applicable. +* @addr: An IP address the net_dev is configured with. +* +* An ib_client that implements a net_dev on top of RDMA devices +* (such as IP over IB) should implement this callback, allowing the +* rdma_cm module to find the right net_dev for a given request. */ + struct net_device *(*get_net_device_by_port_pkey_ip)( + struct ib_device *dev, + u8 port, + u16 pkey, + struct sockaddr *addr); struct list_head list; }; @@ -2679,4 +2694,18 @@ static inline int ib_check_mr_access(int flags) int ib_check_mr_status(struct ib_mr *mr, u32 check_mask, struct ib_mr_status *mr_status); +/** + * ib_get_net_ns_by_port_pkey_ip() - Return the appropriate net namespace + * for a received CM request + * @dev: An RDMA device on which the request has been received. + * @port: Port number on the RDMA device. + * @pkey: The Pke
[PATCH v2 11/11] IB/ucm: Add partial support for network namespaces
From: Shachar Raindel It is impossible to completely support network namespaces for UCM, as we cannot identify the target IPoIB device. However, we add support which will work if the user is following the IB-Spec Annex 11 (RDMA IP CM Services) with the service ID and private data formatting. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/ucm.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/ucm.c b/drivers/infiniband/core/ucm.c index 9604ab068984..424421091dae 100644 --- a/drivers/infiniband/core/ucm.c +++ b/drivers/infiniband/core/ucm.c @@ -45,6 +45,7 @@ #include #include #include +#include #include @@ -490,7 +491,7 @@ static ssize_t ib_ucm_create_id(struct ib_ucm_file *file, ctx->uid = cmd.uid; ctx->cm_id = ib_create_cm_id(file->device->ib_dev, ib_ucm_event_handler, ctx, -&init_net); +current->nsproxy->net_ns); if (IS_ERR(ctx->cm_id)) { result = PTR_ERR(ctx->cm_id); goto err1; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 09/11] IB/cma: Add support for network namespaces
From: Guy Shapiro Add support for network namespaces in the ib_cma module. This is accomplished by: 1. Adding network namespace parameter for rdma_create_id. This parameter is used to populate the network namespace field in rdma_id_private. rdma_create_id keeps a reference on the network namespace. 2. Using the network namespace from the rdma_id instead of init_net inside of ib_cma. 3. Decrementing the reference count for the appropriate network namespace when calling rdma_destroy_id. In order to preserve the current behavior init_net is passed when calling from other modules. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/core/cma.c | 52 +- drivers/infiniband/core/ucma.c | 3 +- drivers/infiniband/ulp/iser/iser_verbs.c | 2 +- drivers/infiniband/ulp/isert/ib_isert.c| 2 +- .../staging/lustre/lnet/klnds/o2iblnd/o2iblnd.h| 4 +- include/rdma/rdma_cm.h | 6 ++- net/9p/trans_rdma.c| 2 +- net/rds/ib.c | 2 +- net/rds/ib_cm.c| 2 +- net/rds/iw.c | 2 +- net/rds/iw_cm.c| 2 +- net/rds/rdma_transport.c | 2 +- net/sunrpc/xprtrdma/svc_rdma_transport.c | 2 +- net/sunrpc/xprtrdma/verbs.c| 3 +- 14 files changed, 52 insertions(+), 34 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 022b0d0a51cc..9ea42fe2853b 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -540,7 +540,8 @@ static int cma_disable_callback(struct rdma_id_private *id_priv, struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, void *context, enum rdma_port_space ps, - enum ib_qp_type qp_type) + enum ib_qp_type qp_type, + struct net *net) { struct rdma_id_private *id_priv; @@ -562,7 +563,7 @@ struct rdma_cm_id *rdma_create_id(rdma_cm_event_handler event_handler, INIT_LIST_HEAD(&id_priv->listen_list); INIT_LIST_HEAD(&id_priv->mc_list); get_random_bytes(&id_priv->seq_num, sizeof id_priv->seq_num); - id_priv->id.route.addr.dev_addr.net = &init_net; + id_priv->id.route.addr.dev_addr.net = get_net(net); return &id_priv->id; } @@ -689,7 +690,7 @@ static int cma_modify_qp_rtr(struct rdma_id_private *id_priv, rdma_port_get_link_layer(id_priv->id.device, id_priv->id.port_num) == IB_LINK_LAYER_ETHERNET) { ret = rdma_addr_find_smac_by_sgid(&sgid, qp_attr.smac, NULL, - &init_net); + id_priv->id.route.addr.dev_addr.net); if (ret) goto out; @@ -953,6 +954,7 @@ static void cma_cancel_operation(struct rdma_id_private *id_priv, static void cma_release_port(struct rdma_id_private *id_priv) { struct rdma_bind_list *bind_list = id_priv->bind_list; + struct net *net = id_priv->id.route.addr.dev_addr.net; if (!bind_list) return; @@ -960,7 +962,7 @@ static void cma_release_port(struct rdma_id_private *id_priv) mutex_lock(&lock); hlist_del(&id_priv->node); if (hlist_empty(&bind_list->owners)) { - cma_ps_remove(bind_list->ps, &init_net, bind_list->port); + cma_ps_remove(bind_list->ps, net, bind_list->port); kfree(bind_list); } mutex_unlock(&lock); @@ -1029,6 +1031,7 @@ void rdma_destroy_id(struct rdma_cm_id *id) cma_deref_id(id_priv->id.context); kfree(id_priv->id.route.path_rec); + put_net(id_priv->id.route.addr.dev_addr.net); kfree(id_priv); } EXPORT_SYMBOL(rdma_destroy_id); @@ -1156,7 +1159,8 @@ static struct rdma_id_private *cma_new_conn_id(struct rdma_cm_id *listen_id, int ret; id = rdma_create_id(listen_id->event_handler, listen_id->context, - listen_id->ps, ib_event->param.req_rcvd.qp_type); + listen_id->ps, ib_event->param.req_rcvd.qp_type, + listen_id->route.addr.dev_addr.net); if (IS_ERR(id)) return NULL; @@ -1201,10 +1205,11 @@ static struct rdma_id_private *cma_new_udp_id(struct rdma_cm_id *listen_id, { struct rdma_id_private *id_priv; struct rdma_cm_id *id; + struct net *net = listen_id->route.addr.dev_addr.
[PATCH v2 01/11] RDMA/CMA: Mark IPv4 addresses correctly when the listener is IPv6
From: Yotam Kenneth When accepting a new connection with the listener being IPv6, the family of the new connection is set as IPv6. This causes cma_zero_addr function to return true on an non-zero address. As a result, the wrong code path is taken. This causes the connection request to be rejected, as the RDMA-CM code looks for the wrong type of device. Since copying the ip address is done in different function depending on the family (cma_save_ip4_info/cma_save_ip6_info) this is fixed by hard coding the family of the IP address according to the function. Signed-off-by: Yotam Kenneth Signed-off-by: Or Gerlitz --- drivers/infiniband/core/cma.c | 8 1 file changed, 4 insertions(+), 4 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index d570030d899c..6e5e11ca7702 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -866,12 +866,12 @@ static void cma_save_ip4_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_i listen4 = (struct sockaddr_in *) &listen_id->route.addr.src_addr; ip4 = (struct sockaddr_in *) &id->route.addr.src_addr; - ip4->sin_family = listen4->sin_family; + ip4->sin_family = AF_INET; ip4->sin_addr.s_addr = hdr->dst_addr.ip4.addr; ip4->sin_port = listen4->sin_port; ip4 = (struct sockaddr_in *) &id->route.addr.dst_addr; - ip4->sin_family = listen4->sin_family; + ip4->sin_family = AF_INET; ip4->sin_addr.s_addr = hdr->src_addr.ip4.addr; ip4->sin_port = hdr->port; } @@ -883,12 +883,12 @@ static void cma_save_ip6_info(struct rdma_cm_id *id, struct rdma_cm_id *listen_i listen6 = (struct sockaddr_in6 *) &listen_id->route.addr.src_addr; ip6 = (struct sockaddr_in6 *) &id->route.addr.src_addr; - ip6->sin6_family = listen6->sin6_family; + ip6->sin6_family = AF_INET6; ip6->sin6_addr = hdr->dst_addr.ip6; ip6->sin6_port = listen6->sin6_port; ip6 = (struct sockaddr_in6 *) &id->route.addr.dst_addr; - ip6->sin6_family = listen6->sin6_family; + ip6->sin6_family = AF_INET6; ip6->sin6_addr = hdr->src_addr.ip6; ip6->sin6_port = hdr->port; } -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 10/11] IB/ucma: Take the network namespace from the process
From: Guy Shapiro Add support for network namespaces from user space. This is done by passing the network namespace of the process instead of init_net. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/ucma.c | 3 ++- 1 file changed, 2 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c index 2f7fad84f933..0ccdf2b05153 100644 --- a/drivers/infiniband/core/ucma.c +++ b/drivers/infiniband/core/ucma.c @@ -42,6 +42,7 @@ #include #include #include +#include #include #include @@ -392,7 +393,7 @@ static ssize_t ucma_create_id(struct ucma_file *file, const char __user *inbuf, ctx->uid = cmd.uid; ctx->cm_id = rdma_create_id(ucma_event_handler, ctx, cmd.ps, qp_type, - &init_net); + current->nsproxy->net_ns); if (IS_ERR(ctx->cm_id)) { ret = PTR_ERR(ctx->cm_id); goto err1; -- 1.7.11.2 -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH v2 03/11] IB/core: Pass network namespace as a parameter to relevant functions
From: Guy Shapiro Add network namespace parameters for the address related ib_core functions. The parameter is passed to lower level function, instead of &init_net, so things are done in the correct namespace. For now pass &init_net on every caller. Callers that will pass &init_net permanently are marked with an appropriate comment. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/agent.c | 4 +++- drivers/infiniband/core/cm.c | 9 +++-- drivers/infiniband/core/mad_rmpp.c| 10 -- drivers/infiniband/core/user_mad.c| 4 +++- drivers/infiniband/core/verbs.c | 10 ++ drivers/infiniband/ulp/srpt/ib_srpt.c | 3 ++- include/rdma/ib_verbs.h | 15 +-- 7 files changed, 42 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/agent.c b/drivers/infiniband/core/agent.c index f6d29614cb01..539378d64041 100644 --- a/drivers/infiniband/core/agent.c +++ b/drivers/infiniband/core/agent.c @@ -99,7 +99,9 @@ void agent_send_response(struct ib_mad *mad, struct ib_grh *grh, } agent = port_priv->agent[qpn]; - ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num); + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ + ah = ib_create_ah_from_wc(agent->qp->pd, wc, grh, port_num, &init_net); if (IS_ERR(ah)) { dev_err(&device->dev, "ib_create_ah_from_wc error %ld\n", PTR_ERR(ah)); diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index e28a494e2a3a..5a45cb76c43e 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -290,8 +290,13 @@ static int cm_alloc_response_msg(struct cm_port *port, struct ib_mad_send_buf *m; struct ib_ah *ah; + /* For IB, the network namespace doesn't affect the created address +* handle, so we use &init_net. In the future, RoCE support will +* require finding a specific network namespace to send the response +* from. */ ah = ib_create_ah_from_wc(port->mad_agent->qp->pd, mad_recv_wc->wc, - mad_recv_wc->recv_buf.grh, port->port_num); + mad_recv_wc->recv_buf.grh, port->port_num, + &init_net); if (IS_ERR(ah)) return PTR_ERR(ah); @@ -346,7 +351,7 @@ static void cm_init_av_for_response(struct cm_port *port, struct ib_wc *wc, av->port = port; av->pkey_index = wc->pkey_index; ib_init_ah_from_wc(port->cm_dev->ib_device, port->port_num, wc, - grh, &av->ah_attr); + grh, &av->ah_attr, &init_net); } static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) diff --git a/drivers/infiniband/core/mad_rmpp.c b/drivers/infiniband/core/mad_rmpp.c index f37878c9c06e..6c1576202965 100644 --- a/drivers/infiniband/core/mad_rmpp.c +++ b/drivers/infiniband/core/mad_rmpp.c @@ -157,8 +157,11 @@ static struct ib_mad_send_buf *alloc_response_msg(struct ib_mad_agent *agent, struct ib_ah *ah; int hdr_len; + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ ah = ib_create_ah_from_wc(agent->qp->pd, recv_wc->wc, - recv_wc->recv_buf.grh, agent->port_num); + recv_wc->recv_buf.grh, agent->port_num, + &init_net); if (IS_ERR(ah)) return (void *) ah; @@ -287,10 +290,13 @@ create_rmpp_recv(struct ib_mad_agent_private *agent, if (!rmpp_recv) return NULL; + /* Physical devices (and their MAD replies) always reside in the host +* network namespace */ rmpp_recv->ah = ib_create_ah_from_wc(agent->agent.qp->pd, mad_recv_wc->wc, mad_recv_wc->recv_buf.grh, -agent->agent.port_num); +agent->agent.port_num, +&init_net); if (IS_ERR(rmpp_recv->ah)) goto error; diff --git a/drivers/infiniband/core/user_mad.c b/drivers/infiniband/core/user_mad.c index 928cdd20e2d1..f34c6077759d 100644 --- a/drivers/infiniband/core/user_mad.c +++ b/drivers/infiniband/core/user_mad.c @@ -239,7 +239,9 @@ static void recv_handler(struct ib_mad_agent *agent, ib_init_ah_from_wc(agent->device, agent->port_num, mad
[PATCH v2 00/11] Add network namespace support in the RDMA-CM
On 4/15/2015 3:39 PM, Doug Ledford wrote: > For instance, the namespace patches aren't included, and that's at least > partially because they didn't apply cleanly any more. Here's an updated series on top of your tree. I've also included the fix for IPv4 connections to IPv6 listeners. Regards, Haggai Changes from v1: - Include patch 1 in this series. - Rebase for v4.1. Changes from v0: - Fix code review comments by Yann - Rebase on top of linux-3.19 RDMA-CM uses IP based addressing and routing to setup RDMA connections between hosts. Currently, all of the IP interfaces and addresses used by the RDMA-CM must reside in the init_net namespace. This restricts the usage of containers with RDMA to only work with host network namespace (aka the kernel init_net NS instance). This patchset allows using network namespaces with the RDMA-CM. Each RDMA-CM and CM id is keeping a reference to a network namespace. This reference is based on the process network namespace at the time of the creation of the object or inherited from the listener. This network namespace is used to perform all IP and network related operations. Specifically, the local device lookup, as well as the remote GID address resolution are done in the context of the RDMA-CM object's namespace. This allows outgoing connections to reach the right target, even if the same IP address exists in multiple network namespaces. This can happen if each network namespace resides on a different pkey. Additionally, the network namespace is used to split the listener service ID table. From the user point of view, each network namespace has a unique, completely independent table of service IDs. This allows running multiple instances of a single service on the same machine, using containers. To implement this, the CM layer now parses the IP address from the CM connect requests, and searches for the matching networking device. The namespace of the device found is used when looking up the service ID in the listener table. The functionnality introduced by this series would come into play when the transport is InfiniBand and IPoIB interfaces are assigned to each namespace. Multiple IPoIB interfaces can be created and assigned to different RDMA-CM capable containers, for example using pipework [1]. Full support for RoCE will be introduced in a later stage. The patches apply against Roland's/Doug's tree for v4.1. The patchset is structured as follows: Patch 1 is a resend of patch to fix IPv4 connections to an IPv4/IPv6 listener. Patches 2 and 4 are relatively trivial API extensions, requiring the callers of certain ib_addr and ib_core functions to provide a network namespace, as needed. Patches 4 and 5 adds the ability to lookup a network namespace according to the IP address, device and pkey. It finds the matching IPoIB interfaces, and safely takes a reference on the network namespace before returning to the caller. Patch 6 moves the logic that extracts the IP address from a connect request into the CM layer. This is needed for the upcoming listener lookup by namespace. Patch 7 adds support for network namespaces in the CM layer. All callers are still passing init_net as the namespace, to maintain backward compatibility. For incoming requests, the namespace of the relevant IPoIB device is used. Patches 8 and 9 add proper namespace support to the RDMA-CM module. Patches 10 and 11 add namespace support to the relevant user facing modules in the IB stack. [1] https://github.com/jpetazzo/pipework/pull/108 Guy Shapiro (7): IB/addr: Pass network namespace as a parameter IB/core: Pass network namespace as a parameter to relevant functions IB/ipoib: Return IPoIB devices as possible matches to get_net_device_by_port_pkey_ip IB/cm, cma: Move RDMA IP CM private-data parsing code from ib_cma to ib_cm IB/cm: Add network namespace support IB/cma: Add support for network namespaces IB/ucma: Take the network namespace from the process Shachar Raindel (1): IB/ucm: Add partial support for network namespaces Yotam Kenneth (3): RDMA/CMA: Mark IPv4 addresses correctly when the listener is IPv6 IB/core: Find the network namespace matching connection parameters IB/cma: Separate port allocation to network namespaces drivers/infiniband/core/addr.c | 31 +- drivers/infiniband/core/agent.c| 4 +- drivers/infiniband/core/cm.c | 287 -- drivers/infiniband/core/cma.c | 332 + drivers/infiniband/core/device.c | 57 drivers/infiniband/core/mad_rmpp.c | 10 +- drivers/infiniband/core/ucm.c | 4 +- drivers/infiniband/core/ucma.c | 4 +- drivers/infiniband/core/user_mad.c | 4 +- drivers/infiniband/core/verbs.c| 22 +- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 3 +- drivers/infiniband/ulp/ipoib/ip
[PATCH v2 02/11] IB/addr: Pass network namespace as a parameter
From: Guy Shapiro Add network namespace support to the ib_addr module. For that, all the address resolution and matching should be done using the appropriate namespace instead of init_net. This is achieved by: 1. Adding an explicit network namespace argument to exported function that require a namespace. 2. Saving the namespace in the rdma_addr_client structure. 3. Using it when calling networking functions. In order to preserve the behavior of calling modules, &init_net is passed as the parameter in calls from other modules. This is modified as namespace support is added on more levels. Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel Signed-off-by: Guy Shapiro --- drivers/infiniband/core/addr.c | 31 -- drivers/infiniband/core/cma.c| 4 ++- drivers/infiniband/core/verbs.c | 14 +++--- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 3 ++- include/rdma/ib_addr.h | 44 5 files changed, 72 insertions(+), 24 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index f80da50d84a5..95beaef6b66d 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -128,7 +128,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, int ret = -EADDRNOTAVAIL; if (dev_addr->bound_dev_if) { - dev = dev_get_by_index(&init_net, dev_addr->bound_dev_if); + dev = dev_get_by_index(dev_addr->net, dev_addr->bound_dev_if); if (!dev) return -ENODEV; ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -137,9 +137,10 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, } switch (addr->sa_family) { - case AF_INET: - dev = ip_dev_find(&init_net, - ((struct sockaddr_in *) addr)->sin_addr.s_addr); + case AF_INET: { + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + + dev = ip_dev_find(dev_addr->net, addr_in->sin_addr.s_addr); if (!dev) return ret; @@ -149,12 +150,12 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, *vlan_id = rdma_vlan_dev_vlan_id(dev); dev_put(dev); break; - + } #if IS_ENABLED(CONFIG_IPV6) case AF_INET6: rcu_read_lock(); - for_each_netdev_rcu(&init_net, dev) { - if (ipv6_chk_addr(&init_net, + for_each_netdev_rcu(dev_addr->net, dev) { + if (ipv6_chk_addr(dev_addr->net, &((struct sockaddr_in6 *) addr)->sin6_addr, dev, 1)) { ret = rdma_copy_addr(dev_addr, dev, NULL); @@ -236,7 +237,7 @@ static int addr4_resolve(struct sockaddr_in *src_in, fl4.daddr = dst_ip; fl4.saddr = src_ip; fl4.flowi4_oif = addr->bound_dev_if; - rt = ip_route_output_key(&init_net, &fl4); + rt = ip_route_output_key(addr->net, &fl4); if (IS_ERR(rt)) { ret = PTR_ERR(rt); goto out; @@ -278,12 +279,13 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, fl6.saddr = src_in->sin6_addr; fl6.flowi6_oif = addr->bound_dev_if; - dst = ip6_route_output(&init_net, NULL, &fl6); + dst = ip6_route_output(addr->net, NULL, &fl6); if ((ret = dst->error)) goto put; if (ipv6_addr_any(&fl6.saddr)) { - ret = ipv6_dev_get_saddr(&init_net, ip6_dst_idev(dst)->dev, + ret = ipv6_dev_get_saddr(addr->net, +ip6_dst_idev(dst)->dev, &fl6.daddr, 0, &fl6.saddr); if (ret) goto put; @@ -458,7 +460,7 @@ static void resolve_cb(int status, struct sockaddr *src_addr, } int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, - u16 *vlan_id) + u16 *vlan_id, struct net *net) { int ret = 0; struct rdma_dev_addr dev_addr; @@ -481,6 +483,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, return ret; memset(&dev_addr, 0, sizeof(dev_addr)); + dev_addr.net = net; ctx.addr = &dev_addr; init_completion(&ctx.comp); @@ -492,7 +495,7 @@ int rdma_addr_find_dmac_by_grh(union ib_gid *sgid, union ib_gid *dgid, u8 *dmac, wait_for_completion(&ctx.comp); memcpy(
[PATCH v2 05/11] IB/ipoib: Return IPoIB devices as possible matches to get_net_device_by_port_pkey_ip
From: Guy Shapiro Implement callback that returns network device to ib_core according to connection parameters. Check the ipoib device and iterate over all child devices to look for a match. For each ipoib device we iterate through all upper devices when searching for a matching IP, in order to support bonding. Signed-off-by: Guy Shapiro Signed-off-by: Haggai Eran Signed-off-by: Yotam Kenneth Signed-off-by: Shachar Raindel --- drivers/infiniband/ulp/ipoib/ipoib_main.c | 122 +- 1 file changed, 121 insertions(+), 1 deletion(-) diff --git a/drivers/infiniband/ulp/ipoib/ipoib_main.c b/drivers/infiniband/ulp/ipoib/ipoib_main.c index 7cad4dd87469..89a59a0e17e6 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_main.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_main.c @@ -48,6 +48,9 @@ #include #include +#include +#include +#include #define DRV_VERSION "1.0.0" @@ -91,11 +94,15 @@ struct ib_sa_client ipoib_sa_client; static void ipoib_add_one(struct ib_device *device); static void ipoib_remove_one(struct ib_device *device); static void ipoib_neigh_reclaim(struct rcu_head *rp); +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, + struct sockaddr *addr); static struct ib_client ipoib_client = { .name = "ipoib", .add= ipoib_add_one, - .remove = ipoib_remove_one + .remove = ipoib_remove_one, + .get_net_device_by_port_pkey_ip = ipoib_get_net_device_by_port_pkey_ip, }; int ipoib_open(struct net_device *dev) @@ -222,6 +229,119 @@ static int ipoib_change_mtu(struct net_device *dev, int new_mtu) return 0; } +static bool ipoib_is_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net *net = dev_net(dev); + + if (addr->sa_family == AF_INET) { + struct in_device *in_dev = in_dev_get(dev); + struct sockaddr_in *addr_in = (struct sockaddr_in *)addr; + __be32 ret_addr; + + if (!in_dev) + return false; + + ret_addr = inet_confirm_addr(net, in_dev, 0, +addr_in->sin_addr.s_addr, +RT_SCOPE_HOST); + in_dev_put(in_dev); + if (ret_addr) + return true; + } +#if IS_ENABLED(CONFIG_IPV6) + else if (addr->sa_family == AF_INET6) { + struct sockaddr_in6 *addr_in6 = (struct sockaddr_in6 *)addr; + + if (ipv6_chk_addr(net, &addr_in6->sin6_addr, dev, 1)) + return true; + } +#endif + return false; +} + +/** + * Find a net_device matching the given address, which is an upper device of + * the given net_device. + * @addr: IP address to look for. + * @dev: base IPoIB net_device + * + * If found, returns the net_device with a reference held. Otherwise return + * NULL. + */ +static struct net_device *ipoib_get_net_dev_match_addr(struct sockaddr *addr, + struct net_device *dev) +{ + struct net_device *upper, + *result = NULL; + struct list_head *iter; + + if (ipoib_is_dev_match_addr(addr, dev)) { + dev_hold(dev); + return dev; + } + + rcu_read_lock(); + netdev_for_each_all_upper_dev_rcu(dev, upper, iter) { + if (ipoib_is_dev_match_addr(addr, upper)) { + dev_hold(upper); + result = upper; + break; + } + } + rcu_read_unlock(); + return result; +} + +static struct net_device *ipoib_get_net_device_by_port_pkey_ip( + struct ib_device *dev, u8 port, u16 pkey, struct sockaddr *addr) +{ + struct ipoib_dev_priv *priv; + struct list_head *dev_list; + u16 pkey_index; + + ib_find_cached_pkey(dev, port, pkey, &pkey_index); + if (pkey_index == (u16)-1) + return NULL; + + if (rdma_node_get_transport(dev->node_type) != RDMA_TRANSPORT_IB) + return NULL; + + dev_list = ib_get_client_data(dev, &ipoib_client); + if (!dev_list) + return NULL; + + list_for_each_entry(priv, dev_list, list) { + struct net_device *net_dev = NULL; + struct ipoib_dev_priv *child_priv; + + if (priv->port != port) + continue; + + if (priv->pkey_index == pkey_index) { + net_dev = ipoib_get_net_dev_match_addr(addr, priv->dev); + if (net_dev) + return net_dev; + } + + down_read(&priv->vlan_rwsem); + list_for_each_entry(child_pri
Re: [PATCH V1 net-next] IB/ipoib: Fix ndo_get_iflink
On 17/04/2015 22:21, David Miller wrote: > From: Erez Shitrit > Date: Thu, 16 Apr 2015 16:34:34 +0300 > >> Currently, iflink of the parent interface was always accessed, even >> when interface didn't have a parent and hence we crashed there. >> >> Handle the interface types properly: for a child interface, return >> the ifindex of the parent, for parent interface, return its ifindex. >> >> For child devices, make sure to set the parent pointer prior to >> invoking register_netdevice(), this allows the new ndo to be called >> by the stack immediately after the child device is registered. >> >> Fixes: 5aa7add8f14b ('infiniband/ipoib: implement ndo_get_iflink') >> Reported-by: Honggang Li >> Signed-off-by: Erez Shitrit >> Signed-off-by: Honggang Li > > Applied, thanks. Doug, Roland, You might want to include this patch in your for-next / for-4.1 trees, or merge net-next again. Currently they contain the issue it fixes, and it can prevent some systems with IPoIB from booting. Regards, Haggai -- To unsubscribe from this list: send the line "unsubscribe netdev" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html