Re: stalled again
Hi Doug, not having any maintainer available for an extended time is a problem, and we actually had long discussions about that at kernel summit, with a clear hint with a cluebat from Linus that he'd prefer maintainer teams. So I'd really love to know who was so ead set aginst them. I personally don't really care if there is a real team (active/active) or just a standby (active/passive), but I think we really need a coherent tree of pending patches to avoid last minute rebases due to conflicts, as well as some feedback for submitters. I'd be happy to volunteer to collect all patches that were properly review into a queue for you to consider to at least sort out these mechanics, although I'd be even more happy if someone who has a longer experience with the subsystem would volunteer instead. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()
Hello Sagi, Hmm ... why would it be unacceptable to return 0 if sg_nents == 0 ? Regarding which component to modify if mapping the first page fails: for almost every kernel function I know a negative return value means failure and a return value >= 0 means success. Hence my proposal to change the return value of the ib_map_mr_sg() function if mapping the first page fails. I'm fine with that. How about the patch below ? Looks fine. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] IB/srp: Fix srp_map_sg_fr()
On Thu, Dec 03, 2015 at 10:46:10AM +0200, Sagi Grimberg wrote: > >> If entries 2 and 3 could be merged dma_len for 2 would span 2 and 3, > >> and then entry 3 would actually have the dma addr and len for entry 4. > > So what would be in the last entry {dma_addr, dma_len}? zeros? > > >>I'm not sure anyone still does that, but the first spot to check would > >>be the Parisc IOMMU drivers. > > So how does that sit with the fact that dma_unmap requires the > same sg_nents as in dma_map and not the actual value of dma entries? Take a look at drivers/parisc/iommu-helpers.h:iommu_coalesce_chunks() and drivers/parisc/sba_iommu.c:sba_unmap_sg() for example. The first fills out the sglist, and zeroes all unused entries past what it fills in. The unmap side than simply exits the loop if the entries are zeroed. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()
> How about the patch below ? The patch looks good to me, but while we touch this area, how about throwing in a few cosmetic fixes as well? > - if (i && page_addr != dma_addr) { > + if (i && (page_addr != dma_addr || last_page_off != 0)) { > if (last_end_dma_addr != dma_addr) { Wo about we one or two sentences for each of the conditions here? > /* gap */ > - goto done; > - > + break; > } else if (last_page_off + dma_len <= mr->page_size) { > /* chunk this fragment with the last */ > mr->length += dma_len; It would be great to avoid the else clauses if we already to a break/continue/goto to make the code flow more clear, e.g. /* * Gap to the previous segment, we'll need to return * and use another FR to map the reminder. */ if (last_end_dma_addr != dma_addr) break; /* * See if this segment is contiguous to the * previous one and just merge it in that case. */ if (last_page_off + dma_len <= mr->page_size) { last_end_dma_addr += dma_len; last_page_off += dma_len; mr->length += dma_len; continue; } /* * New page-aligned segment to map: */ page_addr = last_page_addr + mr->page_size; dma_len -= mr->page_size - last_page_off; -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 6/6] IB/srp: Fix srp_map_sg_fr()
Replying to my own email, dma_map_sg returns the actual number of entries to iterate. At least historically some IOMMU implementations would do strange tricks like: If entries 2 and 3 could be merged dma_len for 2 would span 2 and 3, and then entry 3 would actually have the dma addr and len for entry 4. So what would be in the last entry {dma_addr, dma_len}? zeros? I'm not sure anyone still does that, but the first spot to check would be the Parisc IOMMU drivers. So how does that sit with the fact that dma_unmap requires the same sg_nents as in dma_map and not the actual value of dma entries? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 02/11] IB/cm: Use the source GID index type
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type. In order to support multiple GID types, the gid_type is passed to cm_init_av_by_path and stored in the path record. The rdma cm client would use a default GID type that will be saved in rdma_id_private. Signed-off-by: Matan Barak--- drivers/infiniband/core/cm.c | 25 - drivers/infiniband/core/cma.c | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index af8b907..5ea78ab 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) read_lock_irqsave(_lock, flags); list_for_each_entry(cm_dev, _list, list) { if (!ib_find_cached_gid(cm_dev->ib_device, >sgid, - IB_GID_TYPE_IB, ndev, , NULL)) { + path->gid_type, ndev, , NULL)) { port = cm_dev->port[p-1]; break; } @@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work) struct ib_cm_id *cm_id; struct cm_id_private *cm_id_priv, *listen_cm_id_priv; struct cm_req_msg *req_msg; + union ib_gid gid; + struct ib_gid_attr gid_attr; int ret; req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; @@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work) cm_format_paths_from_req(req_msg, >path[0], >path[1]); memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN); - ret = cm_init_av_by_path(>path[0], _id_priv->av); + ret = ib_get_cached_gid(work->port->cm_dev->ib_device, + work->port->port_num, + cm_id_priv->av.ah_attr.grh.sgid_index, + , _attr); + if (!ret) { + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + work->path[0].gid_type = gid_attr.gid_type; + ret = cm_init_av_by_path(>path[0], _id_priv->av); + } if (ret) { - ib_get_cached_gid(work->port->cm_dev->ib_device, - work->port->port_num, 0, >path[0].sgid, - NULL); + int err = ib_get_cached_gid(work->port->cm_dev->ib_device, + work->port->port_num, 0, + >path[0].sgid, + _attr); + if (!err && gid_attr.ndev) + dev_put(gid_attr.ndev); + work->path[0].gid_type = gid_attr.gid_type; ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, >path[0].sgid, sizeof work->path[0].sgid, NULL, 0); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index c19f822..2914e08 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -228,6 +228,7 @@ struct rdma_id_private { u8 tos; u8 reuseaddr; u8 afonly; + enum ib_gid_typegid_type; }; struct cma_multicast { @@ -2325,6 +2326,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if); route->path_rec->net = _net; route->path_rec->ifindex = addr->dev_addr.bound_dev_if; + route->path_rec->gid_type = id_priv->gid_type; } if (!ndev) { ret = -ENODEV; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 03/11] IB/core: Add gid attributes to sysfs
This patch set adds attributes of net device and gid type to each GID in the GID table. Users that use verbs directly need to specify the GID index. Since the same GID could have different types or associated net devices, users should have the ability to query the associated GID attributes. Adding these attributes to sysfs. Signed-off-by: Matan Barak--- Documentation/ABI/testing/sysfs-class-infiniband | 16 ++ drivers/infiniband/core/sysfs.c | 184 ++- 2 files changed, 198 insertions(+), 2 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband diff --git a/Documentation/ABI/testing/sysfs-class-infiniband b/Documentation/ABI/testing/sysfs-class-infiniband new file mode 100644 index 000..a86abe6 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-infiniband @@ -0,0 +1,16 @@ +What: /sys/class/infiniband//ports//gid_attrs/ndevs/ +Date: November 29, 2015 +KernelVersion: 4.4.0 +Contact: linux-rdma@vger.kernel.org +Description: The net-device's name associated with the GID resides + at index . + +What: /sys/class/infiniband//ports//gid_attrs/types/ +Date: November 29, 2015 +KernelVersion: 4.4.0 +Contact: linux-rdma@vger.kernel.org +Description: The RoCE type of the associated GID resides at index . + This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs + or "RoCE v2" for RoCE v2 based GIDs. + + diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index b1f37d4..4d5d87a 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -37,12 +37,22 @@ #include #include #include +#include #include +struct ib_port; + +struct gid_attr_group { + struct ib_port *port; + struct kobject kobj; + struct attribute_group ndev; + struct attribute_group type; +}; struct ib_port { struct kobject kobj; struct ib_device *ibdev; + struct gid_attr_group *gid_attr_group; struct attribute_group gid_group; struct attribute_group pkey_group; u8 port_num; @@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = { .show = port_attr_show }; +static ssize_t gid_attr_show(struct kobject *kobj, +struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct gid_attr_group, +kobj)->port; + + if (!port_attr->show) + return -EIO; + + return port_attr->show(p, port_attr, buf); +} + +static const struct sysfs_ops gid_attr_sysfs_ops = { + .show = gid_attr_show +}; + static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, char *buf) { @@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = { NULL }; +static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf) +{ + if (!gid_attr->ndev) + return -EINVAL; + + return sprintf(buf, "%s\n", gid_attr->ndev->name); +} + +static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf) +{ + return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type)); +} + +static ssize_t _show_port_gid_attr(struct ib_port *p, + struct port_attribute *attr, + char *buf, + size_t (*print)(struct ib_gid_attr *gid_attr, + char *buf)) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + struct ib_gid_attr gid_attr = {}; + ssize_t ret; + va_list args; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, , + _attr); + if (ret) + goto err; + + ret = print(_attr, buf); + +err: + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + va_end(args); + return ret; +} + static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, char *buf) { @@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, return sprintf(buf, "%pI6\n", gid.raw); } +static ssize_t show_port_gid_attr_ndev(struct ib_port *p, + struct port_attribute *attr, char *buf) +{ + return _show_port_gid_attr(p, attr, buf, print_ndev); +} + +static ssize_t show_port_gid_attr_gid_type(struct ib_port *p, + struct port_attribute *attr, + char *buf) +{
[PATCH for-next V2 01/11] IB/core: Add gid_type to gid attribute
In order to support multiple GID types, we need to store the gid_type with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT GID table entries shall have a "GID type" attribute that denotes the L3 Address type". The currently supported GID is IB_GID_TYPE_IB which is also RoCE v1 GID type. This implies that gid_type should be added to roce_gid_table meta-data. Signed-off-by: Matan Barak--- drivers/infiniband/core/cache.c | 144 -- drivers/infiniband/core/cm.c | 2 +- drivers/infiniband/core/cma.c | 3 +- drivers/infiniband/core/core_priv.h | 4 + drivers/infiniband/core/device.c | 9 +- drivers/infiniband/core/multicast.c | 2 +- drivers/infiniband/core/roce_gid_mgmt.c | 60 +++-- drivers/infiniband/core/sa_query.c| 5 +- drivers/infiniband/core/uverbs_marshall.c | 1 + drivers/infiniband/core/verbs.c | 1 + include/rdma/ib_cache.h | 4 + include/rdma/ib_sa.h | 1 + include/rdma/ib_verbs.h | 11 ++- 13 files changed, 185 insertions(+), 62 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 097e9df..566fd8f 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -64,6 +64,7 @@ enum gid_attr_find_mask { GID_ATTR_FIND_MASK_GID = 1UL << 0, GID_ATTR_FIND_MASK_NETDEV = 1UL << 1, GID_ATTR_FIND_MASK_DEFAULT = 1UL << 2, + GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3, }; enum gid_table_entry_props { @@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device *ib_dev, u8 port) } } +static const char * const gid_type_str[] = { + [IB_GID_TYPE_IB]= "IB/RoCE v1", +}; + +const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) +{ + if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type]) + return gid_type_str[gid_type]; + + return "Invalid GID type"; +} +EXPORT_SYMBOL(ib_cache_gid_type_str); + /* This function expects that rwlock will be write locked in all * scenarios and that lock will be locked in sleep-able (RoCE) * scenarios. @@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const union ib_gid *gid, if (found >=0) continue; + if (mask & GID_ATTR_FIND_MASK_GID_TYPE && + attr->gid_type != val->gid_type) + continue; + if (mask & GID_ATTR_FIND_MASK_GID && memcmp(gid, >gid, sizeof(*gid))) continue; @@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port, write_lock_irq(>rwlock); ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV, ); if (ix >= 0) goto out_unlock; @@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port, ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV | GID_ATTR_FIND_MASK_DEFAULT, NULL); @@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device *ib_dev, static int ib_cache_gid_find(struct ib_device *ib_dev, const union ib_gid *gid, +enum ib_gid_type gid_type, struct net_device *ndev, u8 *port, u16 *index) { - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr gid_attr_val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type}; if (ndev) mask |= GID_ATTR_FIND_MASK_NETDEV; @@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev, int ib_find_cached_gid_by_port(struct ib_device *ib_dev, const union ib_gid *gid, + enum ib_gid_type gid_type, u8 port, struct net_device *ndev, u16 *index) { int local_index; struct ib_gid_table **ports_table = ib_dev->cache.gid_cache; struct ib_gid_table *table; - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type}; unsigned long flags;
[PATCH for-next V2 00/11] Add RoCE v2 support
Hi Doug, This series adds the support for RoCE v2. In order to support RoCE v2, we add gid_type attribute to every GID. When the RoCE GID management populates the GID table, it duplicates each GID with all supported types. This gives the user the ability to communicate over each supported type. Patch 0001, 0002 and 0003 add support for multiple GID types to the cache and related APIs. The third patch exposes the GID attributes information is sysfs. Patch 0004 adds the RoCE v2 GID type and the capabilities required from the vendor in order to implement RoCE v2. These capabilities are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP. RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this information should come from the vendor's driver. In case the vendor doesn't supply this information, we parse the packet headers and resolve its network type. Patch 0005 adds this information and required utilities. Patches 0006 and 0007 adds route validation. This is mandatory to ensure that we send packets using GIDS which corresponds to a net-device that can be routed to the destination. Patches 0008 and 0009 add configfs support (and the required infrastructure) for CMA. The administrator should be able to set the default RoCE type. This is done through a new per-port default_roce_mode configfs file. Patch 0010 formats a QP1 packet in order to support RoCE v2 CM packets. This is required for vendors which implement their QP1 as a Raw QP. Patch 0011 adds support for IPv4 multicast as an IPv4 network requires IGMP to be sent in order to join multicast groups. Vendors code aren't part of this patch-set. Soft-Roce will be sent soon and depends on these patches. Other vendors, like mlx4, ocrdma and mlx5 will follow. This patch is applied on "Change per-entry locks in GID cache to table lock" which was sent to the mailing list. Thanks, Matan Changed from V1: - Rebased against Linux 4.4-rc2 master branch. - Add route validation - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m - Add documentation for configfs and sysfs ABI - Remove ifindex and gid_type from mcmember Changes from V0: - Rebased patches against Doug's latest k.o/for-4.4 tree. - Fixed a bug in configfs (rmdir caused an incorrect free). Matan Barak (8): IB/core: Add gid_type to gid attribute IB/cm: Use the source GID index type IB/core: Add gid attributes to sysfs IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type IB/core: Move rdma_is_upper_dev_rcu to header file IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path IB/rdma_cm: Add wrapper for cma reference count IB/cma: Add configfs for rdma_cm Moni Shoua (2): IB/core: Initialize UD header structure with IP and UDP headers IB/cma: Join and leave multicast groups with IGMP Somnath Kotur (1): IB/core: Add rdma_network_type to wc Documentation/ABI/testing/configfs-rdma_cm | 22 ++ Documentation/ABI/testing/sysfs-class-infiniband | 16 ++ drivers/infiniband/Kconfig | 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/addr.c | 185 + drivers/infiniband/core/cache.c | 169 drivers/infiniband/core/cm.c | 31 ++- drivers/infiniband/core/cma.c| 261 -- drivers/infiniband/core/cma_configfs.c | 321 +++ drivers/infiniband/core/core_priv.h | 45 drivers/infiniband/core/device.c | 10 +- drivers/infiniband/core/multicast.c | 17 +- drivers/infiniband/core/roce_gid_mgmt.c | 81 -- drivers/infiniband/core/sa_query.c | 76 +- drivers/infiniband/core/sysfs.c | 184 - drivers/infiniband/core/ud_header.c | 155 ++- drivers/infiniband/core/uverbs_marshall.c| 1 + drivers/infiniband/core/verbs.c | 170 ++-- drivers/infiniband/hw/mlx4/qp.c | 7 +- drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 2 +- include/rdma/ib_addr.h | 11 +- include/rdma/ib_cache.h | 4 + include/rdma/ib_pack.h | 45 +++- include/rdma/ib_sa.h | 3 + include/rdma/ib_verbs.h | 78 +- 26 files changed, 1704 insertions(+), 203 deletions(-) create mode 100644 Documentation/ABI/testing/configfs-rdma_cm create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband create mode 100644 drivers/infiniband/core/cma_configfs.c -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm
Users would like to control the behaviour of rdma_cm. For example, old applications which don't set the required RoCE gid type could be executed on RoCE V2 network types. In order to support this configuration, we implement a configfs for rdma_cm. In order to use the configfs, one needs to mount it and mkdir inside rdma_cm directory. The patch adds support for a single configuration file, default_roce_mode. The mode can either be "IB/RoCE v1" or "RoCE v2". Signed-off-by: Matan Barak--- Documentation/ABI/testing/configfs-rdma_cm | 22 ++ drivers/infiniband/Kconfig | 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/cache.c| 24 +++ drivers/infiniband/core/cma.c | 108 +- drivers/infiniband/core/cma_configfs.c | 321 + drivers/infiniband/core/core_priv.h| 24 +++ 7 files changed, 503 insertions(+), 7 deletions(-) create mode 100644 Documentation/ABI/testing/configfs-rdma_cm create mode 100644 drivers/infiniband/core/cma_configfs.c diff --git a/Documentation/ABI/testing/configfs-rdma_cm b/Documentation/ABI/testing/configfs-rdma_cm new file mode 100644 index 000..5c389aa --- /dev/null +++ b/Documentation/ABI/testing/configfs-rdma_cm @@ -0,0 +1,22 @@ +What: /config/rdma_cm +Date: November 29, 2015 +KernelVersion: 4.4.0 +Description: Interface is used to configure RDMA-cable HCAs in respect to + RDMA-CM attributes. + + Attributes are visible only when configfs is mounted. To mount + configfs in /config directory use: + # mount -t configfs none /config/ + + In order to set parameters related to a specific HCA, a directory + for this HCA has to be created: + mkdir -p /config/rdma_cm/ + + +What: /config/rdma_cm//ports//default_roce_mode +Date: November 29, 2015 +KernelVersion: 4.4.0 +Description: RDMA-CM based connections from HCA at port + will be initiated with this RoCE type as default. + The possible RoCE types are either "IB/RoCE v1" or "RoCE v2". + This parameter has RW access. diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index aa26f3c..f5312da 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS depends on INFINIBAND default y +config INFINIBAND_ADDR_TRANS_CONFIGFS + bool + depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m) + default y + ---help--- + ConfigFS support for RDMA communication manager (CM). + This allows the user to config the default GID type that the CM + uses for each device, when initiaing new connections. + source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/qib/Kconfig" source "drivers/infiniband/hw/cxgb3/Kconfig" diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index d43a899..7922fa7 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o rdma_cm-y := cma.o +rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o + rdma_ucm-y := ucma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 88b4b6f..4aada52 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) } EXPORT_SYMBOL(ib_cache_gid_type_str); +int ib_cache_gid_parse_type_str(const char *buf) +{ + unsigned int i; + size_t len; + int err = -EINVAL; + + len = strlen(buf); + if (len == 0) + return -EINVAL; + + if (buf[len - 1] == '\n') + len--; + + for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i) + if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) && + len == strlen(gid_type_str[i])) { + err = i; + break; + } + + return err; +} +EXPORT_SYMBOL(ib_cache_gid_parse_type_str); + /* This function expects that rwlock will be write locked in all * scenarios and that lock will be locked in sleep-able (RoCE) * scenarios. diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index f78088a..8fab267 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -152,6 +152,7 @@ struct cma_device { struct completion comp; atomic_trefcount; struct list_headid_list; + enum ib_gid_type*default_gid_type; }; struct rdma_bind_list { @@ -192,6 +193,62 @@ void
[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm
Users would like to control the behaviour of rdma_cm. For example, old applications which don't set the required RoCE gid type could be executed on RoCE V2 network types. In order to support this configuration, we implement a configfs for rdma_cm. In order to use the configfs, one needs to mount it and mkdir inside rdma_cm directory. The patch adds support for a single configuration file, default_roce_mode. The mode can either be "IB/RoCE v1" or "RoCE v2". Signed-off-by: Matan Barak--- Documentation/ABI/testing/configfs-rdma_cm | 22 ++ drivers/infiniband/Kconfig | 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/cache.c| 24 +++ drivers/infiniband/core/cma.c | 108 +- drivers/infiniband/core/cma_configfs.c | 321 + drivers/infiniband/core/core_priv.h| 24 +++ 7 files changed, 503 insertions(+), 7 deletions(-) create mode 100644 Documentation/ABI/testing/configfs-rdma_cm create mode 100644 drivers/infiniband/core/cma_configfs.c diff --git a/Documentation/ABI/testing/configfs-rdma_cm b/Documentation/ABI/testing/configfs-rdma_cm new file mode 100644 index 000..5c389aa --- /dev/null +++ b/Documentation/ABI/testing/configfs-rdma_cm @@ -0,0 +1,22 @@ +What: /config/rdma_cm +Date: November 29, 2015 +KernelVersion: 4.4.0 +Description: Interface is used to configure RDMA-cable HCAs in respect to + RDMA-CM attributes. + + Attributes are visible only when configfs is mounted. To mount + configfs in /config directory use: + # mount -t configfs none /config/ + + In order to set parameters related to a specific HCA, a directory + for this HCA has to be created: + mkdir -p /config/rdma_cm/ + + +What: /config/rdma_cm//ports//default_roce_mode +Date: November 29, 2015 +KernelVersion: 4.4.0 +Description: RDMA-CM based connections from HCA at port + will be initiated with this RoCE type as default. + The possible RoCE types are either "IB/RoCE v1" or "RoCE v2". + This parameter has RW access. diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig index aa26f3c..f5312da 100644 --- a/drivers/infiniband/Kconfig +++ b/drivers/infiniband/Kconfig @@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS depends on INFINIBAND default y +config INFINIBAND_ADDR_TRANS_CONFIGFS + bool + depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m) + default y + ---help--- + ConfigFS support for RDMA communication manager (CM). + This allows the user to config the default GID type that the CM + uses for each device, when initiaing new connections. + source "drivers/infiniband/hw/mthca/Kconfig" source "drivers/infiniband/hw/qib/Kconfig" source "drivers/infiniband/hw/cxgb3/Kconfig" diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile index d43a899..7922fa7 100644 --- a/drivers/infiniband/core/Makefile +++ b/drivers/infiniband/core/Makefile @@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o rdma_cm-y := cma.o +rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o + rdma_ucm-y := ucma.o ib_addr-y := addr.o diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 88b4b6f..4aada52 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) } EXPORT_SYMBOL(ib_cache_gid_type_str); +int ib_cache_gid_parse_type_str(const char *buf) +{ + unsigned int i; + size_t len; + int err = -EINVAL; + + len = strlen(buf); + if (len == 0) + return -EINVAL; + + if (buf[len - 1] == '\n') + len--; + + for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i) + if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) && + len == strlen(gid_type_str[i])) { + err = i; + break; + } + + return err; +} +EXPORT_SYMBOL(ib_cache_gid_parse_type_str); + /* This function expects that rwlock will be write locked in all * scenarios and that lock will be locked in sleep-able (RoCE) * scenarios. diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index f78088a..8fab267 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -152,6 +152,7 @@ struct cma_device { struct completion comp; atomic_trefcount; struct list_headid_list; + enum ib_gid_type*default_gid_type; }; struct rdma_bind_list { @@ -192,6 +193,62 @@ void
[PATCH for-next V2 10/11] IB/core: Initialize UD header structure with IP and UDP headers
From: Moni Shouaib_ud_header_init() is used to format InfiniBand headers in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is required that this function would be able to build also IP and UDP headers. Signed-off-by: Moni Shoua Signed-off-by: Matan Barak --- drivers/infiniband/core/ud_header.c| 155 ++--- drivers/infiniband/hw/mlx4/qp.c| 7 +- drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- include/rdma/ib_pack.h | 45 -- 4 files changed, 188 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 72feee6..96697e7 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -35,6 +35,7 @@ #include #include #include +#include #include @@ -116,6 +117,72 @@ static const struct ib_field vlan_table[] = { .size_bits= 16 } }; +static const struct ib_field ip4_table[] = { + { STRUCT_FIELD(ip4, ver), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, hdr_len), + .offset_words = 0, + .offset_bits = 4, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, tos), + .offset_words = 0, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, tot_len), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, id), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, frag_off), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, ttl), + .offset_words = 2, + .offset_bits = 0, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, protocol), + .offset_words = 2, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, check), + .offset_words = 2, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, saddr), + .offset_words = 3, + .offset_bits = 0, + .size_bits= 32 }, + { STRUCT_FIELD(ip4, daddr), + .offset_words = 4, + .offset_bits = 0, + .size_bits= 32 } +}; + +static const struct ib_field udp_table[] = { + { STRUCT_FIELD(udp, sport), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, dport), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(udp, length), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, csum), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 } +}; + static const struct ib_field grh_table[] = { { STRUCT_FIELD(grh, ip_version), .offset_words = 0, @@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = { .size_bits= 24 } }; +__be16 ib_ud_ip4_csum(struct ib_ud_header *header) +{ + struct iphdr iph; + + iph.ihl = 5; + iph.version = 4; + iph.tos = header->ip4.tos; + iph.tot_len = header->ip4.tot_len; + iph.id = header->ip4.id; + iph.frag_off= header->ip4.frag_off; + iph.ttl = header->ip4.ttl; + iph.protocol= header->ip4.protocol; + iph.check = 0; + iph.saddr = header->ip4.saddr; + iph.daddr = header->ip4.daddr; + + return ip_fast_csum((u8 *), iph.ihl); +} +EXPORT_SYMBOL(ib_ud_ip4_csum); + /** * ib_ud_header_init - Initialize UD header structure * @payload_bytes:Length of packet payload * @lrh_present: specify if LRH is present * @eth_present: specify if Eth header is present * @vlan_present: packet is tagged vlan - * @grh_present:GRH flag (if non-zero, GRH will be included) + * @grh_present: GRH flag (if non-zero, GRH will be included) + * @ip_version: if non-zero, IP header, V4 or V6, will be included + * @udp_present :if non-zero, UDP header will be included * @immediate_present: specify if immediate data is present * @header:Structure to initialize */ -void ib_ud_header_init(int payload_bytes, - int lrh_present, - int eth_present, - int vlan_present, - int grh_present, - int immediate_present, - struct ib_ud_header *header) +int ib_ud_header_init(int payload_bytes, + intlrh_present, + inteth_present, +
[PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
In order to make sure API users don't try to use SGIDs which don't conform to the routing table, validate the route before searching the RoCE GID table. Signed-off-by: Matan Barak--- drivers/infiniband/core/addr.c | 175 ++- drivers/infiniband/core/cm.c | 10 +- drivers/infiniband/core/cma.c| 30 +- drivers/infiniband/core/sa_query.c | 75 +++-- drivers/infiniband/core/verbs.c | 48 ++--- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 2 +- include/rdma/ib_addr.h | 10 +- 7 files changed, 270 insertions(+), 80 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 6e35299..57eda11 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -121,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, } EXPORT_SYMBOL(rdma_copy_addr); -int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, +int rdma_translate_ip(const struct sockaddr *addr, + struct rdma_dev_addr *dev_addr, u16 *vlan_id) { struct net_device *dev; @@ -139,7 +140,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, switch (addr->sa_family) { case AF_INET: dev = ip_dev_find(dev_addr->net, - ((struct sockaddr_in *) addr)->sin_addr.s_addr); + ((const struct sockaddr_in *)addr)->sin_addr.s_addr); if (!dev) return ret; @@ -154,7 +155,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, rcu_read_lock(); for_each_netdev_rcu(dev_addr->net, dev) { if (ipv6_chk_addr(dev_addr->net, - &((struct sockaddr_in6 *) addr)->sin6_addr, + &((const struct sockaddr_in6 *)addr)->sin6_addr, dev, 1)) { ret = rdma_copy_addr(dev_addr, dev, NULL); if (vlan_id) @@ -198,7 +199,8 @@ static void queue_req(struct addr_req *req) mutex_unlock(); } -static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, void *daddr) +static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, + const void *daddr) { struct neighbour *n; int ret; @@ -222,8 +224,9 @@ static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, v } static int addr4_resolve(struct sockaddr_in *src_in, -struct sockaddr_in *dst_in, -struct rdma_dev_addr *addr) +const struct sockaddr_in *dst_in, +struct rdma_dev_addr *addr, +struct rtable **prt) { __be32 src_ip = src_in->sin_addr.s_addr; __be32 dst_ip = dst_in->sin_addr.s_addr; @@ -243,36 +246,23 @@ static int addr4_resolve(struct sockaddr_in *src_in, src_in->sin_family = AF_INET; src_in->sin_addr.s_addr = fl4.saddr; - if (rt->dst.dev->flags & IFF_LOOPBACK) { - ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL); - if (!ret) - memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN); - goto put; - } - - /* If the device does ARP internally, return 'done' */ - if (rt->dst.dev->flags & IFF_NOARP) { - ret = rdma_copy_addr(addr, rt->dst.dev, NULL); - goto put; - } - /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't * routable) and we could set the network type accordingly. */ if (rt->rt_uses_gateway) addr->network = RDMA_NETWORK_IPV4; - ret = dst_fetch_ha(>dst, addr, ); -put: - ip_rt_put(rt); + *prt = rt; + return 0; out: return ret; } #if IS_ENABLED(CONFIG_IPV6) static int addr6_resolve(struct sockaddr_in6 *src_in, -struct sockaddr_in6 *dst_in, -struct rdma_dev_addr *addr) +const struct sockaddr_in6 *dst_in, +struct rdma_dev_addr *addr, +struct dst_entry **pdst) { struct flowi6 fl6; struct dst_entry *dst; @@ -299,49 +289,109 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, src_in->sin6_addr = fl6.saddr; } - if (dst->dev->flags & IFF_LOOPBACK) { - ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL); - if (!ret) - memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN); - goto put; -
[PATCH for-next V2 08/11] IB/rdma_cm: Add wrapper for cma reference count
Currently, cma users can't increase or decrease the cma reference count. This is necassary when setting cma attributes (like the default GID type) in order to avoid use-after-free errors. Adding cma_ref_dev and cma_deref_dev APIs. Signed-off-by: Matan Barak--- drivers/infiniband/core/cma.c | 11 +-- drivers/infiniband/core/core_priv.h | 4 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index cf52b65..f78088a 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,8 @@ #include #include +#include "core_priv.h" + MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); @@ -185,6 +187,11 @@ enum { CMA_OPTION_AFONLY, }; +void cma_ref_dev(struct cma_device *cma_dev) +{ + atomic_inc(_dev->refcount); +} + /* * Device removal can occur at anytime, so we need extra handling to * serialize notifying the user of device removal with other callbacks. @@ -339,7 +346,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) static void cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { - atomic_inc(_dev->refcount); + cma_ref_dev(cma_dev); id_priv->cma_dev = cma_dev; id_priv->id.device = cma_dev->device; id_priv->id.route.addr.dev_addr.transport = @@ -347,7 +354,7 @@ static void cma_attach_to_dev(struct rdma_id_private *id_priv, list_add_tail(_priv->list, _dev->id_list); } -static inline void cma_deref_dev(struct cma_device *cma_dev) +void cma_deref_dev(struct cma_device *cma_dev) { if (atomic_dec_and_test(_dev->refcount)) complete(_dev->comp); diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index 3b250a2..1945b4e 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -38,6 +38,10 @@ #include +struct cma_device; +void cma_ref_dev(struct cma_device *cma_dev); +void cma_deref_dev(struct cma_device *cma_dev); + int ib_device_register_sysfs(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *)); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 11/11] IB/cma: Join and leave multicast groups with IGMP
From: Moni ShouaSince RoCEv2 is a protocol over IP header it is required to send IGMP join and leave requests to the network when joining and leaving multicast groups. Signed-off-by: Moni Shoua --- drivers/infiniband/core/cma.c | 96 ++--- drivers/infiniband/core/multicast.c | 17 ++- include/rdma/ib_sa.h| 2 + 3 files changed, 106 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 8fab267..c30bfe3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -304,6 +305,7 @@ struct cma_multicast { void*context; struct sockaddr_storage addr; struct kref mcref; + booligmp_joined; }; struct cma_work { @@ -400,6 +402,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); } +static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool join) +{ + struct in_device *in_dev = NULL; + + if (ndev) { + rtnl_lock(); + in_dev = __in_dev_get_rtnl(ndev); + if (in_dev) { + if (join) + ip_mc_inc_group(in_dev, + *(__be32 *)(mgid->raw + 12)); + else + ip_mc_dec_group(in_dev, + *(__be32 *)(mgid->raw + 12)); + } + rtnl_unlock(); + } + return (in_dev) ? 0 : -ENODEV; +} + static void _cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { @@ -1535,8 +1557,24 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) id_priv->id.port_num)) { ib_sa_free_multicast(mc->multicast.ib); kfree(mc); - } else + } else { + if (mc->igmp_joined) { + struct rdma_dev_addr *dev_addr = + _priv->id.route.addr.dev_addr; + struct net_device *ndev = NULL; + + if (dev_addr->bound_dev_if) + ndev = dev_get_by_index(_net, + dev_addr->bound_dev_if); + if (ndev) { + cma_igmp_send(ndev, + >multicast.ib->rec.mgid, + false); + dev_put(ndev); + } + } kref_put(>mcref, release_mc); + } } } @@ -3656,12 +3694,23 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) event.status = status; event.param.ud.private_data = mc->context; if (!status) { + struct rdma_dev_addr *dev_addr = + _priv->id.route.addr.dev_addr; + struct net_device *ndev = + dev_get_by_index(_net, dev_addr->bound_dev_if); + enum ib_gid_type gid_type = + id_priv->cma_dev->default_gid_type[id_priv->id.port_num - + rdma_start_port(id_priv->cma_dev->device)]; + event.event = RDMA_CM_EVENT_MULTICAST_JOIN; ib_init_ah_from_mcmember(id_priv->id.device, id_priv->id.port_num, >rec, +ndev, gid_type, _attr); event.param.ud.qp_num = 0xFF; event.param.ud.qkey = be32_to_cpu(multicast->rec.qkey); + if (ndev) + dev_put(ndev); } else event.event = RDMA_CM_EVENT_MULTICAST_ERROR; @@ -3794,9 +3843,10 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, { struct iboe_mcast_work *work; struct rdma_dev_addr *dev_addr = _priv->id.route.addr.dev_addr; - int err; + int err = 0; struct sockaddr *addr = (struct sockaddr *)>addr; struct net_device *ndev = NULL; + enum ib_gid_type gid_type; if (cma_zero_addr((struct sockaddr *)>addr)) return -EINVAL; @@ -3826,9 +3876,25 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, mc->multicast.ib->rec.rate = iboe_get_rate(ndev);
[PATCH for-next V2 04/11] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
Adding RoCE v2 GID type and port type. Vendors which support this type will get their GID table populated with RoCE v2 GIDs automatically. Signed-off-by: Matan Barak--- drivers/infiniband/core/cache.c | 1 + drivers/infiniband/core/roce_gid_mgmt.c | 3 ++- include/rdma/ib_verbs.h | 23 +-- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 566fd8f..88b4b6f 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -128,6 +128,7 @@ static void dispatch_gid_change_event(struct ib_device *ib_dev, u8 port) static const char * const gid_type_str[] = { [IB_GID_TYPE_IB]= "IB/RoCE v1", + [IB_GID_TYPE_ROCE_UDP_ENCAP]= "RoCE v2", }; const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c index 61c27a7..1e3673f 100644 --- a/drivers/infiniband/core/roce_gid_mgmt.c +++ b/drivers/infiniband/core/roce_gid_mgmt.c @@ -71,7 +71,8 @@ static const struct { bool (*is_supported)(const struct ib_device *device, u8 port_num); enum ib_gid_type gid_type; } PORT_CAP_TO_GID_TYPE[] = { - {rdma_protocol_roce, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP}, }; #define CAP_TO_GID_TABLE_SIZE ARRAY_SIZE(PORT_CAP_TO_GID_TYPE) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 2933aeb..87df931 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -71,6 +71,7 @@ enum ib_gid_type { /* If link layer is Ethernet, this is RoCE V1 */ IB_GID_TYPE_IB= 0, IB_GID_TYPE_ROCE = 0, + IB_GID_TYPE_ROCE_UDP_ENCAP = 1, IB_GID_TYPE_SIZE }; @@ -401,6 +402,7 @@ union rdma_protocol_stats { #define RDMA_CORE_CAP_PROT_IB 0x0010 #define RDMA_CORE_CAP_PROT_ROCE 0x0020 #define RDMA_CORE_CAP_PROT_IWARP0x0040 +#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080 #define RDMA_CORE_PORT_IBA_IB (RDMA_CORE_CAP_PROT_IB \ | RDMA_CORE_CAP_IB_MAD \ @@ -413,6 +415,12 @@ union rdma_protocol_stats { | RDMA_CORE_CAP_IB_CM \ | RDMA_CORE_CAP_AF_IB \ | RDMA_CORE_CAP_ETH_AH) +#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP \ + (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \ + | RDMA_CORE_CAP_IB_MAD \ + | RDMA_CORE_CAP_IB_CM \ + | RDMA_CORE_CAP_AF_IB \ + | RDMA_CORE_CAP_ETH_AH) #define RDMA_CORE_PORT_IWARP (RDMA_CORE_CAP_PROT_IWARP \ | RDMA_CORE_CAP_IW_CM) #define RDMA_CORE_PORT_INTEL_OPA (RDMA_CORE_PORT_IBA_IB \ @@ -1975,6 +1983,17 @@ static inline bool rdma_protocol_ib(const struct ib_device *device, u8 port_num) static inline bool rdma_protocol_roce(const struct ib_device *device, u8 port_num) { + return device->port_immutable[port_num].core_cap_flags & + (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP); +} + +static inline bool rdma_protocol_roce_udp_encap(const struct ib_device *device, u8 port_num) +{ + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP; +} + +static inline bool rdma_protocol_roce_eth_encap(const struct ib_device *device, u8 port_num) +{ return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_ROCE; } @@ -1985,8 +2004,8 @@ static inline bool rdma_protocol_iwarp(const struct ib_device *device, u8 port_n static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num) { - return device->port_immutable[port_num].core_cap_flags & - (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE); + return rdma_protocol_ib(device, port_num) || + rdma_protocol_roce(device, port_num); } /** -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 02/11] IB/cm: Use the source GID index type
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type. In order to support multiple GID types, the gid_type is passed to cm_init_av_by_path and stored in the path record. The rdma cm client would use a default GID type that will be saved in rdma_id_private. Signed-off-by: Matan Barak--- drivers/infiniband/core/cm.c | 25 - drivers/infiniband/core/cma.c | 2 ++ 2 files changed, 22 insertions(+), 5 deletions(-) diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c index af8b907..5ea78ab 100644 --- a/drivers/infiniband/core/cm.c +++ b/drivers/infiniband/core/cm.c @@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, struct cm_av *av) read_lock_irqsave(_lock, flags); list_for_each_entry(cm_dev, _list, list) { if (!ib_find_cached_gid(cm_dev->ib_device, >sgid, - IB_GID_TYPE_IB, ndev, , NULL)) { + path->gid_type, ndev, , NULL)) { port = cm_dev->port[p-1]; break; } @@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work) struct ib_cm_id *cm_id; struct cm_id_private *cm_id_priv, *listen_cm_id_priv; struct cm_req_msg *req_msg; + union ib_gid gid; + struct ib_gid_attr gid_attr; int ret; req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad; @@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work) cm_format_paths_from_req(req_msg, >path[0], >path[1]); memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN); - ret = cm_init_av_by_path(>path[0], _id_priv->av); + ret = ib_get_cached_gid(work->port->cm_dev->ib_device, + work->port->port_num, + cm_id_priv->av.ah_attr.grh.sgid_index, + , _attr); + if (!ret) { + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + work->path[0].gid_type = gid_attr.gid_type; + ret = cm_init_av_by_path(>path[0], _id_priv->av); + } if (ret) { - ib_get_cached_gid(work->port->cm_dev->ib_device, - work->port->port_num, 0, >path[0].sgid, - NULL); + int err = ib_get_cached_gid(work->port->cm_dev->ib_device, + work->port->port_num, 0, + >path[0].sgid, + _attr); + if (!err && gid_attr.ndev) + dev_put(gid_attr.ndev); + work->path[0].gid_type = gid_attr.gid_type; ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID, >path[0].sgid, sizeof work->path[0].sgid, NULL, 0); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index c19f822..2914e08 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -228,6 +228,7 @@ struct rdma_id_private { u8 tos; u8 reuseaddr; u8 afonly; + enum ib_gid_typegid_type; }; struct cma_multicast { @@ -2325,6 +2326,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if); route->path_rec->net = _net; route->path_rec->ifindex = addr->dev_addr.bound_dev_if; + route->path_rec->gid_type = id_priv->gid_type; } if (!ndev) { ret = -ENODEV; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 11/11] IB/cma: Join and leave multicast groups with IGMP
From: Moni ShouaSince RoCEv2 is a protocol over IP header it is required to send IGMP join and leave requests to the network when joining and leaving multicast groups. Signed-off-by: Moni Shoua --- drivers/infiniband/core/cma.c | 96 ++--- drivers/infiniband/core/multicast.c | 17 ++- include/rdma/ib_sa.h| 2 + 3 files changed, 106 insertions(+), 9 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 8fab267..c30bfe3 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -38,6 +38,7 @@ #include #include #include +#include #include #include #include @@ -304,6 +305,7 @@ struct cma_multicast { void*context; struct sockaddr_storage addr; struct kref mcref; + booligmp_joined; }; struct cma_work { @@ -400,6 +402,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF); } +static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool join) +{ + struct in_device *in_dev = NULL; + + if (ndev) { + rtnl_lock(); + in_dev = __in_dev_get_rtnl(ndev); + if (in_dev) { + if (join) + ip_mc_inc_group(in_dev, + *(__be32 *)(mgid->raw + 12)); + else + ip_mc_dec_group(in_dev, + *(__be32 *)(mgid->raw + 12)); + } + rtnl_unlock(); + } + return (in_dev) ? 0 : -ENODEV; +} + static void _cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { @@ -1535,8 +1557,24 @@ static void cma_leave_mc_groups(struct rdma_id_private *id_priv) id_priv->id.port_num)) { ib_sa_free_multicast(mc->multicast.ib); kfree(mc); - } else + } else { + if (mc->igmp_joined) { + struct rdma_dev_addr *dev_addr = + _priv->id.route.addr.dev_addr; + struct net_device *ndev = NULL; + + if (dev_addr->bound_dev_if) + ndev = dev_get_by_index(_net, + dev_addr->bound_dev_if); + if (ndev) { + cma_igmp_send(ndev, + >multicast.ib->rec.mgid, + false); + dev_put(ndev); + } + } kref_put(>mcref, release_mc); + } } } @@ -3656,12 +3694,23 @@ static int cma_ib_mc_handler(int status, struct ib_sa_multicast *multicast) event.status = status; event.param.ud.private_data = mc->context; if (!status) { + struct rdma_dev_addr *dev_addr = + _priv->id.route.addr.dev_addr; + struct net_device *ndev = + dev_get_by_index(_net, dev_addr->bound_dev_if); + enum ib_gid_type gid_type = + id_priv->cma_dev->default_gid_type[id_priv->id.port_num - + rdma_start_port(id_priv->cma_dev->device)]; + event.event = RDMA_CM_EVENT_MULTICAST_JOIN; ib_init_ah_from_mcmember(id_priv->id.device, id_priv->id.port_num, >rec, +ndev, gid_type, _attr); event.param.ud.qp_num = 0xFF; event.param.ud.qkey = be32_to_cpu(multicast->rec.qkey); + if (ndev) + dev_put(ndev); } else event.event = RDMA_CM_EVENT_MULTICAST_ERROR; @@ -3794,9 +3843,10 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, { struct iboe_mcast_work *work; struct rdma_dev_addr *dev_addr = _priv->id.route.addr.dev_addr; - int err; + int err = 0; struct sockaddr *addr = (struct sockaddr *)>addr; struct net_device *ndev = NULL; + enum ib_gid_type gid_type; if (cma_zero_addr((struct sockaddr *)>addr)) return -EINVAL; @@ -3826,9 +3876,25 @@ static int cma_iboe_join_multicast(struct rdma_id_private *id_priv, mc->multicast.ib->rec.rate = iboe_get_rate(ndev);
[PATCH for-next V2 06/11] IB/core: Move rdma_is_upper_dev_rcu to header file
In order to validate the route, we need an easy way to check if a net-device belongs to our RDMA device. Move this helper function to a header file in order to make this check easier. Signed-off-by: Matan BarakReviewed-by: Haggai Eran --- drivers/infiniband/core/core_priv.h | 13 + drivers/infiniband/core/roce_gid_mgmt.c | 20 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index d531f91..3b250a2 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -96,4 +96,17 @@ int ib_cache_setup_one(struct ib_device *device); void ib_cache_cleanup_one(struct ib_device *device); void ib_cache_release_one(struct ib_device *device); +static inline bool rdma_is_upper_dev_rcu(struct net_device *dev, +struct net_device *upper) +{ + struct net_device *_upper = NULL; + struct list_head *iter; + + netdev_for_each_all_upper_dev_rcu(dev, _upper, iter) + if (_upper == upper) + break; + + return _upper == upper; +} + #endif /* _CORE_PRIV_H */ diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c index 1e3673f..06556c3 100644 --- a/drivers/infiniband/core/roce_gid_mgmt.c +++ b/drivers/infiniband/core/roce_gid_mgmt.c @@ -139,18 +139,6 @@ static enum bonding_slave_state is_eth_active_slave_of_bonding_rcu(struct net_de return BONDING_SLAVE_STATE_NA; } -static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper) -{ - struct net_device *_upper = NULL; - struct list_head *iter; - - netdev_for_each_all_upper_dev_rcu(dev, _upper, iter) - if (_upper == upper) - break; - - return _upper == upper; -} - #define REQUIRED_BOND_STATES (BONDING_SLAVE_STATE_ACTIVE | \ BONDING_SLAVE_STATE_NA) static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port, @@ -168,7 +156,7 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port, if (!real_dev) real_dev = event_ndev; - res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) && + res = ((rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) && (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) & REQUIRED_BOND_STATES)) || real_dev == rdma_ndev); @@ -214,7 +202,7 @@ static int upper_device_filter(struct ib_device *ib_dev, u8 port, return 1; rcu_read_lock(); - res = is_upper_dev_rcu(rdma_ndev, event_ndev); + res = rdma_is_upper_dev_rcu(rdma_ndev, event_ndev); rcu_read_unlock(); return res; @@ -244,7 +232,7 @@ static void enum_netdev_default_gids(struct ib_device *ib_dev, rcu_read_lock(); if (!rdma_ndev || ((rdma_ndev != event_ndev && - !is_upper_dev_rcu(rdma_ndev, event_ndev)) || + !rdma_is_upper_dev_rcu(rdma_ndev, event_ndev)) || is_eth_active_slave_of_bonding_rcu(rdma_ndev, netdev_master_upper_dev_get_rcu(rdma_ndev)) == BONDING_SLAVE_STATE_INACTIVE)) { @@ -274,7 +262,7 @@ static void bond_delete_netdev_default_gids(struct ib_device *ib_dev, rcu_read_lock(); - if (is_upper_dev_rcu(rdma_ndev, event_ndev) && + if (rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) && is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) == BONDING_SLAVE_STATE_INACTIVE) { unsigned long gid_type_mask; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 00/11] Add RoCE v2 support
Hi Doug, This series adds the support for RoCE v2. In order to support RoCE v2, we add gid_type attribute to every GID. When the RoCE GID management populates the GID table, it duplicates each GID with all supported types. This gives the user the ability to communicate over each supported type. Patch 0001, 0002 and 0003 add support for multiple GID types to the cache and related APIs. The third patch exposes the GID attributes information is sysfs. Patch 0004 adds the RoCE v2 GID type and the capabilities required from the vendor in order to implement RoCE v2. These capabilities are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP. RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this information should come from the vendor's driver. In case the vendor doesn't supply this information, we parse the packet headers and resolve its network type. Patch 0005 adds this information and required utilities. Patches 0006 and 0007 adds route validation. This is mandatory to ensure that we send packets using GIDS which corresponds to a net-device that can be routed to the destination. Patches 0008 and 0009 add configfs support (and the required infrastructure) for CMA. The administrator should be able to set the default RoCE type. This is done through a new per-port default_roce_mode configfs file. Patch 0010 formats a QP1 packet in order to support RoCE v2 CM packets. This is required for vendors which implement their QP1 as a Raw QP. Patch 0011 adds support for IPv4 multicast as an IPv4 network requires IGMP to be sent in order to join multicast groups. Vendors code aren't part of this patch-set. Soft-Roce will be sent soon and depends on these patches. Other vendors, like mlx4, ocrdma and mlx5 will follow. This patch is applied on "Change per-entry locks in GID cache to table lock" which was sent to the mailing list. Thanks, Matan Changed from V1: - Rebased against Linux 4.4-rc2 master branch. - Add route validation - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m - Add documentation for configfs and sysfs ABI - Remove ifindex and gid_type from mcmember Changes from V0: - Rebased patches against Doug's latest k.o/for-4.4 tree. - Fixed a bug in configfs (rmdir caused an incorrect free). Matan Barak (8): IB/core: Add gid_type to gid attribute IB/cm: Use the source GID index type IB/core: Add gid attributes to sysfs IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type IB/core: Move rdma_is_upper_dev_rcu to header file IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path IB/rdma_cm: Add wrapper for cma reference count IB/cma: Add configfs for rdma_cm Moni Shoua (2): IB/core: Initialize UD header structure with IP and UDP headers IB/cma: Join and leave multicast groups with IGMP Somnath Kotur (1): IB/core: Add rdma_network_type to wc Documentation/ABI/testing/configfs-rdma_cm | 22 ++ Documentation/ABI/testing/sysfs-class-infiniband | 16 ++ drivers/infiniband/Kconfig | 9 + drivers/infiniband/core/Makefile | 2 + drivers/infiniband/core/addr.c | 185 + drivers/infiniband/core/cache.c | 169 drivers/infiniband/core/cm.c | 31 ++- drivers/infiniband/core/cma.c| 261 -- drivers/infiniband/core/cma_configfs.c | 321 +++ drivers/infiniband/core/core_priv.h | 45 drivers/infiniband/core/device.c | 10 +- drivers/infiniband/core/multicast.c | 17 +- drivers/infiniband/core/roce_gid_mgmt.c | 81 -- drivers/infiniband/core/sa_query.c | 76 +- drivers/infiniband/core/sysfs.c | 184 - drivers/infiniband/core/ud_header.c | 155 ++- drivers/infiniband/core/uverbs_marshall.c| 1 + drivers/infiniband/core/verbs.c | 170 ++-- drivers/infiniband/hw/mlx4/qp.c | 7 +- drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 2 +- include/rdma/ib_addr.h | 11 +- include/rdma/ib_cache.h | 4 + include/rdma/ib_pack.h | 45 +++- include/rdma/ib_sa.h | 3 + include/rdma/ib_verbs.h | 78 +- 26 files changed, 1704 insertions(+), 203 deletions(-) create mode 100644 Documentation/ABI/testing/configfs-rdma_cm create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband create mode 100644 drivers/infiniband/core/cma_configfs.c -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 01/11] IB/core: Add gid_type to gid attribute
In order to support multiple GID types, we need to store the gid_type with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT GID table entries shall have a "GID type" attribute that denotes the L3 Address type". The currently supported GID is IB_GID_TYPE_IB which is also RoCE v1 GID type. This implies that gid_type should be added to roce_gid_table meta-data. Signed-off-by: Matan Barak--- drivers/infiniband/core/cache.c | 144 -- drivers/infiniband/core/cm.c | 2 +- drivers/infiniband/core/cma.c | 3 +- drivers/infiniband/core/core_priv.h | 4 + drivers/infiniband/core/device.c | 9 +- drivers/infiniband/core/multicast.c | 2 +- drivers/infiniband/core/roce_gid_mgmt.c | 60 +++-- drivers/infiniband/core/sa_query.c| 5 +- drivers/infiniband/core/uverbs_marshall.c | 1 + drivers/infiniband/core/verbs.c | 1 + include/rdma/ib_cache.h | 4 + include/rdma/ib_sa.h | 1 + include/rdma/ib_verbs.h | 11 ++- 13 files changed, 185 insertions(+), 62 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 097e9df..566fd8f 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -64,6 +64,7 @@ enum gid_attr_find_mask { GID_ATTR_FIND_MASK_GID = 1UL << 0, GID_ATTR_FIND_MASK_NETDEV = 1UL << 1, GID_ATTR_FIND_MASK_DEFAULT = 1UL << 2, + GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3, }; enum gid_table_entry_props { @@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device *ib_dev, u8 port) } } +static const char * const gid_type_str[] = { + [IB_GID_TYPE_IB]= "IB/RoCE v1", +}; + +const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) +{ + if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type]) + return gid_type_str[gid_type]; + + return "Invalid GID type"; +} +EXPORT_SYMBOL(ib_cache_gid_type_str); + /* This function expects that rwlock will be write locked in all * scenarios and that lock will be locked in sleep-able (RoCE) * scenarios. @@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const union ib_gid *gid, if (found >=0) continue; + if (mask & GID_ATTR_FIND_MASK_GID_TYPE && + attr->gid_type != val->gid_type) + continue; + if (mask & GID_ATTR_FIND_MASK_GID && memcmp(gid, >gid, sizeof(*gid))) continue; @@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port, write_lock_irq(>rwlock); ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV, ); if (ix >= 0) goto out_unlock; @@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port, ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID | + GID_ATTR_FIND_MASK_GID_TYPE | GID_ATTR_FIND_MASK_NETDEV | GID_ATTR_FIND_MASK_DEFAULT, NULL); @@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device *ib_dev, static int ib_cache_gid_find(struct ib_device *ib_dev, const union ib_gid *gid, +enum ib_gid_type gid_type, struct net_device *ndev, u8 *port, u16 *index) { - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr gid_attr_val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type}; if (ndev) mask |= GID_ATTR_FIND_MASK_NETDEV; @@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev, int ib_find_cached_gid_by_port(struct ib_device *ib_dev, const union ib_gid *gid, + enum ib_gid_type gid_type, u8 port, struct net_device *ndev, u16 *index) { int local_index; struct ib_gid_table **ports_table = ib_dev->cache.gid_cache; struct ib_gid_table *table; - unsigned long mask = GID_ATTR_FIND_MASK_GID; - struct ib_gid_attr val = {.ndev = ndev}; + unsigned long mask = GID_ATTR_FIND_MASK_GID | +GID_ATTR_FIND_MASK_GID_TYPE; + struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type}; unsigned long flags;
[PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
From: Somnath KoturProviders should tell IB core the wc's network type. This is used in order to search for the proper GID in the GID table. When using HCAs that can't provide this info, IB core tries to deep examine the packet and extract the GID type by itself. We choose sgid_index and type from all the matching entries in RDMA-CM based on hint from the IP stack and we set hop_limit for the IP packet based on above hint from IP stack. Signed-off-by: Matan Barak Signed-off-by: Somnath Kotur --- drivers/infiniband/core/addr.c | 14 + drivers/infiniband/core/cma.c | 11 +++- drivers/infiniband/core/verbs.c | 123 ++-- include/rdma/ib_addr.h | 1 + include/rdma/ib_verbs.h | 44 ++ 5 files changed, 187 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 34b1ada..6e35299 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -256,6 +256,12 @@ static int addr4_resolve(struct sockaddr_in *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt->rt_uses_gateway) + addr->network = RDMA_NETWORK_IPV4; + ret = dst_fetch_ha(>dst, addr, ); put: ip_rt_put(rt); @@ -270,6 +276,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, { struct flowi6 fl6; struct dst_entry *dst; + struct rt6_info *rt; int ret; memset(, 0, sizeof fl6); @@ -281,6 +288,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, if ((ret = dst->error)) goto put; + rt = (struct rt6_info *)dst; if (ipv6_addr_any()) { ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev, , 0, ); @@ -304,6 +312,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt->rt6i_flags & RTF_GATEWAY) + addr->network = RDMA_NETWORK_IPV6; + ret = dst_fetch_ha(dst, addr, ); put: dst_release(dst); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2914e08..5dc853c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2302,6 +2302,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) { struct rdma_route *route = _priv->id.route; struct rdma_addr *addr = >addr; + enum ib_gid_type network_gid_type; struct cma_work *work; int ret; struct net_device *ndev = NULL; @@ -2340,7 +2341,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) rdma_ip2gid((struct sockaddr *)_priv->id.route.addr.dst_addr, >path_rec->dgid); - route->path_rec->hop_limit = 1; + /* Use the hint from IP Stack to select GID Type */ + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network); + if (addr->dev_addr.network != RDMA_NETWORK_IB) { + route->path_rec->gid_type = network_gid_type; + /* TODO: get the hoplimit from the inet/inet6 device */ + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT; + } else { + route->path_rec->hop_limit = 1; + } route->path_rec->reversible = 1; route->path_rec->pkey = cpu_to_be16(0x); route->path_rec->mtu_selector = IB_SA_EQ; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 4263c4c..c564131 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -311,8 +311,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) } EXPORT_SYMBOL(ib_create_ah); +static int ib_get_header_version(const union rdma_network_hdr *hdr) +{ + const struct iphdr *ip4h = (struct iphdr *)>roce4grh; + struct iphdr ip4h_checked; + const struct ipv6hdr *ip6h = (struct ipv6hdr *)>ibgrh; + + /* If it's IPv6, the version must be 6, otherwise, the first +* 20 bytes (before the IPv4 header) are garbled. +*/ + if (ip6h->version != 6) + return (ip4h->version == 4) ? 4 : 0; + /* version may be 6 or 4 because the first 20 bytes could be garbled */ + + /* RoCE v2 requires no options, thus header length +* must be 5 words +*/ + if (ip4h->ihl != 5) + return 6; + + /* Verify checksum. +* We can't write on scattered buffers so we need to copy to +* temp buffer. +*/ +
[PATCH for-next V2 06/11] IB/core: Move rdma_is_upper_dev_rcu to header file
In order to validate the route, we need an easy way to check if a net-device belongs to our RDMA device. Move this helper function to a header file in order to make this check easier. Signed-off-by: Matan BarakReviewed-by: Haggai Eran --- drivers/infiniband/core/core_priv.h | 13 + drivers/infiniband/core/roce_gid_mgmt.c | 20 2 files changed, 17 insertions(+), 16 deletions(-) diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index d531f91..3b250a2 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -96,4 +96,17 @@ int ib_cache_setup_one(struct ib_device *device); void ib_cache_cleanup_one(struct ib_device *device); void ib_cache_release_one(struct ib_device *device); +static inline bool rdma_is_upper_dev_rcu(struct net_device *dev, +struct net_device *upper) +{ + struct net_device *_upper = NULL; + struct list_head *iter; + + netdev_for_each_all_upper_dev_rcu(dev, _upper, iter) + if (_upper == upper) + break; + + return _upper == upper; +} + #endif /* _CORE_PRIV_H */ diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c index 1e3673f..06556c3 100644 --- a/drivers/infiniband/core/roce_gid_mgmt.c +++ b/drivers/infiniband/core/roce_gid_mgmt.c @@ -139,18 +139,6 @@ static enum bonding_slave_state is_eth_active_slave_of_bonding_rcu(struct net_de return BONDING_SLAVE_STATE_NA; } -static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper) -{ - struct net_device *_upper = NULL; - struct list_head *iter; - - netdev_for_each_all_upper_dev_rcu(dev, _upper, iter) - if (_upper == upper) - break; - - return _upper == upper; -} - #define REQUIRED_BOND_STATES (BONDING_SLAVE_STATE_ACTIVE | \ BONDING_SLAVE_STATE_NA) static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port, @@ -168,7 +156,7 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port, if (!real_dev) real_dev = event_ndev; - res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) && + res = ((rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) && (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) & REQUIRED_BOND_STATES)) || real_dev == rdma_ndev); @@ -214,7 +202,7 @@ static int upper_device_filter(struct ib_device *ib_dev, u8 port, return 1; rcu_read_lock(); - res = is_upper_dev_rcu(rdma_ndev, event_ndev); + res = rdma_is_upper_dev_rcu(rdma_ndev, event_ndev); rcu_read_unlock(); return res; @@ -244,7 +232,7 @@ static void enum_netdev_default_gids(struct ib_device *ib_dev, rcu_read_lock(); if (!rdma_ndev || ((rdma_ndev != event_ndev && - !is_upper_dev_rcu(rdma_ndev, event_ndev)) || + !rdma_is_upper_dev_rcu(rdma_ndev, event_ndev)) || is_eth_active_slave_of_bonding_rcu(rdma_ndev, netdev_master_upper_dev_get_rcu(rdma_ndev)) == BONDING_SLAVE_STATE_INACTIVE)) { @@ -274,7 +262,7 @@ static void bond_delete_netdev_default_gids(struct ib_device *ib_dev, rcu_read_lock(); - if (is_upper_dev_rcu(rdma_ndev, event_ndev) && + if (rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) && is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) == BONDING_SLAVE_STATE_INACTIVE) { unsigned long gid_type_mask; -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
From: Somnath KoturProviders should tell IB core the wc's network type. This is used in order to search for the proper GID in the GID table. When using HCAs that can't provide this info, IB core tries to deep examine the packet and extract the GID type by itself. We choose sgid_index and type from all the matching entries in RDMA-CM based on hint from the IP stack and we set hop_limit for the IP packet based on above hint from IP stack. Signed-off-by: Matan Barak Signed-off-by: Somnath Kotur --- drivers/infiniband/core/addr.c | 14 + drivers/infiniband/core/cma.c | 11 +++- drivers/infiniband/core/verbs.c | 123 ++-- include/rdma/ib_addr.h | 1 + include/rdma/ib_verbs.h | 44 ++ 5 files changed, 187 insertions(+), 6 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 34b1ada..6e35299 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -256,6 +256,12 @@ static int addr4_resolve(struct sockaddr_in *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt->rt_uses_gateway) + addr->network = RDMA_NETWORK_IPV4; + ret = dst_fetch_ha(>dst, addr, ); put: ip_rt_put(rt); @@ -270,6 +276,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, { struct flowi6 fl6; struct dst_entry *dst; + struct rt6_info *rt; int ret; memset(, 0, sizeof fl6); @@ -281,6 +288,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, if ((ret = dst->error)) goto put; + rt = (struct rt6_info *)dst; if (ipv6_addr_any()) { ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev, , 0, ); @@ -304,6 +312,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, goto put; } + /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't +* routable) and we could set the network type accordingly. +*/ + if (rt->rt6i_flags & RTF_GATEWAY) + addr->network = RDMA_NETWORK_IPV6; + ret = dst_fetch_ha(dst, addr, ); put: dst_release(dst); diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 2914e08..5dc853c 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -2302,6 +2302,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) { struct rdma_route *route = _priv->id.route; struct rdma_addr *addr = >addr; + enum ib_gid_type network_gid_type; struct cma_work *work; int ret; struct net_device *ndev = NULL; @@ -2340,7 +2341,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private *id_priv) rdma_ip2gid((struct sockaddr *)_priv->id.route.addr.dst_addr, >path_rec->dgid); - route->path_rec->hop_limit = 1; + /* Use the hint from IP Stack to select GID Type */ + network_gid_type = ib_network_to_gid_type(addr->dev_addr.network); + if (addr->dev_addr.network != RDMA_NETWORK_IB) { + route->path_rec->gid_type = network_gid_type; + /* TODO: get the hoplimit from the inet/inet6 device */ + route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT; + } else { + route->path_rec->hop_limit = 1; + } route->path_rec->reversible = 1; route->path_rec->pkey = cpu_to_be16(0x); route->path_rec->mtu_selector = IB_SA_EQ; diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 4263c4c..c564131 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -311,8 +311,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr) } EXPORT_SYMBOL(ib_create_ah); +static int ib_get_header_version(const union rdma_network_hdr *hdr) +{ + const struct iphdr *ip4h = (struct iphdr *)>roce4grh; + struct iphdr ip4h_checked; + const struct ipv6hdr *ip6h = (struct ipv6hdr *)>ibgrh; + + /* If it's IPv6, the version must be 6, otherwise, the first +* 20 bytes (before the IPv4 header) are garbled. +*/ + if (ip6h->version != 6) + return (ip4h->version == 4) ? 4 : 0; + /* version may be 6 or 4 because the first 20 bytes could be garbled */ + + /* RoCE v2 requires no options, thus header length +* must be 5 words +*/ + if (ip4h->ihl != 5) + return 6; + + /* Verify checksum. +* We can't write on scattered buffers so we need to copy to +* temp buffer. +*/ +
[PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
In order to make sure API users don't try to use SGIDs which don't conform to the routing table, validate the route before searching the RoCE GID table. Signed-off-by: Matan Barak--- drivers/infiniband/core/addr.c | 175 ++- drivers/infiniband/core/cm.c | 10 +- drivers/infiniband/core/cma.c| 30 +- drivers/infiniband/core/sa_query.c | 75 +++-- drivers/infiniband/core/verbs.c | 48 ++--- drivers/infiniband/hw/ocrdma/ocrdma_ah.c | 2 +- include/rdma/ib_addr.h | 10 +- 7 files changed, 270 insertions(+), 80 deletions(-) diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c index 6e35299..57eda11 100644 --- a/drivers/infiniband/core/addr.c +++ b/drivers/infiniband/core/addr.c @@ -121,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct net_device *dev, } EXPORT_SYMBOL(rdma_copy_addr); -int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, +int rdma_translate_ip(const struct sockaddr *addr, + struct rdma_dev_addr *dev_addr, u16 *vlan_id) { struct net_device *dev; @@ -139,7 +140,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, switch (addr->sa_family) { case AF_INET: dev = ip_dev_find(dev_addr->net, - ((struct sockaddr_in *) addr)->sin_addr.s_addr); + ((const struct sockaddr_in *)addr)->sin_addr.s_addr); if (!dev) return ret; @@ -154,7 +155,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr, rcu_read_lock(); for_each_netdev_rcu(dev_addr->net, dev) { if (ipv6_chk_addr(dev_addr->net, - &((struct sockaddr_in6 *) addr)->sin6_addr, + &((const struct sockaddr_in6 *)addr)->sin6_addr, dev, 1)) { ret = rdma_copy_addr(dev_addr, dev, NULL); if (vlan_id) @@ -198,7 +199,8 @@ static void queue_req(struct addr_req *req) mutex_unlock(); } -static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, void *daddr) +static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, + const void *daddr) { struct neighbour *n; int ret; @@ -222,8 +224,9 @@ static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, v } static int addr4_resolve(struct sockaddr_in *src_in, -struct sockaddr_in *dst_in, -struct rdma_dev_addr *addr) +const struct sockaddr_in *dst_in, +struct rdma_dev_addr *addr, +struct rtable **prt) { __be32 src_ip = src_in->sin_addr.s_addr; __be32 dst_ip = dst_in->sin_addr.s_addr; @@ -243,36 +246,23 @@ static int addr4_resolve(struct sockaddr_in *src_in, src_in->sin_family = AF_INET; src_in->sin_addr.s_addr = fl4.saddr; - if (rt->dst.dev->flags & IFF_LOOPBACK) { - ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL); - if (!ret) - memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN); - goto put; - } - - /* If the device does ARP internally, return 'done' */ - if (rt->dst.dev->flags & IFF_NOARP) { - ret = rdma_copy_addr(addr, rt->dst.dev, NULL); - goto put; - } - /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't * routable) and we could set the network type accordingly. */ if (rt->rt_uses_gateway) addr->network = RDMA_NETWORK_IPV4; - ret = dst_fetch_ha(>dst, addr, ); -put: - ip_rt_put(rt); + *prt = rt; + return 0; out: return ret; } #if IS_ENABLED(CONFIG_IPV6) static int addr6_resolve(struct sockaddr_in6 *src_in, -struct sockaddr_in6 *dst_in, -struct rdma_dev_addr *addr) +const struct sockaddr_in6 *dst_in, +struct rdma_dev_addr *addr, +struct dst_entry **pdst) { struct flowi6 fl6; struct dst_entry *dst; @@ -299,49 +289,109 @@ static int addr6_resolve(struct sockaddr_in6 *src_in, src_in->sin6_addr = fl6.saddr; } - if (dst->dev->flags & IFF_LOOPBACK) { - ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL); - if (!ret) - memcpy(addr->dst_dev_addr, addr->src_dev_addr, MAX_ADDR_LEN); - goto put; -
[PATCH for-next V2 03/11] IB/core: Add gid attributes to sysfs
This patch set adds attributes of net device and gid type to each GID in the GID table. Users that use verbs directly need to specify the GID index. Since the same GID could have different types or associated net devices, users should have the ability to query the associated GID attributes. Adding these attributes to sysfs. Signed-off-by: Matan Barak--- Documentation/ABI/testing/sysfs-class-infiniband | 16 ++ drivers/infiniband/core/sysfs.c | 184 ++- 2 files changed, 198 insertions(+), 2 deletions(-) create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband diff --git a/Documentation/ABI/testing/sysfs-class-infiniband b/Documentation/ABI/testing/sysfs-class-infiniband new file mode 100644 index 000..a86abe6 --- /dev/null +++ b/Documentation/ABI/testing/sysfs-class-infiniband @@ -0,0 +1,16 @@ +What: /sys/class/infiniband//ports//gid_attrs/ndevs/ +Date: November 29, 2015 +KernelVersion: 4.4.0 +Contact: linux-rdma@vger.kernel.org +Description: The net-device's name associated with the GID resides + at index . + +What: /sys/class/infiniband//ports//gid_attrs/types/ +Date: November 29, 2015 +KernelVersion: 4.4.0 +Contact: linux-rdma@vger.kernel.org +Description: The RoCE type of the associated GID resides at index . + This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs + or "RoCE v2" for RoCE v2 based GIDs. + + diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c index b1f37d4..4d5d87a 100644 --- a/drivers/infiniband/core/sysfs.c +++ b/drivers/infiniband/core/sysfs.c @@ -37,12 +37,22 @@ #include #include #include +#include #include +struct ib_port; + +struct gid_attr_group { + struct ib_port *port; + struct kobject kobj; + struct attribute_group ndev; + struct attribute_group type; +}; struct ib_port { struct kobject kobj; struct ib_device *ibdev; + struct gid_attr_group *gid_attr_group; struct attribute_group gid_group; struct attribute_group pkey_group; u8 port_num; @@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = { .show = port_attr_show }; +static ssize_t gid_attr_show(struct kobject *kobj, +struct attribute *attr, char *buf) +{ + struct port_attribute *port_attr = + container_of(attr, struct port_attribute, attr); + struct ib_port *p = container_of(kobj, struct gid_attr_group, +kobj)->port; + + if (!port_attr->show) + return -EIO; + + return port_attr->show(p, port_attr, buf); +} + +static const struct sysfs_ops gid_attr_sysfs_ops = { + .show = gid_attr_show +}; + static ssize_t state_show(struct ib_port *p, struct port_attribute *unused, char *buf) { @@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = { NULL }; +static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf) +{ + if (!gid_attr->ndev) + return -EINVAL; + + return sprintf(buf, "%s\n", gid_attr->ndev->name); +} + +static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf) +{ + return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type)); +} + +static ssize_t _show_port_gid_attr(struct ib_port *p, + struct port_attribute *attr, + char *buf, + size_t (*print)(struct ib_gid_attr *gid_attr, + char *buf)) +{ + struct port_table_attribute *tab_attr = + container_of(attr, struct port_table_attribute, attr); + union ib_gid gid; + struct ib_gid_attr gid_attr = {}; + ssize_t ret; + va_list args; + + ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, , + _attr); + if (ret) + goto err; + + ret = print(_attr, buf); + +err: + if (gid_attr.ndev) + dev_put(gid_attr.ndev); + va_end(args); + return ret; +} + static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, char *buf) { @@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr, return sprintf(buf, "%pI6\n", gid.raw); } +static ssize_t show_port_gid_attr_ndev(struct ib_port *p, + struct port_attribute *attr, char *buf) +{ + return _show_port_gid_attr(p, attr, buf, print_ndev); +} + +static ssize_t show_port_gid_attr_gid_type(struct ib_port *p, + struct port_attribute *attr, + char *buf) +{
[PATCH for-next V2 08/11] IB/rdma_cm: Add wrapper for cma reference count
Currently, cma users can't increase or decrease the cma reference count. This is necassary when setting cma attributes (like the default GID type) in order to avoid use-after-free errors. Adding cma_ref_dev and cma_deref_dev APIs. Signed-off-by: Matan Barak--- drivers/infiniband/core/cma.c | 11 +-- drivers/infiniband/core/core_priv.h | 4 2 files changed, 13 insertions(+), 2 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index cf52b65..f78088a 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -60,6 +60,8 @@ #include #include +#include "core_priv.h" + MODULE_AUTHOR("Sean Hefty"); MODULE_DESCRIPTION("Generic RDMA CM Agent"); MODULE_LICENSE("Dual BSD/GPL"); @@ -185,6 +187,11 @@ enum { CMA_OPTION_AFONLY, }; +void cma_ref_dev(struct cma_device *cma_dev) +{ + atomic_inc(_dev->refcount); +} + /* * Device removal can occur at anytime, so we need extra handling to * serialize notifying the user of device removal with other callbacks. @@ -339,7 +346,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 ip_ver) static void cma_attach_to_dev(struct rdma_id_private *id_priv, struct cma_device *cma_dev) { - atomic_inc(_dev->refcount); + cma_ref_dev(cma_dev); id_priv->cma_dev = cma_dev; id_priv->id.device = cma_dev->device; id_priv->id.route.addr.dev_addr.transport = @@ -347,7 +354,7 @@ static void cma_attach_to_dev(struct rdma_id_private *id_priv, list_add_tail(_priv->list, _dev->id_list); } -static inline void cma_deref_dev(struct cma_device *cma_dev) +void cma_deref_dev(struct cma_device *cma_dev) { if (atomic_dec_and_test(_dev->refcount)) complete(_dev->comp); diff --git a/drivers/infiniband/core/core_priv.h b/drivers/infiniband/core/core_priv.h index 3b250a2..1945b4e 100644 --- a/drivers/infiniband/core/core_priv.h +++ b/drivers/infiniband/core/core_priv.h @@ -38,6 +38,10 @@ #include +struct cma_device; +void cma_ref_dev(struct cma_device *cma_dev); +void cma_deref_dev(struct cma_device *cma_dev); + int ib_device_register_sysfs(struct ib_device *device, int (*port_callback)(struct ib_device *, u8, struct kobject *)); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V2 10/11] IB/core: Initialize UD header structure with IP and UDP headers
From: Moni Shouaib_ud_header_init() is used to format InfiniBand headers in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is required that this function would be able to build also IP and UDP headers. Signed-off-by: Moni Shoua Signed-off-by: Matan Barak --- drivers/infiniband/core/ud_header.c| 155 ++--- drivers/infiniband/hw/mlx4/qp.c| 7 +- drivers/infiniband/hw/mthca/mthca_qp.c | 2 +- include/rdma/ib_pack.h | 45 -- 4 files changed, 188 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/ud_header.c b/drivers/infiniband/core/ud_header.c index 72feee6..96697e7 100644 --- a/drivers/infiniband/core/ud_header.c +++ b/drivers/infiniband/core/ud_header.c @@ -35,6 +35,7 @@ #include #include #include +#include #include @@ -116,6 +117,72 @@ static const struct ib_field vlan_table[] = { .size_bits= 16 } }; +static const struct ib_field ip4_table[] = { + { STRUCT_FIELD(ip4, ver), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, hdr_len), + .offset_words = 0, + .offset_bits = 4, + .size_bits= 4 }, + { STRUCT_FIELD(ip4, tos), + .offset_words = 0, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, tot_len), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, id), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, frag_off), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, ttl), + .offset_words = 2, + .offset_bits = 0, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, protocol), + .offset_words = 2, + .offset_bits = 8, + .size_bits= 8 }, + { STRUCT_FIELD(ip4, check), + .offset_words = 2, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(ip4, saddr), + .offset_words = 3, + .offset_bits = 0, + .size_bits= 32 }, + { STRUCT_FIELD(ip4, daddr), + .offset_words = 4, + .offset_bits = 0, + .size_bits= 32 } +}; + +static const struct ib_field udp_table[] = { + { STRUCT_FIELD(udp, sport), + .offset_words = 0, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, dport), + .offset_words = 0, + .offset_bits = 16, + .size_bits= 16 }, + { STRUCT_FIELD(udp, length), + .offset_words = 1, + .offset_bits = 0, + .size_bits= 16 }, + { STRUCT_FIELD(udp, csum), + .offset_words = 1, + .offset_bits = 16, + .size_bits= 16 } +}; + static const struct ib_field grh_table[] = { { STRUCT_FIELD(grh, ip_version), .offset_words = 0, @@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = { .size_bits= 24 } }; +__be16 ib_ud_ip4_csum(struct ib_ud_header *header) +{ + struct iphdr iph; + + iph.ihl = 5; + iph.version = 4; + iph.tos = header->ip4.tos; + iph.tot_len = header->ip4.tot_len; + iph.id = header->ip4.id; + iph.frag_off= header->ip4.frag_off; + iph.ttl = header->ip4.ttl; + iph.protocol= header->ip4.protocol; + iph.check = 0; + iph.saddr = header->ip4.saddr; + iph.daddr = header->ip4.daddr; + + return ip_fast_csum((u8 *), iph.ihl); +} +EXPORT_SYMBOL(ib_ud_ip4_csum); + /** * ib_ud_header_init - Initialize UD header structure * @payload_bytes:Length of packet payload * @lrh_present: specify if LRH is present * @eth_present: specify if Eth header is present * @vlan_present: packet is tagged vlan - * @grh_present:GRH flag (if non-zero, GRH will be included) + * @grh_present: GRH flag (if non-zero, GRH will be included) + * @ip_version: if non-zero, IP header, V4 or V6, will be included + * @udp_present :if non-zero, UDP header will be included * @immediate_present: specify if immediate data is present * @header:Structure to initialize */ -void ib_ud_header_init(int payload_bytes, - int lrh_present, - int eth_present, - int vlan_present, - int grh_present, - int immediate_present, - struct ib_ud_header *header) +int ib_ud_header_init(int payload_bytes, + intlrh_present, + inteth_present, +
[PATCH for-next V2 04/11] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
Adding RoCE v2 GID type and port type. Vendors which support this type will get their GID table populated with RoCE v2 GIDs automatically. Signed-off-by: Matan Barak--- drivers/infiniband/core/cache.c | 1 + drivers/infiniband/core/roce_gid_mgmt.c | 3 ++- include/rdma/ib_verbs.h | 23 +-- 3 files changed, 24 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c index 566fd8f..88b4b6f 100644 --- a/drivers/infiniband/core/cache.c +++ b/drivers/infiniband/core/cache.c @@ -128,6 +128,7 @@ static void dispatch_gid_change_event(struct ib_device *ib_dev, u8 port) static const char * const gid_type_str[] = { [IB_GID_TYPE_IB]= "IB/RoCE v1", + [IB_GID_TYPE_ROCE_UDP_ENCAP]= "RoCE v2", }; const char *ib_cache_gid_type_str(enum ib_gid_type gid_type) diff --git a/drivers/infiniband/core/roce_gid_mgmt.c b/drivers/infiniband/core/roce_gid_mgmt.c index 61c27a7..1e3673f 100644 --- a/drivers/infiniband/core/roce_gid_mgmt.c +++ b/drivers/infiniband/core/roce_gid_mgmt.c @@ -71,7 +71,8 @@ static const struct { bool (*is_supported)(const struct ib_device *device, u8 port_num); enum ib_gid_type gid_type; } PORT_CAP_TO_GID_TYPE[] = { - {rdma_protocol_roce, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE}, + {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP}, }; #define CAP_TO_GID_TABLE_SIZE ARRAY_SIZE(PORT_CAP_TO_GID_TYPE) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 2933aeb..87df931 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -71,6 +71,7 @@ enum ib_gid_type { /* If link layer is Ethernet, this is RoCE V1 */ IB_GID_TYPE_IB= 0, IB_GID_TYPE_ROCE = 0, + IB_GID_TYPE_ROCE_UDP_ENCAP = 1, IB_GID_TYPE_SIZE }; @@ -401,6 +402,7 @@ union rdma_protocol_stats { #define RDMA_CORE_CAP_PROT_IB 0x0010 #define RDMA_CORE_CAP_PROT_ROCE 0x0020 #define RDMA_CORE_CAP_PROT_IWARP0x0040 +#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080 #define RDMA_CORE_PORT_IBA_IB (RDMA_CORE_CAP_PROT_IB \ | RDMA_CORE_CAP_IB_MAD \ @@ -413,6 +415,12 @@ union rdma_protocol_stats { | RDMA_CORE_CAP_IB_CM \ | RDMA_CORE_CAP_AF_IB \ | RDMA_CORE_CAP_ETH_AH) +#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP \ + (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \ + | RDMA_CORE_CAP_IB_MAD \ + | RDMA_CORE_CAP_IB_CM \ + | RDMA_CORE_CAP_AF_IB \ + | RDMA_CORE_CAP_ETH_AH) #define RDMA_CORE_PORT_IWARP (RDMA_CORE_CAP_PROT_IWARP \ | RDMA_CORE_CAP_IW_CM) #define RDMA_CORE_PORT_INTEL_OPA (RDMA_CORE_PORT_IBA_IB \ @@ -1975,6 +1983,17 @@ static inline bool rdma_protocol_ib(const struct ib_device *device, u8 port_num) static inline bool rdma_protocol_roce(const struct ib_device *device, u8 port_num) { + return device->port_immutable[port_num].core_cap_flags & + (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP); +} + +static inline bool rdma_protocol_roce_udp_encap(const struct ib_device *device, u8 port_num) +{ + return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP; +} + +static inline bool rdma_protocol_roce_eth_encap(const struct ib_device *device, u8 port_num) +{ return device->port_immutable[port_num].core_cap_flags & RDMA_CORE_CAP_PROT_ROCE; } @@ -1985,8 +2004,8 @@ static inline bool rdma_protocol_iwarp(const struct ib_device *device, u8 port_n static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num) { - return device->port_immutable[port_num].core_cap_flags & - (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE); + return rdma_protocol_ib(device, port_num) || + rdma_protocol_roce(device, port_num); } /** -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
Bloating the WC with a field that's not really useful for the ULPs seems pretty sad.. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/2] Handle mlx4 max_sge_rd correctly
On Tue, Nov 10, 2015 at 12:36:44PM +0200, Sagi Grimberg wrote: > Any reply on this patchset? Did we ever make progress on this? -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
On Thu, Dec 03, 2015 at 03:47:12PM +0200, Matan Barak wrote: > From: Somnath Kotur> > Providers should tell IB core the wc's network type. > This is used in order to search for the proper GID in the > GID table. When using HCAs that can't provide this info, > IB core tries to deep examine the packet and extract > the GID type by itself. Eh? A wc has a sgid_index, and in this brave new world a gid has the network type. Why do we need to specify it again? > memset(ah_attr, 0, sizeof *ah_attr); > if (rdma_cap_eth_ah(device, port_num)) { > + if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE) > + net_type = wc->network_hdr_type; > + else > + net_type = ib_get_net_type_by_grh(device, port_num, > grh); > + gid_type = ib_network_to_gid_type(net_type); Like here for instance. ... and I keep saying this is all wrong, once you get into IP land this entire process needs a route/neighbour lookup. > - ret = rdma_addr_find_dmac_by_grh(>dgid, >sgid, > + ret = rdma_addr_find_dmac_by_grh(, , >ah_attr->dmac, >wc->wc_flags & > IB_WC_WITH_VLAN ? >NULL : _id, ie no to this. > + if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP) > + /* TODO: get the hoplimit from the inet/inet6 > + * device > + */ And no again, please fix this and all other missing route lookups before sending another version. > + struct { > + /* The IB spec states that if it's IPv4, the header roceev2 spec, surely Jason -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH v2 0/2] Handle mlx4 max_sge_rd correctly
Did we ever make progress on this? Just up to Doug to pull it in. -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V1 1/5] IB/mlx5: Add create_cq extended command
In order to create a CQ that supports timestamp, mlx5 needs to support the extended create CQ command with the timestamp flag. Signed-off-by: Matan BarakReviewed-by: Eli Cohen --- drivers/infiniband/hw/mlx5/cq.c | 7 +++ 1 file changed, 7 insertions(+) diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c index 3ce5cfa7..a9a7921 100644 --- a/drivers/infiniband/hw/mlx5/cq.c +++ b/drivers/infiniband/hw/mlx5/cq.c @@ -760,6 +760,10 @@ static void destroy_cq_kernel(struct mlx5_ib_dev *dev, struct mlx5_ib_cq *cq) mlx5_db_free(dev->mdev, >db); } +enum { + CQ_CREATE_FLAGS_SUPPORTED = IB_CQ_FLAGS_TIMESTAMP_COMPLETION +}; + struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, const struct ib_cq_init_attr *attr, struct ib_ucontext *context, @@ -783,6 +787,9 @@ struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev, if (entries < 0) return ERR_PTR(-EINVAL); + if (attr->flags & ~CQ_CREATE_FLAGS_SUPPORTED) + return ERR_PTR(-EOPNOTSUPP); + entries = roundup_pow_of_two(entries + 1); if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz))) return ERR_PTR(-EINVAL); -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V1 0/5] User-space time-stamping support for mlx5_ib
Hi Eli, This patch-set adds user-space support for time-stamping in mlx5_ib. It implements the necessary API: (a) ib_create_cq_ex - Add support for CQ creation flags (b) ib_query_device - return timestamp_mask and hca_core_clock. We also add support for mmaping the HCA's free running clock. In order to do so, we use the response of the vendor's extended part in init_ucontext. This allows us to pass the page offset of the free running clock register to the user-space driver. In order to implement it in a future extensible manner, we use the same mechanism of verbs extensions to the mlx5 vendor part as well. Regards, Matan Changes from v0: * Limit mmap PAGE_SIZE to 4K (security wise). * Optimize ib_is_udata_cleared. * Pass hca_core_clock_offset in the vendor's response part of init_ucontext. Matan Barak (5): IB/mlx5: Add create_cq extended command IB/core: Add ib_is_udata_cleared IB/mlx5: Add support for hca_core_clock and timestamp_mask IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext IB/mlx5: Mmap the HCA's core clock register to user-space drivers/infiniband/hw/mlx5/cq.c | 7 drivers/infiniband/hw/mlx5/main.c| 67 +++- drivers/infiniband/hw/mlx5/mlx5_ib.h | 7 +++- drivers/infiniband/hw/mlx5/user.h| 12 +-- include/linux/mlx5/device.h | 7 ++-- include/linux/mlx5/mlx5_ifc.h| 9 +++-- include/rdma/ib_verbs.h | 67 7 files changed, 160 insertions(+), 16 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V1 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask
Reporting the hca_core_clock (in kHZ) and the timestamp_mask in query_device extended verb. timestamp_mask is used by users in order to know what is the valid range of the raw timestamps, while hca_core_clock reports the clock frequency that is used for timestamps. Signed-off-by: Matan BarakReviewed-by: Moshe Lazer --- drivers/infiniband/hw/mlx5/main.c | 2 ++ include/linux/mlx5/mlx5_ifc.h | 9 ++--- 2 files changed, 8 insertions(+), 3 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 9b77058..8aa0330 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -504,6 +504,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev, props->max_total_mcast_qp_attach = props->max_mcast_qp_attach * props->max_mcast_grp; props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */ + props->hca_core_clock = MLX5_CAP_GEN(mdev, device_frequency_khz); + props->timestamp_mask = 0x7FFFULL; #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING if (MLX5_CAP_GEN(mdev, pg)) diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h index af51cd2..c57e975 100644 --- a/include/linux/mlx5/mlx5_ifc.h +++ b/include/linux/mlx5/mlx5_ifc.h @@ -792,15 +792,18 @@ struct mlx5_ifc_cmd_hca_cap_bits { u8 reserved_63[0x8]; u8 log_uar_page_sz[0x10]; - u8 reserved_64[0x100]; + u8 reserved_64[0x20]; + u8 device_frequency_mhz[0x20]; + u8 device_frequency_khz[0x20]; + u8 reserved_65[0xa0]; - u8 reserved_65[0x1f]; + u8 reserved_66[0x1f]; u8 cqe_zip[0x1]; u8 cqe_zip_timeout[0x10]; u8 cqe_zip_max_num[0x10]; - u8 reserved_66[0x220]; + u8 reserved_67[0x220]; }; enum { -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH for-next V1 4/5] IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext
Pass hca_core_clock_offset to user-space is mandatory in order to let the user-space read the free-running clock register from the right offset in the memory mapped page. Passing this value is done by changing the vendor's command and response of init_ucontext to be in extensible form. Signed-off-by: Matan BarakReviewed-By: Moshe Lazer --- drivers/infiniband/hw/mlx5/main.c| 37 drivers/infiniband/hw/mlx5/mlx5_ib.h | 3 +++ drivers/infiniband/hw/mlx5/user.h| 12 ++-- include/linux/mlx5/device.h | 7 +-- 4 files changed, 47 insertions(+), 12 deletions(-) diff --git a/drivers/infiniband/hw/mlx5/main.c b/drivers/infiniband/hw/mlx5/main.c index 8aa0330..e4ce010 100644 --- a/drivers/infiniband/hw/mlx5/main.c +++ b/drivers/infiniband/hw/mlx5/main.c @@ -796,8 +796,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, struct ib_udata *udata) { struct mlx5_ib_dev *dev = to_mdev(ibdev); - struct mlx5_ib_alloc_ucontext_req_v2 req; - struct mlx5_ib_alloc_ucontext_resp resp; + struct mlx5_ib_alloc_ucontext_req_v2 req = {}; + struct mlx5_ib_alloc_ucontext_resp resp = {}; struct mlx5_ib_ucontext *context; struct mlx5_uuar_info *uuari; struct mlx5_uar *uars; @@ -812,20 +812,19 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, if (!dev->ib_active) return ERR_PTR(-EAGAIN); - memset(, 0, sizeof(req)); reqlen = udata->inlen - sizeof(struct ib_uverbs_cmd_hdr); if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req)) ver = 0; - else if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req_v2)) + else if (reqlen >= sizeof(struct mlx5_ib_alloc_ucontext_req_v2)) ver = 2; else return ERR_PTR(-EINVAL); - err = ib_copy_from_udata(, udata, reqlen); + err = ib_copy_from_udata(, udata, min(reqlen, sizeof(req))); if (err) return ERR_PTR(err); - if (req.flags || req.reserved) + if (req.flags) return ERR_PTR(-EINVAL); if (req.total_num_uuars > MLX5_MAX_UUARS) @@ -834,6 +833,14 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, if (req.total_num_uuars == 0) return ERR_PTR(-EINVAL); + if (req.comp_mask) + return ERR_PTR(-EOPNOTSUPP); + + if (reqlen > sizeof(req) && + !ib_is_udata_cleared(udata, '\0', sizeof(req), +udata->inlen - sizeof(req))) + return ERR_PTR(-EOPNOTSUPP); + req.total_num_uuars = ALIGN(req.total_num_uuars, MLX5_NON_FP_BF_REGS_PER_PAGE); if (req.num_low_latency_uuars > req.total_num_uuars - 1) @@ -849,6 +856,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, resp.max_send_wqebb = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz); resp.max_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz); resp.max_srq_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_srq_sz); + resp.response_length = min(offsetof(typeof(resp), response_length) + + sizeof(resp.response_length), udata->outlen); context = kzalloc(sizeof(*context), GFP_KERNEL); if (!context) @@ -899,8 +908,20 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct ib_device *ibdev, resp.tot_uuars = req.total_num_uuars; resp.num_ports = MLX5_CAP_GEN(dev->mdev, num_ports); - err = ib_copy_to_udata(udata, , - sizeof(resp) - sizeof(resp.reserved)); + + if (field_avail(typeof(resp), reserved2, udata->outlen)) + resp.response_length += sizeof(resp.reserved2); + + if (field_avail(typeof(resp), hca_core_clock_offset, udata->outlen)) { + resp.comp_mask |= + MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET; + resp.hca_core_clock_offset = + offsetof(struct mlx5_init_seg, internal_timer_h) % + PAGE_SIZE; + resp.response_length += sizeof(resp.hca_core_clock_offset); + } + + err = ib_copy_to_udata(udata, , resp.response_length); if (err) goto out_uars; diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h b/drivers/infiniband/hw/mlx5/mlx5_ib.h index b0deeb3..b2a6643 100644 --- a/drivers/infiniband/hw/mlx5/mlx5_ib.h +++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h @@ -55,6 +55,9 @@ pr_err("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,\ pr_warn("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,\ __LINE__, current->pid, ##arg) +#define field_avail(type, fld, sz) (offsetof(type, fld) +
[PATCH libmlx5 V1 6/6] Add always_inline check
Always inline isn't supported by every compiler. Adding it to configure.ac in order to support it only when possible. Inline other poll_one data path functions in order to eliminate "ifs". Signed-off-by: Matan Barak--- configure.ac | 17 + src/cq.c | 42 +- src/mlx5.h | 6 ++ 3 files changed, 52 insertions(+), 13 deletions(-) diff --git a/configure.ac b/configure.ac index fca0b46..50b4f9c 100644 --- a/configure.ac +++ b/configure.ac @@ -65,6 +65,23 @@ AC_CHECK_FUNC(ibv_read_sysfs_file, [], AC_MSG_ERROR([ibv_read_sysfs_file() not found. libmlx5 requires libibverbs >= 1.0.3.])) AC_CHECK_FUNCS(ibv_dontfork_range ibv_dofork_range ibv_register_driver) +AC_MSG_CHECKING("always inline") +CFLAGS_BAK="$CFLAGS" +CFLAGS="$CFLAGS -Werror" +AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[ + static inline int f(void) + __attribute__((always_inline)); + static inline int f(void) + { + return 1; + } +]],[[ + int a = f(); + a = a; +]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if __attribute((always_inline)).])], +[AC_MSG_RESULT([no])]) +CFLAGS="$CFLAGS_BAK" + dnl Now check if for libibverbs 1.0 vs 1.1 dummy=if$$ cat < $dummy.c diff --git a/src/cq.c b/src/cq.c index fcb4237..41751b7 100644 --- a/src/cq.c +++ b/src/cq.c @@ -218,6 +218,14 @@ static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex, uint64_t wc_flags_yes, uint64_t wc_flags_no, uint32_t qpn, uint64_t *wc_flags_out) + ALWAYS_INLINE; +static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex, + union wc_buffer *pwc_buffer, + struct mlx5_cqe64 *cqe, + uint64_t wc_flags, + uint64_t wc_flags_yes, + uint64_t wc_flags_no, + uint32_t qpn, uint64_t *wc_flags_out) { union wc_buffer wc_buffer = *pwc_buffer; @@ -367,6 +375,14 @@ static inline int handle_responder_ex(struct ibv_wc_ex *wc_ex, uint64_t wc_flags, uint64_t wc_flags_yes, uint64_t wc_flags_no, uint32_t qpn, uint64_t *wc_flags_out) + ALWAYS_INLINE; +static inline int handle_responder_ex(struct ibv_wc_ex *wc_ex, + union wc_buffer *pwc_buffer, + struct mlx5_cqe64 *cqe, + struct mlx5_qp *qp, struct mlx5_srq *srq, + uint64_t wc_flags, uint64_t wc_flags_yes, + uint64_t wc_flags_no, uint32_t qpn, + uint64_t *wc_flags_out) { uint16_t wqe_ctr; struct mlx5_wq *wq; @@ -573,7 +589,7 @@ static void mlx5_get_cycles(uint64_t *cycles) static inline struct mlx5_qp *get_req_context(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, uint32_t rsn, int cqe_ver) - __attribute__((always_inline)); + ALWAYS_INLINE; static inline struct mlx5_qp *get_req_context(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, uint32_t rsn, int cqe_ver) @@ -589,7 +605,7 @@ static inline int get_resp_cxt_v1(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, struct mlx5_srq **cur_srq, uint32_t uidx, int *is_srq) - __attribute__((always_inline)); + ALWAYS_INLINE; static inline int get_resp_cxt_v1(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, struct mlx5_srq **cur_srq, @@ -625,7 +641,7 @@ static inline int get_resp_cxt_v1(struct mlx5_context *mctx, static inline int get_resp_ctx(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, uint32_t qpn) - __attribute__((always_inline)); + ALWAYS_INLINE; static inline int get_resp_ctx(struct mlx5_context *mctx, struct mlx5_resource **cur_rsc, uint32_t qpn) @@ -647,7 +663,7 @@ static inline int get_resp_ctx(struct mlx5_context *mctx, static inline int get_srq_ctx(struct mlx5_context *mctx,
[PATCH libmlx5 V1 4/6] Add ibv_query_values support
In order to query the current HCA's core clock, libmlx5 should support ibv_query_values verb. Querying the hardware's cycles register is done by mmaping this register to user-space. Therefore, when libmlx5 initializes we mmap the cycles register. This assumes the machine's architecture places the PCI and memory in the same address space. The page offset is passed through init_context vendor's data. Signed-off-by: Matan Barak--- src/mlx5-abi.h | 10 +- src/mlx5.c | 37 + src/mlx5.h | 10 +- src/verbs.c| 46 ++ 4 files changed, 101 insertions(+), 2 deletions(-) diff --git a/src/mlx5-abi.h b/src/mlx5-abi.h index 769ea81..43d4906 100644 --- a/src/mlx5-abi.h +++ b/src/mlx5-abi.h @@ -55,7 +55,11 @@ struct mlx5_alloc_ucontext { __u32 total_num_uuars; __u32 num_low_latency_uuars; __u32 flags; - __u32 reserved; + __u32 comp_mask; +}; + +enum mlx5_ib_alloc_ucontext_resp_mask { + MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET = 1UL << 0, }; struct mlx5_alloc_ucontext_resp { @@ -72,6 +76,10 @@ struct mlx5_alloc_ucontext_resp { __u16 num_ports; __u8cqe_version; __u8reserved; + __u32 comp_mask; + __u32 response_length; + __u32 reserved2; + __u64 hca_core_clock_offset; }; struct mlx5_alloc_pd_resp { diff --git a/src/mlx5.c b/src/mlx5.c index 229d99d..c455c08 100644 --- a/src/mlx5.c +++ b/src/mlx5.c @@ -524,6 +524,30 @@ static int single_threaded_app(void) return 0; } +static int mlx5_map_internal_clock(struct mlx5_device *mdev, + struct ibv_context *ibv_ctx) +{ + struct mlx5_context *context = to_mctx(ibv_ctx); + void *hca_clock_page; + off_t offset = 0; + + set_command(MLX5_MMAP_GET_CORE_CLOCK_CMD, ); + hca_clock_page = mmap(NULL, mdev->page_size, + PROT_READ, MAP_SHARED, ibv_ctx->cmd_fd, + mdev->page_size * offset); + + if (hca_clock_page == MAP_FAILED) { + fprintf(stderr, PFX + "Warning: Timestamp available,\n" + "but failed to mmap() hca core clock page.\n"); + return -1; + } + + context->hca_core_clock = hca_clock_page + + (context->core_clock.offset & (mdev->page_size - 1)); + return 0; +} + static int mlx5_init_context(struct verbs_device *vdev, struct ibv_context *ctx, int cmd_fd) { @@ -647,6 +671,15 @@ static int mlx5_init_context(struct verbs_device *vdev, context->bfs[j].uuarn = j; } + context->hca_core_clock = NULL; + if (resp.response_length + sizeof(resp.ibv_resp) >= + offsetof(struct mlx5_alloc_ucontext_resp, hca_core_clock_offset) + + sizeof(resp.hca_core_clock_offset) && + resp.comp_mask & MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET) { + context->core_clock.offset = resp.hca_core_clock_offset; + mlx5_map_internal_clock(mdev, ctx); + } + mlx5_spinlock_init(>lock32); context->prefer_bf = get_always_bf(); @@ -664,6 +697,7 @@ static int mlx5_init_context(struct verbs_device *vdev, verbs_set_ctx_op(v_ctx, create_srq_ex, mlx5_create_srq_ex); verbs_set_ctx_op(v_ctx, get_srq_num, mlx5_get_srq_num); verbs_set_ctx_op(v_ctx, query_device_ex, mlx5_query_device_ex); + verbs_set_ctx_op(v_ctx, query_values, mlx5_query_values); verbs_set_ctx_op(v_ctx, create_cq_ex, mlx5_create_cq_ex); if (context->cqe_version && context->cqe_version == 1) verbs_set_ctx_op(v_ctx, poll_cq_ex, mlx5_poll_cq_v1_ex); @@ -697,6 +731,9 @@ static void mlx5_cleanup_context(struct verbs_device *device, if (context->uar[i]) munmap(context->uar[i], page_size); } + if (context->hca_core_clock) + munmap(context->hca_core_clock - context->core_clock.offset, + page_size); close_debug_file(context); } diff --git a/src/mlx5.h b/src/mlx5.h index 0c0b027..b5bcfaa 100644 --- a/src/mlx5.h +++ b/src/mlx5.h @@ -117,7 +117,8 @@ enum { enum { MLX5_MMAP_GET_REGULAR_PAGES_CMD= 0, - MLX5_MMAP_GET_CONTIGUOUS_PAGES_CMD = 1 + MLX5_MMAP_GET_CONTIGUOUS_PAGES_CMD = 1, + MLX5_MMAP_GET_CORE_CLOCK_CMD= 5 }; #define MLX5_CQ_PREFIX "MLX_CQ" @@ -307,6 +308,11 @@ struct mlx5_context { struct mlx5_spinlockhugetlb_lock;
[PATCH libmlx5 V1 5/6] Optimize poll_cq
The current ibv_poll_cq_ex mechanism needs to query every field for its existence. In order to avoid this penalty at runtime, add optimized functions for special cases. Signed-off-by: Matan Barak--- src/cq.c| 363 +--- src/mlx5.h | 10 ++ src/verbs.c | 9 +- 3 files changed, 310 insertions(+), 72 deletions(-) diff --git a/src/cq.c b/src/cq.c index 5e06990..fcb4237 100644 --- a/src/cq.c +++ b/src/cq.c @@ -41,6 +41,7 @@ #include #include #include +#include #include #include @@ -207,73 +208,91 @@ union wc_buffer { uint64_t*b64; }; +#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\ + (!((no) & (flag)) && \ + ((maybe) & (flag static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex, union wc_buffer *pwc_buffer, struct mlx5_cqe64 *cqe, uint64_t wc_flags, - uint32_t qpn) + uint64_t wc_flags_yes, + uint64_t wc_flags_no, + uint32_t qpn, uint64_t *wc_flags_out) { union wc_buffer wc_buffer = *pwc_buffer; switch (ntohl(cqe->sop_drop_qpn) >> 24) { case MLX5_OPCODE_RDMA_WRITE_IMM: - wc_ex->wc_flags |= IBV_WC_EX_IMM; + *wc_flags_out |= IBV_WC_EX_IMM; case MLX5_OPCODE_RDMA_WRITE: wc_ex->opcode= IBV_WC_RDMA_WRITE; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX5_OPCODE_SEND_IMM: - wc_ex->wc_flags |= IBV_WC_EX_IMM; + *wc_flags_out |= IBV_WC_EX_IMM; case MLX5_OPCODE_SEND: case MLX5_OPCODE_SEND_INVAL: wc_ex->opcode= IBV_WC_SEND; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) wc_buffer.b32++; - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX5_OPCODE_RDMA_READ: wc_ex->opcode= IBV_WC_RDMA_READ; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) { *wc_buffer.b32++ = ntohl(cqe->byte_cnt); - wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN; } - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX5_OPCODE_ATOMIC_CS: wc_ex->opcode= IBV_WC_COMP_SWAP; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) { *wc_buffer.b32++ = 8; - wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN; } - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_IMM)) wc_buffer.b32++; break; case MLX5_OPCODE_ATOMIC_FA: wc_ex->opcode= IBV_WC_FETCH_ADD; - if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, + IBV_WC_EX_WITH_BYTE_LEN)) { *wc_buffer.b32++ = 8; - wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN; } - if (wc_flags & IBV_WC_EX_WITH_IMM) + if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags, +
[PATCH libmlx5 V1 3/6] Add ibv_create_cq_ex support
In order to create a CQ which supports timestamp, the user needs to specify the timestamp flag for ibv_create_cq_ex. Adding support for ibv_create_cq_ex in the mlx5's vendor library. Signed-off-by: Matan Barak--- src/mlx5.c | 1 + src/mlx5.h | 2 ++ src/verbs.c | 72 + 3 files changed, 66 insertions(+), 9 deletions(-) diff --git a/src/mlx5.c b/src/mlx5.c index eac332b..229d99d 100644 --- a/src/mlx5.c +++ b/src/mlx5.c @@ -664,6 +664,7 @@ static int mlx5_init_context(struct verbs_device *vdev, verbs_set_ctx_op(v_ctx, create_srq_ex, mlx5_create_srq_ex); verbs_set_ctx_op(v_ctx, get_srq_num, mlx5_get_srq_num); verbs_set_ctx_op(v_ctx, query_device_ex, mlx5_query_device_ex); + verbs_set_ctx_op(v_ctx, create_cq_ex, mlx5_create_cq_ex); if (context->cqe_version && context->cqe_version == 1) verbs_set_ctx_op(v_ctx, poll_cq_ex, mlx5_poll_cq_v1_ex); else diff --git a/src/mlx5.h b/src/mlx5.h index 91aafbe..0c0b027 100644 --- a/src/mlx5.h +++ b/src/mlx5.h @@ -600,6 +600,8 @@ int mlx5_dereg_mr(struct ibv_mr *mr); struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe, struct ibv_comp_channel *channel, int comp_vector); +struct ibv_cq *mlx5_create_cq_ex(struct ibv_context *context, +struct ibv_create_cq_attr_ex *cq_attr); int mlx5_poll_cq_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc, struct ibv_poll_cq_ex_attr *attr); int mlx5_poll_cq_v1_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc, diff --git a/src/verbs.c b/src/verbs.c index 92f273d..1dbee60 100644 --- a/src/verbs.c +++ b/src/verbs.c @@ -240,9 +240,21 @@ static int qp_sig_enabled(void) return 0; } -struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe, - struct ibv_comp_channel *channel, - int comp_vector) +enum { + CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS| + IBV_WC_EX_WITH_COMPLETION_TIMESTAMP +}; + +enum { + CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS +}; + +enum { + CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP +}; + +static struct ibv_cq *create_cq(struct ibv_context *context, + const struct ibv_create_cq_attr_ex *cq_attr) { struct mlx5_create_cq cmd; struct mlx5_create_cq_resp resp; @@ -254,12 +266,33 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe, FILE *fp = to_mctx(context)->dbg_fp; #endif - if (!cqe) { - mlx5_dbg(fp, MLX5_DBG_CQ, "\n"); + if (!cq_attr->cqe) { + mlx5_dbg(fp, MLX5_DBG_CQ, "CQE invalid\n"); + errno = EINVAL; + return NULL; + } + + if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) { + mlx5_dbg(fp, MLX5_DBG_CQ, +"Unsupported comp_mask for create_cq\n"); + errno = EINVAL; + return NULL; + } + + if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS && + cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) { + mlx5_dbg(fp, MLX5_DBG_CQ, +"Unsupported creation flags requested for create_cq\n"); errno = EINVAL; return NULL; } + if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) { + mlx5_dbg(fp, MLX5_DBG_CQ, "\n"); + errno = ENOTSUP; + return NULL; + } + cq = calloc(1, sizeof *cq); if (!cq) { mlx5_dbg(fp, MLX5_DBG_CQ, "\n"); @@ -273,14 +306,14 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe, goto err; /* The additional entry is required for resize CQ */ - if (cqe <= 0) { + if (cq_attr->cqe <= 0) { mlx5_dbg(fp, MLX5_DBG_CQ, "\n"); errno = EINVAL; goto err_spl; } - ncqe = align_queue_size(cqe + 1); - if ((ncqe > (1 << 24)) || (ncqe < (cqe + 1))) { + ncqe = align_queue_size(cq_attr->cqe + 1); + if ((ncqe > (1 << 24)) || (ncqe < (cq_attr->cqe + 1))) { mlx5_dbg(fp, MLX5_DBG_CQ, "ncqe %d\n", ncqe); errno = EINVAL; goto err_spl; @@ -313,7 +346,8 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe, cmd.db_addr = (uintptr_t) cq->dbrec; cmd.cqe_size = cqe_sz; - ret = ibv_cmd_create_cq(context, ncqe - 1, channel, comp_vector, + ret = ibv_cmd_create_cq(context, ncqe - 1, cq_attr->channel, + cq_attr->comp_vector, >ibv_cq, _cmd, sizeof cmd, _resp, sizeof resp);
[PATCH libmlx5 V1 2/6] Add timestmap support for ibv_poll_cq_ex
Add support for filling the timestamp field in ibv_poll_cq_ex (if it's required by the user). Signed-off-by: Matan Barak--- src/cq.c | 5 + 1 file changed, 5 insertions(+) diff --git a/src/cq.c b/src/cq.c index 0185696..5e06990 100644 --- a/src/cq.c +++ b/src/cq.c @@ -913,6 +913,11 @@ inline int mlx5_poll_one_ex(struct mlx5_cq *cq, wc_ex->wc_flags = 0; wc_ex->reserved = 0; + if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) { + *wc_buffer.b64++ = ntohll(cqe64->timestamp); + wc_ex->wc_flags |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP; + } + switch (opcode) { case MLX5_CQE_REQ: err = mlx5_poll_one_cqe_req(cq, cur_rsc, cqe, qpn, cqe_ver, -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libmlx5 V1 0/6] Completion timestamping
Hi Eli, This series adds support for completion timestamp. In order to support this feature, several extended verbs were implemented (as instructed in libibverbs). ibv_query_device_ex was extended to support reading the hca_core_clock and timestamp mask. The init_context verb vendor specific data was changed so it'll conform to the verbs extensions form. This is done in order to easily extend the response data for passing the page offset of the free running clock register. This is mandatory for mapping this register to the user space. This mapping is done when libmlx5 initializes. In order to support CQ completion timestmap reporting, we implement ibv_create_cq_ex verb. This verb is used both for creating a CQ which supports timestamp and in order to state which fields should be returned via WC. Returning this data is done via implementing ibv_poll_cq_ex. We query the CQ requested wc_flags for every field the user has requested and populate it according to the carried network operation and WC status. Last but not least, ibv_poll_cq_ex was optimized in order to eliminate the if statements and or operations for common combinations of wc fields. This is done by inlining and using a custom poll_one_ex function for these fields. This series depends on '[PATCH libibverbs 0/5] Completion timestamping' and is rebased above '[PATCH libmlx5 v1 0/5] Support CQE Thanks, Matan Changes from V0: * Use mlx5_init_context in order to pass hca_core_clock_offset. Matan Barak (6): Add ibv_poll_cq_ex support Add timestmap support for ibv_poll_cq_ex Add ibv_create_cq_ex support Add ibv_query_values support Optimize poll_cq Add always_inline check configure.ac | 17 + src/cq.c | 959 - src/mlx5-abi.h | 10 +- src/mlx5.c | 43 +++ src/mlx5.h | 42 ++- src/verbs.c| 115 ++- 6 files changed, 1037 insertions(+), 149 deletions(-) -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH libmlx5 V1 1/6] Add ibv_poll_cq_ex support
Extended poll_cq supports writing only user's required work completion fields. Adding support for this extended verb. Signed-off-by: Matan Barak--- src/cq.c | 699 + src/mlx5.c | 5 + src/mlx5.h | 14 ++ 3 files changed, 584 insertions(+), 134 deletions(-) diff --git a/src/cq.c b/src/cq.c index 32f0dd4..0185696 100644 --- a/src/cq.c +++ b/src/cq.c @@ -200,6 +200,85 @@ static void handle_good_req(struct ibv_wc *wc, struct mlx5_cqe64 *cqe) } } +union wc_buffer { + uint8_t *b8; + uint16_t*b16; + uint32_t*b32; + uint64_t*b64; +}; + +static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex, + union wc_buffer *pwc_buffer, + struct mlx5_cqe64 *cqe, + uint64_t wc_flags, + uint32_t qpn) +{ + union wc_buffer wc_buffer = *pwc_buffer; + + switch (ntohl(cqe->sop_drop_qpn) >> 24) { + case MLX5_OPCODE_RDMA_WRITE_IMM: + wc_ex->wc_flags |= IBV_WC_EX_IMM; + case MLX5_OPCODE_RDMA_WRITE: + wc_ex->opcode= IBV_WC_RDMA_WRITE; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + wc_buffer.b32++; + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + case MLX5_OPCODE_SEND_IMM: + wc_ex->wc_flags |= IBV_WC_EX_IMM; + case MLX5_OPCODE_SEND: + case MLX5_OPCODE_SEND_INVAL: + wc_ex->opcode= IBV_WC_SEND; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + wc_buffer.b32++; + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + case MLX5_OPCODE_RDMA_READ: + wc_ex->opcode= IBV_WC_RDMA_READ; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + *wc_buffer.b32++ = ntohl(cqe->byte_cnt); + wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + } + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + case MLX5_OPCODE_ATOMIC_CS: + wc_ex->opcode= IBV_WC_COMP_SWAP; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + *wc_buffer.b32++ = 8; + wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + } + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + case MLX5_OPCODE_ATOMIC_FA: + wc_ex->opcode= IBV_WC_FETCH_ADD; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + *wc_buffer.b32++ = 8; + wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + } + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + case MLX5_OPCODE_BIND_MW: + wc_ex->opcode= IBV_WC_BIND_MW; + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) + wc_buffer.b32++; + if (wc_flags & IBV_WC_EX_WITH_IMM) + wc_buffer.b32++; + break; + } + + if (wc_flags & IBV_WC_EX_WITH_QP_NUM) { + *wc_buffer.b32++ = qpn; + wc_ex->wc_flags |= IBV_WC_EX_WITH_QP_NUM; + } + + *pwc_buffer = wc_buffer; +} + static int handle_responder(struct ibv_wc *wc, struct mlx5_cqe64 *cqe, struct mlx5_qp *qp, struct mlx5_srq *srq) { @@ -262,6 +341,103 @@ static int handle_responder(struct ibv_wc *wc, struct mlx5_cqe64 *cqe, return IBV_WC_SUCCESS; } +static inline int handle_responder_ex(struct ibv_wc_ex *wc_ex, + union wc_buffer *pwc_buffer, + struct mlx5_cqe64 *cqe, + struct mlx5_qp *qp, struct mlx5_srq *srq, + uint64_t wc_flags, uint32_t qpn) +{ + uint16_t wqe_ctr; + struct mlx5_wq *wq; + uint8_t g; + union wc_buffer wc_buffer = *pwc_buffer; + int err = 0; + uint32_t byte_len = ntohl(cqe->byte_cnt); + + if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) { + *wc_buffer.b32++ = byte_len; + wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN; + } + if (srq) { + wqe_ctr = ntohs(cqe->wqe_counter); + wc_ex->wr_id = srq->wrid[wqe_ctr]; + mlx5_free_srq_wqe(srq, wqe_ctr); + if (cqe->op_own & MLX5_INLINE_SCATTER_32) + err = mlx5_copy_to_recv_srq(srq, wqe_ctr, cqe, + byte_len); + else if (cqe->op_own &
[PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared
Extending core and vendor verb commands require us to check that the unknown part of the user's given command is all zeros. Adding ib_is_udata_cleared in order to do so. Signed-off-by: Matan BarakReviewed-by: Moshe Lazer --- include/rdma/ib_verbs.h | 67 + 1 file changed, 67 insertions(+) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 31fb409..0ad89e3 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -1947,6 +1947,73 @@ static inline int ib_copy_to_udata(struct ib_udata *udata, void *src, size_t len return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0; } +#define IB_UDATA_ELEMENT_CLEARED(type, ptr, len, expected) \ + ({type v; \ + typeof(ptr) __ptr = ptr; \ + \ + ptr = (void *)ptr + sizeof(type); \ + len -= sizeof(type); \ + !copy_from_user(, __ptr, sizeof(v)) && (v == expected); }) + +static inline bool ib_is_udata_cleared(struct ib_udata *udata, + u8 cleared_char, + size_t offset, + size_t len) +{ + const void __user *p = udata->inbuf + offset; +#ifdef CONFIG_64BIT + u64 expected = cleared_char; +#else + u32 expected = cleared_char; +#endif + + if (len > USHRT_MAX) + return false; + + if (len && (uintptr_t)p & 1) + if (!IB_UDATA_ELEMENT_CLEARED(u8, p, len, expected)) + return false; + + expected = expected << 8 | expected; + if (len >= 2 && (uintptr_t)p & 2) + if (!IB_UDATA_ELEMENT_CLEARED(u16, p, len, expected)) + return false; + + expected = expected << 16 | expected; +#ifdef CONFIG_64BIT + if (len >= 4 && (uintptr_t)p & 4) + if (!IB_UDATA_ELEMENT_CLEARED(u32, p, len, expected)) + return false; + + expected = expected << 32 | expected; +#define IB_UDATA_CLEAR_LOOP_TYPE u64 +#else +#define IB_UDATA_CLEAR_LOOP_TYPE u32 +#endif + while (len >= sizeof(IB_UDATA_CLEAR_LOOP_TYPE)) + if (!IB_UDATA_ELEMENT_CLEARED(IB_UDATA_CLEAR_LOOP_TYPE, p, len, + expected)) + return false; + +#ifdef CONFIG_64BIT + expected = expected >> 32; + if (len >= 4 && (uintptr_t)p & 4) + if (!IB_UDATA_ELEMENT_CLEARED(u32, p, len, expected)) + return false; +#endif + expected = expected >> 16; + if (len >= 2 && (uintptr_t)p & 2) + if (!IB_UDATA_ELEMENT_CLEARED(u16, p, len, expected)) + return false; + + expected = expected >> 8; + if (len) + if (!IB_UDATA_ELEMENT_CLEARED(u8, p, len, expected)) + return false; + + return true; +} + /** * ib_modify_qp_is_ok - Check that the supplied attribute mask * contains all required attributes and no attributes not allowed for -- 2.1.0 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc
On Thu, Dec 3, 2015 at 4:05 PM, Christoph Hellwigwrote: > Bloating the WC with a field that's not really useful for the ULPs > seems pretty sad.. Network header type is mandatory in order to find the GID type and get the GIDs correctly from the header. I realize ULPs might have preferred to get the GID itself, but resolving the GID costs time and most of the time you don't really need that when you poll a CQ. This could be refactored later to use wc_flags instead of a new field when we approach the cache line limit. > -- > To unsubscribe from this list: send the line "unsubscribe linux-rdma" in > the body of a message to majord...@vger.kernel.org > More majordomo info at http://vger.kernel.org/majordomo-info.html -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] staging/rdma/hfi1: fix pio progress routine race with allocator
The allocation code assumes that the shadow ring cannot be overrun because the credits will limit the allocation. Unfortuately, the progress mechanism in sc_release_update() updates the free count prior to processing the shadow ring, allowing the shadow ring to be overrun by an allocation. Reviewed-by: Mark DebbageSigned-off-by: Mike Marciniszyn --- drivers/staging/rdma/hfi1/pio.c |9 ++--- 1 file changed, 6 insertions(+), 3 deletions(-) diff --git a/drivers/staging/rdma/hfi1/pio.c b/drivers/staging/rdma/hfi1/pio.c index eab58c1..8e10857 100644 --- a/drivers/staging/rdma/hfi1/pio.c +++ b/drivers/staging/rdma/hfi1/pio.c @@ -1565,6 +1565,7 @@ void sc_release_update(struct send_context *sc) u64 hw_free; u32 head, tail; unsigned long old_free; + unsigned long free; unsigned long extra; unsigned long flags; int code; @@ -1579,7 +1580,7 @@ void sc_release_update(struct send_context *sc) extra = (((hw_free & CR_COUNTER_SMASK) >> CR_COUNTER_SHIFT) - (old_free & CR_COUNTER_MASK)) & CR_COUNTER_MASK; - sc->free = old_free + extra; + free = old_free + extra; trace_hfi1_piofree(sc, extra); /* call sent buffer callbacks */ @@ -1589,7 +1590,7 @@ void sc_release_update(struct send_context *sc) while (head != tail) { pbuf = >sr[tail].pbuf; - if (sent_before(sc->free, pbuf->sent_at)) { + if (sent_before(free, pbuf->sent_at)) { /* not sent yet */ break; } @@ -1603,8 +1604,10 @@ void sc_release_update(struct send_context *sc) if (tail >= sc->sr_size) tail = 0; } - /* update tail, in case we moved it */ sc->sr_tail = tail; + /* make sure tail is updated before free */ + smp_wmb(); + sc->free = free; spin_unlock_irqrestore(>release_lock, flags); sc_piobufavail(sc); } -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()
On 12/03/2015 01:18 AM, Christoph Hellwig wrote: > The patch looks good to me, but while we touch this area, how about > throwing in a few cosmetic fixes as well? How about the patch below ? In that version of the ib_sg_to_pages() fix these concerns have been addressed and additionally to more bugs have been fixed. [PATCH] IB core: Fix ib_sg_to_pages() Fix the code for detecting gaps. A gap occurs not only if the second or later scatterlist element is not aligned but also if any scatterlist element other than the last does not end at a page boundary. In the code for coalescing contiguous elements, ensure that mr->length is correct and that last_page_addr is up-to-date. Ensure that this function returns a negative error code instead of zero if the first set_page() call fails. Fixes: commit 4c67e2bfc8b7 ("IB/core: Introduce new fast registration API") Reported-by: Christoph Hellwig--- drivers/infiniband/core/verbs.c | 43 + 1 file changed, 22 insertions(+), 21 deletions(-) diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c index 043a60e..545906d 100644 --- a/drivers/infiniband/core/verbs.c +++ b/drivers/infiniband/core/verbs.c @@ -1516,7 +1516,7 @@ EXPORT_SYMBOL(ib_map_mr_sg); * @sg_nents: number of entries in sg * @set_page: driver page assignment function pointer * - * Core service helper for drivers to covert the largest + * Core service helper for drivers to convert the largest * prefix of given sg list to a page vector. The sg list * prefix converted is the prefix that meet the requirements * of ib_map_mr_sg. @@ -1533,7 +1533,7 @@ int ib_sg_to_pages(struct ib_mr *mr, u64 last_end_dma_addr = 0, last_page_addr = 0; unsigned int last_page_off = 0; u64 page_mask = ~((u64)mr->page_size - 1); - int i; + int i, ret; mr->iova = sg_dma_address([0]); mr->length = 0; @@ -1544,27 +1544,29 @@ int ib_sg_to_pages(struct ib_mr *mr, u64 end_dma_addr = dma_addr + dma_len; u64 page_addr = dma_addr & page_mask; - if (i && page_addr != dma_addr) { - if (last_end_dma_addr != dma_addr) { - /* gap */ - goto done; - - } else if (last_page_off + dma_len <= mr->page_size) { - /* chunk this fragment with the last */ - mr->length += dma_len; - last_end_dma_addr += dma_len; - last_page_off += dma_len; - continue; - } else { - /* map starting from the next page */ - page_addr = last_page_addr + mr->page_size; - dma_len -= mr->page_size - last_page_off; - } + /* +* For the second and later elements, check whether either the +* end of element i-1 or the start of element i is not aligned +* on a page boundary. +*/ + if (i && (last_page_off != 0 || page_addr != dma_addr)) { + /* Stop mapping if there is a gap. */ + if (last_end_dma_addr != dma_addr) + break; + + /* +* Coalesce this element with the last. If it is small +* enough just update mr->length. Otherwise start +* mapping from the next page. +*/ + goto next_page; } do { - if (unlikely(set_page(mr, page_addr))) - goto done; + ret = set_page(mr, page_addr); + if (unlikely(ret < 0)) + return i ? : ret; +next_page: page_addr += mr->page_size; } while (page_addr < end_dma_addr); @@ -1574,7 +1576,6 @@ int ib_sg_to_pages(struct ib_mr *mr, last_page_off = end_dma_addr & ~page_mask; } -done: return i; } EXPORT_SYMBOL(ib_sg_to_pages); -- 2.1.4 -- To unsubscribe from this list: send the line "unsubscribe linux-rdma" in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html