Re: stalled again

2015-12-03 Thread Christoph Hellwig
Hi Doug,

not having any maintainer available for an extended time is a problem,
and we actually had long discussions about that at kernel summit, with
a clear hint with a cluebat from Linus that he'd prefer maintainer
teams.  So I'd really love to know who was so ead set aginst them.

I personally don't really care if there is a real team (active/active)
or just a standby (active/passive), but I think we really need a
coherent tree of pending patches to avoid last minute rebases due to
conflicts, as well as some feedback for submitters.

I'd be happy to volunteer to collect all patches that were properly
review into a queue for you to consider to at least sort out these
mechanics, although I'd be even more happy if someone who has a longer
experience with the subsystem would volunteer instead.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()

2015-12-03 Thread Sagi Grimberg



Hello Sagi,

Hmm ... why would it be unacceptable to return 0 if sg_nents == 0 ?

Regarding which component to modify if mapping the first page fails:
for almost every kernel function I know a negative return value means
failure and a return value >= 0 means success. Hence my proposal to
change the return value of the ib_map_mr_sg() function if mapping the
first page fails.


I'm fine with that.



How about the patch below ?


Looks fine.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] IB/srp: Fix srp_map_sg_fr()

2015-12-03 Thread Christoph Hellwig
On Thu, Dec 03, 2015 at 10:46:10AM +0200, Sagi Grimberg wrote:
> >>   If entries 2 and 3 could be merged dma_len for 2 would span 2 and 3,
> >>   and then entry 3 would actually have the dma addr and len for entry 4.
> 
> So what would be in the last entry {dma_addr, dma_len}? zeros?
> 
> >>I'm not sure anyone still does that, but the first spot to check would
> >>be the Parisc IOMMU drivers.
> 
> So how does that sit with the fact that dma_unmap requires the
> same sg_nents as in dma_map and not the actual value of dma entries?

Take a look at drivers/parisc/iommu-helpers.h:iommu_coalesce_chunks()
and drivers/parisc/sba_iommu.c:sba_unmap_sg() for example.

The first fills out the sglist, and zeroes all unused entries past what
it fills in.  The unmap side than simply exits the loop if the entries
are zeroed.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()

2015-12-03 Thread Christoph Hellwig
> How about the patch below ?

The patch looks good to me, but while we touch this area, how about
throwing in a few cosmetic fixes as well?

> - if (i && page_addr != dma_addr) {
> + if (i && (page_addr != dma_addr || last_page_off != 0)) {
>   if (last_end_dma_addr != dma_addr) {

Wo about we one or two sentences for each of the conditions here?

>   /* gap */
> - goto done;
> -
> + break;
>   } else if (last_page_off + dma_len <= mr->page_size) {
>   /* chunk this fragment with the last */
>   mr->length += dma_len;

It would be great to avoid the else clauses if we already to a
break/continue/goto to make the code flow more clear, e.g.


/*
 * Gap to the previous segment, we'll need to return
 * and use another FR to map the reminder.
 */
if (last_end_dma_addr != dma_addr)
break;

/*
 * See if this segment is contiguous to the
 * previous one and just merge it in that case.
 */
if (last_page_off + dma_len <= mr->page_size) {
last_end_dma_addr += dma_len;
last_page_off += dma_len;
mr->length += dma_len;
continue;
}

/*
 * New page-aligned segment to map:
 */
page_addr = last_page_addr + mr->page_size;
dma_len -= mr->page_size - last_page_off;


--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 6/6] IB/srp: Fix srp_map_sg_fr()

2015-12-03 Thread Sagi Grimberg

Replying to my own email,



dma_map_sg returns the actual number of entries to iterate.  At least
historically some IOMMU implementations would do strange tricks like:

   If entries 2 and 3 could be merged dma_len for 2 would span 2 and 3,
   and then entry 3 would actually have the dma addr and len for entry 4.


So what would be in the last entry {dma_addr, dma_len}? zeros?


I'm not sure anyone still does that, but the first spot to check would
be the Parisc IOMMU drivers.


So how does that sit with the fact that dma_unmap requires the
same sg_nents as in dma_map and not the actual value of dma entries?
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 02/11] IB/cm: Use the source GID index type

2015-12-03 Thread Matan Barak
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type.
In order to support multiple GID types, the gid_type is passed to
cm_init_av_by_path and stored in the path record.

The rdma cm client would use a default GID type that will be saved in
rdma_id_private.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cm.c  | 25 -
 drivers/infiniband/core/cma.c |  2 ++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index af8b907..5ea78ab 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, 
struct cm_av *av)
read_lock_irqsave(_lock, flags);
list_for_each_entry(cm_dev, _list, list) {
if (!ib_find_cached_gid(cm_dev->ib_device, >sgid,
-   IB_GID_TYPE_IB, ndev, , NULL)) {
+   path->gid_type, ndev, , NULL)) {
port = cm_dev->port[p-1];
break;
}
@@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work)
struct ib_cm_id *cm_id;
struct cm_id_private *cm_id_priv, *listen_cm_id_priv;
struct cm_req_msg *req_msg;
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr;
int ret;
 
req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
-   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num,
+   cm_id_priv->av.ah_attr.grh.sgid_index,
+   , _attr);
+   if (!ret) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
+   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   }
if (ret) {
-   ib_get_cached_gid(work->port->cm_dev->ib_device,
- work->port->port_num, 0, >path[0].sgid,
- NULL);
+   int err = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num, 0,
+   >path[0].sgid,
+   _attr);
+   if (!err && gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
   >path[0].sgid, sizeof work->path[0].sgid,
   NULL, 0);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c19f822..2914e08 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -228,6 +228,7 @@ struct rdma_id_private {
u8  tos;
u8  reuseaddr;
u8  afonly;
+   enum ib_gid_typegid_type;
 };
 
 struct cma_multicast {
@@ -2325,6 +2326,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if);
route->path_rec->net = _net;
route->path_rec->ifindex = addr->dev_addr.bound_dev_if;
+   route->path_rec->gid_type = id_priv->gid_type;
}
if (!ndev) {
ret = -ENODEV;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 03/11] IB/core: Add gid attributes to sysfs

2015-12-03 Thread Matan Barak
This patch set adds attributes of net device and gid type to each GID
in the GID table. Users that use verbs directly need to specify
the GID index. Since the same GID could have different types or
associated net devices, users should have the ability to query the
associated GID attributes. Adding these attributes to sysfs.

Signed-off-by: Matan Barak 
---
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/core/sysfs.c  | 184 ++-
 2 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband

diff --git a/Documentation/ABI/testing/sysfs-class-infiniband 
b/Documentation/ABI/testing/sysfs-class-infiniband
new file mode 100644
index 000..a86abe6
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-infiniband
@@ -0,0 +1,16 @@
+What:  
/sys/class/infiniband//ports//gid_attrs/ndevs/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The net-device's name associated with the GID resides
+   at index .
+
+What:  
/sys/class/infiniband//ports//gid_attrs/types/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The RoCE type of the associated GID resides at index 
.
+   This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs
+   or "RoCE v2" for RoCE v2 based GIDs.
+
+
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..4d5d87a 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -37,12 +37,22 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
+struct ib_port;
+
+struct gid_attr_group {
+   struct ib_port  *port;
+   struct kobject  kobj;
+   struct attribute_group  ndev;
+   struct attribute_group  type;
+};
 struct ib_port {
struct kobject kobj;
struct ib_device  *ibdev;
+   struct gid_attr_group *gid_attr_group;
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
@@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = {
.show = port_attr_show
 };
 
+static ssize_t gid_attr_show(struct kobject *kobj,
+struct attribute *attr, char *buf)
+{
+   struct port_attribute *port_attr =
+   container_of(attr, struct port_attribute, attr);
+   struct ib_port *p = container_of(kobj, struct gid_attr_group,
+kobj)->port;
+
+   if (!port_attr->show)
+   return -EIO;
+
+   return port_attr->show(p, port_attr, buf);
+}
+
+static const struct sysfs_ops gid_attr_sysfs_ops = {
+   .show = gid_attr_show
+};
+
 static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
  char *buf)
 {
@@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = {
NULL
 };
 
+static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf)
+{
+   if (!gid_attr->ndev)
+   return -EINVAL;
+
+   return sprintf(buf, "%s\n", gid_attr->ndev->name);
+}
+
+static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf)
+{
+   return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type));
+}
+
+static ssize_t _show_port_gid_attr(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf,
+  size_t (*print)(struct ib_gid_attr *gid_attr,
+  char *buf))
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr = {};
+   ssize_t ret;
+   va_list args;
+
+   ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, ,
+  _attr);
+   if (ret)
+   goto err;
+
+   ret = print(_attr, buf);
+
+err:
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   va_end(args);
+   return ret;
+}
+
 static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 char *buf)
 {
@@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, "%pI6\n", gid.raw);
 }
 
+static ssize_t show_port_gid_attr_ndev(struct ib_port *p,
+  struct port_attribute *attr, char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_ndev);
+}
+
+static ssize_t show_port_gid_attr_gid_type(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf)
+{

[PATCH for-next V2 01/11] IB/core: Add gid_type to gid attribute

2015-12-03 Thread Matan Barak
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
GID table entries shall have a "GID type" attribute that denotes the L3
Address type". The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.

This implies that gid_type should be added to roce_gid_table meta-data.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cache.c   | 144 --
 drivers/infiniband/core/cm.c  |   2 +-
 drivers/infiniband/core/cma.c |   3 +-
 drivers/infiniband/core/core_priv.h   |   4 +
 drivers/infiniband/core/device.c  |   9 +-
 drivers/infiniband/core/multicast.c   |   2 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |  60 +++--
 drivers/infiniband/core/sa_query.c|   5 +-
 drivers/infiniband/core/uverbs_marshall.c |   1 +
 drivers/infiniband/core/verbs.c   |   1 +
 include/rdma/ib_cache.h   |   4 +
 include/rdma/ib_sa.h  |   1 +
 include/rdma/ib_verbs.h   |  11 ++-
 13 files changed, 185 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 097e9df..566fd8f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -64,6 +64,7 @@ enum gid_attr_find_mask {
GID_ATTR_FIND_MASK_GID  = 1UL << 0,
GID_ATTR_FIND_MASK_NETDEV   = 1UL << 1,
GID_ATTR_FIND_MASK_DEFAULT  = 1UL << 2,
+   GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3,
 };
 
 enum gid_table_entry_props {
@@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
}
 }
 
+static const char * const gid_type_str[] = {
+   [IB_GID_TYPE_IB]= "IB/RoCE v1",
+};
+
+const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
+{
+   if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type])
+   return gid_type_str[gid_type];
+
+   return "Invalid GID type";
+}
+EXPORT_SYMBOL(ib_cache_gid_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
@@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const 
union ib_gid *gid,
if (found >=0)
continue;
 
+   if (mask & GID_ATTR_FIND_MASK_GID_TYPE &&
+   attr->gid_type != val->gid_type)
+   continue;
+
if (mask & GID_ATTR_FIND_MASK_GID &&
memcmp(gid, >gid, sizeof(*gid)))
continue;
@@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
write_lock_irq(>rwlock);
 
ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV, );
if (ix >= 0)
goto out_unlock;
@@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
 
ix = find_gid(table, gid, attr, false,
  GID_ATTR_FIND_MASK_GID  |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV   |
  GID_ATTR_FIND_MASK_DEFAULT,
  NULL);
@@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device 
*ib_dev,
 
 static int ib_cache_gid_find(struct ib_device *ib_dev,
 const union ib_gid *gid,
+enum ib_gid_type gid_type,
 struct net_device *ndev, u8 *port,
 u16 *index)
 {
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type};
 
if (ndev)
mask |= GID_ATTR_FIND_MASK_NETDEV;
@@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev,
 
 int ib_find_cached_gid_by_port(struct ib_device *ib_dev,
   const union ib_gid *gid,
+  enum ib_gid_type gid_type,
   u8 port, struct net_device *ndev,
   u16 *index)
 {
int local_index;
struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
struct ib_gid_table *table;
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type};
unsigned long flags;
 

[PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-03 Thread Matan Barak
Hi Doug,

This series adds the support for RoCE v2. In order to support RoCE v2,
we add gid_type attribute to every GID. When the RoCE GID management
populates the GID table, it duplicates each GID with all supported types.
This gives the user the ability to communicate over each supported
type.

Patch 0001, 0002 and 0003 add support for multiple GID types to the
cache and related APIs. The third patch exposes the GID attributes
information is sysfs.

Patch 0004 adds the RoCE v2 GID type and the capabilities required
from the vendor in order to implement RoCE v2. These capabilities
are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.

RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
information should come from the vendor's driver. In case the vendor
doesn't supply this information, we parse the packet headers and resolve
its network type. Patch 0005 adds this information and required utilities.

Patches 0006 and 0007 adds route validation. This is mandatory to ensure
that we send packets using GIDS which corresponds to a net-device that
can be routed to the destination.

Patches 0008 and 0009 add configfs support (and the required
infrastructure) for CMA. The administrator should be able to set the
default RoCE type. This is done through a new per-port
default_roce_mode configfs file.

Patch 0010 formats a QP1 packet in order to support RoCE v2 CM
packets. This is required for vendors which implement their
QP1 as a Raw QP.

Patch 0011 adds support for IPv4 multicast as an IPv4 network
requires IGMP to be sent in order to join multicast groups.

Vendors code aren't part of this patch-set. Soft-Roce will be
sent soon and depends on these patches. Other vendors, like
mlx4, ocrdma and mlx5 will follow.

This patch is applied on "Change per-entry locks in GID cache to table lock"
which was sent to the mailing list.

Thanks,
Matan

Changed from V1:
 - Rebased against Linux 4.4-rc2 master branch.
 - Add route validation
 - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m
 - Add documentation for configfs and sysfs ABI
 - Remove ifindex and gid_type from mcmember

Changes from V0:
 - Rebased patches against Doug's latest k.o/for-4.4 tree.
 - Fixed a bug in configfs (rmdir caused an incorrect free).

Matan Barak (8):
  IB/core: Add gid_type to gid attribute
  IB/cm: Use the source GID index type
  IB/core: Add gid attributes to sysfs
  IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
  IB/core: Move rdma_is_upper_dev_rcu to header file
  IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
  IB/rdma_cm: Add wrapper for cma reference count
  IB/cma: Add configfs for rdma_cm

Moni Shoua (2):
  IB/core: Initialize UD header structure with IP and UDP headers
  IB/cma: Join and leave multicast groups with IGMP

Somnath Kotur (1):
  IB/core: Add rdma_network_type to wc

 Documentation/ABI/testing/configfs-rdma_cm   |  22 ++
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/Kconfig   |   9 +
 drivers/infiniband/core/Makefile |   2 +
 drivers/infiniband/core/addr.c   | 185 +
 drivers/infiniband/core/cache.c  | 169 
 drivers/infiniband/core/cm.c |  31 ++-
 drivers/infiniband/core/cma.c| 261 --
 drivers/infiniband/core/cma_configfs.c   | 321 +++
 drivers/infiniband/core/core_priv.h  |  45 
 drivers/infiniband/core/device.c |  10 +-
 drivers/infiniband/core/multicast.c  |  17 +-
 drivers/infiniband/core/roce_gid_mgmt.c  |  81 --
 drivers/infiniband/core/sa_query.c   |  76 +-
 drivers/infiniband/core/sysfs.c  | 184 -
 drivers/infiniband/core/ud_header.c  | 155 ++-
 drivers/infiniband/core/uverbs_marshall.c|   1 +
 drivers/infiniband/core/verbs.c  | 170 ++--
 drivers/infiniband/hw/mlx4/qp.c  |   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c   |   2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
 include/rdma/ib_addr.h   |  11 +-
 include/rdma/ib_cache.h  |   4 +
 include/rdma/ib_pack.h   |  45 +++-
 include/rdma/ib_sa.h |   3 +
 include/rdma/ib_verbs.h  |  78 +-
 26 files changed, 1704 insertions(+), 203 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband
 create mode 100644 drivers/infiniband/core/cma_configfs.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm

2015-12-03 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak 
---
 Documentation/ABI/testing/configfs-rdma_cm |  22 ++
 drivers/infiniband/Kconfig |   9 +
 drivers/infiniband/core/Makefile   |   2 +
 drivers/infiniband/core/cache.c|  24 +++
 drivers/infiniband/core/cma.c  | 108 +-
 drivers/infiniband/core/cma_configfs.c | 321 +
 drivers/infiniband/core/core_priv.h|  24 +++
 7 files changed, 503 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/Documentation/ABI/testing/configfs-rdma_cm 
b/Documentation/ABI/testing/configfs-rdma_cm
new file mode 100644
index 000..5c389aa
--- /dev/null
+++ b/Documentation/ABI/testing/configfs-rdma_cm
@@ -0,0 +1,22 @@
+What:  /config/rdma_cm
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   Interface is used to configure RDMA-cable HCAs in respect to
+   RDMA-CM attributes.
+
+   Attributes are visible only when configfs is mounted. To mount
+   configfs in /config directory use:
+   # mount -t configfs none /config/
+
+   In order to set parameters related to a specific HCA, a 
directory
+   for this HCA has to be created:
+   mkdir -p /config/rdma_cm/
+
+
+What:  /config/rdma_cm//ports//default_roce_mode
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   RDMA-CM based connections from HCA  at port 
+   will be initiated with this RoCE type as default.
+   The possible RoCE types are either "IB/RoCE v1" or "RoCE v2".
+   This parameter has RW access.
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..f5312da 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m)
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 88b4b6f..4aada52 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f78088a..8fab267 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -152,6 +152,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_list;
+   enum ib_gid_type*default_gid_type;
 };
 
 struct rdma_bind_list {
@@ -192,6 +193,62 @@ void 

[PATCH for-next V2 09/11] IB/cma: Add configfs for rdma_cm

2015-12-03 Thread Matan Barak
Users would like to control the behaviour of rdma_cm.
For example, old applications which don't set the
required RoCE gid type could be executed on RoCE V2
network types. In order to support this configuration,
we implement a configfs for rdma_cm.

In order to use the configfs, one needs to mount it and
mkdir  inside rdma_cm directory.

The patch adds support for a single configuration file,
default_roce_mode. The mode can either be "IB/RoCE v1" or
"RoCE v2".

Signed-off-by: Matan Barak 
---
 Documentation/ABI/testing/configfs-rdma_cm |  22 ++
 drivers/infiniband/Kconfig |   9 +
 drivers/infiniband/core/Makefile   |   2 +
 drivers/infiniband/core/cache.c|  24 +++
 drivers/infiniband/core/cma.c  | 108 +-
 drivers/infiniband/core/cma_configfs.c | 321 +
 drivers/infiniband/core/core_priv.h|  24 +++
 7 files changed, 503 insertions(+), 7 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 drivers/infiniband/core/cma_configfs.c

diff --git a/Documentation/ABI/testing/configfs-rdma_cm 
b/Documentation/ABI/testing/configfs-rdma_cm
new file mode 100644
index 000..5c389aa
--- /dev/null
+++ b/Documentation/ABI/testing/configfs-rdma_cm
@@ -0,0 +1,22 @@
+What:  /config/rdma_cm
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   Interface is used to configure RDMA-cable HCAs in respect to
+   RDMA-CM attributes.
+
+   Attributes are visible only when configfs is mounted. To mount
+   configfs in /config directory use:
+   # mount -t configfs none /config/
+
+   In order to set parameters related to a specific HCA, a 
directory
+   for this HCA has to be created:
+   mkdir -p /config/rdma_cm/
+
+
+What:  /config/rdma_cm//ports//default_roce_mode
+Date:  November 29, 2015
+KernelVersion:  4.4.0
+Description:   RDMA-CM based connections from HCA  at port 
+   will be initiated with this RoCE type as default.
+   The possible RoCE types are either "IB/RoCE v1" or "RoCE v2".
+   This parameter has RW access.
diff --git a/drivers/infiniband/Kconfig b/drivers/infiniband/Kconfig
index aa26f3c..f5312da 100644
--- a/drivers/infiniband/Kconfig
+++ b/drivers/infiniband/Kconfig
@@ -54,6 +54,15 @@ config INFINIBAND_ADDR_TRANS
depends on INFINIBAND
default y
 
+config INFINIBAND_ADDR_TRANS_CONFIGFS
+   bool
+   depends on INFINIBAND_ADDR_TRANS && !(INFINIBAND=y && CONFIGFS_FS=m)
+   default y
+   ---help---
+ ConfigFS support for RDMA communication manager (CM).
+ This allows the user to config the default GID type that the CM
+ uses for each device, when initiaing new connections.
+
 source "drivers/infiniband/hw/mthca/Kconfig"
 source "drivers/infiniband/hw/qib/Kconfig"
 source "drivers/infiniband/hw/cxgb3/Kconfig"
diff --git a/drivers/infiniband/core/Makefile b/drivers/infiniband/core/Makefile
index d43a899..7922fa7 100644
--- a/drivers/infiniband/core/Makefile
+++ b/drivers/infiniband/core/Makefile
@@ -24,6 +24,8 @@ iw_cm-y :=iwcm.o iwpm_util.o iwpm_msg.o
 
 rdma_cm-y :=   cma.o
 
+rdma_cm-$(CONFIG_INFINIBAND_ADDR_TRANS_CONFIGFS) += cma_configfs.o
+
 rdma_ucm-y :=  ucma.o
 
 ib_addr-y :=   addr.o
diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 88b4b6f..4aada52 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -140,6 +140,30 @@ const char *ib_cache_gid_type_str(enum ib_gid_type 
gid_type)
 }
 EXPORT_SYMBOL(ib_cache_gid_type_str);
 
+int ib_cache_gid_parse_type_str(const char *buf)
+{
+   unsigned int i;
+   size_t len;
+   int err = -EINVAL;
+
+   len = strlen(buf);
+   if (len == 0)
+   return -EINVAL;
+
+   if (buf[len - 1] == '\n')
+   len--;
+
+   for (i = 0; i < ARRAY_SIZE(gid_type_str); ++i)
+   if (gid_type_str[i] && !strncmp(buf, gid_type_str[i], len) &&
+   len == strlen(gid_type_str[i])) {
+   err = i;
+   break;
+   }
+
+   return err;
+}
+EXPORT_SYMBOL(ib_cache_gid_parse_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index f78088a..8fab267 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -152,6 +152,7 @@ struct cma_device {
struct completion   comp;
atomic_trefcount;
struct list_headid_list;
+   enum ib_gid_type*default_gid_type;
 };
 
 struct rdma_bind_list {
@@ -192,6 +193,62 @@ void 

[PATCH for-next V2 10/11] IB/core: Initialize UD header structure with IP and UDP headers

2015-12-03 Thread Matan Barak
From: Moni Shoua 

ib_ud_header_init() is used to format InfiniBand headers
in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is
required that this function would be able to build also IP and UDP
headers.

Signed-off-by: Moni Shoua 
Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/ud_header.c| 155 ++---
 drivers/infiniband/hw/mlx4/qp.c|   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c |   2 +-
 include/rdma/ib_pack.h |  45 --
 4 files changed, 188 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 72feee6..96697e7 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -116,6 +117,72 @@ static const struct ib_field vlan_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field ip4_table[]  = {
+   { STRUCT_FIELD(ip4, ver),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, hdr_len),
+ .offset_words = 0,
+ .offset_bits  = 4,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, tos),
+ .offset_words = 0,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, tot_len),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, id),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, frag_off),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, ttl),
+ .offset_words = 2,
+ .offset_bits  = 0,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, protocol),
+ .offset_words = 2,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, check),
+ .offset_words = 2,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, saddr),
+ .offset_words = 3,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(ip4, daddr),
+ .offset_words = 4,
+ .offset_bits  = 0,
+ .size_bits= 32 }
+};
+
+static const struct ib_field udp_table[]  = {
+   { STRUCT_FIELD(udp, sport),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, dport),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, length),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, csum),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = {
  .size_bits= 24 }
 };
 
+__be16 ib_ud_ip4_csum(struct ib_ud_header *header)
+{
+   struct iphdr iph;
+
+   iph.ihl = 5;
+   iph.version = 4;
+   iph.tos = header->ip4.tos;
+   iph.tot_len = header->ip4.tot_len;
+   iph.id  = header->ip4.id;
+   iph.frag_off= header->ip4.frag_off;
+   iph.ttl = header->ip4.ttl;
+   iph.protocol= header->ip4.protocol;
+   iph.check   = 0;
+   iph.saddr   = header->ip4.saddr;
+   iph.daddr   = header->ip4.daddr;
+
+   return ip_fast_csum((u8 *), iph.ihl);
+}
+EXPORT_SYMBOL(ib_ud_ip4_csum);
+
 /**
  * ib_ud_header_init - Initialize UD header structure
  * @payload_bytes:Length of packet payload
  * @lrh_present: specify if LRH is present
  * @eth_present: specify if Eth header is present
  * @vlan_present: packet is tagged vlan
- * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @grh_present: GRH flag (if non-zero, GRH will be included)
+ * @ip_version: if non-zero, IP header, V4 or V6, will be included
+ * @udp_present :if non-zero, UDP header will be included
  * @immediate_present: specify if immediate data is present
  * @header:Structure to initialize
  */
-void ib_ud_header_init(int payload_bytes,
-  int  lrh_present,
-  int  eth_present,
-  int  vlan_present,
-  int  grh_present,
-  int  immediate_present,
-  struct ib_ud_header *header)
+int ib_ud_header_init(int payload_bytes,
+ intlrh_present,
+ inteth_present,
+ 

[PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-03 Thread Matan Barak
In order to make sure API users don't try to use SGIDs which don't
conform to the routing table, validate the route before searching
the RoCE GID table.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/addr.c   | 175 ++-
 drivers/infiniband/core/cm.c |  10 +-
 drivers/infiniband/core/cma.c|  30 +-
 drivers/infiniband/core/sa_query.c   |  75 +++--
 drivers/infiniband/core/verbs.c  |  48 ++---
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
 include/rdma/ib_addr.h   |  10 +-
 7 files changed, 270 insertions(+), 80 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 6e35299..57eda11 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -121,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 }
 EXPORT_SYMBOL(rdma_copy_addr);
 
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+int rdma_translate_ip(const struct sockaddr *addr,
+ struct rdma_dev_addr *dev_addr,
  u16 *vlan_id)
 {
struct net_device *dev;
@@ -139,7 +140,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
switch (addr->sa_family) {
case AF_INET:
dev = ip_dev_find(dev_addr->net,
-   ((struct sockaddr_in *) addr)->sin_addr.s_addr);
+   ((const struct sockaddr_in *)addr)->sin_addr.s_addr);
 
if (!dev)
return ret;
@@ -154,7 +155,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
rcu_read_lock();
for_each_netdev_rcu(dev_addr->net, dev) {
if (ipv6_chk_addr(dev_addr->net,
- &((struct sockaddr_in6 *) 
addr)->sin6_addr,
+ &((const struct sockaddr_in6 
*)addr)->sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
if (vlan_id)
@@ -198,7 +199,8 @@ static void queue_req(struct addr_req *req)
mutex_unlock();
 }
 
-static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, 
void *daddr)
+static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr,
+   const void *daddr)
 {
struct neighbour *n;
int ret;
@@ -222,8 +224,9 @@ static int dst_fetch_ha(struct dst_entry *dst, struct 
rdma_dev_addr *dev_addr, v
 }
 
 static int addr4_resolve(struct sockaddr_in *src_in,
-struct sockaddr_in *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in *dst_in,
+struct rdma_dev_addr *addr,
+struct rtable **prt)
 {
__be32 src_ip = src_in->sin_addr.s_addr;
__be32 dst_ip = dst_in->sin_addr.s_addr;
@@ -243,36 +246,23 @@ static int addr4_resolve(struct sockaddr_in *src_in,
src_in->sin_family = AF_INET;
src_in->sin_addr.s_addr = fl4.saddr;
 
-   if (rt->dst.dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
-   memcpy(addr->dst_dev_addr, addr->src_dev_addr, 
MAX_ADDR_LEN);
-   goto put;
-   }
-
-   /* If the device does ARP internally, return 'done' */
-   if (rt->dst.dev->flags & IFF_NOARP) {
-   ret = rdma_copy_addr(addr, rt->dst.dev, NULL);
-   goto put;
-   }
-
/* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
 * routable) and we could set the network type accordingly.
 */
if (rt->rt_uses_gateway)
addr->network = RDMA_NETWORK_IPV4;
 
-   ret = dst_fetch_ha(>dst, addr, );
-put:
-   ip_rt_put(rt);
+   *prt = rt;
+   return 0;
 out:
return ret;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int addr6_resolve(struct sockaddr_in6 *src_in,
-struct sockaddr_in6 *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in6 *dst_in,
+struct rdma_dev_addr *addr,
+struct dst_entry **pdst)
 {
struct flowi6 fl6;
struct dst_entry *dst;
@@ -299,49 +289,109 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
src_in->sin6_addr = fl6.saddr;
}
 
-   if (dst->dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
-   memcpy(addr->dst_dev_addr, addr->src_dev_addr, 
MAX_ADDR_LEN);
-   goto put;
-   

[PATCH for-next V2 08/11] IB/rdma_cm: Add wrapper for cma reference count

2015-12-03 Thread Matan Barak
Currently, cma users can't increase or decrease the cma reference
count. This is necassary when setting cma attributes (like the
default GID type) in order to avoid use-after-free errors.
Adding cma_ref_dev and cma_deref_dev APIs.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cma.c   | 11 +--
 drivers/infiniband/core/core_priv.h |  4 
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index cf52b65..f78088a 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -60,6 +60,8 @@
 #include 
 #include 
 
+#include "core_priv.h"
+
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -185,6 +187,11 @@ enum {
CMA_OPTION_AFONLY,
 };
 
+void cma_ref_dev(struct cma_device *cma_dev)
+{
+   atomic_inc(_dev->refcount);
+}
+
 /*
  * Device removal can occur at anytime, so we need extra handling to
  * serialize notifying the user of device removal with other callbacks.
@@ -339,7 +346,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
  struct cma_device *cma_dev)
 {
-   atomic_inc(_dev->refcount);
+   cma_ref_dev(cma_dev);
id_priv->cma_dev = cma_dev;
id_priv->id.device = cma_dev->device;
id_priv->id.route.addr.dev_addr.transport =
@@ -347,7 +354,7 @@ static void cma_attach_to_dev(struct rdma_id_private 
*id_priv,
list_add_tail(_priv->list, _dev->id_list);
 }
 
-static inline void cma_deref_dev(struct cma_device *cma_dev)
+void cma_deref_dev(struct cma_device *cma_dev)
 {
if (atomic_dec_and_test(_dev->refcount))
complete(_dev->comp);
diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 3b250a2..1945b4e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -38,6 +38,10 @@
 
 #include 
 
+struct cma_device;
+void cma_ref_dev(struct cma_device *cma_dev);
+void cma_deref_dev(struct cma_device *cma_dev);
+
 int  ib_device_register_sysfs(struct ib_device *device,
  int (*port_callback)(struct ib_device *,
   u8, struct kobject *));
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 11/11] IB/cma: Join and leave multicast groups with IGMP

2015-12-03 Thread Matan Barak
From: Moni Shoua 

Since RoCEv2 is a protocol over IP header it is required to send IGMP
join and leave requests to the network when joining and leaving
multicast groups.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c   | 96 ++---
 drivers/infiniband/core/multicast.c | 17 ++-
 include/rdma/ib_sa.h|  2 +
 3 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 8fab267..c30bfe3 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -304,6 +305,7 @@ struct cma_multicast {
void*context;
struct sockaddr_storage addr;
struct kref mcref;
+   booligmp_joined;
 };
 
 struct cma_work {
@@ -400,6 +402,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF);
 }
 
+static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool 
join)
+{
+   struct in_device *in_dev = NULL;
+
+   if (ndev) {
+   rtnl_lock();
+   in_dev = __in_dev_get_rtnl(ndev);
+   if (in_dev) {
+   if (join)
+   ip_mc_inc_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   else
+   ip_mc_dec_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   }
+   rtnl_unlock();
+   }
+   return (in_dev) ? 0 : -ENODEV;
+}
+
 static void _cma_attach_to_dev(struct rdma_id_private *id_priv,
   struct cma_device *cma_dev)
 {
@@ -1535,8 +1557,24 @@ static void cma_leave_mc_groups(struct rdma_id_private 
*id_priv)
  id_priv->id.port_num)) {
ib_sa_free_multicast(mc->multicast.ib);
kfree(mc);
-   } else
+   } else {
+   if (mc->igmp_joined) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev = NULL;
+
+   if (dev_addr->bound_dev_if)
+   ndev = dev_get_by_index(_net,
+   
dev_addr->bound_dev_if);
+   if (ndev) {
+   cma_igmp_send(ndev,
+ 
>multicast.ib->rec.mgid,
+ false);
+   dev_put(ndev);
+   }
+   }
kref_put(>mcref, release_mc);
+   }
}
 }
 
@@ -3656,12 +3694,23 @@ static int cma_ib_mc_handler(int status, struct 
ib_sa_multicast *multicast)
event.status = status;
event.param.ud.private_data = mc->context;
if (!status) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev =
+   dev_get_by_index(_net, dev_addr->bound_dev_if);
+   enum ib_gid_type gid_type =
+   id_priv->cma_dev->default_gid_type[id_priv->id.port_num 
-
+   rdma_start_port(id_priv->cma_dev->device)];
+
event.event = RDMA_CM_EVENT_MULTICAST_JOIN;
ib_init_ah_from_mcmember(id_priv->id.device,
 id_priv->id.port_num, >rec,
+ndev, gid_type,
 _attr);
event.param.ud.qp_num = 0xFF;
event.param.ud.qkey = be32_to_cpu(multicast->rec.qkey);
+   if (ndev)
+   dev_put(ndev);
} else
event.event = RDMA_CM_EVENT_MULTICAST_ERROR;
 
@@ -3794,9 +3843,10 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
 {
struct iboe_mcast_work *work;
struct rdma_dev_addr *dev_addr = _priv->id.route.addr.dev_addr;
-   int err;
+   int err = 0;
struct sockaddr *addr = (struct sockaddr *)>addr;
struct net_device *ndev = NULL;
+   enum ib_gid_type gid_type;
 
if (cma_zero_addr((struct sockaddr *)>addr))
return -EINVAL;
@@ -3826,9 +3876,25 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
mc->multicast.ib->rec.rate = iboe_get_rate(ndev);

[PATCH for-next V2 04/11] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type

2015-12-03 Thread Matan Barak
Adding RoCE v2 GID type and port type. Vendors
which support this type will get their GID table
populated with RoCE v2 GIDs automatically.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cache.c |  1 +
 drivers/infiniband/core/roce_gid_mgmt.c |  3 ++-
 include/rdma/ib_verbs.h | 23 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 566fd8f..88b4b6f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -128,6 +128,7 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
 
 static const char * const gid_type_str[] = {
[IB_GID_TYPE_IB]= "IB/RoCE v1",
+   [IB_GID_TYPE_ROCE_UDP_ENCAP]= "RoCE v2",
 };
 
 const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 61c27a7..1e3673f 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -71,7 +71,8 @@ static const struct {
bool (*is_supported)(const struct ib_device *device, u8 port_num);
enum ib_gid_type gid_type;
 } PORT_CAP_TO_GID_TYPE[] = {
-   {rdma_protocol_roce,   IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP},
 };
 
 #define CAP_TO_GID_TABLE_SIZE  ARRAY_SIZE(PORT_CAP_TO_GID_TYPE)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 2933aeb..87df931 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -71,6 +71,7 @@ enum ib_gid_type {
/* If link layer is Ethernet, this is RoCE V1 */
IB_GID_TYPE_IB= 0,
IB_GID_TYPE_ROCE  = 0,
+   IB_GID_TYPE_ROCE_UDP_ENCAP = 1,
IB_GID_TYPE_SIZE
 };
 
@@ -401,6 +402,7 @@ union rdma_protocol_stats {
 #define RDMA_CORE_CAP_PROT_IB   0x0010
 #define RDMA_CORE_CAP_PROT_ROCE 0x0020
 #define RDMA_CORE_CAP_PROT_IWARP0x0040
+#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080
 
 #define RDMA_CORE_PORT_IBA_IB  (RDMA_CORE_CAP_PROT_IB  \
| RDMA_CORE_CAP_IB_MAD \
@@ -413,6 +415,12 @@ union rdma_protocol_stats {
| RDMA_CORE_CAP_IB_CM   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
+#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP  \
+   (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \
+   | RDMA_CORE_CAP_IB_MAD  \
+   | RDMA_CORE_CAP_IB_CM   \
+   | RDMA_CORE_CAP_AF_IB   \
+   | RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
| RDMA_CORE_CAP_IW_CM)
 #define RDMA_CORE_PORT_INTEL_OPA   (RDMA_CORE_PORT_IBA_IB  \
@@ -1975,6 +1983,17 @@ static inline bool rdma_protocol_ib(const struct 
ib_device *device, u8 port_num)
 
 static inline bool rdma_protocol_roce(const struct ib_device *device, u8 
port_num)
 {
+   return device->port_immutable[port_num].core_cap_flags &
+   (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
+}
+
+static inline bool rdma_protocol_roce_udp_encap(const struct ib_device 
*device, u8 port_num)
+{
+   return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP;
+}
+
+static inline bool rdma_protocol_roce_eth_encap(const struct ib_device 
*device, u8 port_num)
+{
return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE;
 }
 
@@ -1985,8 +2004,8 @@ static inline bool rdma_protocol_iwarp(const struct 
ib_device *device, u8 port_n
 
 static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num)
 {
-   return device->port_immutable[port_num].core_cap_flags &
-   (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE);
+   return rdma_protocol_ib(device, port_num) ||
+   rdma_protocol_roce(device, port_num);
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 02/11] IB/cm: Use the source GID index type

2015-12-03 Thread Matan Barak
Previosuly, cm and cma modules supported only IB and RoCE v1 GID type.
In order to support multiple GID types, the gid_type is passed to
cm_init_av_by_path and stored in the path record.

The rdma cm client would use a default GID type that will be saved in
rdma_id_private.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cm.c  | 25 -
 drivers/infiniband/core/cma.c |  2 ++
 2 files changed, 22 insertions(+), 5 deletions(-)

diff --git a/drivers/infiniband/core/cm.c b/drivers/infiniband/core/cm.c
index af8b907..5ea78ab 100644
--- a/drivers/infiniband/core/cm.c
+++ b/drivers/infiniband/core/cm.c
@@ -364,7 +364,7 @@ static int cm_init_av_by_path(struct ib_sa_path_rec *path, 
struct cm_av *av)
read_lock_irqsave(_lock, flags);
list_for_each_entry(cm_dev, _list, list) {
if (!ib_find_cached_gid(cm_dev->ib_device, >sgid,
-   IB_GID_TYPE_IB, ndev, , NULL)) {
+   path->gid_type, ndev, , NULL)) {
port = cm_dev->port[p-1];
break;
}
@@ -1600,6 +1600,8 @@ static int cm_req_handler(struct cm_work *work)
struct ib_cm_id *cm_id;
struct cm_id_private *cm_id_priv, *listen_cm_id_priv;
struct cm_req_msg *req_msg;
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr;
int ret;
 
req_msg = (struct cm_req_msg *)work->mad_recv_wc->recv_buf.mad;
@@ -1639,11 +1641,24 @@ static int cm_req_handler(struct cm_work *work)
cm_format_paths_from_req(req_msg, >path[0], >path[1]);
 
memcpy(work->path[0].dmac, cm_id_priv->av.ah_attr.dmac, ETH_ALEN);
-   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   ret = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num,
+   cm_id_priv->av.ah_attr.grh.sgid_index,
+   , _attr);
+   if (!ret) {
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
+   ret = cm_init_av_by_path(>path[0], _id_priv->av);
+   }
if (ret) {
-   ib_get_cached_gid(work->port->cm_dev->ib_device,
- work->port->port_num, 0, >path[0].sgid,
- NULL);
+   int err = ib_get_cached_gid(work->port->cm_dev->ib_device,
+   work->port->port_num, 0,
+   >path[0].sgid,
+   _attr);
+   if (!err && gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   work->path[0].gid_type = gid_attr.gid_type;
ib_send_cm_rej(cm_id, IB_CM_REJ_INVALID_GID,
   >path[0].sgid, sizeof work->path[0].sgid,
   NULL, 0);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index c19f822..2914e08 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -228,6 +228,7 @@ struct rdma_id_private {
u8  tos;
u8  reuseaddr;
u8  afonly;
+   enum ib_gid_typegid_type;
 };
 
 struct cma_multicast {
@@ -2325,6 +2326,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
ndev = dev_get_by_index(_net, addr->dev_addr.bound_dev_if);
route->path_rec->net = _net;
route->path_rec->ifindex = addr->dev_addr.bound_dev_if;
+   route->path_rec->gid_type = id_priv->gid_type;
}
if (!ndev) {
ret = -ENODEV;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 11/11] IB/cma: Join and leave multicast groups with IGMP

2015-12-03 Thread Matan Barak
From: Moni Shoua 

Since RoCEv2 is a protocol over IP header it is required to send IGMP
join and leave requests to the network when joining and leaving
multicast groups.

Signed-off-by: Moni Shoua 
---
 drivers/infiniband/core/cma.c   | 96 ++---
 drivers/infiniband/core/multicast.c | 17 ++-
 include/rdma/ib_sa.h|  2 +
 3 files changed, 106 insertions(+), 9 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 8fab267..c30bfe3 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -38,6 +38,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 #include 
 #include 
@@ -304,6 +305,7 @@ struct cma_multicast {
void*context;
struct sockaddr_storage addr;
struct kref mcref;
+   booligmp_joined;
 };
 
 struct cma_work {
@@ -400,6 +402,26 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
hdr->ip_version = (ip_ver << 4) | (hdr->ip_version & 0xF);
 }
 
+static int cma_igmp_send(struct net_device *ndev, union ib_gid *mgid, bool 
join)
+{
+   struct in_device *in_dev = NULL;
+
+   if (ndev) {
+   rtnl_lock();
+   in_dev = __in_dev_get_rtnl(ndev);
+   if (in_dev) {
+   if (join)
+   ip_mc_inc_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   else
+   ip_mc_dec_group(in_dev,
+   *(__be32 *)(mgid->raw + 12));
+   }
+   rtnl_unlock();
+   }
+   return (in_dev) ? 0 : -ENODEV;
+}
+
 static void _cma_attach_to_dev(struct rdma_id_private *id_priv,
   struct cma_device *cma_dev)
 {
@@ -1535,8 +1557,24 @@ static void cma_leave_mc_groups(struct rdma_id_private 
*id_priv)
  id_priv->id.port_num)) {
ib_sa_free_multicast(mc->multicast.ib);
kfree(mc);
-   } else
+   } else {
+   if (mc->igmp_joined) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev = NULL;
+
+   if (dev_addr->bound_dev_if)
+   ndev = dev_get_by_index(_net,
+   
dev_addr->bound_dev_if);
+   if (ndev) {
+   cma_igmp_send(ndev,
+ 
>multicast.ib->rec.mgid,
+ false);
+   dev_put(ndev);
+   }
+   }
kref_put(>mcref, release_mc);
+   }
}
 }
 
@@ -3656,12 +3694,23 @@ static int cma_ib_mc_handler(int status, struct 
ib_sa_multicast *multicast)
event.status = status;
event.param.ud.private_data = mc->context;
if (!status) {
+   struct rdma_dev_addr *dev_addr =
+   _priv->id.route.addr.dev_addr;
+   struct net_device *ndev =
+   dev_get_by_index(_net, dev_addr->bound_dev_if);
+   enum ib_gid_type gid_type =
+   id_priv->cma_dev->default_gid_type[id_priv->id.port_num 
-
+   rdma_start_port(id_priv->cma_dev->device)];
+
event.event = RDMA_CM_EVENT_MULTICAST_JOIN;
ib_init_ah_from_mcmember(id_priv->id.device,
 id_priv->id.port_num, >rec,
+ndev, gid_type,
 _attr);
event.param.ud.qp_num = 0xFF;
event.param.ud.qkey = be32_to_cpu(multicast->rec.qkey);
+   if (ndev)
+   dev_put(ndev);
} else
event.event = RDMA_CM_EVENT_MULTICAST_ERROR;
 
@@ -3794,9 +3843,10 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
 {
struct iboe_mcast_work *work;
struct rdma_dev_addr *dev_addr = _priv->id.route.addr.dev_addr;
-   int err;
+   int err = 0;
struct sockaddr *addr = (struct sockaddr *)>addr;
struct net_device *ndev = NULL;
+   enum ib_gid_type gid_type;
 
if (cma_zero_addr((struct sockaddr *)>addr))
return -EINVAL;
@@ -3826,9 +3876,25 @@ static int cma_iboe_join_multicast(struct 
rdma_id_private *id_priv,
mc->multicast.ib->rec.rate = iboe_get_rate(ndev);

[PATCH for-next V2 06/11] IB/core: Move rdma_is_upper_dev_rcu to header file

2015-12-03 Thread Matan Barak
In order to validate the route, we need an easy way to check if a
net-device belongs to our RDMA device. Move this helper function
to a header file in order to make this check easier.

Signed-off-by: Matan Barak 
Reviewed-by: Haggai Eran 
---
 drivers/infiniband/core/core_priv.h | 13 +
 drivers/infiniband/core/roce_gid_mgmt.c | 20 
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index d531f91..3b250a2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -96,4 +96,17 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
+struct net_device *upper)
+{
+   struct net_device *_upper = NULL;
+   struct list_head *iter;
+
+   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
+   if (_upper == upper)
+   break;
+
+   return _upper == upper;
+}
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 1e3673f..06556c3 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -139,18 +139,6 @@ static enum bonding_slave_state 
is_eth_active_slave_of_bonding_rcu(struct net_de
return BONDING_SLAVE_STATE_NA;
 }
 
-static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
-{
-   struct net_device *_upper = NULL;
-   struct list_head *iter;
-
-   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
-   if (_upper == upper)
-   break;
-
-   return _upper == upper;
-}
-
 #define REQUIRED_BOND_STATES   (BONDING_SLAVE_STATE_ACTIVE |   \
 BONDING_SLAVE_STATE_NA)
 static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
@@ -168,7 +156,7 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, 
u8 port,
if (!real_dev)
real_dev = event_ndev;
 
-   res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   res = ((rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
   (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) &
REQUIRED_BOND_STATES)) ||
   real_dev == rdma_ndev);
@@ -214,7 +202,7 @@ static int upper_device_filter(struct ib_device *ib_dev, u8 
port,
return 1;
 
rcu_read_lock();
-   res = is_upper_dev_rcu(rdma_ndev, event_ndev);
+   res = rdma_is_upper_dev_rcu(rdma_ndev, event_ndev);
rcu_read_unlock();
 
return res;
@@ -244,7 +232,7 @@ static void enum_netdev_default_gids(struct ib_device 
*ib_dev,
rcu_read_lock();
if (!rdma_ndev ||
((rdma_ndev != event_ndev &&
- !is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
+ !rdma_is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
 is_eth_active_slave_of_bonding_rcu(rdma_ndev,

netdev_master_upper_dev_get_rcu(rdma_ndev)) ==
 BONDING_SLAVE_STATE_INACTIVE)) {
@@ -274,7 +262,7 @@ static void bond_delete_netdev_default_gids(struct 
ib_device *ib_dev,
 
rcu_read_lock();
 
-   if (is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   if (rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) ==
BONDING_SLAVE_STATE_INACTIVE) {
unsigned long gid_type_mask;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 00/11] Add RoCE v2 support

2015-12-03 Thread Matan Barak
Hi Doug,

This series adds the support for RoCE v2. In order to support RoCE v2,
we add gid_type attribute to every GID. When the RoCE GID management
populates the GID table, it duplicates each GID with all supported types.
This gives the user the ability to communicate over each supported
type.

Patch 0001, 0002 and 0003 add support for multiple GID types to the
cache and related APIs. The third patch exposes the GID attributes
information is sysfs.

Patch 0004 adds the RoCE v2 GID type and the capabilities required
from the vendor in order to implement RoCE v2. These capabilities
are grouped together as RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP.

RoCE v2 could work at IPv4 and IPv6 networks. When receiving ib_wc, this
information should come from the vendor's driver. In case the vendor
doesn't supply this information, we parse the packet headers and resolve
its network type. Patch 0005 adds this information and required utilities.

Patches 0006 and 0007 adds route validation. This is mandatory to ensure
that we send packets using GIDS which corresponds to a net-device that
can be routed to the destination.

Patches 0008 and 0009 add configfs support (and the required
infrastructure) for CMA. The administrator should be able to set the
default RoCE type. This is done through a new per-port
default_roce_mode configfs file.

Patch 0010 formats a QP1 packet in order to support RoCE v2 CM
packets. This is required for vendors which implement their
QP1 as a Raw QP.

Patch 0011 adds support for IPv4 multicast as an IPv4 network
requires IGMP to be sent in order to join multicast groups.

Vendors code aren't part of this patch-set. Soft-Roce will be
sent soon and depends on these patches. Other vendors, like
mlx4, ocrdma and mlx5 will follow.

This patch is applied on "Change per-entry locks in GID cache to table lock"
which was sent to the mailing list.

Thanks,
Matan

Changed from V1:
 - Rebased against Linux 4.4-rc2 master branch.
 - Add route validation
 - ConfigFS - avoid compiling INFINIBAND=y and CONFIGFS_FS=m
 - Add documentation for configfs and sysfs ABI
 - Remove ifindex and gid_type from mcmember

Changes from V0:
 - Rebased patches against Doug's latest k.o/for-4.4 tree.
 - Fixed a bug in configfs (rmdir caused an incorrect free).

Matan Barak (8):
  IB/core: Add gid_type to gid attribute
  IB/cm: Use the source GID index type
  IB/core: Add gid attributes to sysfs
  IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type
  IB/core: Move rdma_is_upper_dev_rcu to header file
  IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path
  IB/rdma_cm: Add wrapper for cma reference count
  IB/cma: Add configfs for rdma_cm

Moni Shoua (2):
  IB/core: Initialize UD header structure with IP and UDP headers
  IB/cma: Join and leave multicast groups with IGMP

Somnath Kotur (1):
  IB/core: Add rdma_network_type to wc

 Documentation/ABI/testing/configfs-rdma_cm   |  22 ++
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/Kconfig   |   9 +
 drivers/infiniband/core/Makefile |   2 +
 drivers/infiniband/core/addr.c   | 185 +
 drivers/infiniband/core/cache.c  | 169 
 drivers/infiniband/core/cm.c |  31 ++-
 drivers/infiniband/core/cma.c| 261 --
 drivers/infiniband/core/cma_configfs.c   | 321 +++
 drivers/infiniband/core/core_priv.h  |  45 
 drivers/infiniband/core/device.c |  10 +-
 drivers/infiniband/core/multicast.c  |  17 +-
 drivers/infiniband/core/roce_gid_mgmt.c  |  81 --
 drivers/infiniband/core/sa_query.c   |  76 +-
 drivers/infiniband/core/sysfs.c  | 184 -
 drivers/infiniband/core/ud_header.c  | 155 ++-
 drivers/infiniband/core/uverbs_marshall.c|   1 +
 drivers/infiniband/core/verbs.c  | 170 ++--
 drivers/infiniband/hw/mlx4/qp.c  |   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c   |   2 +-
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
 include/rdma/ib_addr.h   |  11 +-
 include/rdma/ib_cache.h  |   4 +
 include/rdma/ib_pack.h   |  45 +++-
 include/rdma/ib_sa.h |   3 +
 include/rdma/ib_verbs.h  |  78 +-
 26 files changed, 1704 insertions(+), 203 deletions(-)
 create mode 100644 Documentation/ABI/testing/configfs-rdma_cm
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband
 create mode 100644 drivers/infiniband/core/cma_configfs.c

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 01/11] IB/core: Add gid_type to gid attribute

2015-12-03 Thread Matan Barak
In order to support multiple GID types, we need to store the gid_type
with each GID. This is also aligned with the RoCE v2 annex "RoCEv2 PORT
GID table entries shall have a "GID type" attribute that denotes the L3
Address type". The currently supported GID is IB_GID_TYPE_IB which is
also RoCE v1 GID type.

This implies that gid_type should be added to roce_gid_table meta-data.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cache.c   | 144 --
 drivers/infiniband/core/cm.c  |   2 +-
 drivers/infiniband/core/cma.c |   3 +-
 drivers/infiniband/core/core_priv.h   |   4 +
 drivers/infiniband/core/device.c  |   9 +-
 drivers/infiniband/core/multicast.c   |   2 +-
 drivers/infiniband/core/roce_gid_mgmt.c   |  60 +++--
 drivers/infiniband/core/sa_query.c|   5 +-
 drivers/infiniband/core/uverbs_marshall.c |   1 +
 drivers/infiniband/core/verbs.c   |   1 +
 include/rdma/ib_cache.h   |   4 +
 include/rdma/ib_sa.h  |   1 +
 include/rdma/ib_verbs.h   |  11 ++-
 13 files changed, 185 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 097e9df..566fd8f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -64,6 +64,7 @@ enum gid_attr_find_mask {
GID_ATTR_FIND_MASK_GID  = 1UL << 0,
GID_ATTR_FIND_MASK_NETDEV   = 1UL << 1,
GID_ATTR_FIND_MASK_DEFAULT  = 1UL << 2,
+   GID_ATTR_FIND_MASK_GID_TYPE = 1UL << 3,
 };
 
 enum gid_table_entry_props {
@@ -125,6 +126,19 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
}
 }
 
+static const char * const gid_type_str[] = {
+   [IB_GID_TYPE_IB]= "IB/RoCE v1",
+};
+
+const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
+{
+   if (gid_type < ARRAY_SIZE(gid_type_str) && gid_type_str[gid_type])
+   return gid_type_str[gid_type];
+
+   return "Invalid GID type";
+}
+EXPORT_SYMBOL(ib_cache_gid_type_str);
+
 /* This function expects that rwlock will be write locked in all
  * scenarios and that lock will be locked in sleep-able (RoCE)
  * scenarios.
@@ -233,6 +247,10 @@ static int find_gid(struct ib_gid_table *table, const 
union ib_gid *gid,
if (found >=0)
continue;
 
+   if (mask & GID_ATTR_FIND_MASK_GID_TYPE &&
+   attr->gid_type != val->gid_type)
+   continue;
+
if (mask & GID_ATTR_FIND_MASK_GID &&
memcmp(gid, >gid, sizeof(*gid)))
continue;
@@ -296,6 +314,7 @@ int ib_cache_gid_add(struct ib_device *ib_dev, u8 port,
write_lock_irq(>rwlock);
 
ix = find_gid(table, gid, attr, false, GID_ATTR_FIND_MASK_GID |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV, );
if (ix >= 0)
goto out_unlock;
@@ -329,6 +348,7 @@ int ib_cache_gid_del(struct ib_device *ib_dev, u8 port,
 
ix = find_gid(table, gid, attr, false,
  GID_ATTR_FIND_MASK_GID  |
+ GID_ATTR_FIND_MASK_GID_TYPE |
  GID_ATTR_FIND_MASK_NETDEV   |
  GID_ATTR_FIND_MASK_DEFAULT,
  NULL);
@@ -427,11 +447,13 @@ static int _ib_cache_gid_table_find(struct ib_device 
*ib_dev,
 
 static int ib_cache_gid_find(struct ib_device *ib_dev,
 const union ib_gid *gid,
+enum ib_gid_type gid_type,
 struct net_device *ndev, u8 *port,
 u16 *index)
 {
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr gid_attr_val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr gid_attr_val = {.ndev = ndev, .gid_type = gid_type};
 
if (ndev)
mask |= GID_ATTR_FIND_MASK_NETDEV;
@@ -442,14 +464,16 @@ static int ib_cache_gid_find(struct ib_device *ib_dev,
 
 int ib_find_cached_gid_by_port(struct ib_device *ib_dev,
   const union ib_gid *gid,
+  enum ib_gid_type gid_type,
   u8 port, struct net_device *ndev,
   u16 *index)
 {
int local_index;
struct ib_gid_table **ports_table = ib_dev->cache.gid_cache;
struct ib_gid_table *table;
-   unsigned long mask = GID_ATTR_FIND_MASK_GID;
-   struct ib_gid_attr val = {.ndev = ndev};
+   unsigned long mask = GID_ATTR_FIND_MASK_GID |
+GID_ATTR_FIND_MASK_GID_TYPE;
+   struct ib_gid_attr val = {.ndev = ndev, .gid_type = gid_type};
unsigned long flags;
 

[PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Matan Barak
From: Somnath Kotur 

Providers should tell IB core the wc's network type.
This is used in order to search for the proper GID in the
GID table. When using HCAs that can't provide this info,
IB core tries to deep examine the packet and extract
the GID type by itself.

We choose sgid_index and type from all the matching entries in
RDMA-CM based on hint from the IP stack and we set hop_limit for
the IP packet based on above hint from IP stack.

Signed-off-by: Matan Barak 
Signed-off-by: Somnath Kotur 
---
 drivers/infiniband/core/addr.c  |  14 +
 drivers/infiniband/core/cma.c   |  11 +++-
 drivers/infiniband/core/verbs.c | 123 ++--
 include/rdma/ib_addr.h  |   1 +
 include/rdma/ib_verbs.h |  44 ++
 5 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 34b1ada..6e35299 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -256,6 +256,12 @@ static int addr4_resolve(struct sockaddr_in *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt_uses_gateway)
+   addr->network = RDMA_NETWORK_IPV4;
+
ret = dst_fetch_ha(>dst, addr, );
 put:
ip_rt_put(rt);
@@ -270,6 +276,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 {
struct flowi6 fl6;
struct dst_entry *dst;
+   struct rt6_info *rt;
int ret;
 
memset(, 0, sizeof fl6);
@@ -281,6 +288,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if ((ret = dst->error))
goto put;
 
+   rt = (struct rt6_info *)dst;
if (ipv6_addr_any()) {
ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 , 0, );
@@ -304,6 +312,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt6i_flags & RTF_GATEWAY)
+   addr->network = RDMA_NETWORK_IPV6;
+
ret = dst_fetch_ha(dst, addr, );
 put:
dst_release(dst);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2914e08..5dc853c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2302,6 +2302,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = _priv->id.route;
struct rdma_addr *addr = >addr;
+   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -2340,7 +2341,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
rdma_ip2gid((struct sockaddr *)_priv->id.route.addr.dst_addr,
>path_rec->dgid);
 
-   route->path_rec->hop_limit = 1;
+   /* Use the hint from IP Stack to select GID Type */
+   network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
+   if (addr->dev_addr.network != RDMA_NETWORK_IB) {
+   route->path_rec->gid_type = network_gid_type;
+   /* TODO: get the hoplimit from the inet/inet6 device */
+   route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
+   } else {
+   route->path_rec->hop_limit = 1;
+   }
route->path_rec->reversible = 1;
route->path_rec->pkey = cpu_to_be16(0x);
route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 4263c4c..c564131 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -311,8 +311,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct 
ib_ah_attr *ah_attr)
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+static int ib_get_header_version(const union rdma_network_hdr *hdr)
+{
+   const struct iphdr *ip4h = (struct iphdr *)>roce4grh;
+   struct iphdr ip4h_checked;
+   const struct ipv6hdr *ip6h = (struct ipv6hdr *)>ibgrh;
+
+   /* If it's IPv6, the version must be 6, otherwise, the first
+* 20 bytes (before the IPv4 header) are garbled.
+*/
+   if (ip6h->version != 6)
+   return (ip4h->version == 4) ? 4 : 0;
+   /* version may be 6 or 4 because the first 20 bytes could be garbled */
+
+   /* RoCE v2 requires no options, thus header length
+* must be 5 words
+*/
+   if (ip4h->ihl != 5)
+   return 6;
+
+   /* Verify checksum.
+* We can't write on scattered buffers so we need to copy to
+* temp buffer.
+*/
+   

[PATCH for-next V2 06/11] IB/core: Move rdma_is_upper_dev_rcu to header file

2015-12-03 Thread Matan Barak
In order to validate the route, we need an easy way to check if a
net-device belongs to our RDMA device. Move this helper function
to a header file in order to make this check easier.

Signed-off-by: Matan Barak 
Reviewed-by: Haggai Eran 
---
 drivers/infiniband/core/core_priv.h | 13 +
 drivers/infiniband/core/roce_gid_mgmt.c | 20 
 2 files changed, 17 insertions(+), 16 deletions(-)

diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index d531f91..3b250a2 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -96,4 +96,17 @@ int ib_cache_setup_one(struct ib_device *device);
 void ib_cache_cleanup_one(struct ib_device *device);
 void ib_cache_release_one(struct ib_device *device);
 
+static inline bool rdma_is_upper_dev_rcu(struct net_device *dev,
+struct net_device *upper)
+{
+   struct net_device *_upper = NULL;
+   struct list_head *iter;
+
+   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
+   if (_upper == upper)
+   break;
+
+   return _upper == upper;
+}
+
 #endif /* _CORE_PRIV_H */
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 1e3673f..06556c3 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -139,18 +139,6 @@ static enum bonding_slave_state 
is_eth_active_slave_of_bonding_rcu(struct net_de
return BONDING_SLAVE_STATE_NA;
 }
 
-static bool is_upper_dev_rcu(struct net_device *dev, struct net_device *upper)
-{
-   struct net_device *_upper = NULL;
-   struct list_head *iter;
-
-   netdev_for_each_all_upper_dev_rcu(dev, _upper, iter)
-   if (_upper == upper)
-   break;
-
-   return _upper == upper;
-}
-
 #define REQUIRED_BOND_STATES   (BONDING_SLAVE_STATE_ACTIVE |   \
 BONDING_SLAVE_STATE_NA)
 static int is_eth_port_of_netdev(struct ib_device *ib_dev, u8 port,
@@ -168,7 +156,7 @@ static int is_eth_port_of_netdev(struct ib_device *ib_dev, 
u8 port,
if (!real_dev)
real_dev = event_ndev;
 
-   res = ((is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   res = ((rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
   (is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) &
REQUIRED_BOND_STATES)) ||
   real_dev == rdma_ndev);
@@ -214,7 +202,7 @@ static int upper_device_filter(struct ib_device *ib_dev, u8 
port,
return 1;
 
rcu_read_lock();
-   res = is_upper_dev_rcu(rdma_ndev, event_ndev);
+   res = rdma_is_upper_dev_rcu(rdma_ndev, event_ndev);
rcu_read_unlock();
 
return res;
@@ -244,7 +232,7 @@ static void enum_netdev_default_gids(struct ib_device 
*ib_dev,
rcu_read_lock();
if (!rdma_ndev ||
((rdma_ndev != event_ndev &&
- !is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
+ !rdma_is_upper_dev_rcu(rdma_ndev, event_ndev)) ||
 is_eth_active_slave_of_bonding_rcu(rdma_ndev,

netdev_master_upper_dev_get_rcu(rdma_ndev)) ==
 BONDING_SLAVE_STATE_INACTIVE)) {
@@ -274,7 +262,7 @@ static void bond_delete_netdev_default_gids(struct 
ib_device *ib_dev,
 
rcu_read_lock();
 
-   if (is_upper_dev_rcu(rdma_ndev, event_ndev) &&
+   if (rdma_is_upper_dev_rcu(rdma_ndev, event_ndev) &&
is_eth_active_slave_of_bonding_rcu(rdma_ndev, real_dev) ==
BONDING_SLAVE_STATE_INACTIVE) {
unsigned long gid_type_mask;
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Matan Barak
From: Somnath Kotur 

Providers should tell IB core the wc's network type.
This is used in order to search for the proper GID in the
GID table. When using HCAs that can't provide this info,
IB core tries to deep examine the packet and extract
the GID type by itself.

We choose sgid_index and type from all the matching entries in
RDMA-CM based on hint from the IP stack and we set hop_limit for
the IP packet based on above hint from IP stack.

Signed-off-by: Matan Barak 
Signed-off-by: Somnath Kotur 
---
 drivers/infiniband/core/addr.c  |  14 +
 drivers/infiniband/core/cma.c   |  11 +++-
 drivers/infiniband/core/verbs.c | 123 ++--
 include/rdma/ib_addr.h  |   1 +
 include/rdma/ib_verbs.h |  44 ++
 5 files changed, 187 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 34b1ada..6e35299 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -256,6 +256,12 @@ static int addr4_resolve(struct sockaddr_in *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt_uses_gateway)
+   addr->network = RDMA_NETWORK_IPV4;
+
ret = dst_fetch_ha(>dst, addr, );
 put:
ip_rt_put(rt);
@@ -270,6 +276,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
 {
struct flowi6 fl6;
struct dst_entry *dst;
+   struct rt6_info *rt;
int ret;
 
memset(, 0, sizeof fl6);
@@ -281,6 +288,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
if ((ret = dst->error))
goto put;
 
+   rt = (struct rt6_info *)dst;
if (ipv6_addr_any()) {
ret = ipv6_dev_get_saddr(addr->net, ip6_dst_idev(dst)->dev,
 , 0, );
@@ -304,6 +312,12 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
goto put;
}
 
+   /* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
+* routable) and we could set the network type accordingly.
+*/
+   if (rt->rt6i_flags & RTF_GATEWAY)
+   addr->network = RDMA_NETWORK_IPV6;
+
ret = dst_fetch_ha(dst, addr, );
 put:
dst_release(dst);
diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 2914e08..5dc853c 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -2302,6 +2302,7 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
 {
struct rdma_route *route = _priv->id.route;
struct rdma_addr *addr = >addr;
+   enum ib_gid_type network_gid_type;
struct cma_work *work;
int ret;
struct net_device *ndev = NULL;
@@ -2340,7 +2341,15 @@ static int cma_resolve_iboe_route(struct rdma_id_private 
*id_priv)
rdma_ip2gid((struct sockaddr *)_priv->id.route.addr.dst_addr,
>path_rec->dgid);
 
-   route->path_rec->hop_limit = 1;
+   /* Use the hint from IP Stack to select GID Type */
+   network_gid_type = ib_network_to_gid_type(addr->dev_addr.network);
+   if (addr->dev_addr.network != RDMA_NETWORK_IB) {
+   route->path_rec->gid_type = network_gid_type;
+   /* TODO: get the hoplimit from the inet/inet6 device */
+   route->path_rec->hop_limit = IPV6_DEFAULT_HOPLIMIT;
+   } else {
+   route->path_rec->hop_limit = 1;
+   }
route->path_rec->reversible = 1;
route->path_rec->pkey = cpu_to_be16(0x);
route->path_rec->mtu_selector = IB_SA_EQ;
diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 4263c4c..c564131 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -311,8 +311,61 @@ struct ib_ah *ib_create_ah(struct ib_pd *pd, struct 
ib_ah_attr *ah_attr)
 }
 EXPORT_SYMBOL(ib_create_ah);
 
+static int ib_get_header_version(const union rdma_network_hdr *hdr)
+{
+   const struct iphdr *ip4h = (struct iphdr *)>roce4grh;
+   struct iphdr ip4h_checked;
+   const struct ipv6hdr *ip6h = (struct ipv6hdr *)>ibgrh;
+
+   /* If it's IPv6, the version must be 6, otherwise, the first
+* 20 bytes (before the IPv4 header) are garbled.
+*/
+   if (ip6h->version != 6)
+   return (ip4h->version == 4) ? 4 : 0;
+   /* version may be 6 or 4 because the first 20 bytes could be garbled */
+
+   /* RoCE v2 requires no options, thus header length
+* must be 5 words
+*/
+   if (ip4h->ihl != 5)
+   return 6;
+
+   /* Verify checksum.
+* We can't write on scattered buffers so we need to copy to
+* temp buffer.
+*/
+   

[PATCH for-next V2 07/11] IB/core: Validate route in ib_init_ah_from_wc and ib_init_ah_from_path

2015-12-03 Thread Matan Barak
In order to make sure API users don't try to use SGIDs which don't
conform to the routing table, validate the route before searching
the RoCE GID table.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/addr.c   | 175 ++-
 drivers/infiniband/core/cm.c |  10 +-
 drivers/infiniband/core/cma.c|  30 +-
 drivers/infiniband/core/sa_query.c   |  75 +++--
 drivers/infiniband/core/verbs.c  |  48 ++---
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |   2 +-
 include/rdma/ib_addr.h   |  10 +-
 7 files changed, 270 insertions(+), 80 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index 6e35299..57eda11 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -121,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 }
 EXPORT_SYMBOL(rdma_copy_addr);
 
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+int rdma_translate_ip(const struct sockaddr *addr,
+ struct rdma_dev_addr *dev_addr,
  u16 *vlan_id)
 {
struct net_device *dev;
@@ -139,7 +140,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
switch (addr->sa_family) {
case AF_INET:
dev = ip_dev_find(dev_addr->net,
-   ((struct sockaddr_in *) addr)->sin_addr.s_addr);
+   ((const struct sockaddr_in *)addr)->sin_addr.s_addr);
 
if (!dev)
return ret;
@@ -154,7 +155,7 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr,
rcu_read_lock();
for_each_netdev_rcu(dev_addr->net, dev) {
if (ipv6_chk_addr(dev_addr->net,
- &((struct sockaddr_in6 *) 
addr)->sin6_addr,
+ &((const struct sockaddr_in6 
*)addr)->sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
if (vlan_id)
@@ -198,7 +199,8 @@ static void queue_req(struct addr_req *req)
mutex_unlock();
 }
 
-static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr, 
void *daddr)
+static int dst_fetch_ha(struct dst_entry *dst, struct rdma_dev_addr *dev_addr,
+   const void *daddr)
 {
struct neighbour *n;
int ret;
@@ -222,8 +224,9 @@ static int dst_fetch_ha(struct dst_entry *dst, struct 
rdma_dev_addr *dev_addr, v
 }
 
 static int addr4_resolve(struct sockaddr_in *src_in,
-struct sockaddr_in *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in *dst_in,
+struct rdma_dev_addr *addr,
+struct rtable **prt)
 {
__be32 src_ip = src_in->sin_addr.s_addr;
__be32 dst_ip = dst_in->sin_addr.s_addr;
@@ -243,36 +246,23 @@ static int addr4_resolve(struct sockaddr_in *src_in,
src_in->sin_family = AF_INET;
src_in->sin_addr.s_addr = fl4.saddr;
 
-   if (rt->dst.dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
-   memcpy(addr->dst_dev_addr, addr->src_dev_addr, 
MAX_ADDR_LEN);
-   goto put;
-   }
-
-   /* If the device does ARP internally, return 'done' */
-   if (rt->dst.dev->flags & IFF_NOARP) {
-   ret = rdma_copy_addr(addr, rt->dst.dev, NULL);
-   goto put;
-   }
-
/* If there's a gateway, we're definitely in RoCE v2 (as RoCE v1 isn't
 * routable) and we could set the network type accordingly.
 */
if (rt->rt_uses_gateway)
addr->network = RDMA_NETWORK_IPV4;
 
-   ret = dst_fetch_ha(>dst, addr, );
-put:
-   ip_rt_put(rt);
+   *prt = rt;
+   return 0;
 out:
return ret;
 }
 
 #if IS_ENABLED(CONFIG_IPV6)
 static int addr6_resolve(struct sockaddr_in6 *src_in,
-struct sockaddr_in6 *dst_in,
-struct rdma_dev_addr *addr)
+const struct sockaddr_in6 *dst_in,
+struct rdma_dev_addr *addr,
+struct dst_entry **pdst)
 {
struct flowi6 fl6;
struct dst_entry *dst;
@@ -299,49 +289,109 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
src_in->sin6_addr = fl6.saddr;
}
 
-   if (dst->dev->flags & IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
-   if (!ret)
-   memcpy(addr->dst_dev_addr, addr->src_dev_addr, 
MAX_ADDR_LEN);
-   goto put;
-   

[PATCH for-next V2 03/11] IB/core: Add gid attributes to sysfs

2015-12-03 Thread Matan Barak
This patch set adds attributes of net device and gid type to each GID
in the GID table. Users that use verbs directly need to specify
the GID index. Since the same GID could have different types or
associated net devices, users should have the ability to query the
associated GID attributes. Adding these attributes to sysfs.

Signed-off-by: Matan Barak 
---
 Documentation/ABI/testing/sysfs-class-infiniband |  16 ++
 drivers/infiniband/core/sysfs.c  | 184 ++-
 2 files changed, 198 insertions(+), 2 deletions(-)
 create mode 100644 Documentation/ABI/testing/sysfs-class-infiniband

diff --git a/Documentation/ABI/testing/sysfs-class-infiniband 
b/Documentation/ABI/testing/sysfs-class-infiniband
new file mode 100644
index 000..a86abe6
--- /dev/null
+++ b/Documentation/ABI/testing/sysfs-class-infiniband
@@ -0,0 +1,16 @@
+What:  
/sys/class/infiniband//ports//gid_attrs/ndevs/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The net-device's name associated with the GID resides
+   at index .
+
+What:  
/sys/class/infiniband//ports//gid_attrs/types/
+Date:  November 29, 2015
+KernelVersion: 4.4.0
+Contact:   linux-rdma@vger.kernel.org
+Description:   The RoCE type of the associated GID resides at index 
.
+   This could either be "IB/RoCE v1" for IB and RoCE v1 based GODs
+   or "RoCE v2" for RoCE v2 based GIDs.
+
+
diff --git a/drivers/infiniband/core/sysfs.c b/drivers/infiniband/core/sysfs.c
index b1f37d4..4d5d87a 100644
--- a/drivers/infiniband/core/sysfs.c
+++ b/drivers/infiniband/core/sysfs.c
@@ -37,12 +37,22 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
+struct ib_port;
+
+struct gid_attr_group {
+   struct ib_port  *port;
+   struct kobject  kobj;
+   struct attribute_group  ndev;
+   struct attribute_group  type;
+};
 struct ib_port {
struct kobject kobj;
struct ib_device  *ibdev;
+   struct gid_attr_group *gid_attr_group;
struct attribute_group gid_group;
struct attribute_group pkey_group;
u8 port_num;
@@ -84,6 +94,24 @@ static const struct sysfs_ops port_sysfs_ops = {
.show = port_attr_show
 };
 
+static ssize_t gid_attr_show(struct kobject *kobj,
+struct attribute *attr, char *buf)
+{
+   struct port_attribute *port_attr =
+   container_of(attr, struct port_attribute, attr);
+   struct ib_port *p = container_of(kobj, struct gid_attr_group,
+kobj)->port;
+
+   if (!port_attr->show)
+   return -EIO;
+
+   return port_attr->show(p, port_attr, buf);
+}
+
+static const struct sysfs_ops gid_attr_sysfs_ops = {
+   .show = gid_attr_show
+};
+
 static ssize_t state_show(struct ib_port *p, struct port_attribute *unused,
  char *buf)
 {
@@ -281,6 +309,46 @@ static struct attribute *port_default_attrs[] = {
NULL
 };
 
+static size_t print_ndev(struct ib_gid_attr *gid_attr, char *buf)
+{
+   if (!gid_attr->ndev)
+   return -EINVAL;
+
+   return sprintf(buf, "%s\n", gid_attr->ndev->name);
+}
+
+static size_t print_gid_type(struct ib_gid_attr *gid_attr, char *buf)
+{
+   return sprintf(buf, "%s\n", ib_cache_gid_type_str(gid_attr->gid_type));
+}
+
+static ssize_t _show_port_gid_attr(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf,
+  size_t (*print)(struct ib_gid_attr *gid_attr,
+  char *buf))
+{
+   struct port_table_attribute *tab_attr =
+   container_of(attr, struct port_table_attribute, attr);
+   union ib_gid gid;
+   struct ib_gid_attr gid_attr = {};
+   ssize_t ret;
+   va_list args;
+
+   ret = ib_query_gid(p->ibdev, p->port_num, tab_attr->index, ,
+  _attr);
+   if (ret)
+   goto err;
+
+   ret = print(_attr, buf);
+
+err:
+   if (gid_attr.ndev)
+   dev_put(gid_attr.ndev);
+   va_end(args);
+   return ret;
+}
+
 static ssize_t show_port_gid(struct ib_port *p, struct port_attribute *attr,
 char *buf)
 {
@@ -296,6 +364,19 @@ static ssize_t show_port_gid(struct ib_port *p, struct 
port_attribute *attr,
return sprintf(buf, "%pI6\n", gid.raw);
 }
 
+static ssize_t show_port_gid_attr_ndev(struct ib_port *p,
+  struct port_attribute *attr, char *buf)
+{
+   return _show_port_gid_attr(p, attr, buf, print_ndev);
+}
+
+static ssize_t show_port_gid_attr_gid_type(struct ib_port *p,
+  struct port_attribute *attr,
+  char *buf)
+{

[PATCH for-next V2 08/11] IB/rdma_cm: Add wrapper for cma reference count

2015-12-03 Thread Matan Barak
Currently, cma users can't increase or decrease the cma reference
count. This is necassary when setting cma attributes (like the
default GID type) in order to avoid use-after-free errors.
Adding cma_ref_dev and cma_deref_dev APIs.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cma.c   | 11 +--
 drivers/infiniband/core/core_priv.h |  4 
 2 files changed, 13 insertions(+), 2 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index cf52b65..f78088a 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -60,6 +60,8 @@
 #include 
 #include 
 
+#include "core_priv.h"
+
 MODULE_AUTHOR("Sean Hefty");
 MODULE_DESCRIPTION("Generic RDMA CM Agent");
 MODULE_LICENSE("Dual BSD/GPL");
@@ -185,6 +187,11 @@ enum {
CMA_OPTION_AFONLY,
 };
 
+void cma_ref_dev(struct cma_device *cma_dev)
+{
+   atomic_inc(_dev->refcount);
+}
+
 /*
  * Device removal can occur at anytime, so we need extra handling to
  * serialize notifying the user of device removal with other callbacks.
@@ -339,7 +346,7 @@ static inline void cma_set_ip_ver(struct cma_hdr *hdr, u8 
ip_ver)
 static void cma_attach_to_dev(struct rdma_id_private *id_priv,
  struct cma_device *cma_dev)
 {
-   atomic_inc(_dev->refcount);
+   cma_ref_dev(cma_dev);
id_priv->cma_dev = cma_dev;
id_priv->id.device = cma_dev->device;
id_priv->id.route.addr.dev_addr.transport =
@@ -347,7 +354,7 @@ static void cma_attach_to_dev(struct rdma_id_private 
*id_priv,
list_add_tail(_priv->list, _dev->id_list);
 }
 
-static inline void cma_deref_dev(struct cma_device *cma_dev)
+void cma_deref_dev(struct cma_device *cma_dev)
 {
if (atomic_dec_and_test(_dev->refcount))
complete(_dev->comp);
diff --git a/drivers/infiniband/core/core_priv.h 
b/drivers/infiniband/core/core_priv.h
index 3b250a2..1945b4e 100644
--- a/drivers/infiniband/core/core_priv.h
+++ b/drivers/infiniband/core/core_priv.h
@@ -38,6 +38,10 @@
 
 #include 
 
+struct cma_device;
+void cma_ref_dev(struct cma_device *cma_dev);
+void cma_deref_dev(struct cma_device *cma_dev);
+
 int  ib_device_register_sysfs(struct ib_device *device,
  int (*port_callback)(struct ib_device *,
   u8, struct kobject *));
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V2 10/11] IB/core: Initialize UD header structure with IP and UDP headers

2015-12-03 Thread Matan Barak
From: Moni Shoua 

ib_ud_header_init() is used to format InfiniBand headers
in a buffer up to (but not with) BTH. For RoCE UDP ENCAP it is
required that this function would be able to build also IP and UDP
headers.

Signed-off-by: Moni Shoua 
Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/ud_header.c| 155 ++---
 drivers/infiniband/hw/mlx4/qp.c|   7 +-
 drivers/infiniband/hw/mthca/mthca_qp.c |   2 +-
 include/rdma/ib_pack.h |  45 --
 4 files changed, 188 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/ud_header.c 
b/drivers/infiniband/core/ud_header.c
index 72feee6..96697e7 100644
--- a/drivers/infiniband/core/ud_header.c
+++ b/drivers/infiniband/core/ud_header.c
@@ -35,6 +35,7 @@
 #include 
 #include 
 #include 
+#include 
 
 #include 
 
@@ -116,6 +117,72 @@ static const struct ib_field vlan_table[]  = {
  .size_bits= 16 }
 };
 
+static const struct ib_field ip4_table[]  = {
+   { STRUCT_FIELD(ip4, ver),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, hdr_len),
+ .offset_words = 0,
+ .offset_bits  = 4,
+ .size_bits= 4 },
+   { STRUCT_FIELD(ip4, tos),
+ .offset_words = 0,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, tot_len),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, id),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, frag_off),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, ttl),
+ .offset_words = 2,
+ .offset_bits  = 0,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, protocol),
+ .offset_words = 2,
+ .offset_bits  = 8,
+ .size_bits= 8 },
+   { STRUCT_FIELD(ip4, check),
+ .offset_words = 2,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(ip4, saddr),
+ .offset_words = 3,
+ .offset_bits  = 0,
+ .size_bits= 32 },
+   { STRUCT_FIELD(ip4, daddr),
+ .offset_words = 4,
+ .offset_bits  = 0,
+ .size_bits= 32 }
+};
+
+static const struct ib_field udp_table[]  = {
+   { STRUCT_FIELD(udp, sport),
+ .offset_words = 0,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, dport),
+ .offset_words = 0,
+ .offset_bits  = 16,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, length),
+ .offset_words = 1,
+ .offset_bits  = 0,
+ .size_bits= 16 },
+   { STRUCT_FIELD(udp, csum),
+ .offset_words = 1,
+ .offset_bits  = 16,
+ .size_bits= 16 }
+};
+
 static const struct ib_field grh_table[]  = {
{ STRUCT_FIELD(grh, ip_version),
  .offset_words = 0,
@@ -213,26 +280,57 @@ static const struct ib_field deth_table[] = {
  .size_bits= 24 }
 };
 
+__be16 ib_ud_ip4_csum(struct ib_ud_header *header)
+{
+   struct iphdr iph;
+
+   iph.ihl = 5;
+   iph.version = 4;
+   iph.tos = header->ip4.tos;
+   iph.tot_len = header->ip4.tot_len;
+   iph.id  = header->ip4.id;
+   iph.frag_off= header->ip4.frag_off;
+   iph.ttl = header->ip4.ttl;
+   iph.protocol= header->ip4.protocol;
+   iph.check   = 0;
+   iph.saddr   = header->ip4.saddr;
+   iph.daddr   = header->ip4.daddr;
+
+   return ip_fast_csum((u8 *), iph.ihl);
+}
+EXPORT_SYMBOL(ib_ud_ip4_csum);
+
 /**
  * ib_ud_header_init - Initialize UD header structure
  * @payload_bytes:Length of packet payload
  * @lrh_present: specify if LRH is present
  * @eth_present: specify if Eth header is present
  * @vlan_present: packet is tagged vlan
- * @grh_present:GRH flag (if non-zero, GRH will be included)
+ * @grh_present: GRH flag (if non-zero, GRH will be included)
+ * @ip_version: if non-zero, IP header, V4 or V6, will be included
+ * @udp_present :if non-zero, UDP header will be included
  * @immediate_present: specify if immediate data is present
  * @header:Structure to initialize
  */
-void ib_ud_header_init(int payload_bytes,
-  int  lrh_present,
-  int  eth_present,
-  int  vlan_present,
-  int  grh_present,
-  int  immediate_present,
-  struct ib_ud_header *header)
+int ib_ud_header_init(int payload_bytes,
+ intlrh_present,
+ inteth_present,
+ 

[PATCH for-next V2 04/11] IB/core: Add ROCE_UDP_ENCAP (RoCE V2) type

2015-12-03 Thread Matan Barak
Adding RoCE v2 GID type and port type. Vendors
which support this type will get their GID table
populated with RoCE v2 GIDs automatically.

Signed-off-by: Matan Barak 
---
 drivers/infiniband/core/cache.c |  1 +
 drivers/infiniband/core/roce_gid_mgmt.c |  3 ++-
 include/rdma/ib_verbs.h | 23 +--
 3 files changed, 24 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/core/cache.c b/drivers/infiniband/core/cache.c
index 566fd8f..88b4b6f 100644
--- a/drivers/infiniband/core/cache.c
+++ b/drivers/infiniband/core/cache.c
@@ -128,6 +128,7 @@ static void dispatch_gid_change_event(struct ib_device 
*ib_dev, u8 port)
 
 static const char * const gid_type_str[] = {
[IB_GID_TYPE_IB]= "IB/RoCE v1",
+   [IB_GID_TYPE_ROCE_UDP_ENCAP]= "RoCE v2",
 };
 
 const char *ib_cache_gid_type_str(enum ib_gid_type gid_type)
diff --git a/drivers/infiniband/core/roce_gid_mgmt.c 
b/drivers/infiniband/core/roce_gid_mgmt.c
index 61c27a7..1e3673f 100644
--- a/drivers/infiniband/core/roce_gid_mgmt.c
+++ b/drivers/infiniband/core/roce_gid_mgmt.c
@@ -71,7 +71,8 @@ static const struct {
bool (*is_supported)(const struct ib_device *device, u8 port_num);
enum ib_gid_type gid_type;
 } PORT_CAP_TO_GID_TYPE[] = {
-   {rdma_protocol_roce,   IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_eth_encap, IB_GID_TYPE_ROCE},
+   {rdma_protocol_roce_udp_encap, IB_GID_TYPE_ROCE_UDP_ENCAP},
 };
 
 #define CAP_TO_GID_TABLE_SIZE  ARRAY_SIZE(PORT_CAP_TO_GID_TYPE)
diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 2933aeb..87df931 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -71,6 +71,7 @@ enum ib_gid_type {
/* If link layer is Ethernet, this is RoCE V1 */
IB_GID_TYPE_IB= 0,
IB_GID_TYPE_ROCE  = 0,
+   IB_GID_TYPE_ROCE_UDP_ENCAP = 1,
IB_GID_TYPE_SIZE
 };
 
@@ -401,6 +402,7 @@ union rdma_protocol_stats {
 #define RDMA_CORE_CAP_PROT_IB   0x0010
 #define RDMA_CORE_CAP_PROT_ROCE 0x0020
 #define RDMA_CORE_CAP_PROT_IWARP0x0040
+#define RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP 0x0080
 
 #define RDMA_CORE_PORT_IBA_IB  (RDMA_CORE_CAP_PROT_IB  \
| RDMA_CORE_CAP_IB_MAD \
@@ -413,6 +415,12 @@ union rdma_protocol_stats {
| RDMA_CORE_CAP_IB_CM   \
| RDMA_CORE_CAP_AF_IB   \
| RDMA_CORE_CAP_ETH_AH)
+#define RDMA_CORE_PORT_IBA_ROCE_UDP_ENCAP  \
+   (RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP \
+   | RDMA_CORE_CAP_IB_MAD  \
+   | RDMA_CORE_CAP_IB_CM   \
+   | RDMA_CORE_CAP_AF_IB   \
+   | RDMA_CORE_CAP_ETH_AH)
 #define RDMA_CORE_PORT_IWARP   (RDMA_CORE_CAP_PROT_IWARP \
| RDMA_CORE_CAP_IW_CM)
 #define RDMA_CORE_PORT_INTEL_OPA   (RDMA_CORE_PORT_IBA_IB  \
@@ -1975,6 +1983,17 @@ static inline bool rdma_protocol_ib(const struct 
ib_device *device, u8 port_num)
 
 static inline bool rdma_protocol_roce(const struct ib_device *device, u8 
port_num)
 {
+   return device->port_immutable[port_num].core_cap_flags &
+   (RDMA_CORE_CAP_PROT_ROCE | RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP);
+}
+
+static inline bool rdma_protocol_roce_udp_encap(const struct ib_device 
*device, u8 port_num)
+{
+   return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE_UDP_ENCAP;
+}
+
+static inline bool rdma_protocol_roce_eth_encap(const struct ib_device 
*device, u8 port_num)
+{
return device->port_immutable[port_num].core_cap_flags & 
RDMA_CORE_CAP_PROT_ROCE;
 }
 
@@ -1985,8 +2004,8 @@ static inline bool rdma_protocol_iwarp(const struct 
ib_device *device, u8 port_n
 
 static inline bool rdma_ib_or_roce(const struct ib_device *device, u8 port_num)
 {
-   return device->port_immutable[port_num].core_cap_flags &
-   (RDMA_CORE_CAP_PROT_IB | RDMA_CORE_CAP_PROT_ROCE);
+   return rdma_protocol_ib(device, port_num) ||
+   rdma_protocol_roce(device, port_num);
 }
 
 /**
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Christoph Hellwig
Bloating the WC with a field that's not really useful for the ULPs
seems pretty sad..
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] Handle mlx4 max_sge_rd correctly

2015-12-03 Thread Christoph Hellwig
On Tue, Nov 10, 2015 at 12:36:44PM +0200, Sagi Grimberg wrote:
> Any reply on this patchset?

Did we ever make progress on this?

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Jason Gunthorpe
On Thu, Dec 03, 2015 at 03:47:12PM +0200, Matan Barak wrote:
> From: Somnath Kotur 
> 
> Providers should tell IB core the wc's network type.
> This is used in order to search for the proper GID in the
> GID table. When using HCAs that can't provide this info,
> IB core tries to deep examine the packet and extract
> the GID type by itself.

Eh? A wc has a sgid_index, and in this brave new world a gid has the
network type. Why do we need to specify it again?

>   memset(ah_attr, 0, sizeof *ah_attr);
>   if (rdma_cap_eth_ah(device, port_num)) {
> + if (wc->wc_flags & IB_WC_WITH_NETWORK_HDR_TYPE)
> + net_type = wc->network_hdr_type;
> + else
> + net_type = ib_get_net_type_by_grh(device, port_num, 
> grh);
> + gid_type = ib_network_to_gid_type(net_type);

Like here for instance.

... and I keep saying this is all wrong, once you get into IP land
this entire process needs a route/neighbour lookup.


> - ret = rdma_addr_find_dmac_by_grh(>dgid, >sgid,
> + ret = rdma_addr_find_dmac_by_grh(, ,
>ah_attr->dmac,
>wc->wc_flags & 
> IB_WC_WITH_VLAN ?
>NULL : _id,

ie no to this.

> + if (sgid_attr.gid_type == IB_GID_TYPE_ROCE_UDP_ENCAP)
> + /* TODO: get the hoplimit from the inet/inet6
> +  * device
> +  */

And no again, please fix this and all other missing route lookups
before sending another version.

> + struct {
> + /* The IB spec states that if it's IPv4, the header

roceev2 spec, surely

Jason
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2 0/2] Handle mlx4 max_sge_rd correctly

2015-12-03 Thread Sagi Grimberg



Did we ever make progress on this?


Just up to Doug to pull it in.
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 1/5] IB/mlx5: Add create_cq extended command

2015-12-03 Thread Matan Barak
In order to create a CQ that supports timestamp, mlx5 needs to
support the extended create CQ command with the timestamp flag.

Signed-off-by: Matan Barak 
Reviewed-by: Eli Cohen 
---
 drivers/infiniband/hw/mlx5/cq.c | 7 +++
 1 file changed, 7 insertions(+)

diff --git a/drivers/infiniband/hw/mlx5/cq.c b/drivers/infiniband/hw/mlx5/cq.c
index 3ce5cfa7..a9a7921 100644
--- a/drivers/infiniband/hw/mlx5/cq.c
+++ b/drivers/infiniband/hw/mlx5/cq.c
@@ -760,6 +760,10 @@ static void destroy_cq_kernel(struct mlx5_ib_dev *dev, 
struct mlx5_ib_cq *cq)
mlx5_db_free(dev->mdev, >db);
 }
 
+enum {
+   CQ_CREATE_FLAGS_SUPPORTED = IB_CQ_FLAGS_TIMESTAMP_COMPLETION
+};
+
 struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev,
const struct ib_cq_init_attr *attr,
struct ib_ucontext *context,
@@ -783,6 +787,9 @@ struct ib_cq *mlx5_ib_create_cq(struct ib_device *ibdev,
if (entries < 0)
return ERR_PTR(-EINVAL);
 
+   if (attr->flags & ~CQ_CREATE_FLAGS_SUPPORTED)
+   return ERR_PTR(-EOPNOTSUPP);
+
entries = roundup_pow_of_two(entries + 1);
if (entries > (1 << MLX5_CAP_GEN(dev->mdev, log_max_cq_sz)))
return ERR_PTR(-EINVAL);
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 0/5] User-space time-stamping support for mlx5_ib

2015-12-03 Thread Matan Barak
Hi Eli,

This patch-set adds user-space support for time-stamping in mlx5_ib.
It implements the necessary API:
(a) ib_create_cq_ex - Add support for CQ creation flags
(b) ib_query_device - return timestamp_mask and hca_core_clock.

We also add support for mmaping the HCA's free running clock.
In order to do so, we use the response of the vendor's extended
part in init_ucontext. This allows us to pass the page offset
of the free running clock register to the user-space driver.
In order to implement it in a future extensible manner, we  use the
same mechanism of verbs extensions to the mlx5 vendor part as well.

Regards,
Matan

Changes from v0:
 * Limit mmap PAGE_SIZE to 4K (security wise).
 * Optimize ib_is_udata_cleared.
 * Pass hca_core_clock_offset in the vendor's response part of init_ucontext.

Matan Barak (5):
  IB/mlx5: Add create_cq extended command
  IB/core: Add ib_is_udata_cleared
  IB/mlx5: Add support for hca_core_clock and timestamp_mask
  IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext
  IB/mlx5: Mmap the HCA's core clock register to user-space

 drivers/infiniband/hw/mlx5/cq.c  |  7 
 drivers/infiniband/hw/mlx5/main.c| 67 +++-
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  7 +++-
 drivers/infiniband/hw/mlx5/user.h| 12 +--
 include/linux/mlx5/device.h  |  7 ++--
 include/linux/mlx5/mlx5_ifc.h|  9 +++--
 include/rdma/ib_verbs.h  | 67 
 7 files changed, 160 insertions(+), 16 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 3/5] IB/mlx5: Add support for hca_core_clock and timestamp_mask

2015-12-03 Thread Matan Barak
Reporting the hca_core_clock (in kHZ) and the timestamp_mask in
query_device extended verb. timestamp_mask is used by users in order
to know what is the valid range of the raw timestamps, while
hca_core_clock reports the clock frequency that is used for
timestamps.

Signed-off-by: Matan Barak 
Reviewed-by: Moshe Lazer 
---
 drivers/infiniband/hw/mlx5/main.c | 2 ++
 include/linux/mlx5/mlx5_ifc.h | 9 ++---
 2 files changed, 8 insertions(+), 3 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 9b77058..8aa0330 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -504,6 +504,8 @@ static int mlx5_ib_query_device(struct ib_device *ibdev,
props->max_total_mcast_qp_attach = props->max_mcast_qp_attach *
   props->max_mcast_grp;
props->max_map_per_fmr = INT_MAX; /* no limit in ConnectIB */
+   props->hca_core_clock = MLX5_CAP_GEN(mdev, device_frequency_khz);
+   props->timestamp_mask = 0x7FFFULL;
 
 #ifdef CONFIG_INFINIBAND_ON_DEMAND_PAGING
if (MLX5_CAP_GEN(mdev, pg))
diff --git a/include/linux/mlx5/mlx5_ifc.h b/include/linux/mlx5/mlx5_ifc.h
index af51cd2..c57e975 100644
--- a/include/linux/mlx5/mlx5_ifc.h
+++ b/include/linux/mlx5/mlx5_ifc.h
@@ -792,15 +792,18 @@ struct mlx5_ifc_cmd_hca_cap_bits {
u8 reserved_63[0x8];
u8 log_uar_page_sz[0x10];
 
-   u8 reserved_64[0x100];
+   u8 reserved_64[0x20];
+   u8 device_frequency_mhz[0x20];
+   u8 device_frequency_khz[0x20];
+   u8 reserved_65[0xa0];
 
-   u8 reserved_65[0x1f];
+   u8 reserved_66[0x1f];
u8 cqe_zip[0x1];
 
u8 cqe_zip_timeout[0x10];
u8 cqe_zip_max_num[0x10];
 
-   u8 reserved_66[0x220];
+   u8 reserved_67[0x220];
 };
 
 enum {
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH for-next V1 4/5] IB/mlx5: Add hca_core_clock_offset to udata in init_ucontext

2015-12-03 Thread Matan Barak
Pass hca_core_clock_offset to user-space is mandatory in order to
let the user-space read the free-running clock register from the
right offset in the memory mapped page.
Passing this value is done by changing the vendor's command
and response of init_ucontext to be in extensible form.

Signed-off-by: Matan Barak 
Reviewed-By: Moshe Lazer 
---
 drivers/infiniband/hw/mlx5/main.c| 37 
 drivers/infiniband/hw/mlx5/mlx5_ib.h |  3 +++
 drivers/infiniband/hw/mlx5/user.h| 12 ++--
 include/linux/mlx5/device.h  |  7 +--
 4 files changed, 47 insertions(+), 12 deletions(-)

diff --git a/drivers/infiniband/hw/mlx5/main.c 
b/drivers/infiniband/hw/mlx5/main.c
index 8aa0330..e4ce010 100644
--- a/drivers/infiniband/hw/mlx5/main.c
+++ b/drivers/infiniband/hw/mlx5/main.c
@@ -796,8 +796,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
  struct ib_udata *udata)
 {
struct mlx5_ib_dev *dev = to_mdev(ibdev);
-   struct mlx5_ib_alloc_ucontext_req_v2 req;
-   struct mlx5_ib_alloc_ucontext_resp resp;
+   struct mlx5_ib_alloc_ucontext_req_v2 req = {};
+   struct mlx5_ib_alloc_ucontext_resp resp = {};
struct mlx5_ib_ucontext *context;
struct mlx5_uuar_info *uuari;
struct mlx5_uar *uars;
@@ -812,20 +812,19 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
if (!dev->ib_active)
return ERR_PTR(-EAGAIN);
 
-   memset(, 0, sizeof(req));
reqlen = udata->inlen - sizeof(struct ib_uverbs_cmd_hdr);
if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req))
ver = 0;
-   else if (reqlen == sizeof(struct mlx5_ib_alloc_ucontext_req_v2))
+   else if (reqlen >= sizeof(struct mlx5_ib_alloc_ucontext_req_v2))
ver = 2;
else
return ERR_PTR(-EINVAL);
 
-   err = ib_copy_from_udata(, udata, reqlen);
+   err = ib_copy_from_udata(, udata, min(reqlen, sizeof(req)));
if (err)
return ERR_PTR(err);
 
-   if (req.flags || req.reserved)
+   if (req.flags)
return ERR_PTR(-EINVAL);
 
if (req.total_num_uuars > MLX5_MAX_UUARS)
@@ -834,6 +833,14 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
if (req.total_num_uuars == 0)
return ERR_PTR(-EINVAL);
 
+   if (req.comp_mask)
+   return ERR_PTR(-EOPNOTSUPP);
+
+   if (reqlen > sizeof(req) &&
+   !ib_is_udata_cleared(udata, '\0', sizeof(req),
+udata->inlen - sizeof(req)))
+   return ERR_PTR(-EOPNOTSUPP);
+
req.total_num_uuars = ALIGN(req.total_num_uuars,
MLX5_NON_FP_BF_REGS_PER_PAGE);
if (req.num_low_latency_uuars > req.total_num_uuars - 1)
@@ -849,6 +856,8 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
resp.max_send_wqebb = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_qp_sz);
resp.max_srq_recv_wr = 1 << MLX5_CAP_GEN(dev->mdev, log_max_srq_sz);
+   resp.response_length = min(offsetof(typeof(resp), response_length) +
+  sizeof(resp.response_length), udata->outlen);
 
context = kzalloc(sizeof(*context), GFP_KERNEL);
if (!context)
@@ -899,8 +908,20 @@ static struct ib_ucontext *mlx5_ib_alloc_ucontext(struct 
ib_device *ibdev,
 
resp.tot_uuars = req.total_num_uuars;
resp.num_ports = MLX5_CAP_GEN(dev->mdev, num_ports);
-   err = ib_copy_to_udata(udata, ,
-  sizeof(resp) - sizeof(resp.reserved));
+
+   if (field_avail(typeof(resp), reserved2, udata->outlen))
+   resp.response_length += sizeof(resp.reserved2);
+
+   if (field_avail(typeof(resp), hca_core_clock_offset, udata->outlen)) {
+   resp.comp_mask |=
+   MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET;
+   resp.hca_core_clock_offset =
+   offsetof(struct mlx5_init_seg, internal_timer_h) %
+   PAGE_SIZE;
+   resp.response_length += sizeof(resp.hca_core_clock_offset);
+   }
+
+   err = ib_copy_to_udata(udata, , resp.response_length);
if (err)
goto out_uars;
 
diff --git a/drivers/infiniband/hw/mlx5/mlx5_ib.h 
b/drivers/infiniband/hw/mlx5/mlx5_ib.h
index b0deeb3..b2a6643 100644
--- a/drivers/infiniband/hw/mlx5/mlx5_ib.h
+++ b/drivers/infiniband/hw/mlx5/mlx5_ib.h
@@ -55,6 +55,9 @@ pr_err("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, 
__func__,\
 pr_warn("%s:%s:%d:(pid %d): " format, (dev)->ib_dev.name, __func__,\
__LINE__, current->pid, ##arg)
 
+#define field_avail(type, fld, sz) (offsetof(type, fld) +  

[PATCH libmlx5 V1 6/6] Add always_inline check

2015-12-03 Thread Matan Barak
Always inline isn't supported by every compiler. Adding it to
configure.ac in order to support it only when possible.
Inline other poll_one data path functions in order to eliminate
"ifs".

Signed-off-by: Matan Barak 
---
 configure.ac | 17 +
 src/cq.c | 42 +-
 src/mlx5.h   |  6 ++
 3 files changed, 52 insertions(+), 13 deletions(-)

diff --git a/configure.ac b/configure.ac
index fca0b46..50b4f9c 100644
--- a/configure.ac
+++ b/configure.ac
@@ -65,6 +65,23 @@ AC_CHECK_FUNC(ibv_read_sysfs_file, [],
 AC_MSG_ERROR([ibv_read_sysfs_file() not found.  libmlx5 requires 
libibverbs >= 1.0.3.]))
 AC_CHECK_FUNCS(ibv_dontfork_range ibv_dofork_range ibv_register_driver)
 
+AC_MSG_CHECKING("always inline")
+CFLAGS_BAK="$CFLAGS"
+CFLAGS="$CFLAGS -Werror"
+AC_COMPILE_IFELSE([AC_LANG_PROGRAM([[
+   static inline int f(void)
+   __attribute__((always_inline));
+   static inline int f(void)
+   {
+   return 1;
+   }
+]],[[
+   int a = f();
+   a = a;
+]])], [AC_MSG_RESULT([yes]) AC_DEFINE([HAVE_ALWAYS_INLINE], [1], [Define if 
__attribute((always_inline)).])],
+[AC_MSG_RESULT([no])])
+CFLAGS="$CFLAGS_BAK"
+
 dnl Now check if for libibverbs 1.0 vs 1.1
 dummy=if$$
 cat < $dummy.c
diff --git a/src/cq.c b/src/cq.c
index fcb4237..41751b7 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -218,6 +218,14 @@ static inline void handle_good_req_ex(struct ibv_wc_ex 
*wc_ex,
  uint64_t wc_flags_yes,
  uint64_t wc_flags_no,
  uint32_t qpn, uint64_t *wc_flags_out)
+   ALWAYS_INLINE;
+static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex,
+ union wc_buffer *pwc_buffer,
+ struct mlx5_cqe64 *cqe,
+ uint64_t wc_flags,
+ uint64_t wc_flags_yes,
+ uint64_t wc_flags_no,
+ uint32_t qpn, uint64_t *wc_flags_out)
 {
union wc_buffer wc_buffer = *pwc_buffer;
 
@@ -367,6 +375,14 @@ static inline int handle_responder_ex(struct ibv_wc_ex 
*wc_ex,
  uint64_t wc_flags, uint64_t wc_flags_yes,
  uint64_t wc_flags_no, uint32_t qpn,
  uint64_t *wc_flags_out)
+   ALWAYS_INLINE;
+static inline int handle_responder_ex(struct ibv_wc_ex *wc_ex,
+ union wc_buffer *pwc_buffer,
+ struct mlx5_cqe64 *cqe,
+ struct mlx5_qp *qp, struct mlx5_srq *srq,
+ uint64_t wc_flags, uint64_t wc_flags_yes,
+ uint64_t wc_flags_no, uint32_t qpn,
+ uint64_t *wc_flags_out)
 {
uint16_t wqe_ctr;
struct mlx5_wq *wq;
@@ -573,7 +589,7 @@ static void mlx5_get_cycles(uint64_t *cycles)
 static inline struct mlx5_qp *get_req_context(struct mlx5_context *mctx,
  struct mlx5_resource **cur_rsc,
  uint32_t rsn, int cqe_ver)
- __attribute__((always_inline));
+ ALWAYS_INLINE;
 static inline struct mlx5_qp *get_req_context(struct mlx5_context *mctx,
  struct mlx5_resource **cur_rsc,
  uint32_t rsn, int cqe_ver)
@@ -589,7 +605,7 @@ static inline int get_resp_cxt_v1(struct mlx5_context *mctx,
  struct mlx5_resource **cur_rsc,
  struct mlx5_srq **cur_srq,
  uint32_t uidx, int *is_srq)
- __attribute__((always_inline));
+ ALWAYS_INLINE;
 static inline int get_resp_cxt_v1(struct mlx5_context *mctx,
  struct mlx5_resource **cur_rsc,
  struct mlx5_srq **cur_srq,
@@ -625,7 +641,7 @@ static inline int get_resp_cxt_v1(struct mlx5_context *mctx,
 static inline int get_resp_ctx(struct mlx5_context *mctx,
   struct mlx5_resource **cur_rsc,
   uint32_t qpn)
-  __attribute__((always_inline));
+  ALWAYS_INLINE;
 static inline int get_resp_ctx(struct mlx5_context *mctx,
   struct mlx5_resource **cur_rsc,
   uint32_t qpn)
@@ -647,7 +663,7 @@ static inline int get_resp_ctx(struct mlx5_context *mctx,
 static inline int get_srq_ctx(struct mlx5_context *mctx,
 

[PATCH libmlx5 V1 4/6] Add ibv_query_values support

2015-12-03 Thread Matan Barak
In order to query the current HCA's core clock, libmlx5 should
support ibv_query_values verb. Querying the hardware's cycles
register is done by mmaping this register to user-space.
Therefore, when libmlx5 initializes we mmap the cycles register.
This assumes the machine's architecture places the PCI and memory in
the same address space.
The page offset is passed through init_context vendor's data.

Signed-off-by: Matan Barak 
---
 src/mlx5-abi.h | 10 +-
 src/mlx5.c | 37 +
 src/mlx5.h | 10 +-
 src/verbs.c| 46 ++
 4 files changed, 101 insertions(+), 2 deletions(-)

diff --git a/src/mlx5-abi.h b/src/mlx5-abi.h
index 769ea81..43d4906 100644
--- a/src/mlx5-abi.h
+++ b/src/mlx5-abi.h
@@ -55,7 +55,11 @@ struct mlx5_alloc_ucontext {
__u32   total_num_uuars;
__u32   num_low_latency_uuars;
__u32   flags;
-   __u32   reserved;
+   __u32   comp_mask;
+};
+
+enum mlx5_ib_alloc_ucontext_resp_mask {
+   MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET = 1UL << 0,
 };
 
 struct mlx5_alloc_ucontext_resp {
@@ -72,6 +76,10 @@ struct mlx5_alloc_ucontext_resp {
__u16   num_ports;
__u8cqe_version;
__u8reserved;
+   __u32   comp_mask;
+   __u32   response_length;
+   __u32   reserved2;
+   __u64   hca_core_clock_offset;
 };
 
 struct mlx5_alloc_pd_resp {
diff --git a/src/mlx5.c b/src/mlx5.c
index 229d99d..c455c08 100644
--- a/src/mlx5.c
+++ b/src/mlx5.c
@@ -524,6 +524,30 @@ static int single_threaded_app(void)
return 0;
 }
 
+static int mlx5_map_internal_clock(struct mlx5_device *mdev,
+  struct ibv_context *ibv_ctx)
+{
+   struct mlx5_context *context = to_mctx(ibv_ctx);
+   void *hca_clock_page;
+   off_t offset = 0;
+
+   set_command(MLX5_MMAP_GET_CORE_CLOCK_CMD, );
+   hca_clock_page = mmap(NULL, mdev->page_size,
+ PROT_READ, MAP_SHARED, ibv_ctx->cmd_fd,
+ mdev->page_size * offset);
+
+   if (hca_clock_page == MAP_FAILED) {
+   fprintf(stderr, PFX
+   "Warning: Timestamp available,\n"
+   "but failed to mmap() hca core clock page.\n");
+   return -1;
+   }
+
+   context->hca_core_clock = hca_clock_page +
+   (context->core_clock.offset & (mdev->page_size - 1));
+   return 0;
+}
+
 static int mlx5_init_context(struct verbs_device *vdev,
 struct ibv_context *ctx, int cmd_fd)
 {
@@ -647,6 +671,15 @@ static int mlx5_init_context(struct verbs_device *vdev,
context->bfs[j].uuarn = j;
}
 
+   context->hca_core_clock = NULL;
+   if (resp.response_length + sizeof(resp.ibv_resp) >=
+   offsetof(struct mlx5_alloc_ucontext_resp, hca_core_clock_offset) +
+   sizeof(resp.hca_core_clock_offset) &&
+   resp.comp_mask & 
MLX5_IB_ALLOC_UCONTEXT_RESP_MASK_CORE_CLOCK_OFFSET) {
+   context->core_clock.offset = resp.hca_core_clock_offset;
+   mlx5_map_internal_clock(mdev, ctx);
+   }
+
mlx5_spinlock_init(>lock32);
 
context->prefer_bf = get_always_bf();
@@ -664,6 +697,7 @@ static int mlx5_init_context(struct verbs_device *vdev,
verbs_set_ctx_op(v_ctx, create_srq_ex, mlx5_create_srq_ex);
verbs_set_ctx_op(v_ctx, get_srq_num, mlx5_get_srq_num);
verbs_set_ctx_op(v_ctx, query_device_ex, mlx5_query_device_ex);
+   verbs_set_ctx_op(v_ctx, query_values, mlx5_query_values);
verbs_set_ctx_op(v_ctx, create_cq_ex, mlx5_create_cq_ex);
if (context->cqe_version && context->cqe_version == 1)
verbs_set_ctx_op(v_ctx, poll_cq_ex, mlx5_poll_cq_v1_ex);
@@ -697,6 +731,9 @@ static void mlx5_cleanup_context(struct verbs_device 
*device,
if (context->uar[i])
munmap(context->uar[i], page_size);
}
+   if (context->hca_core_clock)
+   munmap(context->hca_core_clock - context->core_clock.offset,
+  page_size);
close_debug_file(context);
 }
 
diff --git a/src/mlx5.h b/src/mlx5.h
index 0c0b027..b5bcfaa 100644
--- a/src/mlx5.h
+++ b/src/mlx5.h
@@ -117,7 +117,8 @@ enum {
 
 enum {
MLX5_MMAP_GET_REGULAR_PAGES_CMD= 0,
-   MLX5_MMAP_GET_CONTIGUOUS_PAGES_CMD = 1
+   MLX5_MMAP_GET_CONTIGUOUS_PAGES_CMD = 1,
+   MLX5_MMAP_GET_CORE_CLOCK_CMD= 5
 };
 
 #define MLX5_CQ_PREFIX "MLX_CQ"
@@ -307,6 +308,11 @@ struct mlx5_context {
struct mlx5_spinlockhugetlb_lock;

[PATCH libmlx5 V1 5/6] Optimize poll_cq

2015-12-03 Thread Matan Barak
The current ibv_poll_cq_ex mechanism needs to query every field
for its existence. In order to avoid this penalty at runtime,
add optimized functions for special cases.

Signed-off-by: Matan Barak 
---
 src/cq.c| 363 +---
 src/mlx5.h  |  10 ++
 src/verbs.c |   9 +-
 3 files changed, 310 insertions(+), 72 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 5e06990..fcb4237 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -41,6 +41,7 @@
 #include 
 #include 
 #include 
+#include 
 #include 
 
 #include 
@@ -207,73 +208,91 @@ union wc_buffer {
uint64_t*b64;
 };
 
+#define IS_IN_WC_FLAGS(yes, no, maybe, flag) (((yes) & (flag)) ||\
+ (!((no) & (flag)) && \
+  ((maybe) & (flag
 static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex,
  union wc_buffer *pwc_buffer,
  struct mlx5_cqe64 *cqe,
  uint64_t wc_flags,
- uint32_t qpn)
+ uint64_t wc_flags_yes,
+ uint64_t wc_flags_no,
+ uint32_t qpn, uint64_t *wc_flags_out)
 {
union wc_buffer wc_buffer = *pwc_buffer;
 
switch (ntohl(cqe->sop_drop_qpn) >> 24) {
case MLX5_OPCODE_RDMA_WRITE_IMM:
-   wc_ex->wc_flags |= IBV_WC_EX_IMM;
+   *wc_flags_out |= IBV_WC_EX_IMM;
case MLX5_OPCODE_RDMA_WRITE:
wc_ex->opcode= IBV_WC_RDMA_WRITE;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX5_OPCODE_SEND_IMM:
-   wc_ex->wc_flags |= IBV_WC_EX_IMM;
+   *wc_flags_out |= IBV_WC_EX_IMM;
case MLX5_OPCODE_SEND:
case MLX5_OPCODE_SEND_INVAL:
wc_ex->opcode= IBV_WC_SEND;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN))
wc_buffer.b32++;
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX5_OPCODE_RDMA_READ:
wc_ex->opcode= IBV_WC_RDMA_READ;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN)) {
*wc_buffer.b32++ = ntohl(cqe->byte_cnt);
-   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN;
}
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX5_OPCODE_ATOMIC_CS:
wc_ex->opcode= IBV_WC_COMP_SWAP;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN)) {
*wc_buffer.b32++ = 8;
-   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN;
}
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_IMM))
wc_buffer.b32++;
break;
case MLX5_OPCODE_ATOMIC_FA:
wc_ex->opcode= IBV_WC_FETCH_ADD;
-   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  IBV_WC_EX_WITH_BYTE_LEN)) {
*wc_buffer.b32++ = 8;
-   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   *wc_flags_out |= IBV_WC_EX_WITH_BYTE_LEN;
}
-   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   if (IS_IN_WC_FLAGS(wc_flags_yes, wc_flags_no, wc_flags,
+  

[PATCH libmlx5 V1 3/6] Add ibv_create_cq_ex support

2015-12-03 Thread Matan Barak
In order to create a CQ which supports timestamp, the user needs
to specify the timestamp flag for ibv_create_cq_ex.
Adding support for ibv_create_cq_ex in the mlx5's vendor library.

Signed-off-by: Matan Barak 
---
 src/mlx5.c  |  1 +
 src/mlx5.h  |  2 ++
 src/verbs.c | 72 +
 3 files changed, 66 insertions(+), 9 deletions(-)

diff --git a/src/mlx5.c b/src/mlx5.c
index eac332b..229d99d 100644
--- a/src/mlx5.c
+++ b/src/mlx5.c
@@ -664,6 +664,7 @@ static int mlx5_init_context(struct verbs_device *vdev,
verbs_set_ctx_op(v_ctx, create_srq_ex, mlx5_create_srq_ex);
verbs_set_ctx_op(v_ctx, get_srq_num, mlx5_get_srq_num);
verbs_set_ctx_op(v_ctx, query_device_ex, mlx5_query_device_ex);
+   verbs_set_ctx_op(v_ctx, create_cq_ex, mlx5_create_cq_ex);
if (context->cqe_version && context->cqe_version == 1)
verbs_set_ctx_op(v_ctx, poll_cq_ex, mlx5_poll_cq_v1_ex);
else
diff --git a/src/mlx5.h b/src/mlx5.h
index 91aafbe..0c0b027 100644
--- a/src/mlx5.h
+++ b/src/mlx5.h
@@ -600,6 +600,8 @@ int mlx5_dereg_mr(struct ibv_mr *mr);
 struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe,
   struct ibv_comp_channel *channel,
   int comp_vector);
+struct ibv_cq *mlx5_create_cq_ex(struct ibv_context *context,
+struct ibv_create_cq_attr_ex *cq_attr);
 int mlx5_poll_cq_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc,
struct ibv_poll_cq_ex_attr *attr);
 int mlx5_poll_cq_v1_ex(struct ibv_cq *ibcq, struct ibv_wc_ex *wc,
diff --git a/src/verbs.c b/src/verbs.c
index 92f273d..1dbee60 100644
--- a/src/verbs.c
+++ b/src/verbs.c
@@ -240,9 +240,21 @@ static int qp_sig_enabled(void)
return 0;
 }
 
-struct ibv_cq *mlx5_create_cq(struct ibv_context *context, int cqe,
- struct ibv_comp_channel *channel,
- int comp_vector)
+enum {
+   CREATE_CQ_SUPPORTED_WC_FLAGS = IBV_WC_STANDARD_FLAGS|
+  IBV_WC_EX_WITH_COMPLETION_TIMESTAMP
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_COMP_MASK = IBV_CREATE_CQ_ATTR_FLAGS
+};
+
+enum {
+   CREATE_CQ_SUPPORTED_FLAGS = IBV_CREATE_CQ_ATTR_COMPLETION_TIMESTAMP
+};
+
+static struct ibv_cq *create_cq(struct ibv_context *context,
+   const struct ibv_create_cq_attr_ex *cq_attr)
 {
struct mlx5_create_cq   cmd;
struct mlx5_create_cq_resp  resp;
@@ -254,12 +266,33 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context 
*context, int cqe,
FILE *fp = to_mctx(context)->dbg_fp;
 #endif
 
-   if (!cqe) {
-   mlx5_dbg(fp, MLX5_DBG_CQ, "\n");
+   if (!cq_attr->cqe) {
+   mlx5_dbg(fp, MLX5_DBG_CQ, "CQE invalid\n");
+   errno = EINVAL;
+   return NULL;
+   }
+
+   if (cq_attr->comp_mask & ~CREATE_CQ_SUPPORTED_COMP_MASK) {
+   mlx5_dbg(fp, MLX5_DBG_CQ,
+"Unsupported comp_mask for create_cq\n");
+   errno = EINVAL;
+   return NULL;
+   }
+
+   if (cq_attr->comp_mask & IBV_CREATE_CQ_ATTR_FLAGS &&
+   cq_attr->flags & ~CREATE_CQ_SUPPORTED_FLAGS) {
+   mlx5_dbg(fp, MLX5_DBG_CQ,
+"Unsupported creation flags requested for 
create_cq\n");
errno = EINVAL;
return NULL;
}
 
+   if (cq_attr->wc_flags & ~CREATE_CQ_SUPPORTED_WC_FLAGS) {
+   mlx5_dbg(fp, MLX5_DBG_CQ, "\n");
+   errno = ENOTSUP;
+   return NULL;
+   }
+
cq =  calloc(1, sizeof *cq);
if (!cq) {
mlx5_dbg(fp, MLX5_DBG_CQ, "\n");
@@ -273,14 +306,14 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context 
*context, int cqe,
goto err;
 
/* The additional entry is required for resize CQ */
-   if (cqe <= 0) {
+   if (cq_attr->cqe <= 0) {
mlx5_dbg(fp, MLX5_DBG_CQ, "\n");
errno = EINVAL;
goto err_spl;
}
 
-   ncqe = align_queue_size(cqe + 1);
-   if ((ncqe > (1 << 24)) || (ncqe < (cqe + 1))) {
+   ncqe = align_queue_size(cq_attr->cqe + 1);
+   if ((ncqe > (1 << 24)) || (ncqe < (cq_attr->cqe + 1))) {
mlx5_dbg(fp, MLX5_DBG_CQ, "ncqe %d\n", ncqe);
errno = EINVAL;
goto err_spl;
@@ -313,7 +346,8 @@ struct ibv_cq *mlx5_create_cq(struct ibv_context *context, 
int cqe,
cmd.db_addr  = (uintptr_t) cq->dbrec;
cmd.cqe_size = cqe_sz;
 
-   ret = ibv_cmd_create_cq(context, ncqe - 1, channel, comp_vector,
+   ret = ibv_cmd_create_cq(context, ncqe - 1, cq_attr->channel,
+   cq_attr->comp_vector,
>ibv_cq, _cmd, sizeof cmd,
_resp, sizeof resp);

[PATCH libmlx5 V1 2/6] Add timestmap support for ibv_poll_cq_ex

2015-12-03 Thread Matan Barak
Add support for filling the timestamp field in ibv_poll_cq_ex
(if it's required by the user).

Signed-off-by: Matan Barak 
---
 src/cq.c | 5 +
 1 file changed, 5 insertions(+)

diff --git a/src/cq.c b/src/cq.c
index 0185696..5e06990 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -913,6 +913,11 @@ inline int mlx5_poll_one_ex(struct mlx5_cq *cq,
wc_ex->wc_flags = 0;
wc_ex->reserved = 0;
 
+   if (wc_flags & IBV_WC_EX_WITH_COMPLETION_TIMESTAMP) {
+   *wc_buffer.b64++ = ntohll(cqe64->timestamp);
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_COMPLETION_TIMESTAMP;
+   }
+
switch (opcode) {
case MLX5_CQE_REQ:
err = mlx5_poll_one_cqe_req(cq, cur_rsc, cqe, qpn, cqe_ver,
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libmlx5 V1 0/6] Completion timestamping

2015-12-03 Thread Matan Barak
Hi Eli,

This series adds support for completion timestamp. In order to
support this feature, several extended verbs were implemented
(as instructed in libibverbs).

ibv_query_device_ex was extended to support reading the
hca_core_clock and timestamp mask.

The init_context verb vendor specific data was changed so
it'll conform to the verbs extensions form. This is done in
order to easily extend the response data for passing the page
offset of the free running clock register. This is mandatory
for mapping this register to the user space. This mapping
is done when libmlx5 initializes.

In order to support CQ completion timestmap reporting, we implement
ibv_create_cq_ex verb. This verb is used both for creating a CQ
which supports timestamp and in order to state which fields should
be returned via WC. Returning this data is done via implementing
ibv_poll_cq_ex. We query the CQ requested wc_flags for every field
the user has requested and populate it according to the carried
network operation and WC status.

Last but not least, ibv_poll_cq_ex was optimized in order to eliminate
the if statements and or operations for common combinations of wc
fields. This is done by inlining and using a custom poll_one_ex
function for these fields.

This series depends on '[PATCH libibverbs 0/5] Completion timestamping'
and is rebased above '[PATCH libmlx5 v1 0/5] Support CQE

Thanks,
Matan

Changes from V0:
 * Use mlx5_init_context in order to pass hca_core_clock_offset.

Matan Barak (6):
  Add ibv_poll_cq_ex support
  Add timestmap support for ibv_poll_cq_ex
  Add ibv_create_cq_ex support
  Add ibv_query_values support
  Optimize poll_cq
  Add always_inline check

 configure.ac   |  17 +
 src/cq.c   | 959 -
 src/mlx5-abi.h |  10 +-
 src/mlx5.c |  43 +++
 src/mlx5.h |  42 ++-
 src/verbs.c| 115 ++-
 6 files changed, 1037 insertions(+), 149 deletions(-)

-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH libmlx5 V1 1/6] Add ibv_poll_cq_ex support

2015-12-03 Thread Matan Barak
Extended poll_cq supports writing only user's required work completion
fields. Adding support for this extended verb.

Signed-off-by: Matan Barak 
---
 src/cq.c   | 699 +
 src/mlx5.c |   5 +
 src/mlx5.h |  14 ++
 3 files changed, 584 insertions(+), 134 deletions(-)

diff --git a/src/cq.c b/src/cq.c
index 32f0dd4..0185696 100644
--- a/src/cq.c
+++ b/src/cq.c
@@ -200,6 +200,85 @@ static void handle_good_req(struct ibv_wc *wc, struct 
mlx5_cqe64 *cqe)
}
 }
 
+union wc_buffer {
+   uint8_t *b8;
+   uint16_t*b16;
+   uint32_t*b32;
+   uint64_t*b64;
+};
+
+static inline void handle_good_req_ex(struct ibv_wc_ex *wc_ex,
+ union wc_buffer *pwc_buffer,
+ struct mlx5_cqe64 *cqe,
+ uint64_t wc_flags,
+ uint32_t qpn)
+{
+   union wc_buffer wc_buffer = *pwc_buffer;
+
+   switch (ntohl(cqe->sop_drop_qpn) >> 24) {
+   case MLX5_OPCODE_RDMA_WRITE_IMM:
+   wc_ex->wc_flags |= IBV_WC_EX_IMM;
+   case MLX5_OPCODE_RDMA_WRITE:
+   wc_ex->opcode= IBV_WC_RDMA_WRITE;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   wc_buffer.b32++;
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   case MLX5_OPCODE_SEND_IMM:
+   wc_ex->wc_flags |= IBV_WC_EX_IMM;
+   case MLX5_OPCODE_SEND:
+   case MLX5_OPCODE_SEND_INVAL:
+   wc_ex->opcode= IBV_WC_SEND;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   wc_buffer.b32++;
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   case MLX5_OPCODE_RDMA_READ:
+   wc_ex->opcode= IBV_WC_RDMA_READ;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   *wc_buffer.b32++ = ntohl(cqe->byte_cnt);
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   }
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   case MLX5_OPCODE_ATOMIC_CS:
+   wc_ex->opcode= IBV_WC_COMP_SWAP;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   *wc_buffer.b32++ = 8;
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   }
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   case MLX5_OPCODE_ATOMIC_FA:
+   wc_ex->opcode= IBV_WC_FETCH_ADD;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   *wc_buffer.b32++ = 8;
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   }
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   case MLX5_OPCODE_BIND_MW:
+   wc_ex->opcode= IBV_WC_BIND_MW;
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN)
+   wc_buffer.b32++;
+   if (wc_flags & IBV_WC_EX_WITH_IMM)
+   wc_buffer.b32++;
+   break;
+   }
+
+   if (wc_flags & IBV_WC_EX_WITH_QP_NUM) {
+   *wc_buffer.b32++ = qpn;
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_QP_NUM;
+   }
+
+   *pwc_buffer = wc_buffer;
+}
+
 static int handle_responder(struct ibv_wc *wc, struct mlx5_cqe64 *cqe,
struct mlx5_qp *qp, struct mlx5_srq *srq)
 {
@@ -262,6 +341,103 @@ static int handle_responder(struct ibv_wc *wc, struct 
mlx5_cqe64 *cqe,
return IBV_WC_SUCCESS;
 }
 
+static inline int handle_responder_ex(struct ibv_wc_ex *wc_ex,
+ union wc_buffer *pwc_buffer,
+ struct mlx5_cqe64 *cqe,
+ struct mlx5_qp *qp, struct mlx5_srq *srq,
+ uint64_t wc_flags, uint32_t qpn)
+{
+   uint16_t wqe_ctr;
+   struct mlx5_wq *wq;
+   uint8_t g;
+   union wc_buffer wc_buffer = *pwc_buffer;
+   int err = 0;
+   uint32_t byte_len = ntohl(cqe->byte_cnt);
+
+   if (wc_flags & IBV_WC_EX_WITH_BYTE_LEN) {
+   *wc_buffer.b32++ = byte_len;
+   wc_ex->wc_flags |= IBV_WC_EX_WITH_BYTE_LEN;
+   }
+   if (srq) {
+   wqe_ctr = ntohs(cqe->wqe_counter);
+   wc_ex->wr_id = srq->wrid[wqe_ctr];
+   mlx5_free_srq_wqe(srq, wqe_ctr);
+   if (cqe->op_own & MLX5_INLINE_SCATTER_32)
+   err = mlx5_copy_to_recv_srq(srq, wqe_ctr, cqe,
+   byte_len);
+   else if (cqe->op_own & 

[PATCH for-next V1 2/5] IB/core: Add ib_is_udata_cleared

2015-12-03 Thread Matan Barak
Extending core and vendor verb commands require us to check that the
unknown part of the user's given command is all zeros.
Adding ib_is_udata_cleared in order to do so.

Signed-off-by: Matan Barak 
Reviewed-by: Moshe Lazer 
---
 include/rdma/ib_verbs.h | 67 +
 1 file changed, 67 insertions(+)

diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
index 31fb409..0ad89e3 100644
--- a/include/rdma/ib_verbs.h
+++ b/include/rdma/ib_verbs.h
@@ -1947,6 +1947,73 @@ static inline int ib_copy_to_udata(struct ib_udata 
*udata, void *src, size_t len
return copy_to_user(udata->outbuf, src, len) ? -EFAULT : 0;
 }
 
+#define IB_UDATA_ELEMENT_CLEARED(type, ptr, len, expected) \
+   ({type v;   \
+ typeof(ptr) __ptr = ptr;  \
+   \
+ ptr = (void *)ptr + sizeof(type); \
+ len -= sizeof(type);  \
+ !copy_from_user(, __ptr, sizeof(v)) && (v == expected); })
+
+static inline bool ib_is_udata_cleared(struct ib_udata *udata,
+  u8 cleared_char,
+  size_t offset,
+  size_t len)
+{
+   const void __user *p = udata->inbuf + offset;
+#ifdef CONFIG_64BIT
+   u64 expected = cleared_char;
+#else
+   u32 expected = cleared_char;
+#endif
+
+   if (len > USHRT_MAX)
+   return false;
+
+   if (len && (uintptr_t)p & 1)
+   if (!IB_UDATA_ELEMENT_CLEARED(u8, p, len, expected))
+   return false;
+
+   expected = expected << 8 | expected;
+   if (len >= 2 && (uintptr_t)p & 2)
+   if (!IB_UDATA_ELEMENT_CLEARED(u16, p, len, expected))
+   return false;
+
+   expected = expected << 16 | expected;
+#ifdef CONFIG_64BIT
+   if (len >= 4 && (uintptr_t)p & 4)
+   if (!IB_UDATA_ELEMENT_CLEARED(u32, p, len, expected))
+   return false;
+
+   expected = expected << 32 | expected;
+#define IB_UDATA_CLEAR_LOOP_TYPE   u64
+#else
+#define IB_UDATA_CLEAR_LOOP_TYPE   u32
+#endif
+   while (len >= sizeof(IB_UDATA_CLEAR_LOOP_TYPE))
+   if (!IB_UDATA_ELEMENT_CLEARED(IB_UDATA_CLEAR_LOOP_TYPE, p, len,
+ expected))
+   return false;
+
+#ifdef CONFIG_64BIT
+   expected = expected >> 32;
+   if (len >= 4 && (uintptr_t)p & 4)
+   if (!IB_UDATA_ELEMENT_CLEARED(u32, p, len, expected))
+   return false;
+#endif
+   expected = expected >> 16;
+   if (len >= 2 && (uintptr_t)p & 2)
+   if (!IB_UDATA_ELEMENT_CLEARED(u16, p, len, expected))
+   return false;
+
+   expected = expected >> 8;
+   if (len)
+   if (!IB_UDATA_ELEMENT_CLEARED(u8, p, len, expected))
+   return false;
+
+   return true;
+}
+
 /**
  * ib_modify_qp_is_ok - Check that the supplied attribute mask
  * contains all required attributes and no attributes not allowed for
-- 
2.1.0

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 05/11] IB/core: Add rdma_network_type to wc

2015-12-03 Thread Matan Barak
On Thu, Dec 3, 2015 at 4:05 PM, Christoph Hellwig  wrote:
> Bloating the WC with a field that's not really useful for the ULPs
> seems pretty sad..

Network header type is mandatory in order to find the GID type and get
the GIDs correctly from the header.
I realize ULPs might have preferred to get the GID itself, but
resolving the GID costs time and most of the time you don't really
need that when you poll a CQ.
This could be refactored later to use wc_flags instead of a new field
when we approach the cache line limit.

> --
> To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
> the body of a message to majord...@vger.kernel.org
> More majordomo info at  http://vger.kernel.org/majordomo-info.html
--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] staging/rdma/hfi1: fix pio progress routine race with allocator

2015-12-03 Thread Mike Marciniszyn
The allocation code assumes that the shadow ring cannot
be overrun because the credits will limit the allocation.

Unfortuately, the progress mechanism in sc_release_update() updates
the free count prior to processing the shadow ring, allowing the
shadow ring to be overrun by an allocation.

Reviewed-by: Mark Debbage 
Signed-off-by: Mike Marciniszyn 
---
 drivers/staging/rdma/hfi1/pio.c |9 ++---
 1 file changed, 6 insertions(+), 3 deletions(-)

diff --git a/drivers/staging/rdma/hfi1/pio.c b/drivers/staging/rdma/hfi1/pio.c
index eab58c1..8e10857 100644
--- a/drivers/staging/rdma/hfi1/pio.c
+++ b/drivers/staging/rdma/hfi1/pio.c
@@ -1565,6 +1565,7 @@ void sc_release_update(struct send_context *sc)
u64 hw_free;
u32 head, tail;
unsigned long old_free;
+   unsigned long free;
unsigned long extra;
unsigned long flags;
int code;
@@ -1579,7 +1580,7 @@ void sc_release_update(struct send_context *sc)
extra = (((hw_free & CR_COUNTER_SMASK) >> CR_COUNTER_SHIFT)
- (old_free & CR_COUNTER_MASK))
& CR_COUNTER_MASK;
-   sc->free = old_free + extra;
+   free = old_free + extra;
trace_hfi1_piofree(sc, extra);
 
/* call sent buffer callbacks */
@@ -1589,7 +1590,7 @@ void sc_release_update(struct send_context *sc)
while (head != tail) {
pbuf = >sr[tail].pbuf;
 
-   if (sent_before(sc->free, pbuf->sent_at)) {
+   if (sent_before(free, pbuf->sent_at)) {
/* not sent yet */
break;
}
@@ -1603,8 +1604,10 @@ void sc_release_update(struct send_context *sc)
if (tail >= sc->sr_size)
tail = 0;
}
-   /* update tail, in case we moved it */
sc->sr_tail = tail;
+   /* make sure tail is updated before free */
+   smp_wmb();
+   sc->free = free;
spin_unlock_irqrestore(>release_lock, flags);
sc_piobufavail(sc);
 }

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 5/6] IB core: Fix ib_sg_to_pages()

2015-12-03 Thread Bart Van Assche
On 12/03/2015 01:18 AM, Christoph Hellwig wrote:
> The patch looks good to me, but while we touch this area, how about
> throwing in a few cosmetic fixes as well?
 
How about the patch below ? In that version of the ib_sg_to_pages() fix
these concerns have been addressed and additionally to more bugs have been 
fixed.



[PATCH] IB core: Fix ib_sg_to_pages()

Fix the code for detecting gaps. A gap occurs not only if the
second or later scatterlist element is not aligned but also if
any scatterlist element other than the last does not end at a
page boundary.

In the code for coalescing contiguous elements, ensure that
mr->length is correct and that last_page_addr is up-to-date.

Ensure that this function returns a negative
error code instead of zero if the first set_page() call fails.

Fixes: commit 4c67e2bfc8b7 ("IB/core: Introduce new fast registration API")
Reported-by: Christoph Hellwig 
---
 drivers/infiniband/core/verbs.c | 43 +
 1 file changed, 22 insertions(+), 21 deletions(-)

diff --git a/drivers/infiniband/core/verbs.c b/drivers/infiniband/core/verbs.c
index 043a60e..545906d 100644
--- a/drivers/infiniband/core/verbs.c
+++ b/drivers/infiniband/core/verbs.c
@@ -1516,7 +1516,7 @@ EXPORT_SYMBOL(ib_map_mr_sg);
  * @sg_nents:  number of entries in sg
  * @set_page:  driver page assignment function pointer
  *
- * Core service helper for drivers to covert the largest
+ * Core service helper for drivers to convert the largest
  * prefix of given sg list to a page vector. The sg list
  * prefix converted is the prefix that meet the requirements
  * of ib_map_mr_sg.
@@ -1533,7 +1533,7 @@ int ib_sg_to_pages(struct ib_mr *mr,
u64 last_end_dma_addr = 0, last_page_addr = 0;
unsigned int last_page_off = 0;
u64 page_mask = ~((u64)mr->page_size - 1);
-   int i;
+   int i, ret;
 
mr->iova = sg_dma_address([0]);
mr->length = 0;
@@ -1544,27 +1544,29 @@ int ib_sg_to_pages(struct ib_mr *mr,
u64 end_dma_addr = dma_addr + dma_len;
u64 page_addr = dma_addr & page_mask;
 
-   if (i && page_addr != dma_addr) {
-   if (last_end_dma_addr != dma_addr) {
-   /* gap */
-   goto done;
-
-   } else if (last_page_off + dma_len <= mr->page_size) {
-   /* chunk this fragment with the last */
-   mr->length += dma_len;
-   last_end_dma_addr += dma_len;
-   last_page_off += dma_len;
-   continue;
-   } else {
-   /* map starting from the next page */
-   page_addr = last_page_addr + mr->page_size;
-   dma_len -= mr->page_size - last_page_off;
-   }
+   /*
+* For the second and later elements, check whether either the
+* end of element i-1 or the start of element i is not aligned
+* on a page boundary.
+*/
+   if (i && (last_page_off != 0 || page_addr != dma_addr)) {
+   /* Stop mapping if there is a gap. */
+   if (last_end_dma_addr != dma_addr)
+   break;
+
+   /*
+* Coalesce this element with the last. If it is small
+* enough just update mr->length. Otherwise start
+* mapping from the next page.
+*/
+   goto next_page;
}
 
do {
-   if (unlikely(set_page(mr, page_addr)))
-   goto done;
+   ret = set_page(mr, page_addr);
+   if (unlikely(ret < 0))
+   return i ? : ret;
+next_page:
page_addr += mr->page_size;
} while (page_addr < end_dma_addr);
 
@@ -1574,7 +1576,6 @@ int ib_sg_to_pages(struct ib_mr *mr,
last_page_off = end_dma_addr & ~page_mask;
}
 
-done:
return i;
 }
 EXPORT_SYMBOL(ib_sg_to_pages);
-- 
2.1.4

--
To unsubscribe from this list: send the line "unsubscribe linux-rdma" in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html