[PATCH v2] opensm: Clean up event subscriptions if a port goes away

2013-09-10 Thread Line Holen
Event subscriptions needs to be cleaned up if a port goes away.
If the port comes online again later it may no longer want to
receive the events on the same QPN. If the old QPN is used for
something else the SM forwarding events may cause QKey violations.

This behavior is made configurable and it needs to be explicitly
enabled (default is off).

Signed-off-by: Line Holen line.ho...@oracle.com

---

diff --git a/include/opensm/osm_inform.h b/include/opensm/osm_inform.h
index f737441..8cefc20 100644
--- a/include/opensm/osm_inform.h
+++ b/include/opensm/osm_inform.h
@@ -2,6 +2,7 @@
  * Copyright (c) 2004-2009 Voltaire, Inc. All rights reserved.
  * Copyright (c) 2002-2005 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
+ * Copyright (c) 2013 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -200,6 +201,35 @@ void osm_infr_insert_to_db(IN osm_subn_t * p_subn, IN 
osm_log_t * p_log,
 void osm_infr_remove_from_db(IN osm_subn_t * p_subn, IN osm_log_t * p_log,
 IN osm_infr_t * p_infr);
 
+/f* OpenSM: Inform Record/osm_infr_remove_subscriptions
+* NAME
+*  osm_infr_remove_subscriptions
+*
+* DESCRIPTION
+*  Remove all event subscriptions of a port
+*
+* SYNOPSIS
+*/
+ib_api_status_t
+osm_infr_remove_subscriptions(IN osm_subn_t * p_subn, IN osm_log_t * p_log,
+ IN ib_net64_t port_guid);
+/*
+* PARAMETERS
+*  p_subn
+*  [in] Pointer to the subnet object
+*
+*  p_log
+*  [in] Pointer to the log object
+*
+*  port_guid
+*  [in] PortGUID of the subscriber that should be removed
+*
+* RETURN
+*  CL_SUCCESS if port_guid had any subscriptions being removed
+*  CL_NOT_FOUND if port_guid did not have any active subscriptions
+* SEE ALSO
+*/
+
 /f* OpenSM: Inform Record/osm_report_notice
 * NAME
 *  osm_report_notice
diff --git a/include/opensm/osm_subnet.h b/include/opensm/osm_subnet.h
index 463ffea..19f2079 100644
--- a/include/opensm/osm_subnet.h
+++ b/include/opensm/osm_subnet.h
@@ -5,6 +5,7 @@
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
  * Copyright (c) 2009 System Fabric Works, Inc. All rights reserved.
  * Copyright (c) 2009 HNR Consulting. All rights reserved.
+ * Copyright (c) 2013 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -325,6 +326,7 @@ typedef struct osm_subn_opt {
boolean_t daemon;
boolean_t sm_inactive;
boolean_t babbling_port_policy;
+   boolean_t drop_event_subscriptions;
boolean_t use_optimized_slvl;
osm_qos_options_t qos_options;
osm_qos_options_t qos_ca_options;
@@ -595,6 +597,9 @@ typedef struct osm_subn_opt {
 *  babbling_port_policy
 *  OpenSM will enforce its babbling port policy.
 *
+*  drop_event_subscriptions
+*  OpenSM will drop event subscriptions if the port goes away.
+*
 *  use_optimized_slvl
 *  Use optimized SLtoVLMappingTable programming if
 *  device indicates it supports this.
diff --git a/opensm/osm_drop_mgr.c b/opensm/osm_drop_mgr.c
index b309273..4276fe9 100644
--- a/opensm/osm_drop_mgr.c
+++ b/opensm/osm_drop_mgr.c
@@ -3,6 +3,7 @@
  * Copyright (c) 2002-2012 Mellanox Technologies LTD. All rights reserved.
  * Copyright (c) 1996-2003 Intel Corporation. All rights reserved.
  * Copyright (c) 2008 Xsigo Systems Inc.  All rights reserved.
+ * Copyright (c) 2013 Oracle and/or its affiliates. All rights reserved.
  *
  * This software is available to you under a choice of one of two
  * licenses.  You may choose to be licensed under the terms of the GNU
@@ -276,6 +277,15 @@ static void drop_mgr_remove_port(osm_sm_t * sm, IN 
osm_port_t * p_port)
 
drop_mgr_clean_physp(sm, p_port-p_physp);
 
+   /* Delete event forwarding subscriptions */
+   if (sm-p_subn-opt.drop_event_subscriptions) {
+   if (osm_infr_remove_subscriptions(sm-p_subn, sm-p_log, 
port_guid)
+   == CL_SUCCESS)
+   OSM_LOG(sm-p_log, OSM_LOG_DEBUG,
+   Removed event subscriptions for port 0x%016 
PRIx64 \n,
+   cl_ntoh64(port_guid));
+   }
+
/* initialize the p_node - may need to get node_desc later */
p_node = p_port-p_node;
 
diff --git a/opensm/osm_inform.c b/opensm/osm_inform.c
index 804c414..716a124 100644
--- a/opensm/osm_inform.c
+++ b/opensm/osm_inform.c
@@ -282,6 +282,37 @@ void osm_infr_remove_from_db(IN osm_subn_t * p_subn, IN 
osm_log_t * p_log,
OSM_LOG_EXIT(p_log);
 }
 
+ib_api_status_t osm_infr_remove_subscriptions(IN osm_subn_t * p_subn,
+ 

Re: [PATCH 5/6] staging/et131x: Use cached pci_dev-pcie_mpss and pcie_set_readrq() to simplif code

2013-09-10 Thread Mark Einon
On Mon, Sep 09, 2013 at 09:13:07PM +0800, Yijing Wang wrote:
 The PCI core caches the PCI-E Max Payload Size Supported in
 pci_dev-pcie_mpss, so use that instead of pcie_capability_read_dword().
 Also use pcie_set_readrq() instead of pcie_capability_clear_and_set_word()
 to simplify code.
 
 Signed-off-by: Yijing Wang wangyij...@huawei.com

Acked-by: Mark Einon mark.ei...@gmail.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [ceph-users] Help needed porting Ceph to RSockets

2013-09-10 Thread Andreas Bluemle
Hi,

after some more analysis and debugging, I found
workarounds for my problems; I have added these workarounds
to the last version of the patch for the poll problem by Sean;
see the attachment to this posting.

The shutdown() operations below are all SHUT_RDWR.

1. shutdown() on side A of a connection waits for close() on side B

   With rsockets, when a shutdown is done on side A of a socket
   connection, then the shutdown will only return after side B
   has done a close() on its end of the connection.

   This is different from TCP/IP sockets: there a shutdown will cause
   the other end to terminate the connection at the TCP level
   instantly. The socket changes state into CLOSE_WAIT, which indicates
   that the application level close is outstanding.

   In the attached patch, the workaround is in rs_poll_cq(),
   case RS_OP_CTRL, where for a RS_CTRL_DISCONNECT the rshutdown()
   is called on side B; this will cause the termination of the
   socket connection to acknowledged to side A and the shutdown()
   there can now terminate.

2. double (multiple) shutdown on side A: delay on 2nd shutdown

   When an application does a shutdown() of side A and does a 2nd
   shutdown() shortly after (for whatever reason) then the
   return of the 2nd shutdown() is delayed by 2 seconds.

   The delay happens in rdma_disconnect(), when this is called
   from rshutdown() in the case that the rsocket state is
   rs_disconnected.

   Even if it could be considered as a bug if an application
   calls shutdown() twice on the same socket, it still
   does not make sense to delay that 2nd call to shutdown().

   To workaround this, I have
   - introduced an additional rsocket state: rs_shutdown
   - switch to that new state in rshutdown() at the very end
 of the function.

   The first call to shutdown() will therefore switch to the new
   rsocket state rs_shutdown - and any further call to rshutdown()
   will not do anything any more, because every effect of rshutdown()
   will only happen if the rsocket state is either rs_connnected or
   rs_disconnected. Hence it would be better to explicitely check
   the rsocket state at the beginning of the function and return
   immediately if the state is rs_shutdown.

Since I have added these workarounds to my version of the librdmacm
library, I can at least start up ceph using LD_PRELOAD and end up in
a healthy ceph cluster state.

I would not call these workarounds a real fix, but they should point
out the problems which I am trying to solve.


Regards

Andreas Bluemle




On Fri, 23 Aug 2013 00:35:22 +
Hefty, Sean sean.he...@intel.com wrote:

  I tested out the patch and unfortunately had the same results as
  Andreas. About 50% of the time the rpoll() thread in Ceph still
  hangs when rshutdown() is called. I saw a similar behaviour when
  increasing the poll time on the pre-patched version if that's of
  any relevance.
 
 I'm not optimistic, but here's an updated patch.  I attempted to
 handle more shutdown conditions, but I can't say that any of those
 would prevent the hang that you see.
 
 I have a couple of questions: 
 
 Is there any chance that the code would call rclose while rpoll
 is still running?  Also, can you verify that the thread is in the
 real poll() call when the hang occurs?
 
 Signed-off-by: Sean Hefty sean.he...@intel.com
 ---
  src/rsocket.c |   35 +--
  1 files changed, 25 insertions(+), 10 deletions(-)
 
 diff --git a/src/rsocket.c b/src/rsocket.c
 index d544dd0..f94ddf3 100644
 --- a/src/rsocket.c
 +++ b/src/rsocket.c
 @@ -1822,7 +1822,12 @@ static int rs_poll_cq(struct rsocket *rs)
   rs-state = rs_disconnected;
   return 0;
   } else if (rs_msg_data(msg) ==
 RS_CTRL_SHUTDOWN) {
 - rs-state = ~rs_readable;
 + if (rs-state  rs_writable)
 {
 + rs-state =
 ~rs_readable;
 + } else {
 + rs-state =
 rs_disconnected;
 + return 0;
 + }
   }
   break;
   case RS_OP_WRITE:
 @@ -2948,10 +2953,12 @@ static int rs_poll_events(struct pollfd
 *rfds, struct pollfd *fds, nfds_t nfds) 
   rs = idm_lookup(idm, fds[i].fd);
   if (rs) {
 + fastlock_acquire(rs-cq_wait_lock);
   if (rs-type == SOCK_STREAM)
   rs_get_cq_event(rs);
   else
   ds_get_cq_event(rs);
 + fastlock_release(rs-cq_wait_lock);
   fds[i].revents = rs_poll_rs(rs,
 fds[i].events, 1, rs_poll_all); } else {
   

[PATCH V4 8/9] IB/core: Add RoCE IP based addressing extensions for rdma_ucm

2013-09-10 Thread Or Gerlitz
From: Matan Barak mat...@mellanox.com

Add rdma_ucm support for RoCE (IBoE) IP based addressing extensions
towards librdmacm

Extend INIT_QP_ATTR and QUERY_ROUTE ucma commands.

INIT_QP_ATTR_EX uses struct ib_uverbs_qp_attr_ex

QUERY_ROUTE_EX uses struct rdma_ucm_query_route_resp_ex which in turn
uses ib_user_path_rec_ex

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/ucma.c   |  175 -
 include/uapi/rdma/rdma_user_cm.h |   29 ++-
 2 files changed, 198 insertions(+), 6 deletions(-)

diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index 7e7da86..4d59e88 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -652,6 +652,35 @@ static void ucma_copy_ib_route(struct 
rdma_ucm_query_route_resp *resp,
}
 }
 
+static void ucma_copy_ib_route_ex(struct rdma_ucm_query_route_resp_ex *resp,
+ struct rdma_route *route)
+{
+   struct rdma_dev_addr *dev_addr;
+
+   resp-num_paths = route-num_paths;
+   switch (route-num_paths) {
+   case 0:
+   dev_addr = route-addr.dev_addr;
+   rdma_addr_get_dgid(dev_addr,
+  (union ib_gid *)resp-ib_route[0].dgid);
+   rdma_addr_get_sgid(dev_addr,
+  (union ib_gid *)resp-ib_route[0].sgid);
+   resp-ib_route[0].pkey =
+   cpu_to_be16(ib_addr_get_pkey(dev_addr));
+   break;
+   case 2:
+   ib_copy_path_rec_to_user_ex(resp-ib_route[1],
+   route-path_rec[1]);
+   /* fall through */
+   case 1:
+   ib_copy_path_rec_to_user_ex(resp-ib_route[0],
+   route-path_rec[0]);
+   break;
+   default:
+   break;
+   }
+}
+
 static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 struct rdma_route *route)
 {
@@ -678,14 +707,39 @@ static void ucma_copy_iboe_route(struct 
rdma_ucm_query_route_resp *resp,
}
 }
 
-static void ucma_copy_iw_route(struct rdma_ucm_query_route_resp *resp,
+static void ucma_copy_iboe_route_ex(struct rdma_ucm_query_route_resp_ex *resp,
+   struct rdma_route *route)
+{
+   resp-num_paths = route-num_paths;
+   switch (route-num_paths) {
+   case 0:
+   rdma_ip2gid((struct sockaddr *)route-addr.dst_addr,
+   (union ib_gid *)resp-ib_route[0].dgid);
+   rdma_ip2gid((struct sockaddr *)route-addr.src_addr,
+   (union ib_gid *)resp-ib_route[0].sgid);
+   resp-ib_route[0].pkey = cpu_to_be16(0x);
+   break;
+   case 2:
+   ib_copy_path_rec_to_user_ex(resp-ib_route[1],
+   route-path_rec[1]);
+   /* fall through */
+   case 1:
+   ib_copy_path_rec_to_user_ex(resp-ib_route[0],
+   route-path_rec[0]);
+   break;
+   default:
+   break;
+   }
+}
+
+static void ucma_copy_iw_route(struct ib_user_path_rec *resp_path,
   struct rdma_route *route)
 {
struct rdma_dev_addr *dev_addr;
 
dev_addr = route-addr.dev_addr;
-   rdma_addr_get_dgid(dev_addr, (union ib_gid *) resp-ib_route[0].dgid);
-   rdma_addr_get_sgid(dev_addr, (union ib_gid *) resp-ib_route[0].sgid);
+   rdma_addr_get_dgid(dev_addr, (union ib_gid *)resp_path-dgid);
+   rdma_addr_get_sgid(dev_addr, (union ib_gid *)resp_path-sgid);
 }
 
 static ssize_t ucma_query_route(struct ucma_file *file,
@@ -737,7 +791,74 @@ static ssize_t ucma_query_route(struct ucma_file *file,
}
break;
case RDMA_TRANSPORT_IWARP:
-   ucma_copy_iw_route(resp, ctx-cm_id-route);
+   ucma_copy_iw_route(resp.ib_route[0], ctx-cm_id-route);
+   break;
+   default:
+   break;
+   }
+
+out:
+   if (copy_to_user((void __user *)(unsigned long)cmd.response,
+resp, sizeof(resp)))
+   ret = -EFAULT;
+
+   ucma_put_ctx(ctx);
+   return ret;
+}
+
+static ssize_t ucma_query_route_ex(struct ucma_file *file,
+  const char __user *inbuf,
+  int in_len, int out_len)
+{
+   struct rdma_ucm_query_route_ex cmd;
+   struct rdma_ucm_query_route_resp_ex resp;
+   struct ucma_context *ctx;
+   struct sockaddr *addr;
+   int ret = 0;
+
+   if (out_len  sizeof(resp))
+   return -ENOSPC;
+
+   if (copy_from_user(cmd, inbuf, sizeof(cmd)))
+   return -EFAULT;
+
+   ctx = ucma_get_ctx(file, cmd.id);
+   if 

[PATCH V4 6/9] IB/ocrdma: Handle Ethernet L2 parameters for IP based GID addressing

2013-09-10 Thread Or Gerlitz
From: Moni Shoua mo...@mellanox.com

This patch is similar in spirit to the IB/mlx4: Handle Ethernet L2 parameters 
for
IP based GID addressing from series. It handles the fact that IP based RoCE 
gids
don't store Ethernet L2 parameters, MAC and VLAN.

When building an address handle, instead of parsing the dgid to
get the MAC and VLAN, take them from the address handle attributes.

Cc: Naresh Gottumukkala bgottumukk...@emulex.com
Signed-off-by: Moni Shoua mo...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/ocrdma/ocrdma.h|   12 
 drivers/infiniband/hw/ocrdma/ocrdma_ah.c |5 +++--
 drivers/infiniband/hw/ocrdma/ocrdma_hw.c |   21 ++---
 drivers/infiniband/hw/ocrdma/ocrdma_hw.h |1 -
 4 files changed, 17 insertions(+), 22 deletions(-)

diff --git a/drivers/infiniband/hw/ocrdma/ocrdma.h 
b/drivers/infiniband/hw/ocrdma/ocrdma.h
index adc11d1..d544292 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma.h
@@ -422,5 +422,17 @@ static inline int is_cqe_wr_imm(struct ocrdma_cqe *cqe)
OCRDMA_CQE_WRITE_IMM) ? 1 : 0;
 }
 
+static inline int ocrdma_resolve_dmac(struct ocrdma_dev *dev,
+   struct ib_ah_attr *ah_attr, u8 *mac_addr)
+{
+   struct in6_addr in6;
+
+   memcpy(in6, ah_attr-grh.dgid.raw, sizeof(in6));
+   if (rdma_is_multicast_addr(in6))
+   rdma_get_mcast_mac(in6, mac_addr);
+   else
+   memcpy(mac_addr, ah_attr-dmac, ETH_ALEN);
+   return 0;
+}
 
 #endif
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
index ee499d9..bbb7962 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_ah.c
@@ -49,7 +49,7 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct 
ocrdma_ah *ah,
 
ah-sgid_index = attr-grh.sgid_index;
 
-   vlan_tag = rdma_get_vlan_id(attr-grh.dgid);
+   vlan_tag = attr-vlan_id;
if (!vlan_tag || (vlan_tag  0xFFF))
vlan_tag = dev-pvid;
if (vlan_tag  (vlan_tag  0x1000)) {
@@ -64,7 +64,8 @@ static inline int set_av_attr(struct ocrdma_dev *dev, struct 
ocrdma_ah *ah,
eth_sz = sizeof(struct ocrdma_eth_basic);
}
memcpy(eth.smac[0], dev-nic_info.mac_addr[0], ETH_ALEN);
-   status = ocrdma_resolve_dgid(dev, attr-grh.dgid, eth.dmac[0]);
+   memcpy(eth.dmac[0], attr-dmac, ETH_ALEN);
+   status = ocrdma_resolve_dmac(dev, attr, eth.dmac[0]);
if (status)
return status;
status = ocrdma_query_gid(dev-ibdev, 1, attr-grh.sgid_index,
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
index 4ed8235..69c82b1 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.c
@@ -2076,23 +2076,6 @@ mbx_err:
return status;
 }
 
-int ocrdma_resolve_dgid(struct ocrdma_dev *dev, union ib_gid *dgid,
-   u8 *mac_addr)
-{
-   struct in6_addr in6;
-
-   memcpy(in6, dgid, sizeof in6);
-   if (rdma_is_multicast_addr(in6)) {
-   rdma_get_mcast_mac(in6, mac_addr);
-   } else if (rdma_link_local_addr(in6)) {
-   rdma_get_ll_mac(in6, mac_addr);
-   } else {
-   pr_err(%s() fail to resolve mac_addr.\n, __func__);
-   return -EINVAL;
-   }
-   return 0;
-}
-
 static int ocrdma_set_av_params(struct ocrdma_qp *qp,
struct ocrdma_modify_qp *cmd,
struct ib_qp_attr *attrs)
@@ -2126,14 +2109,14 @@ static int ocrdma_set_av_params(struct ocrdma_qp *qp,
 
qp-sgid_idx = ah_attr-grh.sgid_index;
memcpy(cmd-params.sgid[0], sgid.raw[0], sizeof(cmd-params.sgid));
-   ocrdma_resolve_dgid(qp-dev, ah_attr-grh.dgid, mac_addr[0]);
+   ocrdma_resolve_dmac(qp-dev, ah_attr, mac_addr[0]);
cmd-params.dmac_b0_to_b3 = mac_addr[0] | (mac_addr[1]  8) |
(mac_addr[2]  16) | (mac_addr[3]  24);
/* convert them to LE format. */
ocrdma_cpu_to_le32(cmd-params.dgid[0], sizeof(cmd-params.dgid));
ocrdma_cpu_to_le32(cmd-params.sgid[0], sizeof(cmd-params.sgid));
cmd-params.vlan_dmac_b4_to_b5 = mac_addr[4] | (mac_addr[5]  8);
-   vlan_id = rdma_get_vlan_id(sgid);
+   vlan_id = ah_attr-vlan_id;
if (vlan_id  (vlan_id  0x1000)) {
cmd-params.vlan_dmac_b4_to_b5 |=
vlan_id  OCRDMA_QP_PARAMS_VLAN_SHIFT;
diff --git a/drivers/infiniband/hw/ocrdma/ocrdma_hw.h 
b/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
index f2a89d4..82fe332 100644
--- a/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
+++ b/drivers/infiniband/hw/ocrdma/ocrdma_hw.h
@@ -94,7 +94,6 @@ void ocrdma_ring_cq_db(struct ocrdma_dev *, u16 cq_id, bool 
armed,
 int ocrdma_mbx_get_link_speed(struct ocrdma_dev *dev, u8 *lnk_speed);
 int 

[PATCH V4 3/9] IB/mlx4: Use RoCE IP based GIDs in the port GID table

2013-09-10 Thread Or Gerlitz
From: Moni Shoua mo...@mellanox.com

Currently, the mlx4 driver set RoCE (IBoE) gids to encode related
Ethernet netdevice interface MAC address and possibly VLAN id.

Change this scheme such that gids encode interface IP addresses
(both IP4 and IPv6).

This requires learning which are the IP addresses which are of use
by a netdevice associated with the HCA port, formatting them to gids
and adding them to the port gid table. Further, events of add and
delete address are caught to maintain the gid table accordingly.

Associated IP addresses may belong to a master of an Ethernet netdevice
on top of that port so this should be considered when building and
maintaining the gid table.

Signed-off-by: Moni Shoua mo...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/main.c|  474 --
 drivers/infiniband/hw/mlx4/mlx4_ib.h |3 +
 2 files changed, 334 insertions(+), 143 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index d6c5a73..7a29ad5 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -39,6 +39,8 @@
 #include linux/inetdevice.h
 #include linux/rtnetlink.h
 #include linux/if_vlan.h
+#include net/ipv6.h
+#include net/addrconf.h
 
 #include rdma/ib_smi.h
 #include rdma/ib_user_verbs.h
@@ -790,7 +792,6 @@ static int add_gid_entry(struct ib_qp *ibqp, union ib_gid 
*gid)
 int mlx4_ib_add_mc(struct mlx4_ib_dev *mdev, struct mlx4_ib_qp *mqp,
   union ib_gid *gid)
 {
-   u8 mac[6];
struct net_device *ndev;
int ret = 0;
 
@@ -804,11 +805,7 @@ int mlx4_ib_add_mc(struct mlx4_ib_dev *mdev, struct 
mlx4_ib_qp *mqp,
spin_unlock(mdev-iboe.lock);
 
if (ndev) {
-   rdma_get_mcast_mac((struct in6_addr *)gid, mac);
-   rtnl_lock();
-   dev_mc_add(mdev-iboe.netdevs[mqp-port - 1], mac);
ret = 1;
-   rtnl_unlock();
dev_put(ndev);
}
 
@@ -1031,6 +1028,8 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
struct mlx4_ib_qp *mqp = to_mqp(ibqp);
u64 reg_id;
struct mlx4_ib_steering *ib_steering = NULL;
+   enum mlx4_protocol prot = (gid-raw[1] == 0x0e) ?
+   MLX4_PROT_IB_IPV4 : MLX4_PROT_IB_IPV6;
 
if (mdev-dev-caps.steering_mode ==
MLX4_STEERING_MODE_DEVICE_MANAGED) {
@@ -1042,7 +1041,7 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
err = mlx4_multicast_attach(mdev-dev, mqp-mqp, gid-raw, mqp-port,
!!(mqp-flags 
   MLX4_IB_QP_BLOCK_MULTICAST_LOOPBACK),
-   MLX4_PROT_IB_IPV6, reg_id);
+   prot, reg_id);
if (err)
goto err_malloc;
 
@@ -1061,7 +1060,7 @@ static int mlx4_ib_mcg_attach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
 
 err_add:
mlx4_multicast_detach(mdev-dev, mqp-mqp, gid-raw,
- MLX4_PROT_IB_IPV6, reg_id);
+ prot, reg_id);
 err_malloc:
kfree(ib_steering);
 
@@ -1089,10 +1088,11 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
int err;
struct mlx4_ib_dev *mdev = to_mdev(ibqp-device);
struct mlx4_ib_qp *mqp = to_mqp(ibqp);
-   u8 mac[6];
struct net_device *ndev;
struct mlx4_ib_gid_entry *ge;
u64 reg_id = 0;
+   enum mlx4_protocol prot = (gid-raw[1] == 0x0e) ?
+   MLX4_PROT_IB_IPV4 : MLX4_PROT_IB_IPV6;
 
if (mdev-dev-caps.steering_mode ==
MLX4_STEERING_MODE_DEVICE_MANAGED) {
@@ -1115,7 +1115,7 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
}
 
err = mlx4_multicast_detach(mdev-dev, mqp-mqp, gid-raw,
-   MLX4_PROT_IB_IPV6, reg_id);
+   prot, reg_id);
if (err)
return err;
 
@@ -1127,13 +1127,8 @@ static int mlx4_ib_mcg_detach(struct ib_qp *ibqp, union 
ib_gid *gid, u16 lid)
if (ndev)
dev_hold(ndev);
spin_unlock(mdev-iboe.lock);
-   rdma_get_mcast_mac((struct in6_addr *)gid, mac);
-   if (ndev) {
-   rtnl_lock();
-   dev_mc_del(mdev-iboe.netdevs[ge-port - 1], mac);
-   rtnl_unlock();
+   if (ndev)
dev_put(ndev);
-   }
list_del(ge-list);
kfree(ge);
} else
@@ -1229,20 +1224,6 @@ static struct device_attribute *mlx4_class_attributes[] 
= {
dev_attr_board_id
 };
 
-static void mlx4_addrconf_ifid_eui48(u8 *eui, u16 vlan_id, struct net_device 
*dev)
-{
-   memcpy(eui, dev-dev_addr, 3);

[PATCH V4 1/9] IB/core: Ethernet L2 attributes in verbs/cm structures

2013-09-10 Thread Or Gerlitz
From: Matan Barak mat...@mellanox.com

This patch add the support for Ethernet L2 attributes in the
verbs/cm/cma structures.

When dealing with L2 Ethernet, we should use smac, dmac, vlan ID and priority
in a similar manner that the IB L2 (and the L4 PKEY) attributes are used.

Thus, those attributes were added to the following structures:

* ib_ah_attr - added dmac
* ib_qp_attr - added smac and vlan_id, (sl remains vlan priority)
* ib_wc - added smac, vlan_id
* ib_sa_path_rec - added smac, dmac, vlan_id
* cm_av - added smac and vlan_id

For the path record structure, extra care was taken to avoid the new fields when
packing it into wire format, to avoiding breaking the IB CM and SA wire 
protocol.

On the active side, the CM fill its internal structures from the path provided
by the ULP, added there taking the ETH L2 attributes and placing them into
the CM Address Handle (struct cm_av).

On the passive side, the CM fills its internal structures from the WC associated
with the REQ message, added there taking the ETH L2 attributes from the WC.

When the HW driver provides the required ETH L2 attributes in the WC, they
set the IB_WC_WITH_SMAC and IB_WC_WITH_VLAN flags. The IB core code checks
for the presence of these flags, and in their absence does address
resolution from the ib_init_ah_from_wc helper function.

ib_modify_qp_is_ok is also updated to consider the link layer. Some parameters
are mandatory for Ethernet link layer, while they are irrelevant for IB.
Vendor drivers are modified to support the new function signature.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/addr.c  |   97 ++-
 drivers/infiniband/core/cm.c|   55 +++
 drivers/infiniband/core/cma.c   |   60 +++--
 drivers/infiniband/core/sa_query.c  |   12 +++-
 drivers/infiniband/core/verbs.c |   45 -
 drivers/infiniband/hw/ehca/ehca_qp.c|2 +-
 drivers/infiniband/hw/ipath/ipath_qp.c  |2 +-
 drivers/infiniband/hw/mlx4/qp.c |9 ++-
 drivers/infiniband/hw/mlx5/qp.c |3 +-
 drivers/infiniband/hw/mthca/mthca_qp.c  |3 +-
 drivers/infiniband/hw/ocrdma/ocrdma_verbs.c |3 +-
 drivers/infiniband/hw/qib/qib_qp.c  |2 +-
 include/linux/mlx4/device.h |1 +
 include/rdma/ib_addr.h  |   42 +++-
 include/rdma/ib_cm.h|1 +
 include/rdma/ib_pack.h  |1 +
 include/rdma/ib_sa.h|3 +
 include/rdma/ib_verbs.h |   23 ++-
 18 files changed, 340 insertions(+), 24 deletions(-)

diff --git a/drivers/infiniband/core/addr.c b/drivers/infiniband/core/addr.c
index e90f2b2..8172d37 100644
--- a/drivers/infiniband/core/addr.c
+++ b/drivers/infiniband/core/addr.c
@@ -86,6 +86,8 @@ int rdma_addr_size(struct sockaddr *addr)
 }
 EXPORT_SYMBOL(rdma_addr_size);
 
+static struct rdma_addr_client self;
+
 void rdma_addr_register_client(struct rdma_addr_client *client)
 {
atomic_set(client-refcount, 1);
@@ -119,7 +121,8 @@ int rdma_copy_addr(struct rdma_dev_addr *dev_addr, struct 
net_device *dev,
 }
 EXPORT_SYMBOL(rdma_copy_addr);
 
-int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr)
+int rdma_translate_ip(struct sockaddr *addr, struct rdma_dev_addr *dev_addr,
+ u16 *vlan_id)
 {
struct net_device *dev;
int ret = -EADDRNOTAVAIL;
@@ -142,6 +145,8 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr)
return ret;
 
ret = rdma_copy_addr(dev_addr, dev, NULL);
+   if (vlan_id)
+   *vlan_id = rdma_vlan_dev_vlan_id(dev);
dev_put(dev);
break;
 
@@ -153,6 +158,8 @@ int rdma_translate_ip(struct sockaddr *addr, struct 
rdma_dev_addr *dev_addr)
  ((struct sockaddr_in6 *) 
addr)-sin6_addr,
  dev, 1)) {
ret = rdma_copy_addr(dev_addr, dev, NULL);
+   if (vlan_id)
+   *vlan_id = rdma_vlan_dev_vlan_id(dev);
break;
}
}
@@ -238,7 +245,7 @@ static int addr4_resolve(struct sockaddr_in *src_in,
src_in-sin_addr.s_addr = fl4.saddr;
 
if (rt-dst.dev-flags  IFF_LOOPBACK) {
-   ret = rdma_translate_ip((struct sockaddr *) dst_in, addr);
+   ret = rdma_translate_ip((struct sockaddr *)dst_in, addr, NULL);
if (!ret)
memcpy(addr-dst_dev_addr, addr-src_dev_addr, 
MAX_ADDR_LEN);
goto put;
@@ -286,7 +293,7 @@ static int addr6_resolve(struct sockaddr_in6 *src_in,
}
 

[PATCH V4 9/9] IB/mlx4: Enable mlx4_ib support for MODIFY_QP_EX

2013-09-10 Thread Or Gerlitz
From: Matan Barak mat...@mellanox.com

mlx4_ib driver should indicate that it supports
MODIFY_QP_EX user verbs extended command.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/main.c |3 ++-
 1 files changed, 2 insertions(+), 1 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/main.c 
b/drivers/infiniband/hw/mlx4/main.c
index 7a29ad5..77c87d0 100644
--- a/drivers/infiniband/hw/mlx4/main.c
+++ b/drivers/infiniband/hw/mlx4/main.c
@@ -1755,7 +1755,8 @@ static void *mlx4_ib_add(struct mlx4_dev *dev)
(1ull  IB_USER_VERBS_CMD_QUERY_SRQ)   |
(1ull  IB_USER_VERBS_CMD_DESTROY_SRQ) |
(1ull  IB_USER_VERBS_CMD_CREATE_XSRQ) |
-   (1ull  IB_USER_VERBS_CMD_OPEN_QP);
+   (1ull  IB_USER_VERBS_CMD_OPEN_QP) |
+   (1ull  IB_USER_VERBS_CMD_MODIFY_QP_EX);
 
ibdev-ib_dev.query_device  = mlx4_ib_query_device;
ibdev-ib_dev.query_port= mlx4_ib_query_port;
-- 
1.7.1

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH V4 2/9] IB/CMA: RoCE IP based GID addressing

2013-09-10 Thread Or Gerlitz
From: Moni Shoua mo...@mellanox.com

Currently, the IB core and specifically the RDMA-CM assumes that
RoCE (IBoE) gids encode related Ethernet netdevice interface
MAC address and possibly VLAN id.

Change gids to be treated as they encode interface IP address.

Since Ethernet layer 2 address parameters are not longer encoded
within gids, had to extend the Infiniband address structures (e.g.
ib_ah_attr) with layer 2 address parameters, namely mac and vlan.

Signed-off-by: Moni Shoua mo...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/cma.c  |   22 +
 drivers/infiniband/core/ucma.c |   18 +++---
 include/rdma/ib_addr.h |   50 +--
 3 files changed, 28 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 27acdec..2497031 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -386,7 +386,9 @@ static int cma_acquire_dev(struct rdma_id_private *id_priv)
return -EINVAL;
 
mutex_lock(lock);
-   iboe_addr_get_sgid(dev_addr, iboe_gid);
+   rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.src_addr,
+   iboe_gid);
+
memcpy(gid, dev_addr-src_dev_addr +
   rdma_addr_gid_offset(dev_addr), sizeof gid);
list_for_each_entry(cma_dev, dev_list, list) {
@@ -1925,10 +1927,10 @@ static int cma_resolve_iboe_route(struct 
rdma_id_private *id_priv)
memcpy(route-path_rec-dmac, addr-dev_addr.dst_dev_addr, ETH_ALEN);
memcpy(route-path_rec-smac, ndev-dev_addr, ndev-addr_len);
 
-   iboe_mac_vlan_to_ll(route-path_rec-sgid, addr-dev_addr.src_dev_addr,
-   route-path_rec-vlan_id);
-   iboe_mac_vlan_to_ll(route-path_rec-dgid, addr-dev_addr.dst_dev_addr,
-   route-path_rec-vlan_id);
+   rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.src_addr,
+   route-path_rec-sgid);
+   rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.dst_addr,
+   route-path_rec-dgid);
 
route-path_rec-hop_limit = 1;
route-path_rec-reversible = 1;
@@ -2095,6 +2097,7 @@ static void addr_handler(int status, struct sockaddr 
*src_addr,
   RDMA_CM_ADDR_RESOLVED))
goto out;
 
+   memcpy(cma_src_addr(id_priv), src_addr, rdma_addr_size(src_addr));
if (!status  !id_priv-cma_dev)
status = cma_acquire_dev(id_priv);
 
@@ -2104,10 +2107,8 @@ static void addr_handler(int status, struct sockaddr 
*src_addr,
goto out;
event.event = RDMA_CM_EVENT_ADDR_ERROR;
event.status = status;
-   } else {
-   memcpy(cma_src_addr(id_priv), src_addr, 
rdma_addr_size(src_addr));
+   } else
event.event = RDMA_CM_EVENT_ADDR_RESOLVED;
-   }
 
if (id_priv-id.event_handler(id_priv-id, event)) {
cma_exch(id_priv, RDMA_CM_DESTROYING);
@@ -2588,6 +2589,7 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr 
*addr)
if (ret)
goto err1;
 
+   memcpy(cma_src_addr(id_priv), addr, rdma_addr_size(addr));
if (!cma_any_addr(addr)) {
ret = cma_translate_addr(addr, id-route.addr.dev_addr);
if (ret)
@@ -2598,7 +2600,6 @@ int rdma_bind_addr(struct rdma_cm_id *id, struct sockaddr 
*addr)
goto err1;
}
 
-   memcpy(cma_src_addr(id_priv), addr, rdma_addr_size(addr));
if (!(id_priv-options  (1  CMA_OPTION_AFONLY))) {
if (addr-sa_family == AF_INET)
id_priv-afonly = 1;
@@ -3327,7 +3328,8 @@ static int cma_iboe_join_multicast(struct rdma_id_private 
*id_priv,
err = -EINVAL;
goto out2;
}
-   iboe_addr_get_sgid(dev_addr, mc-multicast.ib-rec.port_gid);
+   rdma_ip2gid((struct sockaddr *)id_priv-id.route.addr.src_addr,
+   mc-multicast.ib-rec.port_gid);
work-id = id_priv;
work-mc = mc;
INIT_WORK(work-work, iboe_mcast_work_handler);
diff --git a/drivers/infiniband/core/ucma.c b/drivers/infiniband/core/ucma.c
index b0f189b..7e7da86 100644
--- a/drivers/infiniband/core/ucma.c
+++ b/drivers/infiniband/core/ucma.c
@@ -655,24 +655,14 @@ static void ucma_copy_ib_route(struct 
rdma_ucm_query_route_resp *resp,
 static void ucma_copy_iboe_route(struct rdma_ucm_query_route_resp *resp,
 struct rdma_route *route)
 {
-   struct rdma_dev_addr *dev_addr;
-   struct net_device *dev;
-   u16 vid = 0;
 
resp-num_paths = route-num_paths;
switch (route-num_paths) {
case 0:
-   dev_addr = route-addr.dev_addr;
-   dev = dev_get_by_index(init_net, dev_addr-bound_dev_if);
-   if (dev) {
- 

[PATCH V4 4/9] IB/mlx4: Handle Ethernet L2 parameters for IP based GID addressing

2013-09-10 Thread Or Gerlitz
From: Moni Shoua mo...@mellanox.com

IP based RoCE gids don't store Ethernet L2 parameters, MAC and VLAN.

Hence, we need to extract them now from the CQE and place in struct
ib_wc (to be used for cases were they were taken from the gid).

Also, when modifying a QP or building address handle, instead of
parsing the dgid to get the MAC and VLAN, take them from the
address handle attributes.

Signed-off-by: Moni Shoua mo...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/hw/mlx4/ah.c   |   40 +++-
 drivers/infiniband/hw/mlx4/cq.c   |9 +++
 drivers/infiniband/hw/mlx4/mlx4_ib.h  |3 -
 drivers/infiniband/hw/mlx4/qp.c   |  105 ++---
 drivers/net/ethernet/mellanox/mlx4/port.c |   20 ++
 include/linux/mlx4/cq.h   |   15 +++-
 6 files changed, 130 insertions(+), 62 deletions(-)

diff --git a/drivers/infiniband/hw/mlx4/ah.c b/drivers/infiniband/hw/mlx4/ah.c
index a251bec..170dca6 100644
--- a/drivers/infiniband/hw/mlx4/ah.c
+++ b/drivers/infiniband/hw/mlx4/ah.c
@@ -39,25 +39,6 @@
 
 #include mlx4_ib.h
 
-int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr 
*ah_attr,
-   u8 *mac, int *is_mcast, u8 port)
-{
-   struct in6_addr in6;
-
-   *is_mcast = 0;
-
-   memcpy(in6, ah_attr-grh.dgid.raw, sizeof in6);
-   if (rdma_link_local_addr(in6))
-   rdma_get_ll_mac(in6, mac);
-   else if (rdma_is_multicast_addr(in6)) {
-   rdma_get_mcast_mac(in6, mac);
-   *is_mcast = 1;
-   } else
-   return -EINVAL;
-
-   return 0;
-}
-
 static struct ib_ah *create_ib_ah(struct ib_pd *pd, struct ib_ah_attr *ah_attr,
  struct mlx4_ib_ah *ah)
 {
@@ -92,21 +73,18 @@ static struct ib_ah *create_iboe_ah(struct ib_pd *pd, 
struct ib_ah_attr *ah_attr
 {
struct mlx4_ib_dev *ibdev = to_mdev(pd-device);
struct mlx4_dev *dev = ibdev-dev;
-   union ib_gid sgid;
-   u8 mac[6];
-   int err;
int is_mcast;
+   struct in6_addr in6;
u16 vlan_tag;
 
-   err = mlx4_ib_resolve_grh(ibdev, ah_attr, mac, is_mcast, 
ah_attr-port_num);
-   if (err)
-   return ERR_PTR(err);
-
-   memcpy(ah-av.eth.mac, mac, 6);
-   err = ib_get_cached_gid(pd-device, ah_attr-port_num, 
ah_attr-grh.sgid_index, sgid);
-   if (err)
-   return ERR_PTR(err);
-   vlan_tag = rdma_get_vlan_id(sgid);
+   memcpy(in6, ah_attr-grh.dgid.raw, sizeof(in6));
+   if (rdma_is_multicast_addr(in6)) {
+   is_mcast = 1;
+   rdma_get_mcast_mac(in6, ah-av.eth.mac);
+   } else {
+   memcpy(ah-av.eth.mac, ah_attr-dmac, ETH_ALEN);
+   }
+   vlan_tag = ah_attr-vlan_id;
if (vlan_tag  0x1000)
vlan_tag |= (ah_attr-sl  7)  13;
ah-av.eth.port_pd = cpu_to_be32(to_mpd(pd)-pdn | (ah_attr-port_num 
 24));
diff --git a/drivers/infiniband/hw/mlx4/cq.c b/drivers/infiniband/hw/mlx4/cq.c
index d5e60f4..5f6113b 100644
--- a/drivers/infiniband/hw/mlx4/cq.c
+++ b/drivers/infiniband/hw/mlx4/cq.c
@@ -793,6 +793,15 @@ repoll:
wc-sl  = be16_to_cpu(cqe-sl_vid)  13;
else
wc-sl  = be16_to_cpu(cqe-sl_vid)  12;
+   if (be32_to_cpu(cqe-vlan_my_qpn)  MLX4_CQE_VLAN_PRESENT_MASK) 
{
+   wc-vlan_id = be16_to_cpu(cqe-sl_vid) 
+   MLX4_CQE_VID_MASK;
+   } else {
+   wc-vlan_id = 0x;
+   }
+   wc-wc_flags |= IB_WC_WITH_VLAN;
+   memcpy(wc-smac, cqe-smac, ETH_ALEN);
+   wc-wc_flags |= IB_WC_WITH_SMAC;
}
 
return 0;
diff --git a/drivers/infiniband/hw/mlx4/mlx4_ib.h 
b/drivers/infiniband/hw/mlx4/mlx4_ib.h
index 133f41f..c06f571 100644
--- a/drivers/infiniband/hw/mlx4/mlx4_ib.h
+++ b/drivers/infiniband/hw/mlx4/mlx4_ib.h
@@ -678,9 +678,6 @@ int __mlx4_ib_query_pkey(struct ib_device *ibdev, u8 port, 
u16 index,
 int __mlx4_ib_query_gid(struct ib_device *ibdev, u8 port, int index,
union ib_gid *gid, int netw_view);
 
-int mlx4_ib_resolve_grh(struct mlx4_ib_dev *dev, const struct ib_ah_attr 
*ah_attr,
-   u8 *mac, int *is_mcast, u8 port);
-
 static inline bool mlx4_ib_ah_grh_present(struct mlx4_ib_ah *ah)
 {
u8 port = be32_to_cpu(ah-av.ib.port_pd)  24  3;
diff --git a/drivers/infiniband/hw/mlx4/qp.c b/drivers/infiniband/hw/mlx4/qp.c
index da6f5fa..e0c2186 100644
--- a/drivers/infiniband/hw/mlx4/qp.c
+++ b/drivers/infiniband/hw/mlx4/qp.c
@@ -90,6 +90,21 @@ enum {
MLX4_RAW_QP_MSGMAX  = 31,
 };
 
+#ifndef ETH_ALEN
+#define ETH_ALEN6
+#endif
+static inline u64 mlx4_mac_to_u64(u8 *addr)
+{
+   u64 mac = 0;
+   int i;
+
+   for (i = 0; i  ETH_ALEN; i++) {
+   mac = 8;
+   mac |= 

[PATCH V4 0/9] IP based RoCE GID Addressing

2013-09-10 Thread Or Gerlitz
changes from V3:

  - dropped the uverbs infrastructure patch for extensions which is now upstream
400dbc9 IB/core: Infrastructure for extensible uverbs commands

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 
patch.
   
  - removed the assumption that the low level driver can provide the source mac
and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

See below full listing of change-history.

Currently, the IB stack (core + drivers) handle RoCE (IBoE) gids as
they encode related Ethernet net-device interface MAC address and 
possibly VLAN id.

This series changes RoCE GIDs to encode IP addresses (IPv4 + IPv6)
of the that Ethernet interface, under the following reasoning:

1. There are environments where the compute entity that runs the RoCE 
stack is not aware that its traffic is vlan-tagged. This results with that 
node to create/assume wrong GIDs from the view point of a peer node which 
is aware to vlans. 

Note that node here can be physical node connected to Ethernet switch acting 
in 
access mode talking to another node which does vlan insertion/stripping by 
itself.

Or another example is SRIOV Virtual Function which is configured to work in 
VST 
mode (Virtual-Switch-Tagging) such that the hypervisor configures the HW 
eSWitch 
to do vlan insertion for the vPORT representing that function.

2. When RoCE traffic is inspected (mirrored/trapped) in Ethernet switches for 
monitoring and security purposes. It is much more natural for both humans and 
automated utilities (...) to observe IP addresses in a certain offset into RoCE 
frames L3 header vs. MAC/VLANs (which are there anyway in the L2 header of that 
frame, so they are not gone by this change).

3. Some Bonding/Teaming advanced mode such as balance-alb and balance-tlb 
are using multiple underlying devices in parallel, and hence packets always 
carry the bond IP address but different streams have different source MACs.
The approach brought by this series is part from what would allow to 
support that for RoCE traffic too.

The 1st patch adds explicit handling of Ethernet L2 attributes, source/dest 
mac and vlan_id to the kernel IB core, in data-structures and CMA/CM code. 
Previously, with MAC/VLAN based addressing, they were encoded in the GIDs, 
where now they have to be resolved and placed separately from the IP based GIDs.

The 2nd patch modifies the CMA to cope with IP based GIDs, the 3rd/4th ones do 
that for the mlx4_ib driver, and the 5th patch to the ocrdma driver. 

The 6th patch sets the foundation for extending uverbs to the new scheme which 
was introduced lately, and the 7th/8th patches add two extended uverbs and 
respectively two extended ucma commands which are now exported to user space.
The last patch denotes mlx4 support for the uverbs extended modify qp command.

These extended verbs will allow to enhance user space libraries such that they 
work 
OK over the modified scheme. All RC applications using librdmacm will not need 
to be 
modified at all, since the change will be encapsulated into that library.

Or.

Full listing of change-history:

changes from V3:

  - dropped the uverbs Infrastructure patch for extensions which is now upstream
400dbc9 IB/core: Infrastructure for extensible uverbs commands

  - added ocrdma patch to handle Ethernet L2 parameters, similar to the mlx4 
patch.
   
  - removed the assumption that the low level driver can provide the source mac
and vlan in the struct ib_wc returned by ib_poll_cq, and adjusted the 
ib_init_ah_from_wc helper of the IB core accordingly.

  - fixed some vlan related issues in the mlx4 driver

changes from V2:

  - added handling of IP based GIDs in the ocrdma driver - patch #5, 
as a result patches #5-8 of V1 became patches #6-9
  
changes from V1:

 - rebased the series against the latest kernel bits, which include Sean's 
   AF_IB changes to the rdma-cm
 
 - fixed bug in mlx4_ib where reset of the gid table was done for IB ports too
 
 - fixed build warnings and issues pointed by sparse

 - introduced patch #1 which does the explicit handling of Ethernet L2 
attributes, 
   source/dest mac and vlan_id in the kernel data-structures and CMA/CM code. 

 - use smac when modifying a QP -- find smac in passive side + additional 
fields 
   to adress structures

 - add support to new QP atrr in ib_modify_qp_is_ok() special for ll = ETH
  and modified all low-level drivers to keep working after that change

 -- changes around uverbs:
 - use ah_ext as pointer in qp_attr passed from user space, so this 
   field by itself can be extended in the future
 - for kernel to user command respnses comp_mask is moved into the 
   right place which is after the non-extended command respond fields
 - fixed bug in copy_qp_attr_ex under which some fields were copied to
   wrong locations
 - use new 

[PATCH V4 7/9] IB/core: Add RoCE IP based addressing extensions for uverbs

2013-09-10 Thread Or Gerlitz
From: Matan Barak mat...@mellanox.com

Add uverbs support for RoCE (IBoE) IP based addressing extensions
towards user space libraries.

Under ip based gid addressing, for RC QPs, QP attributes should contain the
Ethernet L2 destination. Until now, indicatings GID was sufficient. When
using ip encoded in gids, the QP attributes should contain extended destination,
indicating vlan and dmac as well. This is done via a new struct 
ib_uverbs_qp_dest_ex.
This new structure is contained in a new struct ib_uverbs_modify_qp_ex that is
used by MODIFY_QP_EX command. In order to make those changes seamlessly, those
extended structures were added in the bottom of the current structures.
The new command also gets smac/alt_smac/vlan_id/alt_vlan_id. Those parameters
are fixed in the QP context in order to enhance security.
The extended dest is used a a pointer rather than as a inline fixed field
in the sake of future scalability.

Also, when the gid encodes ip address, the AH attributes should contain also
dmac. Therefore, ib_uverbs_create_ah was extended to contain those fields.
When creating an AH, the user indicates the exact L2 ethernet destination
parameters. This is done by a new CREATE_AH_EX command that uses a new struct
ib_uverbs_create_ah_ex.

struct ib_user_path_rec was extended too, to contain source and destination
MAC and VLAN ID, this structure is of use by the rdma_ucm driver.

Signed-off-by: Matan Barak mat...@mellanox.com
Signed-off-by: Or Gerlitz ogerl...@mellanox.com
---
 drivers/infiniband/core/uverbs.h  |2 +
 drivers/infiniband/core/uverbs_cmd.c  |  359 ++---
 drivers/infiniband/core/uverbs_main.c |4 +-
 drivers/infiniband/core/uverbs_marshall.c |  128 ++-
 include/rdma/ib_marshall.h|   12 +
 include/uapi/rdma/ib_user_sa.h|   34 +++-
 include/uapi/rdma/ib_user_verbs.h |  160 +-
 7 files changed, 608 insertions(+), 91 deletions(-)

diff --git a/drivers/infiniband/core/uverbs.h b/drivers/infiniband/core/uverbs.h
index d040b87..b0fcb0b 100644
--- a/drivers/infiniband/core/uverbs.h
+++ b/drivers/infiniband/core/uverbs.h
@@ -202,11 +202,13 @@ IB_UVERBS_DECLARE_CMD(create_qp);
 IB_UVERBS_DECLARE_CMD(open_qp);
 IB_UVERBS_DECLARE_CMD(query_qp);
 IB_UVERBS_DECLARE_CMD(modify_qp);
+IB_UVERBS_DECLARE_CMD(modify_qp_ex);
 IB_UVERBS_DECLARE_CMD(destroy_qp);
 IB_UVERBS_DECLARE_CMD(post_send);
 IB_UVERBS_DECLARE_CMD(post_recv);
 IB_UVERBS_DECLARE_CMD(post_srq_recv);
 IB_UVERBS_DECLARE_CMD(create_ah);
+IB_UVERBS_DECLARE_CMD(create_ah_ex);
 IB_UVERBS_DECLARE_CMD(destroy_ah);
 IB_UVERBS_DECLARE_CMD(attach_mcast);
 IB_UVERBS_DECLARE_CMD(detach_mcast);
diff --git a/drivers/infiniband/core/uverbs_cmd.c 
b/drivers/infiniband/core/uverbs_cmd.c
index f2b81b9..9a0c5d7 100644
--- a/drivers/infiniband/core/uverbs_cmd.c
+++ b/drivers/infiniband/core/uverbs_cmd.c
@@ -1900,6 +1900,60 @@ static int modify_qp_mask(enum ib_qp_type qp_type, int 
mask)
}
 }
 
+static void ib_uverbs_modify_qp_assign(struct ib_uverbs_modify_qp *cmd,
+  struct ib_qp_attr *attr,
+  struct ib_uverbs_qp_dest *dest,
+  struct ib_uverbs_qp_dest *alt_dest) {
+   attr-qp_state= cmd-qp_state;
+   attr-cur_qp_state= cmd-cur_qp_state;
+   attr-path_mtu= cmd-path_mtu;
+   attr-path_mig_state  = cmd-path_mig_state;
+   attr-qkey= cmd-qkey;
+   attr-rq_psn  = cmd-rq_psn;
+   attr-sq_psn  = cmd-sq_psn;
+   attr-dest_qp_num = cmd-dest_qp_num;
+   attr-qp_access_flags = cmd-qp_access_flags;
+   attr-pkey_index  = cmd-pkey_index;
+   attr-alt_pkey_index  = cmd-alt_pkey_index;
+   attr-en_sqd_async_notify = cmd-en_sqd_async_notify;
+   attr-max_rd_atomic   = cmd-max_rd_atomic;
+   attr-max_dest_rd_atomic  = cmd-max_dest_rd_atomic;
+   attr-min_rnr_timer   = cmd-min_rnr_timer;
+   attr-port_num= cmd-port_num;
+   attr-timeout = cmd-timeout;
+   attr-retry_cnt   = cmd-retry_cnt;
+   attr-rnr_retry   = cmd-rnr_retry;
+   attr-alt_port_num= cmd-alt_port_num;
+   attr-alt_timeout = cmd-alt_timeout;
+
+   memcpy(attr-ah_attr.grh.dgid.raw, dest-dgid, 16);
+   attr-ah_attr.grh.flow_label= dest-flow_label;
+   attr-ah_attr.grh.sgid_index= dest-sgid_index;
+   attr-ah_attr.grh.hop_limit = dest-hop_limit;
+   attr-ah_attr.grh.traffic_class = dest-traffic_class;
+   attr-ah_attr.dlid  = dest-dlid;
+   attr-ah_attr.sl= dest-sl;
+   attr-ah_attr.src_path_bits = dest-src_path_bits;
+   attr-ah_attr.static_rate   = dest-static_rate;
+   attr-ah_attr.ah_flags  = dest-is_global ?
+ 

Re: Strange NFS client ACK behaviour

2013-09-10 Thread Wendy Cheng
On Mon, Sep 9, 2013 at 11:51 PM, Markus Stockhausen
stockhau...@collogia.de wrote:
 Von: Wendy Cheng [s.wendy.ch...@gmail.com]
 Gesendet: Montag, 9. September 2013 22:03
 An: Markus Stockhausen
 Cc: linux-rdma@vger.kernel.org
 Betreff: Re: Strange NFS client ACK behaviour

 On Sun, Sep 8, 2013 at 11:24 AM, Markus Stockhausen
 stockhau...@collogia.de wrote:

   we observed a performance drop in our IPoIB NFS backup
   infrastructure since we switched to machines with newer
   kernels.
 

 Not sure how your backup infrastructure works but the symptoms seem to
 match with this discussion:
 http://www.spinics.net/lists/linux-nfs/msg38980.html

 If you know how to recompile nfs kmod, Trond's patch does worth a try.
 Or open an Ubuntu support ticket, let them build you a test kmod.

 -- Wendy

 Thanks for pointing into that direction. From my understanding this
 patch goes into the NFS client side. I built a patched module for my
 Fedora 19 client (3.10 kernel). Nevertheless the behaviour ist still
 the same.  If I get the patch right it is about forked childs that
 access a page of a mmapped file round robin and the kernel issues
 tons of write requests to the file.

 My case is only about ACK transmissions for a single writer.

 Markus


So you have to go back to the drawing board :(. Have you tried to profile it ?
http://oprofile.sourceforge.net/about/

-- Wendy
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH 8/8] IB/srp: Make queue size configurable

2013-09-10 Thread Bart Van Assche

On 09/10/13 05:01, David Dillow wrote:

On Tue, 2013-08-20 at 14:50 +0200, Bart Van Assche wrote:

@@ -2227,6 +2270,7 @@ static const match_table_t srp_opt_tokens = {
 { SRP_OPT_SG_TABLESIZE, sg_tablesize=%u   },
 { SRP_OPT_COMP_VECTOR,  comp_vector=%u},
 { SRP_OPT_TL_RETRY_COUNT,   tl_retry_count=%u },
+   { SRP_OPT_CAN_QUEUE,can_queue=%d  },


I'm pretty much OK with the patch, but since we're stuck with it going
forward, I'd like to have a better externally visible name here --
queue_depth? max_queue? queue_size?

Otherwise,
Acked-by: David Dillow dillo...@ornl.gov


Hello Dave,

If this name was not yet in use in any interface that is visible in user 
space, I would agree that we should come up with a better name. However, 
the SCSI mid-layer already uses that name today to export the queue 
size. To me this looks like a good reason to use the name can_queue ? 
An example:


$ cat /sys/class/scsi_host/host93/can_queue
62

See also the shost_rd_attr(can_queue, %hd\n) statement in 
drivers/scsi/scsi_sysfs.c.


Bart.


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH v2] opensm: Clean up event subscriptions if a port goes away

2013-09-10 Thread Hal Rosenstock
On 9/10/2013 6:50 AM, Line Holen wrote:
 Event subscriptions needs to be cleaned up if a port goes away.
 If the port comes online again later it may no longer want to
 receive the events on the same QPN. If the old QPN is used for
 something else the SM forwarding events may cause QKey violations.
 
 This behavior is made configurable and it needs to be explicitly
 enabled (default is off).
 
 Signed-off-by: Line Holen line.ho...@oracle.com

Thanks. Applied.

-- Hal
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: [PATCH] IB/cma: use cached gids

2013-09-10 Thread Hefty, Sean
 The cma_acquire_dev function was changed by commit 3c86aa70bf67
 to use find_gid_port because multiport devices might have
 either IB or IBoE formatted gids.  The old function assumed that
 all ports on the same device used the same GID format.  However,
 when it was changed to use find_gid_port, we inadvertently lost
 usage of the GID cache.  This turned out to be a very costly
 change.  In our testing, each iteration through each index of
 the GID table takes roughly 35us.  When you have multiple
 devices in a system, and the GID you are looking for is on one
 of the later devices, the code loops through all of the GID
 indexes on all of the early devices before it finally succeeds
 on the target device.  This pathological search behavior combined
 with 35us per GID table index retrieval results in results such
 as the following from the cmtime application that's part of the
 latest librdmacm git repo:
 
 cmtime -b card1, port1, mthca -c 1
 
 step  total ms max ms min us  us / conn
 create id:   33.88   0.06   1.00   3.39
 bind addr: 1029.22   0.42  85.00 102.92
 resolve addr :   50.40  25.93   23244.00   5.04
 resolve route:  578.06 551.67   26457.00  57.81
 create qp:  603.69   0.33  51.00  60.37
 connect  : 6461.236417.50   43963.00 646.12
 disconnect   :  877.99 659.96  162985.00  87.80
 destroy  :   38.67   0.03   2.00   3.87
 
 cmtime -b card1, port2, mthca -c 1
 
 step  total ms max ms min us  us / conn
 create id:   34.74   0.07   1.00   3.47
 bind addr:21759.39   2.751874.002175.94
 resolve addr :   50.67  26.30   23962.00   5.07
 resolve route:  622.68 594.80   27952.00  62.27
 create qp:  599.82   0.23  49.00  59.98
 connect  :24761.36   24709.28   49183.002476.14
 disconnect   :  904.57 652.34  187201.00  90.46
 destroy  :   38.94   0.04   2.00   3.89
 
 cmtime -b card2, port1, mlx4, IB -c 1
 
 step  total ms max ms min us  us / conn
 create id:   35.13   0.05   1.00   3.51
 bind addr:47421.04   6.383896.004742.10
 resolve addr :   50.60  25.54   24248.00   5.06
 resolve route:  524.76 498.97   25861.00  52.48
 create qp: 3137.70   5.68 251.00 313.77
 connect  :48959.76   48894.49   31841.004895.98
 disconnect   :   101926.72   98431.12  538689.00   10192.67
 destroy  :   37.63   0.04   2.00   3.76
 
 cmtime -b card2, port2, mlx4, IBoE -c 5000
 
 step  total ms max ms min us  us / conn
 create id:   28.04   0.05   1.00   5.61
 bind addr:  235.03   0.17  41.00  47.01
 resolve addr :   27.45  14.97   12308.00   5.49
 resolve route:  556.26 540.88   15514.00 111.25
 create qp: 1323.23   5.73 210.00 264.65
 connect  :84025.30   83960.46   61319.00   16805.06
 disconnect   : 2273.151734.22  417534.00 454.63
 destroy  :   21.28   0.06   2.00   4.26
 
 Clearly, both the bind address and connect operations suffer
 a huge penalty for being anything other than the default
 GID on the first port in the system.  Note: I had to reduce
 the number of connections to 5000 to get the IBoE test to
 complete, so it's numbers aren't fully comparable to the
 rest of the tests.
 
 After applying this patch, the numbers now look like this:
 
 cmtime -b card1, port1, mthca -c 1
 
 step  total ms max ms min us  us / conn
 create id:   30.30   0.04   1.00   3.03
 bind addr:   26.15   0.03   1.00   2.62
 resolve addr :   47.18  24.62   22336.00   4.72
 resolve route:  642.78 617.61   25242.00  64.28
 create qp:  610.06   0.61  52.00  61.01
 connect  :43362.32   43303.70   59353.004336.23
 disconnect   :  877.59 658.70  165291.00  87.76
 destroy  :   40.03   0.05   2.00   4.00
 
 cmtime -b card1, port2, mthca -c 1
 
 step  total ms max ms min us  us / conn
 create id:   31.34   0.07   1.00   3.13
 bind addr:   42.37   0.03   3.00   4.24
 resolve addr :   47.19  24.92   22003.00   4.72
 resolve route:  580.25 553.65   26680.00  58.03
 create qp:  687.45   0.30  52.00  68.74
 connect  :37457.12   37384.62   73015.003745.71
 disconnect   :  900.72 648.67  183825.00  90.07
 destroy  :   39.05   0.05   2.00   3.90
 
 cmtime -b card2, port1, mlx4, IB -c 1
 
 step  total ms max ms min us  us / conn
 create id:   

Re: [PATCH] IB/cma: use cached gids

2013-09-10 Thread Roland Dreier
On Sun, Sep 8, 2013 at 2:44 PM, Doug Ledford dledf...@redhat.com wrote:

 -   ret = find_gid_port(cma_dev-device, 
 iboe_gid, port);
 +   ret = 
 ib_find_cached_gid(cma_dev-device, iboe_gid, found_port, NULL);

This type of change is kind of unfortunate -- I've been on a
multi-year quest to get rid of the whole ib...cached... set of APIs,
since the part of the code that handles maintaining that cache is racy
and hard to get the locking right with.  It would be better if we
could declare that the device driver's query GID methods had to be
fast (non-sleeping) and use them directly (and device drivers could
maintain a cache if they need to, maybe with a library to help them).

But I guess that's a bigger project than fixing this immediate issue.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] IB/cma: use cached gids

2013-09-10 Thread Doug Ledford
On 09/10/2013 09:17 PM, Roland Dreier wrote:
 On Sun, Sep 8, 2013 at 2:44 PM, Doug Ledford dledf...@redhat.com wrote:

 -   ret = find_gid_port(cma_dev-device, 
 iboe_gid, port);
 +   ret = 
 ib_find_cached_gid(cma_dev-device, iboe_gid, found_port, NULL);
 
 This type of change is kind of unfortunate -- I've been on a
 multi-year quest to get rid of the whole ib...cached... set of APIs,

So I was forewarned.

 since the part of the code that handles maintaining that cache is racy
 and hard to get the locking right with.  It would be better if we
 could declare that the device driver's query GID methods had to be
 fast (non-sleeping) and use them directly (and device drivers could
 maintain a cache if they need to, maybe with a library to help them).
 
 But I guess that's a bigger project than fixing this immediate issue.

There are only two solutions to this that I can see:

1) Don't go to the card
2) Go to the card, but get the entire GID table in one operation (or at
most two if you need one to get the GID table length and then a second
to get the entire GID table in one go)

The current implementation goes to the card once to get the table
length, and then once for each table entry.  That's what's killing us.

Personally, I don't want it ever to go to the card for this sort of
operation.  We should be able to keep the card's idea of GIDs and our
idea of the same in sync.  Getting this right might be hard, but it's
still the right thing to do.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html