Re: SoftiWARP: new patchset

2012-09-05 Thread Stefan Metzmacher
Hi Bernard,

 many thanks for commenting on the software iWARP RDMA
 driver code is sent about 5 weeks ago. I hope I now
 incorporated all recent suggestions and fixes.
 
 These are the main changes:
 
 o changing siw device attachment to be dynamic based on
   netlink events
 o enabling inline data for kernel clients
   (now inline data are stored within wqe structure)
 o bitmask access to packet headers removing the
   '#if defined(__LITTLE_ENDIAN_BITFIELD)' style
 o moving debug stuff to debugfs
 o shrinking the stack size of the core tx
   function
 o updates to documentation
 
 
 Due to the number of lines of code changed, it might be
 appropriate if I send a complete new patchset. I'll
 keep patch packaging as before.
 
 I made the current siw code available at
 www.gitorious.org/softiwarp
 and will keep it up-to-date. This code is now free
 of kernel version dependencies. It has been tested
 on different kernel versions from version 2.6.36.2
 up to 3.0.0-rc1+. I tested on both big an little
 endian machines.
 
 I would be very happy to get further input. I work on
 this project only part of my time - all your input speeds
 up code maturing. I would be happy if the code reaches an
 acceptable status soon. Thank you.

I'm wondering what the status of this patchset is?

Do you get any feedback?

It would be very good if the softiwarp module could be
included in the mainline kernel.

Hopefully Samba will get support for SMB-Direct (SMB2 over RDMA) support
in future. And having RDMA support without hardware support would be very
good for testing and also for usage on the client side against a server
which has hardware RDMA support.

For me at least rping tests work fine using siw.ko on a 3.2.x kernel on
Ubuntu 12.04.

metze


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [patch v2 00/37] add rxe (soft RoCE)

2012-09-05 Thread Stefan (metze) Metzmacher
Hi Bob,

Am 24.07.2011 21:43, schrieb
rpearson-klaocwyjdxkshymvu7je4pqqe7ycj...@public.gmane.org:
 Changes in v2 include:
 
   - Updated to Roland's tree as of 7/24/2011
 
   - Moved the crc32 algorithm into a patch (slice-by-8-for_crc32.c.diff)
 that goes into the mainline kernel. It has been submitted upstream
 but is also included in here since it is required to build the driver.
 
   - renamed rxe_sb8.c rxe_icrc.c since that is all it now does.
 
   - Cleaned up warnings from checkpatch, C=2 and __CHECK_ENDIAN__.
 
   - moved small .h files into rxe_loc.h
 
   - rewrote the Kconfig text to be a little friendlier
 
   - Changed the patch names to make them easier to handle.
 
   - the quilt patch series is online at:
 http://support.systemfabricworks.com/downloads/rxe/patches-v2.tgz
 
   - librxe is online at:
 http://support.systemfabricworks.com/downloads/rxe/librxe-1.0.0.tar.gz
 
   Thanks to Roland Dreier, Bart van Assche and David Dillow for helpful
   suggestions.

I'm wondering what the status of this patchset is?

It would be very good if the rxe modules could be
included in the mainline kernel.

Hopefully Samba will get support for SMB-Direct (SMB2 over RDMA) support
in future. And having RDMA support without hardware support would be very
good for testing and also for usage on the client side against a server
which has hardware RDMA support.

Are there git repositories yet (for the kernel and userspace)?

metze



signature.asc
Description: OpenPGP digital signature


[PATCH V2 2/2] cxgb4: Remove duplicate register definitions

2012-09-05 Thread Vipul Pandya
Removed duplicate definition for SGE_PF_KDOORBELL, SGE_INT_ENABLE3,
PCIE_MEM_ACCESS_OFFSET registers.
Moved the register field definitions around the register definition.

Signed-off-by: Santosh Rastapur sant...@chelsio.com
Signed-off-by: Vipul Pandya vi...@chelsio.com
Reviewed-by: Sivakumar Subramani siv...@chelsio.com
---
V2: Changed the order of the patch in patch series to avoid build failure
between two changes.

 drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c |   18 
 drivers/net/ethernet/chelsio/cxgb4/sge.c|4 +-
 drivers/net/ethernet/chelsio/cxgb4/t4_hw.c  |   12 +++---
 drivers/net/ethernet/chelsio/cxgb4/t4_regs.h|   54 +-
 4 files changed, 30 insertions(+), 58 deletions(-)

diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c 
b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
index 4a20821..5497eaa 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c
@@ -2470,8 +2470,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, 
u16 pidx,
else
delta = size - hw_pidx + pidx;
wmb();
-   t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
-V_QID(qid) | V_PIDX(delta));
+   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
+QID(qid) | PIDX(delta));
}
 out:
return ret;
@@ -2579,8 +2579,8 @@ static void sync_txq_pidx(struct adapter *adap, struct 
sge_txq *q)
else
delta = q-size - hw_pidx + q-db_pidx;
wmb();
-   t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
-   V_QID(q-cntxt_id) | V_PIDX(delta));
+   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
+QID(q-cntxt_id) | PIDX(delta));
}
 out:
q-db_disabled = 0;
@@ -2617,9 +2617,9 @@ static void process_db_full(struct work_struct *work)
 
notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL);
drain_db_fifo(adap, dbfifo_drain_delay);
-   t4_set_reg_field(adap, A_SGE_INT_ENABLE3,
-   F_DBFIFO_HP_INT | F_DBFIFO_LP_INT,
-   F_DBFIFO_HP_INT | F_DBFIFO_LP_INT);
+   t4_set_reg_field(adap, SGE_INT_ENABLE3,
+DBFIFO_HP_INT | DBFIFO_LP_INT,
+DBFIFO_HP_INT | DBFIFO_LP_INT);
notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY);
 }
 
@@ -2639,8 +2639,8 @@ static void process_db_drop(struct work_struct *work)
 
 void t4_db_full(struct adapter *adap)
 {
-   t4_set_reg_field(adap, A_SGE_INT_ENABLE3,
-   F_DBFIFO_HP_INT | F_DBFIFO_LP_INT, 0);
+   t4_set_reg_field(adap, SGE_INT_ENABLE3,
+DBFIFO_HP_INT | DBFIFO_LP_INT, 0);
queue_work(workq, adap-db_full_task);
 }
 
diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c 
b/drivers/net/ethernet/chelsio/cxgb4/sge.c
index d49933e..1fde57d 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/sge.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c
@@ -769,8 +769,8 @@ static inline void ring_tx_db(struct adapter *adap, struct 
sge_txq *q, int n)
wmb();/* write descriptors before telling HW */
spin_lock(q-db_lock);
if (!q-db_disabled) {
-   t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL),
-V_QID(q-cntxt_id) | V_PIDX(n));
+   t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL),
+QID(q-cntxt_id) | PIDX(n));
}
q-db_pidx = q-pidx;
spin_unlock(q-db_lock);
diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c 
b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
index af16013..dccecdc 100644
--- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
+++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c
@@ -1018,9 +1018,9 @@ static void sge_intr_handler(struct adapter *adapter)
{ ERR_INVALID_CIDX_INC,
  SGE GTS CIDX increment too large, -1, 0 },
{ ERR_CPL_OPCODE_0, SGE received 0-length CPL, -1, 0 },
-   { F_DBFIFO_LP_INT, NULL, -1, 0, t4_db_full },
-   { F_DBFIFO_HP_INT, NULL, -1, 0, t4_db_full },
-   { F_ERR_DROPPED_DB, NULL, -1, 0, t4_db_dropped },
+   { DBFIFO_LP_INT, NULL, -1, 0, t4_db_full },
+   { DBFIFO_HP_INT, NULL, -1, 0, t4_db_full },
+   { ERR_DROPPED_DB, NULL, -1, 0, t4_db_dropped },
{ ERR_DATA_CPL_ON_HIGH_QID1 | ERR_DATA_CPL_ON_HIGH_QID0,
  SGE IQID  1023 received CPL for FL, -1, 0 },
{ ERR_BAD_DB_PIDX3, SGE DBP 3 pidx increment too large, -1,
@@ -1520,7 +1520,7 @@ void t4_intr_enable(struct adapter *adapter)
 ERR_BAD_DB_PIDX2 | ERR_BAD_DB_PIDX1 |
 ERR_BAD_DB_PIDX0 | ERR_ING_CTXT_PRIO |
 ERR_EGR_CTXT_PRIO | INGRESS_SIZE_ERR |
-   

Re: [PATCH] opensm: improve search common pkeys.

2012-09-05 Thread Alex Netes
Hi Daniel,

On 17:07 Wed 18 Jul , Daniel Klein wrote:
 improving runtim of search comon pkeys code to o(n).
 
 Signed-off-by: Daniel Klein dani...@mellanox.com
 ---

Applied after removing unused variables. Thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH][MINOR] opensm/osm_vendor_ibumad.c: Add management class to error log message

2012-09-05 Thread Alex Netes
Hi Hal,

On 02:27 Thu 09 Aug , Hal Rosenstock wrote:
 
 Signed-off-by: Hal Rosenstock h...@mellanox.com
 ---

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] opensm/osm_sw_info_rcv.c: Fixed locking issue on osm_get_node_by_guid error

2012-09-05 Thread Alex Netes
Hi Hal,

On 14:10 Tue 28 Aug , Hal Rosenstock wrote:
 
 Signed-off-by: Hal Rosenstock h...@mellanox.com
 ---

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH] OpenSM: Add new Mellanox OUI

2012-09-05 Thread Alex Netes
Hi Hal,

On 08:18 Tue 04 Sep , Hal Rosenstock wrote:
 
 Signed-off-by: Hal Rosenstock h...@mellanox.com
 ---

Applied, thanks.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: [PATCH for-next V2 01/22] IB/core: Reserve bits in enum ib_qp_create_flags for low-level driver use

2012-09-05 Thread Doug Ledford
On 8/3/2012 4:40 AM, Jack Morgenstein wrote:
 Reserve bits 26-31 for internal use by low-level drivers. Two
 such bits are used in the mlx4 driver SRIOV IB implementation.
 
 These enum additions guarantee that the core layer will never use
 these bits, so that low level drivers may safely make use of them.
 
 Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il
 ---
  include/rdma/ib_verbs.h |3 +++
  1 files changed, 3 insertions(+), 0 deletions(-)
 
 diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h
 index 07996af..46bc045 100644
 --- a/include/rdma/ib_verbs.h
 +++ b/include/rdma/ib_verbs.h
 @@ -614,6 +614,9 @@ enum ib_qp_type {
  enum ib_qp_create_flags {
   IB_QP_CREATE_IPOIB_UD_LSO   = 1  0,
   IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK   = 1  1,
 + /* reserve bits 26-31 for low level drivers' internal use */
 + IB_QP_CREATE_RESERVED_START = 1  26,
 + IB_QP_CREATE_RESERVED_END   = 1  31,
  };
  
  struct ib_qp_init_attr {
 

Reserving 6 bits for driver use out of 32 seems reasonable.

Acked-by: Doug Ledford dledf...@redhat.com

-- 
Doug Ledford dledf...@redhat.com
  GPG KeyID: 0E572FDD
  http://people.redhat.com/dledford

Infiniband specific RPMs available at
  http://people.redhat.com/dledford/Infiniband



signature.asc
Description: OpenPGP digital signature


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 29 Aug 2012, Atchley, Scott wrote:

 I am benchmarking a sockets based application and I want a sanity check
 on IPoIB performance expectations when using connected mode (65520 MTU).
 I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
 machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
 output). I am using a 2.6.36 kernel. The hosts have single socket Intel
 E5520 (4 core with hyper-threading on) at 2.27 GHz.

 I am using netperf's TCP_STREAM and binding cores. The best I have seen
 is ~13 Gbps. Is this the best I can expect from these cards?

Sounds about right, This is not a hardware limitation but
a limitation of the socket I/O layer / PCI-E bus. The cards generally can
process more data than the PCI bus and the OS can handle.

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.

 What should I expect as a max for ipoib with FDR cards?

More of the same. You may want to

A) increase the block size handled by the socket layer

B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.

C) Bypass the socket layer. Look at Sean's rsockets layer f.e.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


[PATCH] IB: new module params. cm_response_timeout, max_cm_retries

2012-09-05 Thread Dongsu Park
Create two kernel parameters, in order to make variables configurable.
i.e. cma_cm_response_timeout for CM response timeout,
 and cma_max_cm_retries for the number of retries.

They can now be configured via command line for the kernel modules.
For example:
# modprobe ib_srp cma_cm_response_timeout=30 cma_max_cm_retries=60

Signed-off-by: Dongsu Park dongsu.p...@profitbricks.com
Reviewed-by: Sebastian Riemer sebastian.rie...@profitbricks.com
---
 drivers/infiniband/core/cma.c   | 21 ++---
 drivers/infiniband/ulp/ipoib/ipoib_cm.c | 15 ---
 drivers/infiniband/ulp/srp/ib_srp.c | 14 +++---
 include/rdma/ib_cm.h|  3 +++
 4 files changed, 40 insertions(+), 13 deletions(-)

diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c
index 7172559..1d7771f 100644
--- a/drivers/infiniband/core/cma.c
+++ b/drivers/infiniband/core/cma.c
@@ -59,11 +59,18 @@ MODULE_AUTHOR(Sean Hefty);
 MODULE_DESCRIPTION(Generic RDMA CM Agent);
 MODULE_LICENSE(Dual BSD/GPL);
 
-#define CMA_CM_RESPONSE_TIMEOUT 20
-#define CMA_MAX_CM_RETRIES 15
 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24)
 #define CMA_IBOE_PACKET_LIFETIME 18
 
+static unsigned int cma_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
+static unsigned int cma_max_cm_retries = CMA_MAX_CM_RETRIES;
+
+module_param(cma_cm_response_timeout, uint, 0444);
+MODULE_PARM_DESC(cma_cm_response_timeout, Response timeout for the RDMA 
Connection Manager. (default is 20));
+
+module_param(cma_max_cm_retries, uint, 0444);
+MODULE_PARM_DESC(cma_max_cm_retries, Max number of retries for the RDMA 
Connection Manager. (default is 15));
+
 static void cma_add_one(struct ib_device *device);
 static void cma_remove_one(struct ib_device *device);
 
@@ -2587,8 +2594,8 @@ static int cma_resolve_ib_udp(struct rdma_id_private 
*id_priv,
req.path = route-path_rec;
req.service_id = cma_get_service_id(id_priv-id.ps,
(struct sockaddr *) 
route-addr.dst_addr);
-   req.timeout_ms = 1  (CMA_CM_RESPONSE_TIMEOUT - 8);
-   req.max_cm_retries = CMA_MAX_CM_RETRIES;
+   req.timeout_ms = 1  (cma_cm_response_timeout - 8);
+   req.max_cm_retries = cma_max_cm_retries;
 
ret = ib_send_cm_sidr_req(id_priv-cm_id.ib, req);
if (ret) {
@@ -2650,9 +2657,9 @@ static int cma_connect_ib(struct rdma_id_private *id_priv,
req.flow_control = conn_param-flow_control;
req.retry_count = conn_param-retry_count;
req.rnr_retry_count = conn_param-rnr_retry_count;
-   req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-   req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
-   req.max_cm_retries = CMA_MAX_CM_RETRIES;
+   req.remote_cm_response_timeout = cma_cm_response_timeout;
+   req.local_cm_response_timeout = cma_cm_response_timeout;
+   req.max_cm_retries = cma_max_cm_retries;
req.srq = id_priv-srq ? 1 : 0;
 
ret = ib_send_cm_req(id_priv-cm_id.ib, req);
diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c 
b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
index 24683fd..3b41ab0 100644
--- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c
+++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c
@@ -56,6 +56,15 @@ MODULE_PARM_DESC(cm_data_debug_level,
 Enable data path debug tracing for connected mode if  0);
 #endif
 
+static unsigned int cma_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT;
+static unsigned int cma_max_cm_retries = CMA_MAX_CM_RETRIES;
+
+module_param(cma_cm_response_timeout, uint, 0444);
+MODULE_PARM_DESC(cma_cm_response_timeout, Response timeout for the RDMA 
Connection Manager. (default is 20));
+
+module_param(cma_max_cm_retries, uint, 0444);
+MODULE_PARM_DESC(cma_max_cm_retries, Max number of retries for the RDMA 
Connection Manager. (default is 15));
+
 #define IPOIB_CM_IETF_ID 0x1000ULL
 
 #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ)
@@ -1055,11 +1064,11 @@ static int ipoib_cm_send_req(struct net_device *dev,
 * module parameters if anyone cared about setting them.
 */
req.responder_resources = 4;
-   req.remote_cm_response_timeout  = 20;
-   req.local_cm_response_timeout   = 20;
+   req.remote_cm_response_timeout  = cma_cm_response_timeout;
+   req.local_cm_response_timeout   = cma_cm_response_timeout;
req.retry_count = 0; /* RFC draft warns against retries 
*/
req.rnr_retry_count = 0; /* RFC draft warns against retries 
*/
-   req.max_cm_retries  = 15;
+   req.max_cm_retries  = cma_max_cm_retries;
req.srq = ipoib_cm_has_srq(dev);
return ib_send_cm_req(id, req);
 }
diff --git a/drivers/infiniband/ulp/srp/ib_srp.c 
b/drivers/infiniband/ulp/srp/ib_srp.c
index ba7bbfd..13536da 100644
--- a/drivers/infiniband/ulp/srp/ib_srp.c
+++ b/drivers/infiniband/ulp/srp/ib_srp.c
@@ -66,6 +66,8 @@ static 

[PATCH] osmtest/osmt_multicast.c: Fix 02BF error

2012-09-05 Thread Hal Rosenstock

when running osmtest -f m -M 2

Reported-by: Daniel Klein dani...@mellanox.com

Sep 04 20:27:28 920578 [D2499700] 0x02 - osmt_run_mcast_flow: Checking partial 
JoinState delete request - removing NonMember (o15.0.1.14)...
Sep 04 20:27:28 920863 [D2499700] 0x02 - osmt_run_mcast_flow: Validating Join 
State removal of Non Member bit (o15.0.1.14)...
Sep 04 20:27:28 921510 [D2499700] 0x02 - osmt_run_mcast_flow: Validating Join 
State update remove (o15.0.1.14)...
Sep 04 20:27:28 921518 [D2499700] 0x01 - osmt_run_mcast_flow: ERR 02BF: 
Validating JoinState update failed. Expected 0x25 got: 0x20
Sep 04 20:27:28 921527 [D2499700] 0x01 - osmtest_run: ERR 0152: Multicast Flow 
failed: (IB_ERROR)
OSMTEST: TEST Multicast FAIL

Signed-off-by: Hal Rosenstock h...@mellanox.com
---
diff --git a/osmtest/osmt_multicast.c b/osmtest/osmt_multicast.c
index b861ad4..052ab1a 100644
--- a/osmtest/osmt_multicast.c
+++ b/osmtest/osmt_multicast.c
@@ -2096,7 +2096,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const 
p_osmt)
OSM_LOG(p_osmt-log, OSM_LOG_INFO,
Validating Join State update remove (o15.0.1.14)...\n);
 
-   if (p_mc_res-scope_state != 0x25) {/* scope is MSB - now only 0x0 
so port is removed from MCG */
+   if (p_mc_res-scope_state != 0x20) {/* scope is MSB - now only 0x0 
so port is removed from MCG */
OSM_LOG(p_osmt-log, OSM_LOG_ERROR, ERR 02BF: 
Validating JoinState update failed. Expected 0x25 got: 
0x%02X\n,
p_mc_res-scope_state);
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote:

 On Wed, 29 Aug 2012, Atchley, Scott wrote:
 
 I am benchmarking a sockets based application and I want a sanity check
 on IPoIB performance expectations when using connected mode (65520 MTU).
 I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
 machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
 output). I am using a 2.6.36 kernel. The hosts have single socket Intel
 E5520 (4 core with hyper-threading on) at 2.27 GHz.
 
 I am using netperf's TCP_STREAM and binding cores. The best I have seen
 is ~13 Gbps. Is this the best I can expect from these cards?
 
 Sounds about right, This is not a hardware limitation but
 a limitation of the socket I/O layer / PCI-E bus. The cards generally can
 process more data than the PCI bus and the OS can handle.
 
 PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
 nics. So there is like something that the network layer does to you that
 limits the bandwidth.

First, thanks for the reply.

I am not sure where are are getting the 2.3 GB/s value. When using verbs 
natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP 
offloads that might allow higher Socket performance. Ethtool reports:

# ethtool -k ib0
Offload parameters for ib0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: off

There is no checksum support which I would expect to lower performance. Since 
checksums need to be calculated in the host, I would expect faster processors 
to help performance some.

So basically, am I in the ball park given this hardware?

 
 What should I expect as a max for ipoib with FDR cards?
 
 More of the same. You may want to
 
 A) increase the block size handled by the socket layer

Do you mean altering sysctl with something like:

# increase TCP max buffer size setable using setsockopt()
net.core.rmem_max = 16777216 
net.core.wmem_max = 16777216 
# increase Linux autotuning TCP buffer limit 
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# increase the length of the processor input queue
net.core.netdev_max_backlog = 3

or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

 B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.
 
 C) Bypass the socket layer. Look at Sean's rsockets layer f.e.

We actually want to test the socket stack and not bypass it.

Thanks again!

Scott

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 08/29/12 21:35, Atchley, Scott wrote:

Hi all,

I am benchmarking a sockets based application and I want a sanity check on 
IPoIB performance expectations when using connected mode (65520 MTU).


I have read that with newer cards the datagram (unconnected) mode is 
faster at IPoIB than connected mode. Do you want to check?


What benchmark program are you using?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 09/05/12 17:51, Christoph Lameter wrote:

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.


I think those are 8 lane PCI-e 2.0 so that would be 500MB/sec x 8 that's 
4 GBytes/sec. Or you really mean there is almost 50% overhead?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 1:50 PM, Reeted wrote:

 On 08/29/12 21:35, Atchley, Scott wrote:
 Hi all,
 
 I am benchmarking a sockets based application and I want a sanity check on 
 IPoIB performance expectations when using connected mode (65520 MTU).
 
 I have read that with newer cards the datagram (unconnected) mode is 
 faster at IPoIB than connected mode. Do you want to check?

I have read that the latency is lower (better) but the bandwidth is lower.

Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
these machines/cards. Connected mode at the same MTU performs roughly the same. 
The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 
Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.

 What benchmark program are you using?

netperf with process binding (-T). I tune sysctl per the DOE FasterData specs:

http://fasterdata.es.net/host-tuning/linux/

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 # ethtool -k ib0
 Offload parameters for ib0:
 rx-checksumming: off
 tx-checksumming: off
 scatter-gather: off
 tcp segmentation offload: off
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: off

 There is no checksum support which I would expect to lower performance.
 Since checksums need to be calculated in the host, I would expect faster
 processors to help performance some.

K that is a major problem. Both are on by default here. What NIC is this?

  A) increase the block size handled by the socket layer

 Do you mean altering sysctl with something like:

Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

That does nothing for performance. The problem is that the handling of the
data by the kernel causes too much latency so that you cannot reach the
full bw of the hardware.

 We actually want to test the socket stack and not bypass it.

AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 # ethtool -k ib0
 Offload parameters for ib0:
 rx-checksumming: off
 tx-checksumming: off
 scatter-gather: off
 tcp segmentation offload: off
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: off
 
 There is no checksum support which I would expect to lower performance.
 Since checksums need to be calculated in the host, I would expect faster
 processors to help performance some.
 
 K that is a major problem. Both are on by default here. What NIC is this?

These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
ibv_devinfo is in my original post.

 A) increase the block size handled by the socket layer
 
 Do you mean altering sysctl with something like:
 
 Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

Yes, I am using the max MTU (65520).

 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?
 
 That does nothing for performance. The problem is that the handling of the
 data by the kernel causes too much latency so that you cannot reach the
 full bw of the hardware.
 
 We actually want to test the socket stack and not bypass it.
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.

Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G 
Ethernet NICs, but I hope that they will get close to line rate. If not, what 
is the point? ;-)

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 09/05/12 19:59, Atchley, Scott wrote:

On Sep 5, 2012, at 1:50 PM, Reeted wrote:



I have read that with newer cards the datagram (unconnected) mode is
faster at IPoIB than connected mode. Do you want to check?

I have read that the latency is lower (better) but the bandwidth is lower.

Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
these machines/cards. Connected mode at the same MTU performs roughly the same. 
The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 
Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.



Have a look at an old thread in this ML by Sebastien Dugue IPoIB to 
Ethernet routing performance
He had numbers much higher than yours on similar hardware, and was 
suggested to use datagram to achieve offloading and even higher speeds.
Keep me informed if you can fix this, I am interested but can't test 
infiniband myself right now.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

  AFAICT the network stack is useful up to 1Gbps and
  after that more and more band-aid comes into play.

 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If not, 
 what is the point? ;-)

Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
 ibv_devinfo is in my original post.

Hmmm... You are running an old kernel. What version of OFED do you use?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 3:04 PM, Reeted wrote:

 On 09/05/12 19:59, Atchley, Scott wrote:
 On Sep 5, 2012, at 1:50 PM, Reeted wrote:
 
 
 I have read that with newer cards the datagram (unconnected) mode is
 faster at IPoIB than connected mode. Do you want to check?
 I have read that the latency is lower (better) but the bandwidth is lower.
 
 Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
 these machines/cards. Connected mode at the same MTU performs roughly the 
 same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I 
 see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get 
 ~13 Gb/s.
 
 
 Have a look at an old thread in this ML by Sebastien Dugue IPoIB to 
 Ethernet routing performance
 He had numbers much higher than yours on similar hardware, and was 
 suggested to use datagram to achieve offloading and even higher speeds.
 Keep me informed if you can fix this, I am interested but can't test 
 infiniband myself right now.

He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using 
datagram mode. I checked and datagram mode shows support via ethtool for more 
offloads. In my case, I still see better performance with connected mode.

Thanks,

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.
 
 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If not, 
 what is the point? ;-)
 
 Oh yes they can under restricted circumstances. Large packets, multiple
 cores etc. With the band-aids….

With Myricom 10G NICs, for example, you just need one core and it can do line 
rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or 
something else?

I have not tested any 40G NICs yet, but I imagine that one core will not be 
enough.

Thanks,

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott

On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
 ibv_devinfo is in my original post.
 
 Hmmm... You are running an old kernel. What version of OFED do you use?

Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-)

Does the version of OFED impact the kernel modules? I am using the modules that 
came with the kernel. I don't believe that libibverbs or librdmacm are used by 
the kernel's socket stack. That said, I am using source builds with tags 
libibverbs-1.1.6 and v1.0.16 (librdmacm).

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 With Myricom 10G NICs, for example, you just need one core and it can do
 line rate with 1500 byte MTU. Do you count the stateless offloads as
 band-aids? Or something else?

The stateless aids also have certain limitations. Its a grey zone if you
want to call them band aids. It gets there at some point because stateless
offload can only get you so far. The need to send larger sized packets
through the kernel increases the latency and forces the app to do larger
batching. Its not very useful if you need to send small packets to a
variety of receivers.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Ezra Kissel

On 9/5/2012 3:48 PM, Atchley, Scott wrote:

On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:


On Wed, 5 Sep 2012, Atchley, Scott wrote:


AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.


Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G 
Ethernet NICs, but I hope that they will get close to line rate. If not, what 
is the point? ;-)


Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids….


With Myricom 10G NICs, for example, you just need one core and it can do line 
rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or 
something else?

I have not tested any 40G NICs yet, but I imagine that one core will not be 
enough.

Since you are using netperf, you might also considering experimenting 
with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
significant impact for sockets-based apps.


Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
there is no issue saturating a 40G link with about 40-50% core 
utilization.  That being said, binding to the right core/node, message 
size and memory alignment, interrupt handling, and proper host/NIC 
tuning all have an impact on the performance.  The state of 
high-performance networking is certainly not plug-and-play.


- ezra
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

  Hmmm... You are running an old kernel. What version of OFED do you
  use?

 Hah, if you think my kernel is old, you should see my userland
 (RHEL5.5). ;-)

My condolences.

 Does the version of OFED impact the kernel modules? I am using the
 modules that came with the kernel. I don't believe that libibverbs or
 librdmacm are used by the kernel's socket stack. That said, I am using
 source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm).

OFED includes kernel modules which provides the drivers that you need.
Installing a new OFED release on RH5 is possible and would give you up to
date drivers. Check with RH: They may have them somewhere easy to install
for your version of RH.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote:

 On 9/5/2012 3:48 PM, Atchley, Scott wrote:
 On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:
 
 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.
 
 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If 
 not, what is the point? ;-)
 
 Oh yes they can under restricted circumstances. Large packets, multiple
 cores etc. With the band-aids….
 
 With Myricom 10G NICs, for example, you just need one core and it can do 
 line rate with 1500 byte MTU. Do you count the stateless offloads as 
 band-aids? Or something else?
 
 I have not tested any 40G NICs yet, but I imagine that one core will not be 
 enough.
 
 Since you are using netperf, you might also considering experimenting 
 with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
 significant impact for sockets-based apps.
 
 Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
 22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
 there is no issue saturating a 40G link with about 40-50% core 
 utilization.  That being said, binding to the right core/node, message 
 size and memory alignment, interrupt handling, and proper host/NIC 
 tuning all have an impact on the performance.  The state of 
 high-performance networking is certainly not plug-and-play.

Thanks for the tip. The app we want to test does not use sendfile() or splice().

I do bind to the best core (determined by testing all combinations on client 
and server).

I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad 
to hear that you got to 22 Gb/s for a single stream. That is more reassuring.

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


RE: rsocket library and dup2()

2012-09-05 Thread Hefty, Sean
 I found the following code in dup2():
 
   oldfdi = idm_lookup(idm, oldfd);
   if (oldfdi  oldfdi-type == fd_fork)
   fork_passive(oldfd);
 
 In that code the file descriptor type (type) is compared with a fork
 state enum value (fd_fork). Is that on purpose ??

On purpose?  I'll have to go with no.

It's a bug resulting from working on different patches at the same time and not 
updating the dup2 patch to account for the change where I added a state to the 
fd.  Thanks for catching this.  I'll fix it up later this week or early next 
week and push out an update.

- Sean
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html