Re: SoftiWARP: new patchset
Hi Bernard, many thanks for commenting on the software iWARP RDMA driver code is sent about 5 weeks ago. I hope I now incorporated all recent suggestions and fixes. These are the main changes: o changing siw device attachment to be dynamic based on netlink events o enabling inline data for kernel clients (now inline data are stored within wqe structure) o bitmask access to packet headers removing the '#if defined(__LITTLE_ENDIAN_BITFIELD)' style o moving debug stuff to debugfs o shrinking the stack size of the core tx function o updates to documentation Due to the number of lines of code changed, it might be appropriate if I send a complete new patchset. I'll keep patch packaging as before. I made the current siw code available at www.gitorious.org/softiwarp and will keep it up-to-date. This code is now free of kernel version dependencies. It has been tested on different kernel versions from version 2.6.36.2 up to 3.0.0-rc1+. I tested on both big an little endian machines. I would be very happy to get further input. I work on this project only part of my time - all your input speeds up code maturing. I would be happy if the code reaches an acceptable status soon. Thank you. I'm wondering what the status of this patchset is? Do you get any feedback? It would be very good if the softiwarp module could be included in the mainline kernel. Hopefully Samba will get support for SMB-Direct (SMB2 over RDMA) support in future. And having RDMA support without hardware support would be very good for testing and also for usage on the client side against a server which has hardware RDMA support. For me at least rping tests work fine using siw.ko on a 3.2.x kernel on Ubuntu 12.04. metze -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [patch v2 00/37] add rxe (soft RoCE)
Hi Bob, Am 24.07.2011 21:43, schrieb rpearson-klaocwyjdxkshymvu7je4pqqe7ycj...@public.gmane.org: Changes in v2 include: - Updated to Roland's tree as of 7/24/2011 - Moved the crc32 algorithm into a patch (slice-by-8-for_crc32.c.diff) that goes into the mainline kernel. It has been submitted upstream but is also included in here since it is required to build the driver. - renamed rxe_sb8.c rxe_icrc.c since that is all it now does. - Cleaned up warnings from checkpatch, C=2 and __CHECK_ENDIAN__. - moved small .h files into rxe_loc.h - rewrote the Kconfig text to be a little friendlier - Changed the patch names to make them easier to handle. - the quilt patch series is online at: http://support.systemfabricworks.com/downloads/rxe/patches-v2.tgz - librxe is online at: http://support.systemfabricworks.com/downloads/rxe/librxe-1.0.0.tar.gz Thanks to Roland Dreier, Bart van Assche and David Dillow for helpful suggestions. I'm wondering what the status of this patchset is? It would be very good if the rxe modules could be included in the mainline kernel. Hopefully Samba will get support for SMB-Direct (SMB2 over RDMA) support in future. And having RDMA support without hardware support would be very good for testing and also for usage on the client side against a server which has hardware RDMA support. Are there git repositories yet (for the kernel and userspace)? metze signature.asc Description: OpenPGP digital signature
[PATCH V2 2/2] cxgb4: Remove duplicate register definitions
Removed duplicate definition for SGE_PF_KDOORBELL, SGE_INT_ENABLE3, PCIE_MEM_ACCESS_OFFSET registers. Moved the register field definitions around the register definition. Signed-off-by: Santosh Rastapur sant...@chelsio.com Signed-off-by: Vipul Pandya vi...@chelsio.com Reviewed-by: Sivakumar Subramani siv...@chelsio.com --- V2: Changed the order of the patch in patch series to avoid build failure between two changes. drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c | 18 drivers/net/ethernet/chelsio/cxgb4/sge.c|4 +- drivers/net/ethernet/chelsio/cxgb4/t4_hw.c | 12 +++--- drivers/net/ethernet/chelsio/cxgb4/t4_regs.h| 54 +- 4 files changed, 30 insertions(+), 58 deletions(-) diff --git a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c index 4a20821..5497eaa 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c +++ b/drivers/net/ethernet/chelsio/cxgb4/cxgb4_main.c @@ -2470,8 +2470,8 @@ int cxgb4_sync_txq_pidx(struct net_device *dev, u16 qid, u16 pidx, else delta = size - hw_pidx + pidx; wmb(); - t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL), -V_QID(qid) | V_PIDX(delta)); + t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL), +QID(qid) | PIDX(delta)); } out: return ret; @@ -2579,8 +2579,8 @@ static void sync_txq_pidx(struct adapter *adap, struct sge_txq *q) else delta = q-size - hw_pidx + q-db_pidx; wmb(); - t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL), - V_QID(q-cntxt_id) | V_PIDX(delta)); + t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL), +QID(q-cntxt_id) | PIDX(delta)); } out: q-db_disabled = 0; @@ -2617,9 +2617,9 @@ static void process_db_full(struct work_struct *work) notify_rdma_uld(adap, CXGB4_CONTROL_DB_FULL); drain_db_fifo(adap, dbfifo_drain_delay); - t4_set_reg_field(adap, A_SGE_INT_ENABLE3, - F_DBFIFO_HP_INT | F_DBFIFO_LP_INT, - F_DBFIFO_HP_INT | F_DBFIFO_LP_INT); + t4_set_reg_field(adap, SGE_INT_ENABLE3, +DBFIFO_HP_INT | DBFIFO_LP_INT, +DBFIFO_HP_INT | DBFIFO_LP_INT); notify_rdma_uld(adap, CXGB4_CONTROL_DB_EMPTY); } @@ -2639,8 +2639,8 @@ static void process_db_drop(struct work_struct *work) void t4_db_full(struct adapter *adap) { - t4_set_reg_field(adap, A_SGE_INT_ENABLE3, - F_DBFIFO_HP_INT | F_DBFIFO_LP_INT, 0); + t4_set_reg_field(adap, SGE_INT_ENABLE3, +DBFIFO_HP_INT | DBFIFO_LP_INT, 0); queue_work(workq, adap-db_full_task); } diff --git a/drivers/net/ethernet/chelsio/cxgb4/sge.c b/drivers/net/ethernet/chelsio/cxgb4/sge.c index d49933e..1fde57d 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/sge.c +++ b/drivers/net/ethernet/chelsio/cxgb4/sge.c @@ -769,8 +769,8 @@ static inline void ring_tx_db(struct adapter *adap, struct sge_txq *q, int n) wmb();/* write descriptors before telling HW */ spin_lock(q-db_lock); if (!q-db_disabled) { - t4_write_reg(adap, MYPF_REG(A_SGE_PF_KDOORBELL), -V_QID(q-cntxt_id) | V_PIDX(n)); + t4_write_reg(adap, MYPF_REG(SGE_PF_KDOORBELL), +QID(q-cntxt_id) | PIDX(n)); } q-db_pidx = q-pidx; spin_unlock(q-db_lock); diff --git a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c index af16013..dccecdc 100644 --- a/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c +++ b/drivers/net/ethernet/chelsio/cxgb4/t4_hw.c @@ -1018,9 +1018,9 @@ static void sge_intr_handler(struct adapter *adapter) { ERR_INVALID_CIDX_INC, SGE GTS CIDX increment too large, -1, 0 }, { ERR_CPL_OPCODE_0, SGE received 0-length CPL, -1, 0 }, - { F_DBFIFO_LP_INT, NULL, -1, 0, t4_db_full }, - { F_DBFIFO_HP_INT, NULL, -1, 0, t4_db_full }, - { F_ERR_DROPPED_DB, NULL, -1, 0, t4_db_dropped }, + { DBFIFO_LP_INT, NULL, -1, 0, t4_db_full }, + { DBFIFO_HP_INT, NULL, -1, 0, t4_db_full }, + { ERR_DROPPED_DB, NULL, -1, 0, t4_db_dropped }, { ERR_DATA_CPL_ON_HIGH_QID1 | ERR_DATA_CPL_ON_HIGH_QID0, SGE IQID 1023 received CPL for FL, -1, 0 }, { ERR_BAD_DB_PIDX3, SGE DBP 3 pidx increment too large, -1, @@ -1520,7 +1520,7 @@ void t4_intr_enable(struct adapter *adapter) ERR_BAD_DB_PIDX2 | ERR_BAD_DB_PIDX1 | ERR_BAD_DB_PIDX0 | ERR_ING_CTXT_PRIO | ERR_EGR_CTXT_PRIO | INGRESS_SIZE_ERR | -
Re: [PATCH] opensm: improve search common pkeys.
Hi Daniel, On 17:07 Wed 18 Jul , Daniel Klein wrote: improving runtim of search comon pkeys code to o(n). Signed-off-by: Daniel Klein dani...@mellanox.com --- Applied after removing unused variables. Thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH][MINOR] opensm/osm_vendor_ibumad.c: Add management class to error log message
Hi Hal, On 02:27 Thu 09 Aug , Hal Rosenstock wrote: Signed-off-by: Hal Rosenstock h...@mellanox.com --- Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] opensm/osm_sw_info_rcv.c: Fixed locking issue on osm_get_node_by_guid error
Hi Hal, On 14:10 Tue 28 Aug , Hal Rosenstock wrote: Signed-off-by: Hal Rosenstock h...@mellanox.com --- Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH] OpenSM: Add new Mellanox OUI
Hi Hal, On 08:18 Tue 04 Sep , Hal Rosenstock wrote: Signed-off-by: Hal Rosenstock h...@mellanox.com --- Applied, thanks. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: [PATCH for-next V2 01/22] IB/core: Reserve bits in enum ib_qp_create_flags for low-level driver use
On 8/3/2012 4:40 AM, Jack Morgenstein wrote: Reserve bits 26-31 for internal use by low-level drivers. Two such bits are used in the mlx4 driver SRIOV IB implementation. These enum additions guarantee that the core layer will never use these bits, so that low level drivers may safely make use of them. Signed-off-by: Jack Morgenstein ja...@dev.mellanox.co.il --- include/rdma/ib_verbs.h |3 +++ 1 files changed, 3 insertions(+), 0 deletions(-) diff --git a/include/rdma/ib_verbs.h b/include/rdma/ib_verbs.h index 07996af..46bc045 100644 --- a/include/rdma/ib_verbs.h +++ b/include/rdma/ib_verbs.h @@ -614,6 +614,9 @@ enum ib_qp_type { enum ib_qp_create_flags { IB_QP_CREATE_IPOIB_UD_LSO = 1 0, IB_QP_CREATE_BLOCK_MULTICAST_LOOPBACK = 1 1, + /* reserve bits 26-31 for low level drivers' internal use */ + IB_QP_CREATE_RESERVED_START = 1 26, + IB_QP_CREATE_RESERVED_END = 1 31, }; struct ib_qp_init_attr { Reserving 6 bits for driver use out of 32 seems reasonable. Acked-by: Doug Ledford dledf...@redhat.com -- Doug Ledford dledf...@redhat.com GPG KeyID: 0E572FDD http://people.redhat.com/dledford Infiniband specific RPMs available at http://people.redhat.com/dledford/Infiniband signature.asc Description: OpenPGP digital signature
Re: IPoIB performance
On Wed, 29 Aug 2012, Atchley, Scott wrote: I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? Sounds about right, This is not a hardware limitation but a limitation of the socket I/O layer / PCI-E bus. The cards generally can process more data than the PCI bus and the OS can handle. PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. What should I expect as a max for ipoib with FDR cards? More of the same. You may want to A) increase the block size handled by the socket layer B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes. C) Bypass the socket layer. Look at Sean's rsockets layer f.e. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
[PATCH] IB: new module params. cm_response_timeout, max_cm_retries
Create two kernel parameters, in order to make variables configurable. i.e. cma_cm_response_timeout for CM response timeout, and cma_max_cm_retries for the number of retries. They can now be configured via command line for the kernel modules. For example: # modprobe ib_srp cma_cm_response_timeout=30 cma_max_cm_retries=60 Signed-off-by: Dongsu Park dongsu.p...@profitbricks.com Reviewed-by: Sebastian Riemer sebastian.rie...@profitbricks.com --- drivers/infiniband/core/cma.c | 21 ++--- drivers/infiniband/ulp/ipoib/ipoib_cm.c | 15 --- drivers/infiniband/ulp/srp/ib_srp.c | 14 +++--- include/rdma/ib_cm.h| 3 +++ 4 files changed, 40 insertions(+), 13 deletions(-) diff --git a/drivers/infiniband/core/cma.c b/drivers/infiniband/core/cma.c index 7172559..1d7771f 100644 --- a/drivers/infiniband/core/cma.c +++ b/drivers/infiniband/core/cma.c @@ -59,11 +59,18 @@ MODULE_AUTHOR(Sean Hefty); MODULE_DESCRIPTION(Generic RDMA CM Agent); MODULE_LICENSE(Dual BSD/GPL); -#define CMA_CM_RESPONSE_TIMEOUT 20 -#define CMA_MAX_CM_RETRIES 15 #define CMA_CM_MRA_SETTING (IB_CM_MRA_FLAG_DELAY | 24) #define CMA_IBOE_PACKET_LIFETIME 18 +static unsigned int cma_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; +static unsigned int cma_max_cm_retries = CMA_MAX_CM_RETRIES; + +module_param(cma_cm_response_timeout, uint, 0444); +MODULE_PARM_DESC(cma_cm_response_timeout, Response timeout for the RDMA Connection Manager. (default is 20)); + +module_param(cma_max_cm_retries, uint, 0444); +MODULE_PARM_DESC(cma_max_cm_retries, Max number of retries for the RDMA Connection Manager. (default is 15)); + static void cma_add_one(struct ib_device *device); static void cma_remove_one(struct ib_device *device); @@ -2587,8 +2594,8 @@ static int cma_resolve_ib_udp(struct rdma_id_private *id_priv, req.path = route-path_rec; req.service_id = cma_get_service_id(id_priv-id.ps, (struct sockaddr *) route-addr.dst_addr); - req.timeout_ms = 1 (CMA_CM_RESPONSE_TIMEOUT - 8); - req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.timeout_ms = 1 (cma_cm_response_timeout - 8); + req.max_cm_retries = cma_max_cm_retries; ret = ib_send_cm_sidr_req(id_priv-cm_id.ib, req); if (ret) { @@ -2650,9 +2657,9 @@ static int cma_connect_ib(struct rdma_id_private *id_priv, req.flow_control = conn_param-flow_control; req.retry_count = conn_param-retry_count; req.rnr_retry_count = conn_param-rnr_retry_count; - req.remote_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.local_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; - req.max_cm_retries = CMA_MAX_CM_RETRIES; + req.remote_cm_response_timeout = cma_cm_response_timeout; + req.local_cm_response_timeout = cma_cm_response_timeout; + req.max_cm_retries = cma_max_cm_retries; req.srq = id_priv-srq ? 1 : 0; ret = ib_send_cm_req(id_priv-cm_id.ib, req); diff --git a/drivers/infiniband/ulp/ipoib/ipoib_cm.c b/drivers/infiniband/ulp/ipoib/ipoib_cm.c index 24683fd..3b41ab0 100644 --- a/drivers/infiniband/ulp/ipoib/ipoib_cm.c +++ b/drivers/infiniband/ulp/ipoib/ipoib_cm.c @@ -56,6 +56,15 @@ MODULE_PARM_DESC(cm_data_debug_level, Enable data path debug tracing for connected mode if 0); #endif +static unsigned int cma_cm_response_timeout = CMA_CM_RESPONSE_TIMEOUT; +static unsigned int cma_max_cm_retries = CMA_MAX_CM_RETRIES; + +module_param(cma_cm_response_timeout, uint, 0444); +MODULE_PARM_DESC(cma_cm_response_timeout, Response timeout for the RDMA Connection Manager. (default is 20)); + +module_param(cma_max_cm_retries, uint, 0444); +MODULE_PARM_DESC(cma_max_cm_retries, Max number of retries for the RDMA Connection Manager. (default is 15)); + #define IPOIB_CM_IETF_ID 0x1000ULL #define IPOIB_CM_RX_UPDATE_TIME (256 * HZ) @@ -1055,11 +1064,11 @@ static int ipoib_cm_send_req(struct net_device *dev, * module parameters if anyone cared about setting them. */ req.responder_resources = 4; - req.remote_cm_response_timeout = 20; - req.local_cm_response_timeout = 20; + req.remote_cm_response_timeout = cma_cm_response_timeout; + req.local_cm_response_timeout = cma_cm_response_timeout; req.retry_count = 0; /* RFC draft warns against retries */ req.rnr_retry_count = 0; /* RFC draft warns against retries */ - req.max_cm_retries = 15; + req.max_cm_retries = cma_max_cm_retries; req.srq = ipoib_cm_has_srq(dev); return ib_send_cm_req(id, req); } diff --git a/drivers/infiniband/ulp/srp/ib_srp.c b/drivers/infiniband/ulp/srp/ib_srp.c index ba7bbfd..13536da 100644 --- a/drivers/infiniband/ulp/srp/ib_srp.c +++ b/drivers/infiniband/ulp/srp/ib_srp.c @@ -66,6 +66,8 @@ static
[PATCH] osmtest/osmt_multicast.c: Fix 02BF error
when running osmtest -f m -M 2 Reported-by: Daniel Klein dani...@mellanox.com Sep 04 20:27:28 920578 [D2499700] 0x02 - osmt_run_mcast_flow: Checking partial JoinState delete request - removing NonMember (o15.0.1.14)... Sep 04 20:27:28 920863 [D2499700] 0x02 - osmt_run_mcast_flow: Validating Join State removal of Non Member bit (o15.0.1.14)... Sep 04 20:27:28 921510 [D2499700] 0x02 - osmt_run_mcast_flow: Validating Join State update remove (o15.0.1.14)... Sep 04 20:27:28 921518 [D2499700] 0x01 - osmt_run_mcast_flow: ERR 02BF: Validating JoinState update failed. Expected 0x25 got: 0x20 Sep 04 20:27:28 921527 [D2499700] 0x01 - osmtest_run: ERR 0152: Multicast Flow failed: (IB_ERROR) OSMTEST: TEST Multicast FAIL Signed-off-by: Hal Rosenstock h...@mellanox.com --- diff --git a/osmtest/osmt_multicast.c b/osmtest/osmt_multicast.c index b861ad4..052ab1a 100644 --- a/osmtest/osmt_multicast.c +++ b/osmtest/osmt_multicast.c @@ -2096,7 +2096,7 @@ ib_api_status_t osmt_run_mcast_flow(IN osmtest_t * const p_osmt) OSM_LOG(p_osmt-log, OSM_LOG_INFO, Validating Join State update remove (o15.0.1.14)...\n); - if (p_mc_res-scope_state != 0x25) {/* scope is MSB - now only 0x0 so port is removed from MCG */ + if (p_mc_res-scope_state != 0x20) {/* scope is MSB - now only 0x0 so port is removed from MCG */ OSM_LOG(p_osmt-log, OSM_LOG_ERROR, ERR 02BF: Validating JoinState update failed. Expected 0x25 got: 0x%02X\n, p_mc_res-scope_state); -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote: On Wed, 29 Aug 2012, Atchley, Scott wrote: I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? Sounds about right, This is not a hardware limitation but a limitation of the socket I/O layer / PCI-E bus. The cards generally can process more data than the PCI bus and the OS can handle. PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. First, thanks for the reply. I am not sure where are are getting the 2.3 GB/s value. When using verbs natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP offloads that might allow higher Socket performance. Ethtool reports: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. So basically, am I in the ball park given this hardware? What should I expect as a max for ipoib with FDR cards? More of the same. You may want to A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: # increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limit net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # increase the length of the processor input queue net.core.netdev_max_backlog = 3 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes. C) Bypass the socket layer. Look at Sean's rsockets layer f.e. We actually want to test the socket stack and not bypass it. Thanks again! Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 08/29/12 21:35, Atchley, Scott wrote: Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? What benchmark program are you using? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 09/05/12 17:51, Christoph Lameter wrote: PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. I think those are 8 lane PCI-e 2.0 so that would be 500MB/sec x 8 that's 4 GBytes/sec. Or you really mean there is almost 50% overhead? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 1:50 PM, Reeted wrote: On 08/29/12 21:35, Atchley, Scott wrote: Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. What benchmark program are you using? netperf with process binding (-T). I tune sysctl per the DOE FasterData specs: http://fasterdata.es.net/host-tuning/linux/ Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. K that is a major problem. Both are on by default here. What NIC is this? A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: Nope increase mtu. Connected mode supports up to 64k mtu size I believe. or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? That does nothing for performance. The problem is that the handling of the data by the kernel causes too much latency so that you cannot reach the full bw of the hardware. We actually want to test the socket stack and not bypass it. AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. K that is a major problem. Both are on by default here. What NIC is this? These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: Nope increase mtu. Connected mode supports up to 64k mtu size I believe. Yes, I am using the max MTU (65520). or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? That does nothing for performance. The problem is that the handling of the data by the kernel causes too much latency so that you cannot reach the full bw of the hardware. We actually want to test the socket stack and not bypass it. AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 09/05/12 19:59, Atchley, Scott wrote: On Sep 5, 2012, at 1:50 PM, Reeted wrote: I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. Have a look at an old thread in this ML by Sebastien Dugue IPoIB to Ethernet routing performance He had numbers much higher than yours on similar hardware, and was suggested to use datagram to achieve offloading and even higher speeds. Keep me informed if you can fix this, I am interested but can't test infiniband myself right now. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. Hmmm... You are running an old kernel. What version of OFED do you use? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:04 PM, Reeted wrote: On 09/05/12 19:59, Atchley, Scott wrote: On Sep 5, 2012, at 1:50 PM, Reeted wrote: I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. Have a look at an old thread in this ML by Sebastien Dugue IPoIB to Ethernet routing performance He had numbers much higher than yours on similar hardware, and was suggested to use datagram to achieve offloading and even higher speeds. Keep me informed if you can fix this, I am interested but can't test infiniband myself right now. He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using datagram mode. I checked and datagram mode shows support via ethtool for more offloads. In my case, I still see better performance with connected mode. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. Hmmm... You are running an old kernel. What version of OFED do you use? Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-) Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? The stateless aids also have certain limitations. Its a grey zone if you want to call them band aids. It gets there at some point because stateless offload can only get you so far. The need to send larger sized packets through the kernel increases the latency and forces the app to do larger batching. Its not very useful if you need to send small packets to a variety of receivers. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 9/5/2012 3:48 PM, Atchley, Scott wrote: On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Since you are using netperf, you might also considering experimenting with the TCP_SENDFILE test. Using sendfile/splice calls can have a significant impact for sockets-based apps. Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 22Gb/s single core/stream while fully CPU bound. With sendfile/splice, there is no issue saturating a 40G link with about 40-50% core utilization. That being said, binding to the right core/node, message size and memory alignment, interrupt handling, and proper host/NIC tuning all have an impact on the performance. The state of high-performance networking is certainly not plug-and-play. - ezra -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: Hmmm... You are running an old kernel. What version of OFED do you use? Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-) My condolences. Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm). OFED includes kernel modules which provides the drivers that you need. Installing a new OFED release on RH5 is possible and would give you up to date drivers. Check with RH: They may have them somewhere easy to install for your version of RH. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote: On 9/5/2012 3:48 PM, Atchley, Scott wrote: On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Since you are using netperf, you might also considering experimenting with the TCP_SENDFILE test. Using sendfile/splice calls can have a significant impact for sockets-based apps. Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 22Gb/s single core/stream while fully CPU bound. With sendfile/splice, there is no issue saturating a 40G link with about 40-50% core utilization. That being said, binding to the right core/node, message size and memory alignment, interrupt handling, and proper host/NIC tuning all have an impact on the performance. The state of high-performance networking is certainly not plug-and-play. Thanks for the tip. The app we want to test does not use sendfile() or splice(). I do bind to the best core (determined by testing all combinations on client and server). I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad to hear that you got to 22 Gb/s for a single stream. That is more reassuring. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
RE: rsocket library and dup2()
I found the following code in dup2(): oldfdi = idm_lookup(idm, oldfd); if (oldfdi oldfdi-type == fd_fork) fork_passive(oldfd); In that code the file descriptor type (type) is compared with a fork state enum value (fd_fork). Is that on purpose ?? On purpose? I'll have to go with no. It's a bug resulting from working on different patches at the same time and not updating the dup2 patch to account for the change where I added a state to the fd. Thanks for catching this. I'll fix it up later this week or early next week and push out an update. - Sean -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html