Re: [ewg] IPoIB to Ethernet routing performance
Hi Matthieu, On Thu, 16 Dec 2010 23:20:35 +0100 matthieu hautreux matthieu.hautr...@gmail.com wrote: The router is fitted with one ConnectX2 QDR HCA and one dual port Myricom 10G Ethernet adapter. ... Here are some numbers: - 1 IPoIB stream between client and router: 20 Gbits/sec Looks OK. - 2 Ethernet streams between router and server: 19.5 Gbits/sec Looks OK. Actually I am amazed you can get such a speed with IPoIB. Trying with NPtcp on my DDR infiniband I can only obtain about 4.6Gbit/sec at the best packet size (that is 1/4 of the infiniband bandwidth) with this chip embedded in the mainboard: InfiniBand: Mellanox Technologies MT25204 [InfiniHost III Lx HCA]; and dual E5430 xeon (not nehalem). That's with 2.6.37 kernel and vanilla ib_ipoib module. What's wrong with my setup? I always assumed that such a slow speed was due to the lack of offloading capabilities you get with ethernet cards, but maybe I was wrong...? Hi, I made the same kind of experimentations than Sebastien and got results similar to those of you Jabe, with about ~4.6Gbit/s. I am using QDR HCA and ipoib in connected mode on the infiniband part of the testbed and 2 * 10Ge ethernet cards in bonding on the ethernet side of the router. To get better results, I had to increase the MTU on the ethernet side from 1500 to 9000. Indeed, due to the TCP Path MTU discovery, during routed exchanges the MTU used on the ipoib link for TCP messages was automatically set to the minimum MTU of 1500. This small but yet very standard MTU value does not seem to be well handled by the ipoib_cm layer. This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet is padded to 2048 bytes before being sent through the wire, so you're loosing roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048. Is this issue already known and/or reported ? It should be really interesting to understand why a small value of MTU is such a problem for ipoib_cm. After a quick look at the code, it seems that ipoib packet processing is single threaded and that each ip packet is transmitted/received and processed as a single unit. If that appears to be the bottleneck, do you think that packets aggregation and/or processing parallelization could be feasible in a future ipoib module ? A big part of the ethernet networks are configured with an MTU of 1500 and 10Ge cards currently employ parallelization strategy in their kernel module to cope with this problem. It is clear that a bigger MTU is better but it is not always possible to achieve due to existing equipments and machines. IMHO, that is a real problem for infiniband/ethernet interoperability. Sebastien, concerning your bad performance of 9.3Gbit/s when routing 2 streams from you infiniband client to your ethernet server, what is the mode of your bonding on the ethernet side during the test ? are you using balance-rr or LACP ? I did not use any Ethernet teaming, I only declared 2 aliases on the clients' ib0 and set the routing tables accordingly. Sébastien. I got this kind of results with LACP as only one link is really used during the transmissions and this link depends of the layer 2 informations of the peers involved in the communication (as long as you use the default xmit_hash_policy). HTH Regards, Matthieu Also what application did you use for the benchmark? Thank you ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
[ewg] ofa_1_5_kernel 20101217-0200 daily build status
This email was generated automatically, please do not reply git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git git_branch: ofed_kernel_1_5 Common build parameters: Passed: Passed on i686 with linux-2.6.18 Passed on i686 with linux-2.6.21.1 Passed on i686 with linux-2.6.19 Passed on i686 with linux-2.6.26 Passed on i686 with linux-2.6.27 Passed on i686 with linux-2.6.24 Passed on i686 with linux-2.6.22 Passed on x86_64 with linux-2.6.18 Passed on x86_64 with linux-2.6.16.60-0.54.5-smp Passed on x86_64 with linux-2.6.16.60-0.21-smp Passed on x86_64 with linux-2.6.18-194.el5 Passed on x86_64 with linux-2.6.18-164.el5 Passed on x86_64 with linux-2.6.21.1 Passed on x86_64 with linux-2.6.19 Passed on x86_64 with linux-2.6.20 Passed on x86_64 with linux-2.6.22 Passed on x86_64 with linux-2.6.26 Passed on x86_64 with linux-2.6.27 Passed on x86_64 with linux-2.6.24 Passed on x86_64 with linux-2.6.25 Passed on x86_64 with linux-2.6.27.19-5-smp Passed on x86_64 with linux-2.6.28 Passed on ia64 with linux-2.6.23 Passed on ia64 with linux-2.6.18 Passed on ia64 with linux-2.6.19 Passed on ia64 with linux-2.6.21.1 Passed on ia64 with linux-2.6.22 Passed on ia64 with linux-2.6.24 Passed on ia64 with linux-2.6.26 Passed on ia64 with linux-2.6.28 Passed on ia64 with linux-2.6.25 Passed on ppc64 with linux-2.6.18 Passed on ppc64 with linux-2.6.19 Failed: Build failed on x86_64 with linux-2.6.33 Log: from include/linux/mutex.h:13, from /home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core/addr.c:36: include/linux/compiler-gcc4.h:8:4: error: #error Your version of gcc miscompiles the __weak directive make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.33' make: *** [kernel] Error 2 -- Build failed on x86_64 with linux-2.6.34 Log: from include/linux/mutex.h:13, from /home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core/addr.c:36: include/linux/compiler-gcc4.h:8:4: error: #error Your version of gcc miscompiles the __weak directive make[4]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core/addr.o] Error 1 make[3]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core] Error 2 make[2]: *** [/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband] Error 2 make[1]: *** [_module_/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check] Error 2 make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.34' make: *** [kernel] Error 2 -- ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] IPoIB to Ethernet routing performance
This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet is padded to 2048 bytes before being sent through the wire, so you're loosing roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048. This isn't true. IB packets are only padded to a multiple of 4 bytes. However there's no point in using IPoIB connected mode to pass packets smaller than the IB MTU -- you might as well use datagram mode. - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg
Re: [ewg] IPoIB to Ethernet routing performance
2010/12/17 Roland Dreier rdre...@cisco.com This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet is padded to 2048 bytes before being sent through the wire, so you're loosing roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048. This isn't true. IB packets are only padded to a multiple of 4 bytes. However there's no point in using IPoIB connected mode to pass packets smaller than the IB MTU -- you might as well use datagram mode. We are using infiniband as an HPC cluster interconnect network and our compute nodes use this technology to exchange data in IPoIB with a MTU of 65520, do RDMA MPI communications and access Lustre filesystems. On top of that, some nodes are connected to both the IB interconnect and an external ethernet network. These nodes act as IP routers and enable compute nodes to access site centric resources (home directories using nfs, LDAP, ...). Compute nodes are using IPoIB with a large MTU to contact the router nodes so we get really good performances when we only communicate with the routers. However, as soon as the compute nodes communicate with the external ethernet world, the TCP path MTU discovery automatically reduces IPoIB MTU to 1500, the ethernet MTU, and we touch this 4.6Gbit/s wall. Using datagram mode in our scenario is not possible as it will reduce the cluster internal performances in IPoIB. What we 'd rather have is an ipoib_cm that would better handle small packet. Do you think that this limitation is a HCA hardware limitation (number of packets per second) or a software limitation (number of packet processed per second) ? I would think that it is a software limitation as better results are achieved in datagram mode with a same 1500 bytes MTU. IPoIB in connected mode seems to use a single completion queue with a single MSI vector for all the queue pairs it creates to communicate. Perhaps that multiplying the number of completion queues and MSI vectors could help to spread/parallelize the load and get better results. What is your feeling about that ? Regards, Matthieu In fact, we really need to use IPoIB connected mode as - R. ___ ewg mailing list ewg@lists.openfabrics.org http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg