Re: [ewg] IPoIB to Ethernet routing performance

2010-12-17 Thread sebastien dugue

  Hi Matthieu,

On Thu, 16 Dec 2010 23:20:35 +0100
matthieu hautreux matthieu.hautr...@gmail.com wrote:

  The router is fitted with one ConnectX2 QDR HCA and one dual port
 Myricom 10G
   Ethernet adapter.
  
   ...
  
  Here are some numbers:
  
  - 1 IPoIB stream between client and router: 20 Gbits/sec
  
Looks OK.
  
  - 2 Ethernet streams between router and server: 19.5 Gbits/sec
  
Looks OK.
  
 
 
  Actually I am amazed you can get such a speed with IPoIB. Trying with
  NPtcp on my DDR infiniband I can only obtain about 4.6Gbit/sec at the
  best packet size (that is 1/4 of the infiniband bandwidth) with this
  chip embedded in the mainboard: InfiniBand: Mellanox Technologies
  MT25204 [InfiniHost III Lx HCA]; and dual E5430 xeon (not nehalem).
  That's with 2.6.37 kernel and vanilla ib_ipoib module. What's wrong with
  my setup?
  I always assumed that such a slow speed was due to the lack of
  offloading capabilities you get with ethernet cards, but maybe I was
  wrong...?
 
 Hi,
 
 I made the same kind of experimentations than Sebastien and got results
 similar to those of you Jabe, with about ~4.6Gbit/s.
 
 I am using QDR HCA and ipoib in connected mode on the infiniband part of the
 testbed and 2 * 10Ge ethernet cards in bonding on the ethernet side of the
 router.
 To get better results, I had to increase the MTU on the ethernet side from
 1500 to 9000. Indeed, due to the TCP Path MTU discovery, during routed
 exchanges the MTU used on the ipoib link for TCP messages was automatically
 set to the minimum MTU of 1500. This small but yet very standard MTU value
 does not seem to be well handled by the ipoib_cm layer.

  This may be due to the fact that the IB MTU is 2048. Every 1500 bytes packet
is padded to 2048 bytes before being sent through the wire, so you're loosing
roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048.

 
 Is this issue already known and/or reported ? It should be really
 interesting to understand why a small value of MTU is such a problem for
 ipoib_cm. After a quick look at the code, it seems that ipoib packet
 processing is single threaded and that each ip packet is
 transmitted/received and processed as a single unit. If that appears to be
 the bottleneck, do you think that packets aggregation and/or processing
 parallelization could be feasible in a future ipoib module ? A big part of
 the ethernet networks are configured with an MTU of 1500 and 10Ge cards
 currently employ parallelization strategy in their kernel module to cope
 with this problem. It is clear that a bigger MTU is better but it is not
 always possible to achieve due to existing equipments and machines. IMHO,
 that is a real problem for infiniband/ethernet interoperability.
 
 Sebastien, concerning your bad performance of 9.3Gbit/s when routing 2
 streams from you infiniband client to your ethernet server, what is the mode
 of your bonding on the ethernet side during the test ? are you using
 balance-rr or LACP ?

  I did not use any Ethernet teaming, I only declared 2 aliases on the
clients' ib0 and set the routing tables accordingly.

  Sébastien.

 I got this kind of results with LACP as only one link
 is really used during the transmissions and this link depends of the layer 2
 informations of the peers involved in the communication (as long as you use
 the default xmit_hash_policy).
 
 HTH
 Regards,
 Matthieu
 
  Also what application did you use for the benchmark?
  Thank you
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg

[ewg] ofa_1_5_kernel 20101217-0200 daily build status

2010-12-17 Thread Vladimir Sokolovsky (Mellanox)
This email was generated automatically, please do not reply


git_url: git://git.openfabrics.org/ofed_1_5/linux-2.6.git
git_branch: ofed_kernel_1_5

Common build parameters: 

Passed:
Passed on i686 with linux-2.6.18
Passed on i686 with linux-2.6.21.1
Passed on i686 with linux-2.6.19
Passed on i686 with linux-2.6.26
Passed on i686 with linux-2.6.27
Passed on i686 with linux-2.6.24
Passed on i686 with linux-2.6.22
Passed on x86_64 with linux-2.6.18
Passed on x86_64 with linux-2.6.16.60-0.54.5-smp
Passed on x86_64 with linux-2.6.16.60-0.21-smp
Passed on x86_64 with linux-2.6.18-194.el5
Passed on x86_64 with linux-2.6.18-164.el5
Passed on x86_64 with linux-2.6.21.1
Passed on x86_64 with linux-2.6.19
Passed on x86_64 with linux-2.6.20
Passed on x86_64 with linux-2.6.22
Passed on x86_64 with linux-2.6.26
Passed on x86_64 with linux-2.6.27
Passed on x86_64 with linux-2.6.24
Passed on x86_64 with linux-2.6.25
Passed on x86_64 with linux-2.6.27.19-5-smp
Passed on x86_64 with linux-2.6.28
Passed on ia64 with linux-2.6.23
Passed on ia64 with linux-2.6.18
Passed on ia64 with linux-2.6.19
Passed on ia64 with linux-2.6.21.1
Passed on ia64 with linux-2.6.22
Passed on ia64 with linux-2.6.24
Passed on ia64 with linux-2.6.26
Passed on ia64 with linux-2.6.28
Passed on ia64 with linux-2.6.25
Passed on ppc64 with linux-2.6.18
Passed on ppc64 with linux-2.6.19

Failed:
Build failed on x86_64 with linux-2.6.33
Log:
 from include/linux/mutex.h:13,
 from 
/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core/addr.c:36:
include/linux/compiler-gcc4.h:8:4: error: #error Your version of gcc 
miscompiles the __weak directive
make[4]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core/addr.o]
 Error 1
make[3]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband/core]
 Error 2
make[2]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check/drivers/infiniband]
 Error 2
make[1]: *** 
[_module_/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.33_x86_64_check] 
Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.33'
make: *** [kernel] Error 2
--
Build failed on x86_64 with linux-2.6.34
Log:
 from include/linux/mutex.h:13,
 from 
/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core/addr.c:36:
include/linux/compiler-gcc4.h:8:4: error: #error Your version of gcc 
miscompiles the __weak directive
make[4]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core/addr.o]
 Error 1
make[3]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband/core]
 Error 2
make[2]: *** 
[/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check/drivers/infiniband]
 Error 2
make[1]: *** 
[_module_/home/vlad/tmp/ofa_1_5_kernel-20101217-0200_linux-2.6.34_x86_64_check] 
Error 2
make[1]: Leaving directory `/home/vlad/kernel.org/x86_64/linux-2.6.34'
make: *** [kernel] Error 2
--
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg


Re: [ewg] IPoIB to Ethernet routing performance

2010-12-17 Thread Roland Dreier
This may be due to the fact that the IB MTU is 2048. Every 1500 bytes 
  packet
  is padded to 2048 bytes before being sent through the wire, so you're loosing
  roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of 2048.

This isn't true.  IB packets are only padded to a multiple of 4 bytes.

However there's no point in using IPoIB connected mode to pass packets
smaller than the IB MTU -- you might as well use datagram mode.

 - R.
___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg



Re: [ewg] IPoIB to Ethernet routing performance

2010-12-17 Thread matthieu hautreux
2010/12/17 Roland Dreier rdre...@cisco.com

 This may be due to the fact that the IB MTU is 2048. Every 1500 bytes
 packet
   is padded to 2048 bytes before being sent through the wire, so you're
 loosing
   roughly 25% bandwidth compared to an IPoIB MTU which is a multiple of
 2048.

 This isn't true.  IB packets are only padded to a multiple of 4 bytes.

 However there's no point in using IPoIB connected mode to pass packets
 smaller than the IB MTU -- you might as well use datagram mode.


We are using infiniband as an HPC cluster interconnect network and our
compute nodes use this technology to exchange data in IPoIB with a MTU of
65520, do RDMA MPI communications and access Lustre filesystems. On top of
that, some nodes are connected to both the IB interconnect and an external
ethernet network. These nodes act as IP routers and enable compute nodes to
access site centric resources (home directories using nfs, LDAP, ...).
Compute nodes are using IPoIB with a large MTU to contact the router nodes
so we get really good performances when we only communicate with the
routers. However, as soon as the compute nodes communicate with the external
ethernet world, the TCP path MTU discovery automatically reduces IPoIB MTU
to 1500, the ethernet MTU, and we touch this 4.6Gbit/s wall.

Using datagram mode in our scenario is not possible as it will reduce the
cluster internal performances in IPoIB. What we 'd rather have is an
ipoib_cm that would better handle small packet. Do you think that this
limitation is a HCA hardware limitation (number of packets per second) or a
software limitation (number of packet processed per second) ? I would think
that it is a software limitation as better results are achieved in datagram
mode with a same 1500 bytes MTU. IPoIB in connected mode seems to use a
single completion queue with a single MSI vector for all the queue pairs it
creates to communicate. Perhaps that multiplying the number of completion
queues and MSI vectors could help to spread/parallelize the load and get
better results. What is your feeling about that ?

Regards,
Matthieu


In fact, we really need to use IPoIB connected mode as



  - R.

___
ewg mailing list
ewg@lists.openfabrics.org
http://lists.openfabrics.org/cgi-bin/mailman/listinfo/ewg