Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 29 Aug 2012, Atchley, Scott wrote:

 I am benchmarking a sockets based application and I want a sanity check
 on IPoIB performance expectations when using connected mode (65520 MTU).
 I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
 machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
 output). I am using a 2.6.36 kernel. The hosts have single socket Intel
 E5520 (4 core with hyper-threading on) at 2.27 GHz.

 I am using netperf's TCP_STREAM and binding cores. The best I have seen
 is ~13 Gbps. Is this the best I can expect from these cards?

Sounds about right, This is not a hardware limitation but
a limitation of the socket I/O layer / PCI-E bus. The cards generally can
process more data than the PCI bus and the OS can handle.

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.

 What should I expect as a max for ipoib with FDR cards?

More of the same. You may want to

A) increase the block size handled by the socket layer

B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.

C) Bypass the socket layer. Look at Sean's rsockets layer f.e.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote:

 On Wed, 29 Aug 2012, Atchley, Scott wrote:
 
 I am benchmarking a sockets based application and I want a sanity check
 on IPoIB performance expectations when using connected mode (65520 MTU).
 I am using the tuning tips in Documentation/infiniband/ipoib.txt. The
 machines have Mellanox QDR cards (see below for the verbose ibv_devinfo
 output). I am using a 2.6.36 kernel. The hosts have single socket Intel
 E5520 (4 core with hyper-threading on) at 2.27 GHz.
 
 I am using netperf's TCP_STREAM and binding cores. The best I have seen
 is ~13 Gbps. Is this the best I can expect from these cards?
 
 Sounds about right, This is not a hardware limitation but
 a limitation of the socket I/O layer / PCI-E bus. The cards generally can
 process more data than the PCI bus and the OS can handle.
 
 PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
 nics. So there is like something that the network layer does to you that
 limits the bandwidth.

First, thanks for the reply.

I am not sure where are are getting the 2.3 GB/s value. When using verbs 
natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP 
offloads that might allow higher Socket performance. Ethtool reports:

# ethtool -k ib0
Offload parameters for ib0:
rx-checksumming: off
tx-checksumming: off
scatter-gather: off
tcp segmentation offload: off
udp fragmentation offload: off
generic segmentation offload: on
generic-receive-offload: off

There is no checksum support which I would expect to lower performance. Since 
checksums need to be calculated in the host, I would expect faster processors 
to help performance some.

So basically, am I in the ball park given this hardware?

 
 What should I expect as a max for ipoib with FDR cards?
 
 More of the same. You may want to
 
 A) increase the block size handled by the socket layer

Do you mean altering sysctl with something like:

# increase TCP max buffer size setable using setsockopt()
net.core.rmem_max = 16777216 
net.core.wmem_max = 16777216 
# increase Linux autotuning TCP buffer limit 
net.ipv4.tcp_rmem = 4096 87380 16777216
net.ipv4.tcp_wmem = 4096 65536 16777216
# increase the length of the processor input queue
net.core.netdev_max_backlog = 3

or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

 B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes.
 
 C) Bypass the socket layer. Look at Sean's rsockets layer f.e.

We actually want to test the socket stack and not bypass it.

Thanks again!

Scott

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 08/29/12 21:35, Atchley, Scott wrote:

Hi all,

I am benchmarking a sockets based application and I want a sanity check on 
IPoIB performance expectations when using connected mode (65520 MTU).


I have read that with newer cards the datagram (unconnected) mode is 
faster at IPoIB than connected mode. Do you want to check?


What benchmark program are you using?
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 09/05/12 17:51, Christoph Lameter wrote:

PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these
nics. So there is like something that the network layer does to you that
limits the bandwidth.


I think those are 8 lane PCI-e 2.0 so that would be 500MB/sec x 8 that's 
4 GBytes/sec. Or you really mean there is almost 50% overhead?

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 1:50 PM, Reeted wrote:

 On 08/29/12 21:35, Atchley, Scott wrote:
 Hi all,
 
 I am benchmarking a sockets based application and I want a sanity check on 
 IPoIB performance expectations when using connected mode (65520 MTU).
 
 I have read that with newer cards the datagram (unconnected) mode is 
 faster at IPoIB than connected mode. Do you want to check?

I have read that the latency is lower (better) but the bandwidth is lower.

Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
these machines/cards. Connected mode at the same MTU performs roughly the same. 
The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 
Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.

 What benchmark program are you using?

netperf with process binding (-T). I tune sysctl per the DOE FasterData specs:

http://fasterdata.es.net/host-tuning/linux/

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 # ethtool -k ib0
 Offload parameters for ib0:
 rx-checksumming: off
 tx-checksumming: off
 scatter-gather: off
 tcp segmentation offload: off
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: off

 There is no checksum support which I would expect to lower performance.
 Since checksums need to be calculated in the host, I would expect faster
 processors to help performance some.

K that is a major problem. Both are on by default here. What NIC is this?

  A) increase the block size handled by the socket layer

 Do you mean altering sysctl with something like:

Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?

That does nothing for performance. The problem is that the handling of the
data by the kernel causes too much latency so that you cannot reach the
full bw of the hardware.

 We actually want to test the socket stack and not bypass it.

AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 # ethtool -k ib0
 Offload parameters for ib0:
 rx-checksumming: off
 tx-checksumming: off
 scatter-gather: off
 tcp segmentation offload: off
 udp fragmentation offload: off
 generic segmentation offload: on
 generic-receive-offload: off
 
 There is no checksum support which I would expect to lower performance.
 Since checksums need to be calculated in the host, I would expect faster
 processors to help performance some.
 
 K that is a major problem. Both are on by default here. What NIC is this?

These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
ibv_devinfo is in my original post.

 A) increase the block size handled by the socket layer
 
 Do you mean altering sysctl with something like:
 
 Nope increase mtu. Connected mode supports up to 64k mtu size I believe.

Yes, I am using the max MTU (65520).

 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else?
 
 That does nothing for performance. The problem is that the handling of the
 data by the kernel causes too much latency so that you cannot reach the
 full bw of the hardware.
 
 We actually want to test the socket stack and not bypass it.
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.

Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G 
Ethernet NICs, but I hope that they will get close to line rate. If not, what 
is the point? ;-)

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Reeted

On 09/05/12 19:59, Atchley, Scott wrote:

On Sep 5, 2012, at 1:50 PM, Reeted wrote:



I have read that with newer cards the datagram (unconnected) mode is
faster at IPoIB than connected mode. Do you want to check?

I have read that the latency is lower (better) but the bandwidth is lower.

Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
these machines/cards. Connected mode at the same MTU performs roughly the same. 
The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 
Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s.



Have a look at an old thread in this ML by Sebastien Dugue IPoIB to 
Ethernet routing performance
He had numbers much higher than yours on similar hardware, and was 
suggested to use datagram to achieve offloading and even higher speeds.
Keep me informed if you can fix this, I am interested but can't test 
infiniband myself right now.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

  AFAICT the network stack is useful up to 1Gbps and
  after that more and more band-aid comes into play.

 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If not, 
 what is the point? ;-)

Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
 ibv_devinfo is in my original post.

Hmmm... You are running an old kernel. What version of OFED do you use?


--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 3:04 PM, Reeted wrote:

 On 09/05/12 19:59, Atchley, Scott wrote:
 On Sep 5, 2012, at 1:50 PM, Reeted wrote:
 
 
 I have read that with newer cards the datagram (unconnected) mode is
 faster at IPoIB than connected mode. Do you want to check?
 I have read that the latency is lower (better) but the bandwidth is lower.
 
 Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on 
 these machines/cards. Connected mode at the same MTU performs roughly the 
 same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I 
 see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get 
 ~13 Gb/s.
 
 
 Have a look at an old thread in this ML by Sebastien Dugue IPoIB to 
 Ethernet routing performance
 He had numbers much higher than yours on similar hardware, and was 
 suggested to use datagram to achieve offloading and even higher speeds.
 Keep me informed if you can fix this, I am interested but can't test 
 infiniband myself right now.

He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using 
datagram mode. I checked and datagram mode shows support via ethtool for more 
offloads. In my case, I still see better performance with connected mode.

Thanks,

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.
 
 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If not, 
 what is the point? ;-)
 
 Oh yes they can under restricted circumstances. Large packets, multiple
 cores etc. With the band-aids….

With Myricom 10G NICs, for example, you just need one core and it can do line 
rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or 
something else?

I have not tested any 40G NICs yet, but I imagine that one core will not be 
enough.

Thanks,

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott

On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote:

 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of 
 ibv_devinfo is in my original post.
 
 Hmmm... You are running an old kernel. What version of OFED do you use?

Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-)

Does the version of OFED impact the kernel modules? I am using the modules that 
came with the kernel. I don't believe that libibverbs or librdmacm are used by 
the kernel's socket stack. That said, I am using source builds with tags 
libibverbs-1.1.6 and v1.0.16 (librdmacm).

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

 With Myricom 10G NICs, for example, you just need one core and it can do
 line rate with 1500 byte MTU. Do you count the stateless offloads as
 band-aids? Or something else?

The stateless aids also have certain limitations. Its a grey zone if you
want to call them band aids. It gets there at some point because stateless
offload can only get you so far. The need to send larger sized packets
through the kernel increases the latency and forces the app to do larger
batching. Its not very useful if you need to send small packets to a
variety of receivers.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Ezra Kissel

On 9/5/2012 3:48 PM, Atchley, Scott wrote:

On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:


On Wed, 5 Sep 2012, Atchley, Scott wrote:


AFAICT the network stack is useful up to 1Gbps and
after that more and more band-aid comes into play.


Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G 
Ethernet NICs, but I hope that they will get close to line rate. If not, what 
is the point? ;-)


Oh yes they can under restricted circumstances. Large packets, multiple
cores etc. With the band-aids….


With Myricom 10G NICs, for example, you just need one core and it can do line 
rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or 
something else?

I have not tested any 40G NICs yet, but I imagine that one core will not be 
enough.

Since you are using netperf, you might also considering experimenting 
with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
significant impact for sockets-based apps.


Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
there is no issue saturating a 40G link with about 40-50% core 
utilization.  That being said, binding to the right core/node, message 
size and memory alignment, interrupt handling, and proper host/NIC 
tuning all have an impact on the performance.  The state of 
high-performance networking is certainly not plug-and-play.


- ezra
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Christoph Lameter
On Wed, 5 Sep 2012, Atchley, Scott wrote:

  Hmmm... You are running an old kernel. What version of OFED do you
  use?

 Hah, if you think my kernel is old, you should see my userland
 (RHEL5.5). ;-)

My condolences.

 Does the version of OFED impact the kernel modules? I am using the
 modules that came with the kernel. I don't believe that libibverbs or
 librdmacm are used by the kernel's socket stack. That said, I am using
 source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm).

OFED includes kernel modules which provides the drivers that you need.
Installing a new OFED release on RH5 is possible and would give you up to
date drivers. Check with RH: They may have them somewhere easy to install
for your version of RH.

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance

2012-09-05 Thread Atchley, Scott
On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote:

 On 9/5/2012 3:48 PM, Atchley, Scott wrote:
 On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote:
 
 On Wed, 5 Sep 2012, Atchley, Scott wrote:
 
 AFAICT the network stack is useful up to 1Gbps and
 after that more and more band-aid comes into play.
 
 Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 
 40G Ethernet NICs, but I hope that they will get close to line rate. If 
 not, what is the point? ;-)
 
 Oh yes they can under restricted circumstances. Large packets, multiple
 cores etc. With the band-aids….
 
 With Myricom 10G NICs, for example, you just need one core and it can do 
 line rate with 1500 byte MTU. Do you count the stateless offloads as 
 band-aids? Or something else?
 
 I have not tested any 40G NICs yet, but I imagine that one core will not be 
 enough.
 
 Since you are using netperf, you might also considering experimenting 
 with the TCP_SENDFILE test.  Using sendfile/splice calls can have a 
 significant impact for sockets-based apps.
 
 Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 
 22Gb/s single core/stream while fully CPU bound.  With sendfile/splice, 
 there is no issue saturating a 40G link with about 40-50% core 
 utilization.  That being said, binding to the right core/node, message 
 size and memory alignment, interrupt handling, and proper host/NIC 
 tuning all have an impact on the performance.  The state of 
 high-performance networking is certainly not plug-and-play.

Thanks for the tip. The app we want to test does not use sendfile() or splice().

I do bind to the best core (determined by testing all combinations on client 
and server).

I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad 
to hear that you got to 22 Gb/s for a single stream. That is more reassuring.

Scott--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB performance

2012-08-29 Thread Atchley, Scott
Hi all,

I am benchmarking a sockets based application and I want a sanity check on 
IPoIB performance expectations when using connected mode (65520 MTU). I am 
using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have 
Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 
2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with 
hyper-threading on) at 2.27 GHz.

I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 
Gbps. Is this the best I can expect from these cards?

What should I expect as a max for ipoib with FDR cards?

Thanks,

Scott



hca_id: mlx4_0
transport:  InfiniBand (0)
fw_ver: 2.7.626
node_guid:  0002:c903:000b:6520
sys_image_guid: 0002:c903:000b:6523
vendor_id:  0x02c9
vendor_part_id: 26428
hw_ver: 0xB0
board_id:   MT_0D90110009
phys_port_cnt:  1
max_mr_size:0x
page_size_cap:  0xfe00
max_qp: 65464
max_qp_wr:  16384
device_cap_flags:   0x006c9c76
max_sge:32
max_sge_rd: 0
max_cq: 65408
max_cqe:4194303
max_mr: 131056
max_pd: 32764
max_qp_rd_atom: 16
max_ee_rd_atom: 0
max_res_rd_atom:1047424
max_qp_init_rd_atom:128
max_ee_init_rd_atom:0
atomic_cap: ATOMIC_HCA (1)
max_ee: 0
max_rdd:0
max_mw: 0
max_raw_ipv6_qp:0
max_raw_ethy_qp:0
max_mcast_grp:  8192
max_mcast_qp_attach:56
max_total_mcast_qp_attach:  458752
max_ah: 0
max_fmr:0
max_srq:65472
max_srq_wr: 16383
max_srq_sge:31
max_pkeys:  128
local_ca_ack_delay: 15
port:   1
state:  PORT_ACTIVE (4)
max_mtu:2048 (4)
active_mtu: 2048 (4)
sm_lid: 6
port_lid:   8
port_lmc:   0x00
link_layer: InfiniBand
max_msg_sz: 0x4000
port_cap_flags: 0x02510868
max_vl_num: 8 (4)
bad_pkey_cntr:  0x0
qkey_viol_cntr: 0x0
sm_sl:  0
pkey_tbl_len:   128
gid_tbl_len:128
subnet_timeout: 18
init_type_reply:0
active_width:   4X (2)
active_speed:   10.0 Gbps (4)
phys_state: LINK_UP (5)
GID[  0]:   
fe80::::0002:c903:000b:6521

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


IPoIB performance benchmarking

2010-04-12 Thread Tom Ammon

Hi,

I'm trying to do some performance benchmarking of IPoIB on a DDR IB 
cluster, and I am having a hard time understanding what I am seeing.


When I do a simple netperf, I get results like these:

[r...@gateway3 ~]# netperf -H 192.168.23.252
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252 
(192.168.23.252) port 0 AF_INET

Recv   SendSend
Socket Socket  Message  Elapsed
Size   SizeSize Time Throughput
bytes  bytes   bytessecs.10^6bits/sec

 87380  65536  6553610.014577.70


Which is disappointing since it is simply two DDR IB-connected nodes 
plugged in to a DDR switch - I would expect much higher throughput than 
that. When I do a test with ibv_srq_pingpong (using the same message 
size reported above), here's what I get:


[r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536
  local address:  LID 0x012b, QPN 0x000337, PSN 0x19cc85
  local address:  LID 0x012b, QPN 0x000338, PSN 0x956fc2
...
[output omitted]
...
  remote address: LID 0x0129, QPN 0x00032e, PSN 0x891ce3
131072000 bytes in 0.08 seconds = 12763.08 Mbit/sec
1000 iters in 0.08 seconds = 82.16 usec/iter

Which is much closer to what I would expect with DDR.

The MTU on both of the QLogic DDR HCAs is set to 4096, as it is on the 
QLogic switch.


I know the above is not completely apples-to-apples, since the 
ibv_srq_pingpong is layer2 and is using 16 QPs. So I ran it again with 
only a single QP, to make it more roughly equivalent of my single-stream 
netperf test, and I still get almost double the performance:


[r...@gateway3 ~]# ibv_srq_pingpong 192.168.23.252 -m 4096 -s 65536 -q 1
  local address:  LID 0x012b, QPN 0x000347, PSN 0x65fb56
  remote address: LID 0x0129, QPN 0x00032f, PSN 0x5e52f9
131072000 bytes in 0.13 seconds = 8323.22 Mbit/sec
1000 iters in 0.13 seconds = 125.98 usec/iter


Is there something that I am not understanding, here? Is there any way 
to make single-stream TCP IPoIB performance better than 4.5Gb/s on a DDR 
network? Am I just not using the benchmarking tools correctly?


Thanks,

Tom

--


Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance benchmarking

2010-04-12 Thread Tom Ammon

Dave,

Thanks for the pointer. I thought it was running in connected mode, and 
looking at that variable that you mentioned confirms it:


[r...@gateway3 ~]# cat /sys/class/net/ib0/mode
connected

And the IP MTU shows up as:

[r...@gateway3 ~]# ifconfig ib0
ib0   Link encap:InfiniBand  HWaddr 
80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
  inet addr:192.168.23.253  Bcast:192.168.23.255  
Mask:255.255.254.0

  inet6 addr: fe80::211:7500:ff:6edc/64 Scope:Link
  UP BROADCAST RUNNING MULTICAST  MTU:65520  Metric:1
  RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0
  TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0
  collisions:0 txqueuelen:256
  RX bytes:5450805352 (5.0 GiB)  TX bytes:154353169896 (143.7 GiB)


This is partly why I'm stumped - I've seen threads about how connected 
mode is supposed to improve IPoIB performance, but I'm not seeing as 
much performance as I'd like.


Tom

On 04/12/2010 02:19 PM, Dave Olson wrote:

On Mon, 12 Apr 2010, Tom Ammon wrote:
| I'm trying to do some performance benchmarking of IPoIB on a DDR IB
| cluster, and I am having a hard time understanding what I am seeing.
|
| When I do a simple netperf, I get results like these:
|
| [r...@gateway3 ~]# netperf -H 192.168.23.252
| TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252
| (192.168.23.252) port 0 AF_INET
| Recv   SendSend
| Socket Socket  Message  Elapsed
| Size   SizeSize Time Throughput
| bytes  bytes   bytessecs.10^6bits/sec
|
|   87380  65536  6553610.014577.70

Are you using connected mode, or UD?  Since you say you have a 4K MTU,
I'm guessing you are using UD.  Change to use connected mode (edit
/etc/infiniband/openib.conf), or as a quick test

 echo connected  /sys/class/net/ib0/mode

and then the mtu should show as 65520.  That should help
the bandwidth a fair amount.


Dave Olson
dave.ol...@qlogic.com
   


--

Tom Ammon
Network Engineer
Office: 801.587.0976
Mobile: 801.674.9273

Center for High Performance Computing
University of Utah
http://www.chpc.utah.edu

--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html


Re: IPoIB performance benchmarking

2010-04-12 Thread Dave Olson
On Mon, 12 Apr 2010, Tom Ammon wrote:
| Thanks for the pointer. I thought it was running in connected mode, and 
| looking at that variable that you mentioned confirms it:


| [r...@gateway3 ~]# ifconfig ib0
| ib0   Link encap:InfiniBand  HWaddr 
| 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00
|inet addr:192.168.23.253  Bcast:192.168.23.255  Mask:255.255.254.0
|RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0
|TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0

That's a lot of packets dropped on the tx side.

If you have the qlogic software installed, running ipathstats -c1 while
you are running the test would be useful, otherwise perfquery -r at
start and another perfquery at the end on both nodes might point to
something.

Oh, and depending on your tcp stack tuning, setting the receive and/or
send buffer size might help.   These are all ddr results, on a more
or less OFED 1.5.1 stack (completely unofficial, blah blah).

And yes, multi-thread will bring the results up (iperf, rather than netperf).

# netperf -H ib-host TCP_STREAM -- -m 65536  
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

 87380  65536  6553610.035150.24   
# netperf -H ib-host TCP_STREAM -- -m 65536 -S 131072
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

262144  65536  6553610.035401.83   

# netperf -H ib-host TCP_STREAM -- -m 65536 -S 262144
TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) 
port 0 AF_INET
Recv   SendSend  
Socket Socket  Message  Elapsed  
Size   SizeSize Time Throughput  
bytes  bytes   bytessecs.10^6bits/sec  

524288  65536  6553610.015478.28   


Dave Olson
dave.ol...@qlogic.com
--
To unsubscribe from this list: send the line unsubscribe linux-rdma in
the body of a message to majord...@vger.kernel.org
More majordomo info at  http://vger.kernel.org/majordomo-info.html