Re: IPoIB performance
On Wed, 29 Aug 2012, Atchley, Scott wrote: I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? Sounds about right, This is not a hardware limitation but a limitation of the socket I/O layer / PCI-E bus. The cards generally can process more data than the PCI bus and the OS can handle. PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. What should I expect as a max for ipoib with FDR cards? More of the same. You may want to A) increase the block size handled by the socket layer B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes. C) Bypass the socket layer. Look at Sean's rsockets layer f.e. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 11:51 AM, Christoph Lameter wrote: On Wed, 29 Aug 2012, Atchley, Scott wrote: I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I am using the tuning tips in Documentation/infiniband/ipoib.txt. The machines have Mellanox QDR cards (see below for the verbose ibv_devinfo output). I am using a 2.6.36 kernel. The hosts have single socket Intel E5520 (4 core with hyper-threading on) at 2.27 GHz. I am using netperf's TCP_STREAM and binding cores. The best I have seen is ~13 Gbps. Is this the best I can expect from these cards? Sounds about right, This is not a hardware limitation but a limitation of the socket I/O layer / PCI-E bus. The cards generally can process more data than the PCI bus and the OS can handle. PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. First, thanks for the reply. I am not sure where are are getting the 2.3 GB/s value. When using verbs natively, I can get ~3.4 GB/s. I am assuming that these HCAs lack certain TCP offloads that might allow higher Socket performance. Ethtool reports: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. So basically, am I in the ball park given this hardware? What should I expect as a max for ipoib with FDR cards? More of the same. You may want to A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: # increase TCP max buffer size setable using setsockopt() net.core.rmem_max = 16777216 net.core.wmem_max = 16777216 # increase Linux autotuning TCP buffer limit net.ipv4.tcp_rmem = 4096 87380 16777216 net.ipv4.tcp_wmem = 4096 65536 16777216 # increase the length of the processor input queue net.core.netdev_max_backlog = 3 or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? B) Increase the bandwidth by using PCI-E 3 or more PCI-E lanes. C) Bypass the socket layer. Look at Sean's rsockets layer f.e. We actually want to test the socket stack and not bypass it. Thanks again! Scott -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 08/29/12 21:35, Atchley, Scott wrote: Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? What benchmark program are you using? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 09/05/12 17:51, Christoph Lameter wrote: PCI-E on PCI 2.0 should give you up to about 2.3 Gbytes/sec with these nics. So there is like something that the network layer does to you that limits the bandwidth. I think those are 8 lane PCI-e 2.0 so that would be 500MB/sec x 8 that's 4 GBytes/sec. Or you really mean there is almost 50% overhead? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 1:50 PM, Reeted wrote: On 08/29/12 21:35, Atchley, Scott wrote: Hi all, I am benchmarking a sockets based application and I want a sanity check on IPoIB performance expectations when using connected mode (65520 MTU). I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. What benchmark program are you using? netperf with process binding (-T). I tune sysctl per the DOE FasterData specs: http://fasterdata.es.net/host-tuning/linux/ Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. K that is a major problem. Both are on by default here. What NIC is this? A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: Nope increase mtu. Connected mode supports up to 64k mtu size I believe. or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? That does nothing for performance. The problem is that the handling of the data by the kernel causes too much latency so that you cannot reach the full bw of the hardware. We actually want to test the socket stack and not bypass it. AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 2:20 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: # ethtool -k ib0 Offload parameters for ib0: rx-checksumming: off tx-checksumming: off scatter-gather: off tcp segmentation offload: off udp fragmentation offload: off generic segmentation offload: on generic-receive-offload: off There is no checksum support which I would expect to lower performance. Since checksums need to be calculated in the host, I would expect faster processors to help performance some. K that is a major problem. Both are on by default here. What NIC is this? These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. A) increase the block size handled by the socket layer Do you mean altering sysctl with something like: Nope increase mtu. Connected mode supports up to 64k mtu size I believe. Yes, I am using the max MTU (65520). or something increasing the SO_SNFBUF and SO_RCVBUF sizes or something else? That does nothing for performance. The problem is that the handling of the data by the kernel causes too much latency so that you cannot reach the full bw of the hardware. We actually want to test the socket stack and not bypass it. AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 09/05/12 19:59, Atchley, Scott wrote: On Sep 5, 2012, at 1:50 PM, Reeted wrote: I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. Have a look at an old thread in this ML by Sebastien Dugue IPoIB to Ethernet routing performance He had numbers much higher than yours on similar hardware, and was suggested to use datagram to achieve offloading and even higher speeds. Keep me informed if you can fix this, I am interested but can't test infiniband myself right now. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. Hmmm... You are running an old kernel. What version of OFED do you use? -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:04 PM, Reeted wrote: On 09/05/12 19:59, Atchley, Scott wrote: On Sep 5, 2012, at 1:50 PM, Reeted wrote: I have read that with newer cards the datagram (unconnected) mode is faster at IPoIB than connected mode. Do you want to check? I have read that the latency is lower (better) but the bandwidth is lower. Using datagram mode limits the MTU to 2044 and the throughput to ~3 Gb/s on these machines/cards. Connected mode at the same MTU performs roughly the same. The win in connected mode comes with larger MTUs. With a 9000 MTU, I see ~6 Gb/s. Pushing the MTU to 655120 (the maximum for ipoib), I can get ~13 Gb/s. Have a look at an old thread in this ML by Sebastien Dugue IPoIB to Ethernet routing performance He had numbers much higher than yours on similar hardware, and was suggested to use datagram to achieve offloading and even higher speeds. Keep me informed if you can fix this, I am interested but can't test infiniband myself right now. He claims 20 Gb/s and Or replies that one should also get near 20 Gb/s using datagram mode. I checked and datagram mode shows support via ethtool for more offloads. In my case, I still see better performance with connected mode. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Thanks, Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 3:13 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: These are Mellanox QDR HCAs (board id is MT_0D90110009). The full output of ibv_devinfo is in my original post. Hmmm... You are running an old kernel. What version of OFED do you use? Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-) Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm). Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? The stateless aids also have certain limitations. Its a grey zone if you want to call them band aids. It gets there at some point because stateless offload can only get you so far. The need to send larger sized packets through the kernel increases the latency and forces the app to do larger batching. Its not very useful if you need to send small packets to a variety of receivers. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On 9/5/2012 3:48 PM, Atchley, Scott wrote: On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Since you are using netperf, you might also considering experimenting with the TCP_SENDFILE test. Using sendfile/splice calls can have a significant impact for sockets-based apps. Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 22Gb/s single core/stream while fully CPU bound. With sendfile/splice, there is no issue saturating a 40G link with about 40-50% core utilization. That being said, binding to the right core/node, message size and memory alignment, interrupt handling, and proper host/NIC tuning all have an impact on the performance. The state of high-performance networking is certainly not plug-and-play. - ezra -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Wed, 5 Sep 2012, Atchley, Scott wrote: Hmmm... You are running an old kernel. What version of OFED do you use? Hah, if you think my kernel is old, you should see my userland (RHEL5.5). ;-) My condolences. Does the version of OFED impact the kernel modules? I am using the modules that came with the kernel. I don't believe that libibverbs or librdmacm are used by the kernel's socket stack. That said, I am using source builds with tags libibverbs-1.1.6 and v1.0.16 (librdmacm). OFED includes kernel modules which provides the drivers that you need. Installing a new OFED release on RH5 is possible and would give you up to date drivers. Check with RH: They may have them somewhere easy to install for your version of RH. -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance
On Sep 5, 2012, at 4:12 PM, Ezra Kissel wrote: On 9/5/2012 3:48 PM, Atchley, Scott wrote: On Sep 5, 2012, at 3:06 PM, Christoph Lameter wrote: On Wed, 5 Sep 2012, Atchley, Scott wrote: AFAICT the network stack is useful up to 1Gbps and after that more and more band-aid comes into play. Hmm, many 10G Ethernet NICs can reach line rate. I have not yet tested any 40G Ethernet NICs, but I hope that they will get close to line rate. If not, what is the point? ;-) Oh yes they can under restricted circumstances. Large packets, multiple cores etc. With the band-aids…. With Myricom 10G NICs, for example, you just need one core and it can do line rate with 1500 byte MTU. Do you count the stateless offloads as band-aids? Or something else? I have not tested any 40G NICs yet, but I imagine that one core will not be enough. Since you are using netperf, you might also considering experimenting with the TCP_SENDFILE test. Using sendfile/splice calls can have a significant impact for sockets-based apps. Using 40G NICs (Mellanox ConnectX-3 EN), I've seen our applications hit 22Gb/s single core/stream while fully CPU bound. With sendfile/splice, there is no issue saturating a 40G link with about 40-50% core utilization. That being said, binding to the right core/node, message size and memory alignment, interrupt handling, and proper host/NIC tuning all have an impact on the performance. The state of high-performance networking is certainly not plug-and-play. Thanks for the tip. The app we want to test does not use sendfile() or splice(). I do bind to the best core (determined by testing all combinations on client and server). I have heard others within DOE reach ~16 Gb/s on a 40G Mellanox NIC. I'm glad to hear that you got to 22 Gb/s for a single stream. That is more reassuring. Scott-- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance benchmarking
Dave, Thanks for the pointer. I thought it was running in connected mode, and looking at that variable that you mentioned confirms it: [r...@gateway3 ~]# cat /sys/class/net/ib0/mode connected And the IP MTU shows up as: [r...@gateway3 ~]# ifconfig ib0 ib0 Link encap:InfiniBand HWaddr 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 inet addr:192.168.23.253 Bcast:192.168.23.255 Mask:255.255.254.0 inet6 addr: fe80::211:7500:ff:6edc/64 Scope:Link UP BROADCAST RUNNING MULTICAST MTU:65520 Metric:1 RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0 TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0 collisions:0 txqueuelen:256 RX bytes:5450805352 (5.0 GiB) TX bytes:154353169896 (143.7 GiB) This is partly why I'm stumped - I've seen threads about how connected mode is supposed to improve IPoIB performance, but I'm not seeing as much performance as I'd like. Tom On 04/12/2010 02:19 PM, Dave Olson wrote: On Mon, 12 Apr 2010, Tom Ammon wrote: | I'm trying to do some performance benchmarking of IPoIB on a DDR IB | cluster, and I am having a hard time understanding what I am seeing. | | When I do a simple netperf, I get results like these: | | [r...@gateway3 ~]# netperf -H 192.168.23.252 | TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to 192.168.23.252 | (192.168.23.252) port 0 AF_INET | Recv SendSend | Socket Socket Message Elapsed | Size SizeSize Time Throughput | bytes bytes bytessecs.10^6bits/sec | | 87380 65536 6553610.014577.70 Are you using connected mode, or UD? Since you say you have a 4K MTU, I'm guessing you are using UD. Change to use connected mode (edit /etc/infiniband/openib.conf), or as a quick test echo connected /sys/class/net/ib0/mode and then the mtu should show as 65520. That should help the bandwidth a fair amount. Dave Olson dave.ol...@qlogic.com -- Tom Ammon Network Engineer Office: 801.587.0976 Mobile: 801.674.9273 Center for High Performance Computing University of Utah http://www.chpc.utah.edu -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html
Re: IPoIB performance benchmarking
On Mon, 12 Apr 2010, Tom Ammon wrote: | Thanks for the pointer. I thought it was running in connected mode, and | looking at that variable that you mentioned confirms it: | [r...@gateway3 ~]# ifconfig ib0 | ib0 Link encap:InfiniBand HWaddr | 80:00:00:02:FE:80:00:00:00:00:00:00:00:00:00:00:00:00:00:00 |inet addr:192.168.23.253 Bcast:192.168.23.255 Mask:255.255.254.0 |RX packets:2319010 errors:0 dropped:0 overruns:0 frame:0 |TX packets:4512605 errors:0 dropped:33011 overruns:0 carrier:0 That's a lot of packets dropped on the tx side. If you have the qlogic software installed, running ipathstats -c1 while you are running the test would be useful, otherwise perfquery -r at start and another perfquery at the end on both nodes might point to something. Oh, and depending on your tcp stack tuning, setting the receive and/or send buffer size might help. These are all ddr results, on a more or less OFED 1.5.1 stack (completely unofficial, blah blah). And yes, multi-thread will bring the results up (iperf, rather than netperf). # netperf -H ib-host TCP_STREAM -- -m 65536 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 87380 65536 6553610.035150.24 # netperf -H ib-host TCP_STREAM -- -m 65536 -S 131072 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 262144 65536 6553610.035401.83 # netperf -H ib-host TCP_STREAM -- -m 65536 -S 262144 TCP STREAM TEST from 0.0.0.0 (0.0.0.0) port 0 AF_INET to ib-host (172.29.9.46) port 0 AF_INET Recv SendSend Socket Socket Message Elapsed Size SizeSize Time Throughput bytes bytes bytessecs.10^6bits/sec 524288 65536 6553610.015478.28 Dave Olson dave.ol...@qlogic.com -- To unsubscribe from this list: send the line unsubscribe linux-rdma in the body of a message to majord...@vger.kernel.org More majordomo info at http://vger.kernel.org/majordomo-info.html