Hello Jesse, thanks for all the advice! I experimented some more...
Our usage scenario will be an NFS server serving 16 blades via the two ports of the 10GBE adapter connected directly to the blade center switch (Cisco 6220). Maximum disk reading speed is 400 MBytes/sec, but we also made tests where the server reads from its buffer cache and the disk stays idle, so we can saturate the network. Aside from the standard TCP_MAERTS netperf test, the test is "cat /nfs/<longfile> > /dev/null" on each of the blades, whereas longfile has around 800MB. Over NFS4 we get around 500MBytes/sec aggregated reading throughput, when we distribute the IRQs evenly on the 32 cores it goes up to 1.1GBytes/sec. By default all 64 IRQs (32x2 GBE ports) are assigned to CPU#0 which is then kind of exhausted. These tests were all with MTU=1500. We are quite happy with that. In the meantime, we successfully achieved 989 MBit/s with MTU=9000 for a single blade, so there seems to be no general problem here. With MTU=1500 we get only 938 MBit/s, so this is a considerable increase. The trick was apparently to set the netperf socket buffer size to 128K so the tcp segmentation offload can be used efficiently. The packet dump shows no dropped or out-of-order packets, not even TCP window full, so I think in principle, it should be running smoothly. However, when we run the distributed test, we only get 700MBytes/sec instead of the 1.1GByte/sec. The trace dump shows many tcp out of order segments, and maybe also lost ones (tcpdump always drops packets when we dump on the server). On the (Broadcom) receiving side, I see no signs of packet drop: ethtool -S reports "rx_error_bytes: 0", there is no "dropped" count, and the ones from ifconfig are 0. netstat -s: In 10 seconds, there were 140 "packet rejects in established connections because of timestamp" on the server; might this be a hint to some specific problem? There were also 98 TCPSlowStartRetrans, which should hurt quite a bit. Of course we don't want to switch off the tso at the sender, as it is quite busy enough already. I also don't think that buffering issues should be a problem, as in such a simple network no packets should be lost. I also don't want to touch the switch, I think he's fine just as he came out of the box. A short test with txqueuelen=100 shows the same bad performance. So, we'll stick with an MTU 1500 for now, as we can easily achieve disk reading speed. In a year however, when we'll get the second RAID controller and evaluate speeds again, we'll look at it once more, and have to dig in where there might be packets lost. Thanks again for all the helpful feedback! Regards, Walter -- Walter Zimmer German Aerospace Center (DLR) Earth Observation Center (EOC) Remote Sensing Technology Institute (IMF) Department Atmospheric Processors (AP) Oberpfaffenhofen 82234 Wessling Tel.: +49 (8153) 28 1492 Fax: +49 (8153) 28 1446 email: [email protected] Internet: http://www.dlr.de/eoc ------------------------------------------------------------------------------ For Developers, A Lot Can Happen In A Second. Boundary is the first to Know...and Tell You. Monitor Your Applications in Ultra-Fine Resolution. Try it FREE! http://p.sf.net/sfu/Boundary-d2dvs2 _______________________________________________ E1000-devel mailing list [email protected] https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired
