Hello Jesse,

thanks for all the advice! I experimented some more...

Our usage scenario will be an NFS server serving 16 blades via the
two ports of the 10GBE adapter connected directly to the blade center
switch (Cisco 6220). Maximum disk reading speed is 400 MBytes/sec, but
we also made tests where the server reads from its buffer cache and
the disk stays idle, so we can saturate the network.

Aside from the standard TCP_MAERTS netperf test, the test is
"cat /nfs/<longfile> > /dev/null" on each of the blades, whereas longfile
has around 800MB.

Over NFS4 we get around 500MBytes/sec aggregated reading throughput,
when we distribute the IRQs evenly on the 32 cores it goes up to
1.1GBytes/sec. By default all 64 IRQs (32x2 GBE ports) are assigned
to CPU#0 which is then kind of exhausted. These tests were all with
MTU=1500. We are quite happy with that.

In the meantime, we successfully achieved 989 MBit/s with MTU=9000
for a single blade, so there seems to be no general problem here.
With MTU=1500 we get only 938 MBit/s, so this is a considerable
increase. The trick was apparently to set the netperf socket buffer
size to 128K so the tcp segmentation offload can be used efficiently.

The packet dump shows no dropped or out-of-order packets, not even
TCP window full, so I think in principle, it should be running smoothly.

However, when we run the distributed test, we only get 700MBytes/sec
instead of the 1.1GByte/sec. The trace dump shows many tcp out of
order segments, and maybe also lost ones (tcpdump always drops
packets when we dump on the server).

On the (Broadcom) receiving side, I see no signs of packet drop:
ethtool -S reports "rx_error_bytes: 0", there is no "dropped"
count, and the ones from ifconfig are 0.

netstat -s:
In 10 seconds, there were 140 "packet rejects in established
connections because of timestamp" on the server; might this be a hint
to some specific problem? There were also 98 TCPSlowStartRetrans,
which should hurt quite a bit.

Of course we don't want to switch off the tso at the sender, as it
is quite busy enough already.

I also don't think that buffering issues should be a problem, as in
such a simple network no packets should be lost. I also don't want to
touch the switch, I think he's fine just as he came out of the box.
A short test with txqueuelen=100 shows the same bad performance.

So, we'll stick with an MTU 1500 for now, as we can easily achieve
disk reading speed. In a year however, when we'll get the second
RAID controller and evaluate speeds again, we'll look at it once
more, and have to dig in where there might be packets lost.

Thanks again for all the helpful feedback!

Regards,
Walter
-- 
Walter Zimmer

German Aerospace Center (DLR)
Earth Observation Center (EOC)
Remote Sensing Technology Institute (IMF)
Department Atmospheric Processors (AP)

Oberpfaffenhofen
82234 Wessling

Tel.: +49 (8153) 28 1492
Fax: +49 (8153) 28 1446
email: [email protected]
Internet: http://www.dlr.de/eoc

------------------------------------------------------------------------------
For Developers, A Lot Can Happen In A Second.
Boundary is the first to Know...and Tell You.
Monitor Your Applications in Ultra-Fine Resolution. Try it FREE!
http://p.sf.net/sfu/Boundary-d2dvs2
_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel&#174; Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to