Re: [Beowulf] Help with inconsistent network performance

Patrick Geoffray Wed, 19 Dec 2007 00:37:30 -0800

Greg Lindahl wrote:

On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:

No, it just means the NIC supports it.


Well, then how about ethtool -S? That looks like an actual count of
flow control events, so rx flow control events means the switch
must support it in some fashion.

If this counter is not null, then you can say the switch does support RXflow control, which is the most important. However, the NIC driver maynot report these events to ethtool, and you eventually need to generatesome contention in the switch. A simple test is to run a simple MPI codewhere several senders streams to a single receiver. If you see acumulated bandwidth equal to the receiver link bandwidth, then flowcontrol works. If you see that all senders have the same bandwidth, thenthe switch is fair on top of that.

Well, we know it can be done perfectly, it's done in InfiniBand
switches, and that other 10 gig non-ethernet switch, what's it called?
Oh yeah, Myrinet. They do it, too.

In Ethernet, the sender has to finish sending the current packet beforestopping, so your switch buffers should be able to store a full framein addition to the wire delay. In Myrinet (and I presume in IB), thehardware flow control can stop a sender in the middle of a packet, soyou only have to buffered the wire delay. It's 4 KB per port versus 12to 16 KB per port. Not trivial and some corners may be cut to savespace/money in the switch chips.

Flow-control is not for everyone, and that's why it is often turned offby default. When a sender is paused, it will stop sending anything,including packets for different destinations. Dropping packets isexpensive to recover but it keeps things moving.
Can Myrinet even disable flow control? Odd that Ethrernet is any
different; dropping any packets is an utter disaster for TCP.

I think it's technically possible to disable flow control in the switchcrossbars in Myrinet, but you would not want to. The NICs can changeroutes quickly when they sense contention on a specific path (Quadricsdoes the same thing, others can't). That helps a lot for internal hotspots that are frequent in HPC, but it does nothing against the N->1communication pattern of death. As Mark pointed out, the best way aroundit is to not have it in the first place.

Ethernet switches are often used in more hostile environments where youcan not prevent such N->1 traffic: I could flood a particular machine ona campus from a couple of host to produce contention, that wouldsaturate some internal links in the switch that would propagate thecontention to other ports, more links are blocked, etc. If you cansustain the contention a few seconds on a busy switch, then you canblock the whole thing, complete meltdown.

That's why high-end switch/routers are super expensive, they are wayover-dimensioned inside to be able to handle contentions. That's alsowhy the FCoE folks are pushing for per-priority flow-control inEthernet, so that untrusted/misbehaving traffic can be dropped to notaffect trusted/important FCoE traffic that should not be dropped. Andthat's why switch flow-control is turned off by default most of the time.


Patrick
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Re: [Beowulf] Help with inconsistent network performance

Reply via email to