Greg Lindahl wrote:
On Tue, Dec 18, 2007 at 09:05:41PM -0500, Patrick Geoffray wrote:

No, it just means the NIC supports it.

Well, then how about ethtool -S? That looks like an actual count of
flow control events, so rx flow control events means the switch
must support it in some fashion.

If this counter is not null, then you can say the switch does support RX flow control, which is the most important. However, the NIC driver may not report these events to ethtool, and you eventually need to generate some contention in the switch. A simple test is to run a simple MPI code where several senders streams to a single receiver. If you see a cumulated bandwidth equal to the receiver link bandwidth, then flow control works. If you see that all senders have the same bandwidth, then the switch is fair on top of that.

Well, we know it can be done perfectly, it's done in InfiniBand
switches, and that other 10 gig non-ethernet switch, what's it called?
Oh yeah, Myrinet. They do it, too.

In Ethernet, the sender has to finish sending the current packet before stopping, so your switch buffers should be able to store a full frame in addition to the wire delay. In Myrinet (and I presume in IB), the hardware flow control can stop a sender in the middle of a packet, so you only have to buffered the wire delay. It's 4 KB per port versus 12 to 16 KB per port. Not trivial and some corners may be cut to save space/money in the switch chips.

Flow-control is not for everyone, and that's why it is often turned off by default. When a sender is paused, it will stop sending anything, including packets for different destinations. Dropping packets is expensive to recover but it keeps things moving.

Can Myrinet even disable flow control? Odd that Ethrernet is any
different; dropping any packets is an utter disaster for TCP.

I think it's technically possible to disable flow control in the switch crossbars in Myrinet, but you would not want to. The NICs can change routes quickly when they sense contention on a specific path (Quadrics does the same thing, others can't). That helps a lot for internal hot spots that are frequent in HPC, but it does nothing against the N->1 communication pattern of death. As Mark pointed out, the best way around it is to not have it in the first place.

Ethernet switches are often used in more hostile environments where you can not prevent such N->1 traffic: I could flood a particular machine on a campus from a couple of host to produce contention, that would saturate some internal links in the switch that would propagate the contention to other ports, more links are blocked, etc. If you can sustain the contention a few seconds on a busy switch, then you can block the whole thing, complete meltdown.

That's why high-end switch/routers are super expensive, they are way over-dimensioned inside to be able to handle contentions. That's also why the FCoE folks are pushing for per-priority flow-control in Ethernet, so that untrusted/misbehaving traffic can be dropped to not affect trusted/important FCoE traffic that should not be dropped. And that's why switch flow-control is turned off by default most of the time.

Patrick
_______________________________________________
Beowulf mailing list, [email protected]
To change your subscription (digest mode or unsubscribe) visit 
http://www.beowulf.org/mailman/listinfo/beowulf

Reply via email to