>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:
re> Please don't confuse Ethernet with IP. okay, but I'm not. seriously, if you'll look into it. Did you misread where I said FC can exert back-pressure? I was contrasting with Ethernet. Ethernet output queues are either FIFO or RED, and are large compared to FC and IB. FC is buffer-credit, which HOL-blocks to prevent the small buffers from overflowing, and IB is...blocking (almost no buffer at all---about 2KB per port and bandwidth*delay product of about 1KB for the whole mesh, compared to ARISTA which has about 48MB per port, so except to pedantic IB is bufferless, ie it does not even buffer one full frame). Unlike Ethernet, both are lossless fabrics (sounds good) and have an HOL-blocking character (sounds bad). They're fundamentally different at L2, so this is not about IP. If you run IP over IB, it is still blocking and lossless. It does not magically start buffering when you use IP because the fabric is simply unable to buffer---there is no RAM in the mesh anywhere. Both L2 and L3 switches have output queues, and both L3 and L2 output queues can be FIFO or RED because the output buffer exists in the same piece of silicon of an L3 switch no matter whether it's set to forward in L2 or L3 mode, so L2 and L3 switches are like each other and unlike FC & IB. This is not about IP. It's about Ethernet. a relevant congestion difference between L3 and L2 switches (confusing ethernet with IP) might be ECN, because only an L3 switch can do ECN. But I don't think anyone actually uses ECN. It's disabled by default in Solaris and, I think, all other Unixes. AFAICT my Extreme switches, a very old L3 flow-forwarding platform, are not able to flip the bit. I think 6500 can, but I'm not certain. re> no back-off other than that required for the link. Since re> GbE and higher speeds are all implemented as switched fabrics, re> the ability of the switch to manage contention is paramount. re> You can observe this on a Solaris system by looking at the NIC re> flow control kstats. You're really confused, though I'm sure you're going to deny it. Ethernet flow control mostly isn't used at all, and it is never used to manage output queue congestion except in hardware that everyone agrees is defective. I almost feel like I've written all this stuff already, even the part about ECN. Ethernet flow control is never correctly used to signal output queue congestion. The ethernet signal for congestion is a dropped packet. flow control / PAUSE frames are *not* part of some magic mesh-wide mechanism by which switches ``manage'' congestion. PAUSE are used, when they're used at all, for oversubscribed backplanes: for congestion on *input*, which in Ethernet is something you want to avoid. You want to switch ethernet frames to the output port where it may or may not encounter congestion so that you don't hold up input frames headed toward other output ports. If you did hold them up, you'd have something like HOL blocking. IB takes a different approach: you simply accept the HOL blocking, but tend to design a mesh with little or no oversubscription unlike ethernet LAN's which are heavily oversubscribed on their trunk ports. so...the HOL blocking happens, but not as much as it would with a typical Ethernet topology, and it happens in a way that in practice probably increases the performance of storage networks. This is interesting for storage because when you try to shove a 128kByte write into an Ethernet fabric, part of it may get dropped in an output queue somewhere along the way. In IB, never will part of the write get dropped, but sometimes you can't shove it into the network---it just won't go, at L2. With Ethernet you rely on TCP to emulate this can't-shove-in condition, and it does not work perfectly in that it can introduce huge jitter and link underuse (``incast'' problem: http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf ), and secondly leave many kilobytes in transit within the mesh or TCP buffers, like tens of megabytes and milliseconds per hop, requiring large TCP buffers on both ends to match the bandwidth*jitter and frustrating storage QoS by queueing commands on the link instead of in the storage device, but in exchange you get from Ethernet no HOL blocking and the possibility of end-to-end network QoS. It is a fair tradeoff but arguably the wrong one for storage based on experience with iSCSI sucking so far. But the point is, looking at those ``flow control'' kstats will only warn you if your switches are shit, and shit in one particular way that even cheap switches rarely are. The metric that's relevant is how many packets are being dropped, and in what pattern (a big bucket of them at once like FIFO, or a scattering like RED), and how TCP is adapting to these drops. For this you might look at TCP stats in solaris, or at output queue drop and output queue size stats on managed switches, or simply at the overall bandwidth, the ``goodput'' in the incast paper. The flow control kstats will never be activated by normal congestion, unless you have some $20 gamer switch that is misdesigned: http://www.networkworld.com/netresources/0913flow2.html http://www.smallnetbuilder.com/content/view/30212/54/ http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html I said PAUSE frames are mostly never used, but Cisco's Nexus FCoE supposedly does send pause frames within a CoS when it has a link partner who wants to play its Cisco-FCoE game, so the PAUSE apply to that CoS and not to the whole link, and these have a completely different purpose unrelated to the original pause frames. I'm speculating from limited information because I'm not interested in Nexus and have not read much about it much less have any. Cisco has a lot of slick talk about them that makes it sound like you're getting the best of every buzzword, but AIUI the point is to create a lossless low-jitter HOL-blocking VLAN for storage only, so that storage traffic can be transmitted without eating huge amounts of switch output buffer and without provoking TCP{,-like protocols} with congestion-signal packet drops, while at the same time running other non-storage vlan's in lossful, non-HOL-blocking mode where nothing blocks on input and the fabric signals congestion by dropping packets from output queues and color-marking diffserv-style QoS is possible, like most TCP app developers are accustomed to. I know some FCoE stuff got checked into Solaris, but I don't think FCoE support necessarily implies Nexus CoS-PAUSE support so I don't know if Solaris even supports this type of weird pause frame. I do think it would need to support these frames for FCoE to work well because otherwise you just push the incast problem out to the edge, to the first switch facing the packet source. Anyway FCoE's not on the table for any of this discussion so far. I only mention it so you won't try to make my whole post sound wrong by mentioning some pedantic nit-picky detail. re> The latest OpenSolaris release is 2009.06 which treats all re> Zvol-backed COMSTAR iSCSI writes as sync. This was changed in re> the developer releases in summer 2009, b114. For a release re> such as NexentaStor 3.0.2, which is based on b140 (+/-), the re> initiator's write cache enable/disable request is respected, re> by default. that helps a little, but it's far from a full enough picture to be useful to anyone IMHO. In fact it's pretty close to ``it varies and is confusing'' which I already knew: * how do I control the write cache from the initiator? though I think I already know the answer: ``it depends on which initiator,'' and ``oh, you're using that one? well i don't know how to do it with THAT initiator'' == YOU DON'T * when the setting has been controlled, how long does it persist? Where can it be inspected? * ``by default'' == there is a way to make it not respect the initiator's setting, and through a target shell command cause it to use one setting or the other, persistently? * is the behavior different for file-backed LUN's than zvol's? I guess there is less point to figuring this out until the behavior is settled.
pgpAVu3gw9it2.pgp
Description: PGP signature
_______________________________________________ zfs-discuss mailing list zfs-discuss@opensolaris.org http://mail.opensolaris.org/mailman/listinfo/zfs-discuss