Re: [zfs-discuss] Homegrown Hybrid Storage

Miles Nordin Tue, 08 Jun 2010 12:47:48 -0700

>>>>> "re" == Richard Elling <richard.ell...@gmail.com> writes:


    re> Please don't confuse Ethernet with IP.

okay, but I'm not.  seriously, if you'll look into it.

Did you misread where I said FC can exert back-pressure?  I was
contrasting with Ethernet.

Ethernet output queues are either FIFO or RED, and are large compared
to FC and IB.  FC is buffer-credit, which HOL-blocks to prevent the
small buffers from overflowing, and IB is...blocking (almost no buffer
at all---about 2KB per port and bandwidth*delay product of about 1KB
for the whole mesh, compared to ARISTA which has about 48MB per port,
so except to pedantic IB is bufferless, ie it does not even buffer one
full frame).  Unlike Ethernet, both are lossless fabrics (sounds good)
and have an HOL-blocking character (sounds bad).  They're
fundamentally different at L2, so this is not about IP.  If you run IP
over IB, it is still blocking and lossless.  It does not magically
start buffering when you use IP because the fabric is simply unable to
buffer---there is no RAM in the mesh anywhere.  Both L2 and L3
switches have output queues, and both L3 and L2 output queues can be
FIFO or RED because the output buffer exists in the same piece of
silicon of an L3 switch no matter whether it's set to forward in L2 or
L3 mode, so L2 and L3 switches are like each other and unlike FC & IB.
This is not about IP.  It's about Ethernet.

a relevant congestion difference between L3 and L2 switches (confusing
ethernet with IP) might be ECN, because only an L3 switch can do ECN.
But I don't think anyone actually uses ECN.  It's disabled by default
in Solaris and, I think, all other Unixes.  AFAICT my Extreme
switches, a very old L3 flow-forwarding platform, are not able to flip
the bit.  I think 6500 can, but I'm not certain.

    re> no back-off other than that required for the link. Since
    re> GbE and higher speeds are all implemented as switched fabrics,
    re> the ability of the switch to manage contention is paramount.
    re> You can observe this on a Solaris system by looking at the NIC
    re> flow control kstats.

You're really confused, though I'm sure you're going to deny it.
Ethernet flow control mostly isn't used at all, and it is never used
to manage output queue congestion except in hardware that everyone
agrees is defective.  I almost feel like I've written all this stuff
already, even the part about ECN.

Ethernet flow control is never correctly used to signal output queue
congestion.  The ethernet signal for congestion is a dropped packet.
flow control / PAUSE frames are *not* part of some magic mesh-wide
mechanism by which switches ``manage'' congestion.  PAUSE are used,
when they're used at all, for oversubscribed backplanes: for
congestion on *input*, which in Ethernet is something you want to
avoid.  You want to switch ethernet frames to the output port where it
may or may not encounter congestion so that you don't hold up input
frames headed toward other output ports.  If you did hold them up,
you'd have something like HOL blocking.  IB takes a different
approach: you simply accept the HOL blocking, but tend to design a
mesh with little or no oversubscription unlike ethernet LAN's which
are heavily oversubscribed on their trunk ports.  so...the HOL
blocking happens, but not as much as it would with a typical Ethernet
topology, and it happens in a way that in practice probably increases
the performance of storage networks.

This is interesting for storage because when you try to shove a
128kByte write into an Ethernet fabric, part of it may get dropped in
an output queue somewhere along the way.  In IB, never will part of
the write get dropped, but sometimes you can't shove it into the
network---it just won't go, at L2.  With Ethernet you rely on TCP to
emulate this can't-shove-in condition, and it does not work perfectly
in that it can introduce huge jitter and link underuse (``incast'' problem:

 http://www.pdl.cmu.edu/PDL-FTP/Storage/FASTIncast.pdf

), and secondly leave many kilobytes in transit within the mesh or TCP
buffers, like tens of megabytes and milliseconds per hop, requiring
large TCP buffers on both ends to match the bandwidth*jitter and
frustrating storage QoS by queueing commands on the link instead of in
the storage device, but in exchange you get from Ethernet no HOL
blocking and the possibility of end-to-end network QoS.  It is a fair
tradeoff but arguably the wrong one for storage based on experience
with iSCSI sucking so far.

But the point is, looking at those ``flow control'' kstats will only
warn you if your switches are shit, and shit in one particular way
that even cheap switches rarely are.  The metric that's relevant is
how many packets are being dropped, and in what pattern (a big bucket
of them at once like FIFO, or a scattering like RED), and how TCP is
adapting to these drops.  For this you might look at TCP stats in
solaris, or at output queue drop and output queue size stats on
managed switches, or simply at the overall bandwidth, the ``goodput''
in the incast paper.  The flow control kstats will never be activated
by normal congestion, unless you have some $20 gamer switch that is
misdesigned:

  http://www.networkworld.com/netresources/0913flow2.html
  http://www.smallnetbuilder.com/content/view/30212/54/
  http://virtualthreads.blogspot.com/2006/02/beware-ethernet-flow-control.html

I said PAUSE frames are mostly never used, but Cisco's Nexus FCoE
supposedly does send pause frames within a CoS when it has a link
partner who wants to play its Cisco-FCoE game, so the PAUSE apply to
that CoS and not to the whole link, and these have a completely
different purpose unrelated to the original pause frames.  I'm
speculating from limited information because I'm not interested in
Nexus and have not read much about it much less have any.  Cisco has a
lot of slick talk about them that makes it sound like you're getting
the best of every buzzword, but AIUI the point is to create a lossless
low-jitter HOL-blocking VLAN for storage only, so that storage traffic
can be transmitted without eating huge amounts of switch output buffer
and without provoking TCP{,-like protocols} with congestion-signal
packet drops, while at the same time running other non-storage vlan's
in lossful, non-HOL-blocking mode where nothing blocks on input and
the fabric signals congestion by dropping packets from output queues
and color-marking diffserv-style QoS is possible, like most TCP app
developers are accustomed to.  I know some FCoE stuff got checked into
Solaris, but I don't think FCoE support necessarily implies Nexus
CoS-PAUSE support so I don't know if Solaris even supports this type
of weird pause frame.  I do think it would need to support these
frames for FCoE to work well because otherwise you just push the
incast problem out to the edge, to the first switch facing the packet
source.  Anyway FCoE's not on the table for any of this discussion so
far.  I only mention it so you won't try to make my whole post sound
wrong by mentioning some pedantic nit-picky detail.

    re> The latest OpenSolaris release is 2009.06 which treats all
    re> Zvol-backed COMSTAR iSCSI writes as sync. This was changed in
    re> the developer releases in summer 2009, b114.  For a release
    re> such as NexentaStor 3.0.2, which is based on b140 (+/-), the
    re> initiator's write cache enable/disable request is respected,
    re> by default.

that helps a little, but it's far from a full enough picture to be
useful to anyone IMHO.  In fact it's pretty close to ``it varies and
is confusing'' which I already knew:

 * how do I control the write cache from the initiator?  though I
   think I already know the answer: ``it depends on which initiator,''
   and ``oh, you're using that one?  well i don't know how to do it
   with THAT initiator'' == YOU DON'T

 * when the setting has been controlled, how long does it persist?
   Where can it be inspected?

 * ``by default'' == there is a way to make it not respect the
   initiator's setting, and through a target shell command cause it to
   use one setting or the other, persistently?

 * is the behavior different for file-backed LUN's than zvol's?

I guess there is less point to figuring this out until the behavior is
settled.

pgpAVu3gw9it2.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] Homegrown Hybrid Storage

Reply via email to