Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

Miles Nordin Thu, 02 Jul 2009 12:59:49 -0700

>>>>> "rw" == Ross Walker <rswwal...@gmail.com> writes:

rw> you can create a LAG which does redundancy and load balancing.

be careful---these aggregators are all hash-based, so the question is,
of what is the hash taken? The widest scale on which the hash can be
taken is L4 (TCP source/dest port numbers) because this type of
aggregation only preserves packet order within a single link, and
reordering packets is ``bad'', not sure why exactly, but I presume it
hurts TCP performance, so the way around that problem is to keep each
TCP flow nailed to a particular physical link. It looks like there's
a 'dladm -P L4' option so I imagine L4 hashing is supported on the
transmit side *iff* you explicitly ask for it. though sometimes things
like that might be less or more performant depending on the NIC you
buy, I can't imagine a convincing story why it would be in this case.
so that handles the TRANSMIT direction.

The RECEIVE direction is another story. Application-layer multipath
uses a different source IP address for the two sessions, so both sent
and received traffic will be automatically spread over the two NIC's.
With LACP-style aggregation it's entirely the discretion of each end
of the link how they'd like to divide up transmitted traffic.
Typically switches hash L2 MAC only, which is obviously useless. It's
meant for switching trunks with many end systems on either side.
host->switch is covered by dladm above, but if you want L4 hashing for
packets in the switch->host direction you must buy an L3 switch and
configure it ``appropriately'', which seems to be described here for
cisco 6500:

http://www.cisco.com/en/US/docs/switches/lan/catalyst6500/ios/12.1E/native/configuration/guide/channel.html#wp1020804

I believve it's layer-violating feature, so it works fine on a port
channel in an L2 VLAN. You don't have to configure a /30 router-style
non-VLAN two-host-subnet interface on the 6500 to use L4 hashing, I
think.

however the Cisco command applies to all port channels on the entire
switch!!, including trunks to other switches, so the network team is
likely to give lots of push-back when you ask them to turn this knob.
IMHO it's not harmful, and they should do it for you, but maybe they
will complain about SYN-flood vulnerability and TCAM wastage and wait
but how does it interact with dCEF and FUDFUDFUD and all the things
they usually say whenever you want to actually use any feature of the
6500 instead of just bragging about its theoretical availability.

Finally, *ALL THIS IS COMPLETELY USELESS FOR NFS* because L4 hashing
can only split up separate TCP flows. I checked with a Linux client
and Solaris host, and it puts all the NFSv3 mounts onto a single TCP
flow, not one mount per flow. iSCSI seems to do one flow per session,
while I bet multiple LUN's (comstar-style) would share the same TCP
flow for several LUN's.

so...as elegant as network-layer multipath is, I think you'll need
SCSI-layer multipath to squeeze more performance from an aggregated
link between two physical hosts.

And if you are using network-layer multipath (such as a
port-aggregated trunk) carrying iSCSI it might work better to (a) make
sure the equal-cost-multipath hash you're using is L4, not L3 or L2,
and (b) use a single LUN per session (multiple flows per target).
This might also be better on very recent versions of Solaris
(something later than snv_105) which also have 10Gbit network cards
even without any network ECMP because the TCP stack can supposedly
divide TCP flows among the CPU's:

http://www.opensolaris.org/os/project/crossbow/topics/nic/

I'm not sure, though. The data path is getting really advanced, and
there are so many optimisations conflicting with each other at this
point. Maybe it's better to worry about this optimisation for http
clients, and forget about it entirely for iSCSI u.s.w. and instead try
to scheme for a NIC that can do SRP or iSER/iWARP.

There's a downside to it, too. Multiple TCP flows will use more
switch buffers when going from a faster link into a slower or shared
link than a single flow, so if you have a 3560 or some other switch
with small output queues, reading a wide RAID stripe could in theory
overwhelm the switch when all the targets answer at once. If this
happens, you should be able to see dropped packet counters
incrementing in the switch. FC and IB are both lossless and does not
have this problem.

If you're not using any port-aggregated trunks and don't have
10Gbit/s, the TCP flow control might work better to avoid this
``microbursting'' if you use multi LUN per flow, multiplexing all the
LUN's onto a single TCP flow per initiator/target pair Comstar-style
(or, well, NFS-style).

(all pretty speculative, though. YMMV.)

pgp7Risyci3Tw.pgp
Description: PGP signature

_______________________________________________
zfs-discuss mailing list
zfs-discuss@opensolaris.org
http://mail.opensolaris.org/mailman/listinfo/zfs-discuss

Re: [zfs-discuss] ZFS, ESX ,and NFS. oh my!

Reply via email to