> I think we found your smoking gun.  You're getting ping loss on a local 
> network, and you're using 4x 10Gb LACP bonded network.  And for some reason 
> you say "should be pretty solid."  What you've described is basically the 
> definition of unstable, if you ask me.

No, we're not getting any ping loss, that's the thing.  The network looks 
entirely faultless.  We've run pings for 24 hours with no ping loss.


> Before anything else, know this:  In LACP, only one network interface can be 
> used per data stream.  So if you have a server with LACP, then each client 
> can go up to 10Gb, but if you have 4 clients simultaneously, they can each go 
> up to 10Gb.  You cannot push 40Gb to a single client.

Each storage server has 5 clients.

> Also, your hard disks are all 1Gbit.  So every 10 disks you have in the 
> server add up to a single 10Gb network interface.  It is absolutely pointless 
> to use LACP in this situation unless you have a huge honking server.  
> (Meaning >40 disks).

They've got 38 disks.

> In my experience, LACP is usually unstable, unless you buy a really expensive 
> switch

The switches are pretty expensive, we've got Arista switches and SolarFlare 
NICs in the servers (well, the bond is across a SolarFlare NIC and an Intel 
NIC).

> and QA test the hell out of your configuration before using it.  I hear lots 
> of people say their LACP is stable and reliable where they are - but it's 
> only because they have never tested it and haven't noticed the problems.  The 
> problems are specifically as you've described.  Occasional packet loss, which 
> people tend to think is ok, but in reality, the only acceptable level of 
> packet loss is 0%.

Yep, 0% packet loss, sorry if I've mis-worded something somewhere, but 
definitely no dropped packets.

> 
> Figure out how to observe & clear the error counters on all the network 
> interfaces.  Login to the switch to measure them there ...  Login to the 
> server to measure them there ...  Login to each client to measure them there. 
>  Reset them all to 0.  And then start hammering the shit out of the whole 
> system.  Get all the clients to drive the network hard, both transmit and 
> receive.  If you see error counters increasing, you have a problem.


I'll double check but pretty sure that we've reset witnessed no CRC errors over 
test periods, even when hammering the system.

James.

_______________________________________________
OpenIndiana-discuss mailing list
OpenIndiana-discuss@openindiana.org
http://openindiana.org/mailman/listinfo/openindiana-discuss

Reply via email to