> After installation and configuration, I observed all kinds of bad behavior > in the network traffic between the hosts and the server. All of this bad > behavior is traced to the ixgbe driver on the storage server. Without going > into the full troubleshooting process, here are my takeaways: [...]
For what it's worth, we managed to achieve much better line rates on copper 10G ixgbe hardware of various descriptions between OmniOS and CentOS 7 (I don't think we ever tested OmniOS to OmniOS). I don't believe OmniOS could do TCP at full line rate but I think we managed 700+ Mbytes/sec on both transmit and receive and we got basically disk-limited speeds with iSCSI (across multiple disks on multi-disk mirrored pools, OmniOS iSCSI initiator, Linux iSCSI targets). I don't believe we did any specific kernel tuning (and in fact some of our attempts to fiddle ixgbe driver parameters blew up in our face). We did tune iSCSI connection parameters to increase various buffer sizes so that ZFS could do even large single operations in single iSCSI transactions. (More details available if people are interested.) > 10: At the wire level, the speed problems are clearly due to pauses in > response time by omnios. At 9000 byte frame sizes, I see a good number > of duplicate ACKs and fast retransmits during read operations (when > omnios is transmitting). But below about a 4100-byte MTU on omnios > (which seems to correlate to 4096-byte iSCSI block transfers), the > transmission errors fade away and we only see the transmission pause > problem. This is what really attracted my attention. In our OmniOS setup, our specific Intel hardware had ixgbe driver issues that could cause activity stalls during once-a-second link heartbeat checks. This obviously had an effect at the TCP and iSCSI layers. My initial message to illumos-developer sparked a potentially interesting discussion: http://www.listbox.com/member/archive/182179/2014/10/sort/time_rev/page/16/entry/6:405/20141003125035:6357079A-4B1D-11E4-A39C-D534381BA44D/ If you think this is a possibility in your setup, I've put the DTrace script I used to hunt for this up on the web: http://www.cs.toronto.edu/~cks/src/omnios-ixgbe/ixgbe_delay.d This isn't the only potential source of driver stalls by any means, it's just the one I found. You may also want to look at lockstat in general, as information it reported is what led us to look specifically at the ixgbe code here. (If you suspect kernel/driver issues, lockstat combined with kernel source is a really excellent resource.) - cks _______________________________________________ OmniOS-discuss mailing list OmniOS-discuss@lists.omniti.com http://lists.omniti.com/mailman/listinfo/omnios-discuss