Dear all, we recently brought a few new machines online as nodes in a virtualization cluster, which provide 10GbE connectivity via Intel X710 adapters. We're on Linux 3.16, as found in Debian Jessie's archives as of today (3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08)).
We have two interfaces per machine, where one of these interfaces is configured as a bridge to provide bridged networking for the Guest machines (qemu/kvm) to hook into. The other provides a dedicated storage-networking link in a distinct VLAN, with an MTU of 9000. We did not set any special parameters on either of these interfaces (apart from the MTU on the non-bridged one), so all hw offloading mechanisms that are enabled by default are still enabled - if this is wrong and maybe even the cause of our troubles reproduced below, please let me know. Yesterday, we registered worrying errors being logged to the kernel's debug ringbuffer on one machine; they read like this: [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 16386, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 546, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 7778, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 370, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 366, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 1842, could not be added [Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error and are repeated multiple times per second. I tried to to read up on these NICs' hardware-assisted packet filtering support, and tried to shut down the feature whose backing storage we're apparently overflowing, by issuing # ethtool -K eth0 ntuple off (eth0 is the iface acting as the base for the bridge for the guest machines.) This had a catastrophic effect on the bridged networking: Host-to-Guest connectivity was working, and so was Inter-Guest connectivity - but we lost all connectivity between the Guest systems and other hosts on our LAN. Turning the feature back on via the inverse of the commend above did not fix the loss of connectivity, and we experienced a service interruption because of that. Since I made the `ethtool` changes on all nodes in the cluster simultaneously, I was in a hurry to get things working again asap - which prompted me to failover all Guests to a single node, and reboot the other nodes (restoring default NIC parameters, as well as connectivity of bridged Guests with other hosts on the LAN). I have several questions now, and would really appreciate your advice: Is the ntuple-related error state and message a serious problem, and could it adversely affect our setup? I dug up a patch from 2014 (https://patchwork.ozlabs.org/patch/383396/) that seems to try to handle this error condition more gracefully, but I failed to parse if it's more of a functional fix, or a cosmetic one. Since we don't have this exact equipment in our staging environment, I'm reluctant to have another go at disabling the ntuple feature on our production systems. Did I do something obviously wrong? Would I maybe have to disable that very same feature on the bridge that uses the 10GbE interface as its underlying networking device? (Unfortunately, I did not think to try this when the shit hit the fan, before implementing the plan outlined above.) Would you generally recommend upgrading to the Jessie backports kernel? 3.16, as packaged by Debian, provides the i40e module in version "0.4.10-k", while the Linux 4.5-based kernel image in backports has version "1.4.8-k". The version numbers suggest that the driver has matured quite a bit since 3.16 was released, but since the module wasn't in staging back then, I'm not sure if we want to give up (future) support via Jessie-LTS for the kernel image we run. Can you recommend an authoritative and comprehensive source of documentation for these (X710) NICs and their (driver's) configuration on a GNU/Linux system? Thanks very much for reading this far - I very much look forward to your comments and any insight you want to share. Have a nice day! -- Mit freundlichen Grüßen Johannes Truschnigg Technik / Senior System Administrator Geizhals (R) - Preisvergleich Preisvergleich Internet Services AG Obere Donaustraße 63/2 A-1020 Wien Tel: +43 1 5811609/87 Fax: +43 1 5811609/55 http://www.geizhals.at | http://www.geizhals.de | http://www.geizhals.eu http://www.facebook.com/geizhals => Geizhals auf Facebook! http://twitter.com/geizhals => Geizhals auf Twitter! http://blog.geizhals.at => Der Geizhals-Blog! http://unternehmen.geizhals.at/about/de/apps/ => Die Geizhals Mobile-App Handelsgericht Wien | FN 197241K | Firmensitz Wien ------------------------------------------------------------------------------ What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic patterns at an interface-level. Reveals which users, apps, and protocols are consuming the most bandwidth. Provides multi-vendor support for NetFlow, J-Flow, sFlow and other flows. Make informed decisions using capacity planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e _______________________________________________ E1000-devel mailing list E1000-devel@lists.sourceforge.net https://lists.sourceforge.net/lists/listinfo/e1000-devel To learn more about Intel® Ethernet, visit http://communities.intel.com/community/wired