Dear all,

we recently brought a few new machines online as nodes in a
virtualization cluster, which provide 10GbE connectivity via Intel X710
adapters. We're on Linux 3.16, as found in Debian Jessie's archives as
of today (3.16.0-4-amd64 #1 SMP Debian 3.16.7-ckt25-2 (2016-04-08)).

We have two interfaces per machine, where one of these interfaces is
configured as a bridge to provide bridged networking for the Guest
machines (qemu/kvm) to hook into. The other provides a dedicated
storage-networking link in a distinct VLAN, with an MTU of 9000. We did
not set any special parameters on either of these interfaces (apart from
the MTU on the non-bridged one), so all hw offloading mechanisms that
are enabled by default are still enabled - if this is wrong and maybe
even the cause of our troubles reproduced below, please let me know.

Yesterday, we registered worrying errors being logged to the kernel's
debug ringbuffer on one machine; they read like this:

[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 16386,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 546,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 7778,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 370,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 366,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: ntuple filter loc = 1842,
could not be added
[Mon Jun 13 09:50:32 2016] i40e 0000:05:00.1: FD filter programming error


and are repeated multiple times per second. I tried to to read up on
these NICs' hardware-assisted packet filtering support, and tried to
shut down the feature whose backing storage we're apparently
overflowing, by issuing

# ethtool -K eth0 ntuple off

(eth0 is the iface acting as the base for the bridge for the guest
machines.)

This had a catastrophic effect on the bridged networking: Host-to-Guest
connectivity was working, and so was Inter-Guest connectivity - but we
lost all connectivity between the Guest systems and other hosts on our
LAN. Turning the feature back on via the inverse of the commend above
did not fix the loss of connectivity, and we experienced a service
interruption because of that.

Since I made the `ethtool` changes on all nodes in the cluster
simultaneously, I was in a hurry to get things working again asap -
which prompted me to failover all Guests to a single node, and reboot
the other nodes (restoring default NIC parameters, as well as
connectivity of bridged Guests with other hosts on the LAN).

I have several questions now, and would really appreciate your advice:

Is the ntuple-related error state and message a serious problem, and
could it adversely affect our setup? I dug up a patch from 2014
(https://patchwork.ozlabs.org/patch/383396/) that seems to try to handle
this error condition more gracefully, but I failed to parse if it's more
of a functional fix, or a cosmetic one.

Since we don't have this exact equipment in our staging environment, I'm
reluctant to have another go at disabling the ntuple feature on our
production systems. Did I do something obviously wrong? Would I maybe
have to disable that very same feature on the bridge that uses the 10GbE
interface as its underlying networking device? (Unfortunately, I did not
think to try this when the shit hit the fan, before implementing the
plan outlined above.)

Would you generally recommend upgrading to the Jessie backports kernel?
3.16, as packaged by Debian, provides the i40e module in version
"0.4.10-k", while the Linux 4.5-based kernel image in backports has
version "1.4.8-k". The version numbers suggest that the driver has
matured quite a bit since 3.16 was released, but since the module wasn't
in staging back then, I'm not sure if we want to give up (future)
support via Jessie-LTS for the kernel image we run.

Can you recommend an authoritative and comprehensive source of
documentation for these (X710) NICs and their (driver's) configuration
on a GNU/Linux system?

Thanks very much for reading this far - I very much look forward to your
comments and any insight you want to share.

Have a nice day!
-- 
Mit freundlichen Grüßen

Johannes Truschnigg
Technik / Senior System Administrator

Geizhals (R) - Preisvergleich
Preisvergleich Internet Services AG
Obere Donaustraße 63/2
A-1020 Wien
Tel: +43 1 5811609/87
Fax: +43 1 5811609/55

http://www.geizhals.at | http://www.geizhals.de | http://www.geizhals.eu
http://www.facebook.com/geizhals              => Geizhals auf Facebook!
http://twitter.com/geizhals                   => Geizhals auf Twitter!
http://blog.geizhals.at                       => Der Geizhals-Blog!
http://unternehmen.geizhals.at/about/de/apps/ => Die Geizhals Mobile-App

Handelsgericht Wien | FN 197241K | Firmensitz Wien

------------------------------------------------------------------------------
What NetFlow Analyzer can do for you? Monitors network bandwidth and traffic
patterns at an interface-level. Reveals which users, apps, and protocols are 
consuming the most bandwidth. Provides multi-vendor support for NetFlow, 
J-Flow, sFlow and other flows. Make informed decisions using capacity 
planning reports. https://ad.doubleclick.net/ddm/clk/305295220;132659582;e
_______________________________________________
E1000-devel mailing list
E1000-devel@lists.sourceforge.net
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel® Ethernet, visit 
http://communities.intel.com/community/wired

Reply via email to