Re: [e1000-devel] Intel E810 100Gb goes down sporadically

Brandeburg, Jesse Thu, 07 Dec 2023 11:57:28 -0800

Hi Assaf, and thanks Don for mentioning the Cisco link.

I had a further look at the stats and see this:
     mac_local_faults.nic: 0
     mac_remote_faults.nic: 1

on both the sender and receiver stats. Remote fault means the switch RX PCS 
failed to maintain locked state (far end of the cable away from our adapter). 
This might help you switch team or cisco figure out what is going on.

In this case I don’t think it’s the driver or the local end firmware, but I 
would strongly suggest that you update the firmware to a newer version on (some 
of) your cards, and you can get the updated firmware from Cisco.

So, I’d be asking, why is the switch cycling or dropping the link? Hope this 
helps!

Jesse

From: Buchholz, Donald <[email protected]>
Sent: Thursday, December 7, 2023 11:05 AM
To: Assaf Albo <[email protected]>
Cc: Brandeburg, Jesse <[email protected]>; 
[email protected]; Matan Levy <[email protected]>; Itamar Maron 
<[email protected]>
Subject: RE: [e1000-devel] Intel E810 100Gb goes down sporadically

Hi Assaf,

Thank you for the data.  I see from the data files you included that
you are working with a Cisco-branded E810-CQDA2 NIC.

As this is a Cisco supported NIC, have you consulted Cisco support
and configured your system with Cisco-approved firmware/vendor
versions?

I do not support the Cisco products, but I see immediately that the
NIC FW is revision 2.25.  The ice driver v1.9.11 was developed at
Intel for use with 4.xx firmware.

Please contact Cisco.  If it is a problem that they cannot resolve the
matter, they will reach out to the appropriate Intel support team
for this product.

Best regards,
- Don

From: Assaf Albo <[email protected]<mailto:[email protected]>>
Sent: Wednesday, December 6, 2023 3:34 AM
To: Buchholz, Donald 
<[email protected]<mailto:[email protected]>>
Cc: Brandeburg, Jesse 
<[email protected]<mailto:[email protected]>>; 
[email protected]<mailto:[email protected]>; 
Matan Levy <[email protected]<mailto:[email protected]>>; Itamar Maron 
<[email protected]<mailto:[email protected]>>
Subject: Re: [e1000-devel] Intel E810 100Gb goes down sporadically

Hey guys,

Firstly, I'd like to thank you all for helping us out.
Attached to this mail are two files with all the statistics (client machine + 
server machine).

"The passthrough device shouldn't be any problem but I do recommend that
if you're passing through the device to a VM, you try to match the
destination PCIe function number to the origination ID to prevent odd
issues.

like if your host device is:
01:00.1 then (I'm not sure you can do this) I'd hope the VM device is
00:06.1, and not 00:06.0"

Exactly what we are doing, we are matching.
You can see in the attached files that one of the machines is working with eth0 
00:06.0 and the other eth1 00:06.1

"Also, do you see any stats or events on the switch side when link is lost?"
We use Cisco Nexus switches, and our network engineer said that he sees events 
of link down from the ports.

On Wed, Dec 6, 2023 at 6:42 AM Buchholz, Donald 
<[email protected]<mailto:[email protected]>> wrote:
Hi Assaf,

In addition to the commands listed by Jesse,
please also provide "ethtool -i <eth#>" output.
This will assist us in identifying the NIC and
Firmware revision you are using.

- Don

> -----Original Message-----
> From: Jesse Brandeburg 
> <[email protected]<mailto:[email protected]>>
> Sent: Tuesday, December 5, 2023 10:47 AM
> To: Assaf Albo <[email protected]<mailto:[email protected]>>; 
> [email protected]<mailto:[email protected]>; 
> Matan
> Levy <[email protected]<mailto:[email protected]>>
> Subject: Re: [e1000-devel] Intel E810 100Gb goes down sporadically
>
> On 12/3/2023 1:26 AM, Assaf Albo via E1000-devel wrote:
> > Hello guys,
> >
> > We are having constant network issues in production in that the link goes
> > down, waits *exactly* 7-8 seconds, and goes up again.
> > This can happen zero to a few times a day on all our servers; they are not
> > in the same location and are connected to different network devices.
> >
> > Each server runs as a KVM virtual machine with 60 CPUs (Pinning) and 224Gi
> > (Huge pages) - overall performance is excellent.
> > The NIC is PCI passed through to the KVM machine AS IS.
> > OS Rocky Linux 8.5, kernel 4.18.0-348.23.1.el8_5.x86_64 with Intel ice
> > 1.9.11 built and installed using rpm.
> > We have a traffic generator between two servers (our app: client+server)
> > that is reaching 94Gb and can replicate this issue.
> >
> > The dmesg once the issue occur:
> > Nov 28 16:01:27 SERVER kernel: ice 0000:00:06.0 eth0: NIC Link is Down
> > Nov 28 16:01:35 SERVER kernel: ice 0000:00:06.0 eth0: NIC Link is up 100
> > Gbps Full Duplex, Requested FEC: RS-FEC, Negotiated FEC: RS-FEC, Autoneg
> > Advertised: Off, Autoneg Negotiated: False, Flow Control: None
>
> Hi Assaf, sorry hear you're having problems.
>
> w.r.t. the link down events we need to determine if it is a local down
> or remote.
>
> Please gather the 'ethtool -S eth0' statistics for a system that has had
> some problems, and send to the list as text.
>
> also, 'ethtool -m eth0'
>
> The passthrough device shouldn't be any problem but I do recommend that
> if you're passing through the device to a VM, you try to match the
> destination PCIe function number to the origination ID to prevent odd
> issues.
>
> like if your host device is:
> 01:00.1 then (I'm not sure you can do this) I'd hope the VM device is
> 00:06.1, and not 00:06.0
>
> So I guess with that statement I'd ask do you ever see the problem on
> systems with
> 3b:00.0 (ice PF PCIe in host)
> 00:06.0 (ice PF in VM)
>
> having the link down issues?
>
> Please include output from devlink dev info, and if you know it, what
> switch you're connected to.
>
> Also, do you see any stats or events on the switch side when link is lost?
>
> - Jesse
>
>
> _______________________________________________
> E1000-devel mailing list
> [email protected]<mailto:[email protected]>
> https://lists.sourceforge.net/lists/listinfo/e1000-devel
> To learn more about Intel Ethernet, visit
> https://community.intel.com/t5/Ethernet-Products/bd-p/ethernet-products

_______________________________________________
E1000-devel mailing list
[email protected]
https://lists.sourceforge.net/lists/listinfo/e1000-devel
To learn more about Intel Ethernet, visit 
https://community.intel.com/t5/Ethernet-Products/bd-p/ethernet-products

Re: [e1000-devel] Intel E810 100Gb goes down sporadically

Reply via email to