Hi,

Intel 82576 is.... bad. I've seen quite a few problems with these older igb
familiy NICs, but losing the PCIe link is a new one.
I usually see them getting stuck with a message like "tx queue X hung,
resetting device..."

Try to disable offloading features using ethtool, that sometimes helps with
the problems that I've seen. Maybe that's just a variant of the stuck
problem?


Paul

-- 
Paul Emmerich

Looking for help with your Ceph cluster? Contact us at https://croit.io

croit GmbH
Freseniusstr. 31h
81247 München
www.croit.io
Tel: +49 89 1896585 90


On Thu, Jul 18, 2019 at 12:47 PM Geoffrey Rhodes <geoff...@rhodes.org.za>
wrote:

> Hi Cephers,
>
> I've been having an issue since upgrading my cluster to Mimic 6 months ago
> (previously installed with Luminous 12.2.1).
> All nodes that have the same PCIe network card seem to loose network
> connectivity randomly. (frequency ranges from a few days to weeks per host
> node)
> The affected nodes only have the Intel 82576 LAN Card in common, different
> motherboards, installed drives, RAM and even PSUs.
> Nodes that have the Intel I350 cards are not affected by the Mimic upgrade.
> Each host node has recommended RAM installed and has between 4 and 6 OSDs
> / sata hard drives installed.
> The cluster operated for over a year (Luminous) without a single issue,
> only after the Mimic upgrade did the issues begin with these nodes.
> The cluster is only used for CephFS (file storage, low intensity usage)
> and makes use of erasure data pool (K=4, M=2).
>
> I've tested many things, different kernel versions, different Ubuntu LTS
> releases, re-installation and even CENTOS 7, different releases of Mimic,
> different igb drivers.
> If I stop the ceph-osd daemons the issue does not occur.  If I swap out
> the Intel 82576 card with the Intel I350 the issue is resolved.
> I haven't any more ideas other than replacing the cards but I feel the
> issue is linked to the ceph-osd daemon and a change in the Mimic release.
> Below are the various software versions and drivers I've tried and a log
> extract from a node that lost network connectivity. - Any help or
> suggestions would be greatly appreciated.
>
> *OS:*                          Ubuntu 16.04 / 18.04 and recently CENTOS 7
> *Ceph Version:*        Mimic (currently 13.2.6)
> *Network card:*        4-PORT 1GB INTEL 82576 LAN CARD (AOC-SG-I4)
> *Driver:              *       igb
> *Driver Versions:*     5.3.0-k / 5.3.5.22s / 5.4.0-k
> *Network Config:*     2 x bonded (LACP) 1GB nic for public net,   2 x
> bonded (LACP) 1GB nic for private net
> *Log errors:*
> Jun 27 12:10:28 cephnode5 kernel: [497346.638608] igb 0000:03:00.0
> enp3s0f0: PCIe link lost, device now detached
> Jun 27 12:10:28 cephnode5 kernel: [497346.686752] igb 0000:04:00.1
> enp4s0f1: PCIe link lost, device now detached
> Jun 27 12:10:29 cephnode5 kernel: [497347.550473] igb 0000:03:00.1
> enp3s0f1: PCIe link lost, device now detached
> Jun 27 12:10:29 cephnode5 kernel: [497347.646785] igb 0000:04:00.0
> enp4s0f0: PCIe link lost, device now detached
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.4.1:6809 osd.16 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.6.1:6804 osd.20 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.7.1:6803 osd.25 since back 2019-06
> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.8.1:6803 osd.30 since back 2019-06
> -27 12:10:27.438961 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
> Jun 27 12:10:43 cephnode5 ceph-osd[2575]: 2019-06-27 12:10:43.793
> 7f73ca637700 -1 osd.15 28497 heartbeat_check: no reply from
> 10.100.9.1:6808 osd.43 since back 2019-06
> -27 12:10:23.338012 front 2019-06-27 12:10:23.338012 (cutoff 2019-06-27
> 12:10:23.796726)
>
>
> Kind regards
> Geoffrey Rhodes
> _______________________________________________
> ceph-users mailing list
> ceph-users@lists.ceph.com
> http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com
>
_______________________________________________
ceph-users mailing list
ceph-users@lists.ceph.com
http://lists.ceph.com/listinfo.cgi/ceph-users-ceph.com

Reply via email to