Hi, I'm love some ideas to debug what has become a frequent annoyance for us.  
At the high level, we're observing fairly frequent OSS hangs, with absolutely 
no console or logging activity.  Our BMC watchdogs then reboot the OSS and ~6 
minutes later everything is back in line.  This has been an infrequent 
occurance on this system for a couple years, but has become much more frequent 
in recent months.

I'd love any suggestions for either lustre/lnet or overall kernel tricks to up 
the logging level if possible to see if we can get some more useful output. 
Right now we're blind.

More details below, and also what I'd characterize as uninformed speculation:

-) overall system is (2x)MDS, (12x)OSS, (2x) Monitoring nodes of identical 
servers, network cards, etc... 

-) only difference is JBOD types, the OSS'es are connected to Supermicro 90-bay 
SC946ED-R2KJBOD. All other server hardware is identical. 

-) only the OSSes hang in this manner. I'm looking back, some seem more prone 
than others, but it's not obviously only a few.

-) CentOS 7.6, lustre 2.10.8, ZFS 0.7.9

-) 2 active file systems, one is pure ZFS and the other ZFS/OSS with ldiskfs mdt

-) Mellanox ConnectX3 FDR IB & 40GbE

-) LSI 9300-8e HBA

-) Lustre servers are triple-homed, they live on (2x) IB and (1x) 40GbE networks

-) previously when we first moved to 2.10 we were bit hard and frequently by 
LU-10163 (which may or may not be relevant)

-) The hangs don't correlate to any discrete event best I can tell.  
Importantly, we get no LBUGs or anything, which is different than the previous 
signature.

-) We have definitely stepped up the traffic on the ethernet network this year. 
 Whereas the primary I/O was previously just on the two IB networks, we are now 
taxing the ethernet as well with some regularity.

Any thoughts are most welcome, and thanks!

-Ben




_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to