Hi Sid,

if you are using a CentOS 7.9 kernel newer than 3.10.0-1160.6.1.el7.x86_64 then 
check out LU-14341 as these kernel versions cause a timer related regression:

https://jira.whamcloud.com/browse/LU-14341

We learnt this the hard way during the last couple of days and downgraded to 
kernel-3.10.0-1160.2.1.el7.x86_64 (which is the officially supported kernel 
version of lustre 2.12.6). We use ZFS. YMMV.

--
Karsten Weiss


From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> On Behalf Of Sid 
Young via lustre-discuss
Sent: Tuesday, March 2, 2021 02:37
To: lustre-discuss <lustre-discuss@lists.lustre.org>
Subject: [lustre-discuss] OSS node crash/high CPU latency when deleting 100's 
of emty test files


Caution! External email. Do not open attachments or click links, unless this 
email comes from a known sender and you know the content is safe.
G'Day all,

I've been doing some file create/delete testing on our new Lustre storage which 
results in the OSS nodes crashing and rebooting due to high latency issues.

I can reproduce it by running "dd" commands on the /lustre file system in a for 
loop and then do a rm -f testfile-*.text at the end.
This results in console errors on our DL385 OSS nodes (running Centos 7.9) 
which basically show a stack of:
  mlx5_core and bnxt_en error messages.... mlx5 being the Mellanox Driver for 
the 100G ConnectX5 cards followed by a stack of:
"NMI watchdog: BUG: soft lockup - CPU#"N stuck for XXs "
where the CPU number is around 4 different ones and XX is typical 
20-24seconds...then the boxes reboot!

Before I log a support ticket to HPe, I'm going to try and disable the 100G 
cards and see if its repeatable via the 10G interfaces on the motherboards, but 
before I do that, does anyone use the mellanox ConnectX5 cards on their Lustre 
Storage nodes and ethernet only and if so, which driver are you using and on 
which OS...

Thanks in advance!

Sid Young

_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to