Hello Ger,

Can you share the full stack trace from the log output for the hung thread? 
That will be helpful for diagnosing the issue. Some other clues: do you get any 
stack traces or error output on clients where you observe the hang? Does every 
client hang, or only some? Does it hang on any access to the FS at all, or only 
on certain files? 

When looking for such error output, it's good to check the logs during times 
when errors are not occurring as well, since Lustre writes a lot of messages 
that are "normal". If you recognize these then you can filter them out as noise 
when the actual problems are happening.

To diagnose if bad I/O from some particular application is causing the problem, 
using jobstats is very helpful. Here are some pages with information on Lustre 
jobstats:

https://wiki.lustre.org/Lustre_Monitoring_and_Statistics_Guide
https://doc.lustre.org/lustre_manual.xhtml#jobstats

Using jobstats, you can often correlate the errors with the job(s) doing the 
most I/O on the filesystem at the time. It's useful to have a script 
periodically send jobstats output to a monitoring/logging service, so that you 
can compare historical data with previous errors as well. We've been able to 
identify many "problem apps" with bad I/O patterns this way. Of course if the 
problem doesn't come from a client application, but is from something else like 
a hardware failure, jobstats won't help identify that.

- Thomas Bertschinger

________________________________________
From: lustre-discuss <lustre-discuss-boun...@lists.lustre.org> on behalf of 
vaibhav pol via lustre-discuss <lustre-discuss@lists.lustre.org>
Sent: Monday, December 18, 2023 3:36 AM
To: Strikwerda, Ger
Cc: Lustre discussion
Subject: [EXTERNAL] Re: [lustre-discuss] hanging threads

iotop can be used to debug the I/O performance.  lfs health_check , lctl 
get_param to get lustre health status.
cratch-OST0084_UUID: not available for connect from 172.23.15.246@tcp30 (no 
target)   indicates the network issue  check network as well.
verify the  health of the storage devices associated with OST00_036 can use 
smartctl.



On Mon, 18 Dec 2023 at 15:28, Strikwerda, Ger via lustre-discuss 
<lustre-discuss@lists.lustre.org<mailto:lustre-discuss@lists.lustre.org>> wrote:

Dear all,

Since last week we are facing 'hanging kernel threads' causing our Lustre 
environment (Rocky 8.7/Lustre 2.15.2) to hang.

errors:

Dec 18 10:36:04 hb-oss01 kernel: LustreError: 137-5: scratch-OST0084_UUID: not 
available for connect from 172.23.15.246@tcp30 (no target). If you are running 
an HA pair check that the target is mounted on the other server.
Dec 18 10:36:04 hb-oss01 kernel: LustreError: Skipped 330 previous similar 
messages
Dec 18 10:36:04 hb-oss01 kernel: ptlrpc_watchdog_fire: 1 callbacks suppressed
Dec 18 10:36:04 hb-oss01 kernel: Lustre: ll_ost00_036: service thread pid 85609 
was inactive for 1062.652 seconds. The thread might be hung, or it might only 
be slow and will resume later. Dumping the stack trace for debugging purposes:

at that moment 231 jobs, not really high io. Normally we run way more jobs, and 
way more io.

environment is

2 MDS
4 OSS
160 OST's
250 clients

network is tcp

According to the internet, this could be caused by 'bad i/o'. Are there any 
useful things to check/isolate where this bad i/o is coming from? How do others 
pinpoint these issues?

Any feedback is very welcome,

--

Vriendelijke groet,

Ger Strikwerda
senior expert multidisciplinary enabler
simple solution architect
Rijksuniversiteit Groningen
CIT/RDMS/HPC

Smitsborg
Nettelbosje 1
9747 AJ Groningen
Tel. 050 363 9276

"God is hard, God is fair
 some men he gave brains, others he gave hair"
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to