Dear all, Since last week we are facing 'hanging kernel threads' causing our Lustre environment (Rocky 8.7/Lustre 2.15.2) to hang.
errors: Dec 18 10:36:04 hb-oss01 kernel: LustreError: 137-5: scratch-OST0084_UUID: not available for connect from 172.23.15.246@tcp30 (no target). If you are running an HA pair check that the target is mounted on the other server. Dec 18 10:36:04 hb-oss01 kernel: LustreError: Skipped 330 previous similar messages Dec 18 10:36:04 hb-oss01 kernel: ptlrpc_watchdog_fire: 1 callbacks suppressed Dec 18 10:36:04 hb-oss01 kernel: Lustre: ll_ost00_036: service thread pid 85609 was inactive for 1062.652 seconds. The thread might be hung, or it might only be slow and will resume later. Dumping the stack trace for debugging purposes: at that moment 231 jobs, not really high io. Normally we run way more jobs, and way more io. environment is 2 MDS 4 OSS 160 OST's 250 clients network is tcp According to the internet, this could be caused by 'bad i/o'. Are there any useful things to check/isolate where this bad i/o is coming from? How do others pinpoint these issues? Any feedback is very welcome, -- Vriendelijke groet, Ger Strikwerdasenior expert multidisciplinary enabler simple solution architect Rijksuniversiteit Groningen CIT/RDMS/HPC Smitsborg Nettelbosje 1 9747 AJ Groningen Tel. 050 363 9276 "God is hard, God is fair some men he gave brains, others he gave hair"
_______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org