Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511
This is a bug that has likely hit twice in 24 hours. The complaint from the site is that Lustre isn't cleaning up and evicting nids -- preventing new jobs from starting. This state has persisted for > 2 hours in the current "hit" of this problem. As for the issue itself: We seem to have an issue trying to evict a job that queues up lots of FLK (flock) locks on a common resource and then dies. We come in with llrd and try to evict the nids and somehow get into a loop that is trying to send completion ASTs to (now dead) liblustre clients to clean up the FLK locks. These ASTs are obviously timing out, at the rate of 1 every 2 second. For the current instance, it is a 6000 node job that seems to be causing this, which if left alone it seems 6000 * 2sec == 200 min, or just under 3.5 hours to complete the lock timeouts. The messages from ldlm_server_completion_ast() also indicate that these locks have been waiting for an extremely long time: ldlm_server_completion_ast()) ### enqueue wait took 9401808080us from 1168351505 ns In the lustre logs that will be attached, the python pid 4501 is llrd. I have a few -1 debug logs that show this processing loop quite nicely. Is this just another example of bug 11330 -- where the processing code should take into account the obd_timeout when processing & cleaning up locks ? _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
