Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



This is a bug that has likely hit twice in 24 hours. The complaint from the site
is that Lustre isn't cleaning up and evicting nids -- preventing new jobs from
starting.  This state has persisted for > 2 hours in the current "hit" of this
problem.


As for the issue itself:

We seem to have an issue trying to evict a job that queues up lots of FLK
(flock) locks on a common resource and then dies. We come in with llrd and try
to evict the nids and somehow get into a loop that is trying to send completion
ASTs to (now dead) liblustre clients to clean up the FLK locks. These ASTs are
obviously timing out, at the rate of 1 every 2 second. For the current instance,
it is a 6000 node job that seems to be causing this, which if left alone it
seems 6000 * 2sec == 200 min, or just under 3.5 hours to complete the lock 
timeouts.

The messages from ldlm_server_completion_ast() also indicate that these locks
have been waiting for an extremely long time:
ldlm_server_completion_ast()) ### enqueue wait took 9401808080us from 
1168351505 ns
 
In the lustre logs that will be attached, the python pid 4501 is llrd. I have a
few -1 debug logs that show this processing loop quite nicely.

Is this just another example of bug 11330 -- where the processing code should
take into account the obd_timeout when processing & cleaning up locks ?

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to