Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511
I don't have a crystal clear reproducer -- other than to say an application (obviously) gets a bunch of flock locks and then dies. It seems the common threads is that they are all flock'ing the same resource on the MDS - so probably only one client gets a granted lock, the rest are waiting. Once the application is dead, we come in with llrd to clean these nids up and do the evictions. I am sure we are only going to see more of these. It should be quite easy to write an MPI test app that would do a bunch of flock enqueues on a single resource and then fall over dead (segfault, etc) It does seem that we are killing a node with the lock held, which gets the completion AST sent to the client (which seems silly, given that we _know_ one of the clients is dead) and then when that AST timesout, we release that lock and reprocess the queue of pending locks for that resource. I understand there isn't much we can do, given that llrd only gives us a single nid at once. We *could* utlize the evict nid by list changes that are floating around somewhere in Bugzilla and update llrd to use them. I do not know if there is a limit to the number of nids we can write into this proc file -- but we certainly need to know. This would give Lustre a single look at all the nids we are trying to kill. If Lustre could then mark each as "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted locks, etc the various paths that would send RPCs to these clients could be prevented from taking too much time. Also -- it should be possible to look at the time spent waiting for the flock locks and if it was > obd_timeout (from request sent to being actually granted), dump the request as old. I believe this is similiar to the approach for bug 11330. _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
