The situation that Nic is describing is not obscure. While the application doing this type of access may not be ideal, I can image several cases where this might happen. Also, even if we correct it in this app, another user will have the same problem down the road. Just to make it clear, typically when one task dies in an MPI job, then entire application stops and all tasks exit.
This type of scenario is a typical example of what I think of when we talk about scalable recovery. This situation is actually an easier case, because LLRD can provide you a list of all the nodes/nids that should be cleaned up. A two stage process (which would probably require seconds) is fine given the alternative of it taking hours. I'm still unclear as to why we are seeing this now. Nic: is this a "new" application or one we have run a good bit? Has anything changed in Lustre that could have caused this to become an issue? --Shane -----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] On Behalf Of [EMAIL PROTECTED] Sent: Wednesday, January 10, 2007 2:29 PM To: [email protected] Subject: [Lustre-devel] [Bug 11511] can't evict nodes; stuck in flock ast processing loop Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511 (In reply to comment #10) > I don't have a crystal clear reproducer -- other than to say an application > (obviously) gets a bunch of flock locks and then dies. It seems the common Note that only one app can get a lock if all the locks are conflicting. > threads is that they are all flock'ing the same resource on the MDS - so > probably only one client gets a granted lock, the rest are waiting. Once the > application is dead, we come in with llrd to clean these nids up and do the > evictions. I am sure we are only going to see more of these. It should be quite Yes. this sounds possible, though quite stupid thing to do on something like xt3. If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a lock on a file (in who knows what time), this is quite unproductive use of resources. > easy to write an MPI test app that would do a bunch of flock enqueues on a > single resource and then fall over dead (segfault, etc) Is single node exiting means rest of nodes would be forcefully killed too? > It does seem that we are killing a node with the lock held, which gets the > completion AST sent to the client (which seems silly, given that we _know_ one > of the clients is dead) and then when that AST timesout, we release that lock This is not silly, because we are killing ONE client and we are granting lock to ANOTHER that is not killed yet. > and reprocess the queue of pending locks for that resource. Yes, because we killed one lock and now we need to see if something was waiting for it to go away and needs to be granted. If you kill all the processes that do not have locks granted first, this won't happen, of course. > I understand there isn't much we can do, given that llrd only gives us a single > nid at once. We *could* utlize the evict nid by list changes that are floating > around somewhere in Bugzilla and update llrd to use them. I do not know if there > is a limit to the number of nids we can write into this proc file -- but we > certainly need to know. This would give Lustre a single look at all the nids we > are trying to kill. If Lustre could then mark each as > "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted locks, etc > the various paths that would send RPCs to these clients could be prevented from > taking too much time. This would make such an eviction two stage process, I think First go mark all of them as eviction pending, then go and evict everybody. Twice as much work done for obscure case. > Also -- it should be possible to look at the time spent waiting for the flock > locks and if it was > obd_timeout (from request sent to being actually granted), > dump the request as old. I believe this is similiar to the approach for bug 11330. This won't work. There is absolutely no limit on amount of time flock lock can be held. So with what you propose if one node gets a lock and another node waits for conflicting lock. First node holds the lock for say obd_timeout+1, then second node won't get its lock at all because the timeout expired? _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
