Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



(In reply to comment #10)
> I don't have a crystal clear reproducer -- other than to say an application
> (obviously) gets a bunch of flock locks and then dies. It seems the common

Note that only one app can get a lock if all the locks are conflicting.

> threads is that they are all flock'ing the same resource on the MDS - so
> probably only one client gets a granted lock, the rest are waiting. Once the
> application is dead, we come in with llrd to clean these nids up and do the
> evictions. I am sure we are only going to see more of these. It should be 
> quite

Yes. this sounds possible, though quite stupid thing to do on something like 
xt3.
If you have 6000 nodes job, and 5999 nodes just wait until 1 node will release a
lock on a file (in who knows what time), this is quite unproductive use of
resources.

> easy to write an MPI test app that would do a bunch of flock enqueues on a
> single resource and then fall over dead (segfault, etc)

Is single node exiting means rest of nodes would be forcefully killed too?

> It does seem that we are killing a node with the lock held, which gets the
> completion AST sent to the client (which seems silly, given that we _know_ one
> of the clients is dead) and then when that AST timesout, we release that lock

This is not silly, because we are killing ONE client and we are granting lock to
ANOTHER that is not killed yet.

> and reprocess the queue of pending locks for that resource. 

Yes, because we killed one lock and now we need to see if something was waiting
for it to go away and needs to be granted.
If you kill all the processes that do not have locks granted first, this won't
happen, of course.

> I understand there isn't much we can do, given that llrd only gives us a 
> single
> nid at once. We *could* utlize the evict nid by list changes that are floating
> around somewhere in Bugzilla and update llrd to use them. I do not know if 
> there
> is a limit to the number of nids we can write into this proc file -- but we
> certainly need to know. This would give Lustre a single look at all the nids 
> we
> are trying to kill. If Lustre could then mark each as
> "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted locks, etc
> the various paths that would send RPCs to these clients could be prevented 
> from
> taking too much time.

This would make such an eviction two stage process, I think
First go mark all of them as eviction pending, then go and evict everybody.
Twice as much work done for obscure case.

> Also -- it should be possible to look at the time spent waiting for the flock
> locks and if it was > obd_timeout (from request sent to being actually 
> granted),
> dump the request as old. I believe this is similiar to the approach for bug
11330.      

This won't work. There is absolutely no limit on amount of time flock lock can
be held.
So with what you propose if one node gets a lock and another node waits for
conflicting lock.
First node holds the lock for say obd_timeout+1, then second node won't get its
lock at all because the timeout expired?

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to