The situation that Nic is describing is not obscure.  While the
application doing this type of access may not be ideal, I can image
several cases where this might happen.  Also, even if we correct it in
this app, another user will have the same problem down the road.  Just
to make it clear, typically when one task dies in an MPI job, then
entire application stops and all tasks exit.

This type of scenario is a typical example of what I think of when we
talk about scalable recovery.  This situation is actually an easier
case, because LLRD can provide you a list of all the nodes/nids that
should be cleaned up.  A two stage process (which would probably require
seconds) is fine given the alternative of it taking hours.

I'm still unclear as to why we are seeing this now.  Nic: is this a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?

--Shane

-----Original Message-----
From: [EMAIL PROTECTED]
[mailto:[EMAIL PROTECTED] On Behalf Of
[EMAIL PROTECTED]
Sent: Wednesday, January 10, 2007 2:29 PM
To: [email protected]
Subject: [Lustre-devel] [Bug 11511] can't evict nodes; stuck in flock
ast processing loop

Please don't reply to lustre-devel. Instead, comment in Bugzilla by
using the following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



(In reply to comment #10)
> I don't have a crystal clear reproducer -- other than to say an
application
> (obviously) gets a bunch of flock locks and then dies. It seems the
common

Note that only one app can get a lock if all the locks are conflicting.

> threads is that they are all flock'ing the same resource on the MDS -
so
> probably only one client gets a granted lock, the rest are waiting.
Once the
> application is dead, we come in with llrd to clean these nids up and
do the
> evictions. I am sure we are only going to see more of these. It should
be quite

Yes. this sounds possible, though quite stupid thing to do on something
like xt3.
If you have 6000 nodes job, and 5999 nodes just wait until 1 node will
release a
lock on a file (in who knows what time), this is quite unproductive use
of
resources.

> easy to write an MPI test app that would do a bunch of flock enqueues
on a
> single resource and then fall over dead (segfault, etc)

Is single node exiting means rest of nodes would be forcefully killed
too?

> It does seem that we are killing a node with the lock held, which gets
the
> completion AST sent to the client (which seems silly, given that we
_know_ one
> of the clients is dead) and then when that AST timesout, we release
that lock

This is not silly, because we are killing ONE client and we are granting
lock to
ANOTHER that is not killed yet.

> and reprocess the queue of pending locks for that resource. 

Yes, because we killed one lock and now we need to see if something was
waiting
for it to go away and needs to be granted.
If you kill all the processes that do not have locks granted first, this
won't
happen, of course.

> I understand there isn't much we can do, given that llrd only gives us
a single
> nid at once. We *could* utlize the evict nid by list changes that are
floating
> around somewhere in Bugzilla and update llrd to use them. I do not
know if there
> is a limit to the number of nids we can write into this proc file --
but we
> certainly need to know. This would give Lustre a single look at all
the nids we
> are trying to kill. If Lustre could then mark each as
> "ADMIN_EVICTION_IN_PROGRESS" before it started cleaning up granted
locks, etc
> the various paths that would send RPCs to these clients could be
prevented from
> taking too much time.

This would make such an eviction two stage process, I think
First go mark all of them as eviction pending, then go and evict
everybody.
Twice as much work done for obscure case.

> Also -- it should be possible to look at the time spent waiting for
the flock
> locks and if it was > obd_timeout (from request sent to being actually
granted),
> dump the request as old. I believe this is similiar to the approach
for bug
11330.      

This won't work. There is absolutely no limit on amount of time flock
lock can
be held.
So with what you propose if one node gets a lock and another node waits
for
conflicting lock.
First node holds the lock for say obd_timeout+1, then second node won't
get its
lock at all because the timeout expired?

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to