Oral, H. Sarp wrote:
<snip>
"I'm still unclear as to why we are seeing this now. Nic: is this a
"new" application or one we have run a good bit? Has anything changed
in Lustre that could have caused this to become an issue?"
<snip>
I wonder this myself, since what has been described is not such an
obscure case.
Good darn question folks. I really can't explain this, other than to say
it might be a factor of scale in a specific job ?
I think we need to have the application killed at just the *wrong* time
for this to happen -- I'm guessing that usually the nodes process this
flock'd file quite quickly and never gets killed while the locks are held.
Here is a possible solution we are thinking about. Warning -- this
isn't fully tested yet. It does look good on paper though :)
The basic issue is that with Portals, liblustre & Catamount, one cannot
do anything but set a timer when sending an RPC to a client node. Once
the application is dead, this timer will run to completion as QK is
looking at the destination pid and seeing nobody exists to receive that
message and drops it on the floor.
Now, there exists some paths in Lustre node eviction that will result in
RPC traffic to nodes -- given that we evict nodes one-by-one (and
evicting the whole list at once is problematic for a host of reasons),
we can get into the situation where we are sending RPCs to a node that
llrd knows is dead, but we've not gotten that information into Lustre
yet. I will grant that these are probably due to some varying level of
application Evil Quotient -- but in the end, the system needs to protect
against this.
Consider the case:
nids 1-10 all do an flock on MDS inode 123456
nid 1 is granted the lock, nids 2-10 are put int the pending list.
Job up & dies (how rude!)
llrd evicts nid1 - causing Lustre to delete it's lock and causing the
list of pending locks to be reprocessed
Lustre sends a Completion AST to nid2 informing it that it now has the
lock -- this times out after 2s
Lustre repeats this process for nids 3-10
total time spent waiting for nids 2-10 == 9 *2s, or 18s.
The following is a snippet of an idea that we have to deal with this
problem.
Eric B. came up with the idea that we can use lctl --net ptl del_peer
<nid> for every nid we are evicting to delete the LNET level information
for that nid -- in effect preventing any future communication with that
node. This should cause these RPC requests to immediately fail
(something I'll be testing to verify) -- preventing the long and arduous
serial 2second cleanup for hours and hours.
Note that this only works as LNET on the servers will *not* try to
reconnect to a libclient. Deleting the peer has the effect of of failing
the dead-node-bound-RPC immediately on the server rather than after a 2s
timeout.
Nic
_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel