Oral, H. Sarp wrote:
<snip>

"I'm still unclear as to why we are seeing this now.  Nic: is this a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?"

<snip>



I wonder this myself, since what has been described is not such an
obscure case.

Good darn question folks. I really can't explain this, other than to say it might be a factor of scale in a specific job ?

I think we need to have the application killed at just the *wrong* time for this to happen -- I'm guessing that usually the nodes process this flock'd file quite quickly and never gets killed while the locks are held.

Here is a possible solution we are thinking about. Warning -- this isn't fully tested yet. It does look good on paper though :)

The basic issue is that with Portals, liblustre & Catamount, one cannot do anything but set a timer when sending an RPC to a client node. Once the application is dead, this timer will run to completion as QK is looking at the destination pid and seeing nobody exists to receive that message and drops it on the floor.

Now, there exists some paths in Lustre node eviction that will result in RPC traffic to nodes -- given that we evict nodes one-by-one (and evicting the whole list at once is problematic for a host of reasons), we can get into the situation where we are sending RPCs to a node that llrd knows is dead, but we've not gotten that information into Lustre yet. I will grant that these are probably due to some varying level of application Evil Quotient -- but in the end, the system needs to protect against this.

Consider the case:

nids 1-10 all do an flock on MDS inode 123456

nid 1 is granted the lock, nids 2-10 are put int the pending list.

Job up & dies (how rude!)

llrd evicts nid1 - causing Lustre to delete it's lock and causing the list of pending locks to be reprocessed

Lustre sends a Completion AST to nid2 informing it that it now has the lock -- this times out after 2s

Lustre repeats this process for nids 3-10

total time spent waiting for nids 2-10 == 9 *2s, or 18s.

The following is a snippet of an idea that we have to deal with this problem.
Eric B. came up with the idea that we can use lctl --net ptl del_peer <nid> for every nid we are evicting to delete the LNET level information for that nid -- in effect preventing any future communication with that node. This should cause these RPC requests to immediately fail (something I'll be testing to verify) -- preventing the long and arduous serial 2second cleanup for hours and hours.


Note that this only works as LNET on the servers will *not* try to reconnect to a libclient. Deleting the peer has the effect of of failing the dead-node-bound-RPC immediately on the server rather than after a 2s timeout.

Nic

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to