Re: [Lustre-devel] [Bug 11511] can't evict nodes; stuck in flock astprocessing loop

Nicholas Henke Thu, 11 Jan 2007 09:37:39 -0800

Oral, H. Sarp wrote:

<snip>


"I'm still unclear as to why we are seeing this now.  Nic: is this a
"new" application or one we have run a good bit?  Has anything changed
in Lustre that could have caused this to become an issue?"

<snip>



I wonder this myself, since what has been described is not such an
obscure case.

Good darn question folks. I really can't explain this, other than to sayit might be a factor of scale in a specific job ?

I think we need to have the application killed at just the *wrong* timefor this to happen -- I'm guessing that usually the nodes process thisflock'd file quite quickly and never gets killed while the locks are held.

Here is a possible solution we are thinking about. Warning -- thisisn't fully tested yet. It does look good on paper though :)

The basic issue is that with Portals, liblustre & Catamount, one cannotdo anything but set a timer when sending an RPC to a client node. Oncethe application is dead, this timer will run to completion as QK islooking at the destination pid and seeing nobody exists to receive thatmessage and drops it on the floor.

Now, there exists some paths in Lustre node eviction that will result inRPC traffic to nodes -- given that we evict nodes one-by-one (andevicting the whole list at once is problematic for a host of reasons),we can get into the situation where we are sending RPCs to a node thatllrd knows is dead, but we've not gotten that information into Lustreyet. I will grant that these are probably due to some varying level ofapplication Evil Quotient -- but in the end, the system needs to protectagainst this.


Consider the case:

nids 1-10 all do an flock on MDS inode 123456

nid 1 is granted the lock, nids 2-10 are put int the pending list.

Job up & dies (how rude!)

llrd evicts nid1 - causing Lustre to delete it's lock and causing thelist of pending locks to be reprocessed

Lustre sends a Completion AST to nid2 informing it that it now has thelock -- this times out after 2s


Lustre repeats this process for nids 3-10

total time spent waiting for nids 2-10 == 9 *2s, or 18s.

The following is a snippet of an idea that we have to deal with thisproblem.

Eric B. came up with the idea that we can use lctl --net ptl del_peer<nid> for every nid we are evicting to delete the LNET level informationfor that nid -- in effect preventing any future communication with thatnode. This should cause these RPC requests to immediately fail(something I'll be testing to verify) -- preventing the long and arduousserial 2second cleanup for hours and hours.

Note that this only works as LNET on the servers will *not* try toreconnect to a libclient. Deleting the peer has the effect of of failingthe dead-node-bound-RPC immediately on the server rather than after a 2stimeout.


Nic

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Re: [Lustre-devel] [Bug 11511] can't evict nodes; stuck in flock astprocessing loop

Reply via email to