Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the 
following link:
https://bugzilla.lustre.org/show_bug.cgi?id=11511



Eric, Oleg & I talked about possible solutions to the bug 11511 issue that is
plaguing ORNL. I explored the possibility of something Cray Portals side that
would immediately NAK a LNET message on a node where no application was active,
but nothing was going to work or pass muster.

Eric then came up with the idea that we can use lctl --net ptl del_peer <nid>
for every nid we are evicting to delete the LNET level information for that nid
-- in effect preventing any future communication with that node. This should
cause these RPC requests to immediately fail (something I'll be testing to
verify) -- preventing the long and arduous serial 2second cleanup for hours and
hours.

I know this needs to be done for certain on the MDS, but is there any benefit to
doing this on the OSTs as well? Can it only help?

I think we need to explore this before looking at code change -- given the
nature of the flock.

_______________________________________________
Lustre-devel mailing list
[email protected]
https://mail.clusterfs.com/mailman/listinfo/lustre-devel

Reply via email to