Please don't reply to lustre-devel. Instead, comment in Bugzilla by using the following link: https://bugzilla.lustre.org/show_bug.cgi?id=11511
Eric, Oleg & I talked about possible solutions to the bug 11511 issue that is plaguing ORNL. I explored the possibility of something Cray Portals side that would immediately NAK a LNET message on a node where no application was active, but nothing was going to work or pass muster. Eric then came up with the idea that we can use lctl --net ptl del_peer <nid> for every nid we are evicting to delete the LNET level information for that nid -- in effect preventing any future communication with that node. This should cause these RPC requests to immediately fail (something I'll be testing to verify) -- preventing the long and arduous serial 2second cleanup for hours and hours. I know this needs to be done for certain on the MDS, but is there any benefit to doing this on the OSTs as well? Can it only help? I think we need to explore this before looking at code change -- given the nature of the flock. _______________________________________________ Lustre-devel mailing list [email protected] https://mail.clusterfs.com/mailman/listinfo/lustre-devel
