Hi Alejandro! This makes me think of an asymmetric routing problem. It could be addressed by implementing something like reverse path filtering (http://tldp.org/HOWTO/Adv-Routing-HOWTO/lartc.kernel.rpf.html) in LNet: nodes would not accept requests from peers through router B when they are configured to talk to those peers through router A only.
If there is no other ready for use solution and you are willing to contribute code :) Cheers, Sebastien. > Le 13 oct. 2017 à 15:20, LOPEZ, ALEXANDRE <alexandre.lo...@atos.net> a écrit : > > Hi everyone, > > I’d like to have your opinion on a problem I’m facing. Sorry for the long > mail but I failed to make it shorter without removing some important > information. > > Each islet on my cluster has a dedicated Lustre router connected to the > interconnect and to a dedicated network where Lustre servers are reachable. > Lustre servers are NOT on the main interconnect, thus the need for routers. > Any router is reachable thru the interconnect from any node but, when the > node and the router aren’t on the same islet, several switches (hops) need to > be crossed. The idea is to use the shortest path to the servers thru the > islet-local router. > > I created the appropriate routes on each compute node to contact the > islet-local Lustre router. There is also a lower-priority route to fail over > a router on another islet in case the local Lustre router fails. (This could > have also been done with the route’s hops, but my understanding is that the > final result is the same.) I also created the routes on the Lustre servers > for the responses to reach the clients thru the routes. > > This seems to work as expected, but this is actually false. > > Although the filesystem is mounted on the clients and works, there is a > problem when there is no failure (all routers are up). The problem roots in > the routes used to deliver the responses from the servers. If I assign > priorities to the routes on the servers, the higher priority route will > always be used to send the responses. So, if a compute node sent a request > thru its islet’s router (the shortest path), the response will not return > thru the same router but thru the one designated by the higher priority > route, making the return path longer. Using hops is the same thing: the route > with the lower hop value is chosen, but the same set of routes apply to all > the nodes on all the islets and a valid value for an islet is not valid for > all the others. If I assign neither priority nor hops, round-robin will be > used and the next route on the list is selected. > > The ideal solution would be for the response to follow the reverse path > followed by the request (thru the same router) but I found no way to do it. > > Is there any way to make the responses go the reverse (shortest) path? > > Any other way to solve this? > > I considered assigning a separate Lustre network to each islet but, although > this solves this problem, it adds new ones; so I ended up discarding it. > > I’m currently using Lustre 2.7 but I found nothing suggesting that 2.10 will > solve the problem. > > Thanks for your time and answers. > > Alexandre Lopez > Big Data & Security – Data Management > Bull SAS – Atos Technologies > > > > _______________________________________________ > lustre-discuss mailing list > lustre-discuss@lists.lustre.org > http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org _______________________________________________ lustre-discuss mailing list lustre-discuss@lists.lustre.org http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org