Hi all,

I just experienced our first asymmetric router failure today, where only one interface on a subset of our LNET routers was down - however, only our clients connected directly via that interface detected this. Unfortunately our Lustre servers, connected to the routers via a different, functional interface didn't mark these routers as down, even though I thought we had a configuration that mitigated this problem by setting all our clients and servers to have the LNET module parameter:

avoid_asym_router_failure=1

As I understand it, the router pinger on our clients and servers should detect that one of the router's NIDs is down in this scenario and then mark the router down.

Should this parameter be set on the routers as well to be effective (which doesn't seem like it from my understanding of what the parameter does)?

In my case, the router's IB interface was showing as the state 'DOWN' in ibstatus and wasn't flapping, so I would have expected the router pinger on the server to detect this?

My lnet configurations are below for further information:

-------------------------------------------
Servers
-------
lustre-2.7.21 (Not quite the latest IEEL 3.X release)

options lnet networks="o2ib1(ib0)" routes="tcp2 1 10.47.240.[161-168]@o2ib1; tcp4 1 10.47.240.[161-168]@o2ib1; o2ib0 1 10.47.240.[165-168]@o2ib1; o2ib2 1 10.47.240.[161-168]@o2ib1" auto_down =1 avoid_asym_router_failure=1 check_routers_before_use=1 dead_router_check_interval=60 live_router_check_interval=60 router_ping_timeout=60


Routers
-------
lustre-2.7.21

options lnet networks="o2ib1(ib0), o2ib2(ib1), tcp2(em1.43), tcp4(em1.40)"

Clients
-------
lustre-client-2.10.3-1

options lnet networks=o2ib2(ib0) routes="o2ib1 1 10.44.240.[161-168]@o2ib2; o2ib0 1 10.44.240.[165-168]@o2ib2" auto_down=1 avoid_asym_router_failure=1 check_routers_before_use=1 dead_router_check_interval=60 live_router_check_interval=60 router_ping_timeout=60

-------------------------------------------

Has anyone else experienced this problem before and has this parameter worked properly for you?

Could anyone suggest ways I could observe what the router pinger is doing? Would that be looking for messages in lustre debug logs - perhaps just with 'net' set in the debug mask?

Thanks,
Matt

--
Matt Rásó-Barnett
Research Computing Platforms
University Information Services
University of Cambridge
_______________________________________________
lustre-discuss mailing list
lustre-discuss@lists.lustre.org
http://lists.lustre.org/listinfo.cgi/lustre-discuss-lustre.org

Reply via email to