Hi Eric Barton a écrit : > It's expected that peers will crash and therefore the low-level > network should not clutter the logs with noise and the upper > layers should handle the problem by retrying or doing actual > recovery.
Ok, so I can understand those errors to something like: - my IB network is not so clean - but Lustre upper layers will retry, and so this is transparent for them as long as i do not have too many of this kind of issue. > "RDMA failed" should really only occur when a peer node crashes. > However it could be a sign that there are deeper problems with > the network setup or hardware. Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like: (this occurs on LNET routeurs) Tx -> ... cookie ... sending 1 waiting 0: failed 12 Closing conn to ... : error -5 (waiting) Even if the corresponding node is responding and Lustre works for it. > If you suspect the network is > misbehaving, I'd run an LNET self-test. This is well documented > in the manual (at least to people who already know how it works ;) > and lets you soak-test the network from any convenient node. Ok :) I use it often, so that's ok. But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1). So it is difficult to use it as a test for my current issue. Thanks Aurélien > > Cheers, > Eric > > > >> -----Original Message----- >> From: lustre-devel-boun...@lists.lustre.org >> [mailto:lustre-devel-boun...@lists.lustre.org] On Behalf >> Of Aurelien Degremont >> Sent: 22 September 2010 5:20 PM >> To: lustre-de...@lists.lustre.org >> Subject: [Lustre-devel] Meaning of LND/neterrors ? >> >> Hello >> >> I've noticed that Lustre network error, especially LND errors, are >> considered as maskable errors. >> That means that on a production node, where debug mask is 0, those specific >> errors won't be displayed >> if they happened. >> >> Does that mean that they are harmless? >> Do upper-layers resend their RPC/packet if LNDs report an error? >> >> When, in my case, o2iblnd says something like "RDMA failed" (neterror). It >> is a big issue? Some RPC >> were lost or not? >> >> Thanks in advance >> >> -- >> Aurelien Degremont >> _______________________________________________ >> Lustre-devel mailing list >> lustre-de...@lists.lustre.org >> http://lists.lustre.org/mailman/listinfo/lustre-devel > > -- Aurelien Degremont CEA _______________________________________________ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss