Re: [Lustre-discuss] Meaning of LND/neterrors ?
Hi Eric Barton a écrit : It's expected that peers will crash and therefore the low-level network should not clutter the logs with noise and the upper layers should handle the problem by retrying or doing actual recovery. Ok, so I can understand those errors to something like: - my IB network is not so clean - but Lustre upper layers will retry, and so this is transparent for them as long as i do not have too many of this kind of issue. RDMA failed should really only occur when a peer node crashes. However it could be a sign that there are deeper problems with the network setup or hardware. Ok, but in my case we have issue where nodes do not crash but we got this kind of issues, like: (this occurs on LNET routeurs) Tx - ... cookie ... sending 1 waiting 0: failed 12 Closing conn to ... : error -5 (waiting) Even if the corresponding node is responding and Lustre works for it. If you suspect the network is misbehaving, I'd run an LNET self-test. This is well documented in the manual (at least to people who already know how it works ;) and lets you soak-test the network from any convenient node. Ok :) I use it often, so that's ok. But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1). So it is difficult to use it as a test for my current issue. Thanks Aurélien Cheers, Eric -Original Message- From: lustre-devel-boun...@lists.lustre.org [mailto:lustre-devel-boun...@lists.lustre.org] On Behalf Of Aurelien Degremont Sent: 22 September 2010 5:20 PM To: lustre-de...@lists.lustre.org Subject: [Lustre-devel] Meaning of LND/neterrors ? Hello I've noticed that Lustre network error, especially LND errors, are considered as maskable errors. That means that on a production node, where debug mask is 0, those specific errors won't be displayed if they happened. Does that mean that they are harmless? Do upper-layers resend their RPC/packet if LNDs report an error? When, in my case, o2iblnd says something like RDMA failed (neterror). It is a big issue? Some RPC were lost or not? Thanks in advance -- Aurelien Degremont ___ Lustre-devel mailing list lustre-de...@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-devel -- Aurelien Degremont CEA ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] [Lustre-devel] Meaning of LND/neterrors ?
Alexey Lyashkov a écrit : Hi Aurelien, That message you can see in two cases 1) low level network error, that bad - because client will be reconnected and resend requests after that error. that will add extra load to the service nodes. 2) service node (MDS, OSS) is restarted or hung, at that case transfer aborted. In our cases nodes were not restarted, so the infiniband network seems to have issues. But these errors could be ignored as long as they do not appear to often. -- Aurelien Degremont CEA ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
[Lustre-discuss] need help debuggin an access permission problem
Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Thanks, Tina -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] Meaning of LND/neterrors ?
Aurelien, could you give us some details about those difficulties of lnet_selftest over different OFED stacks when you see them again? It will be interesting to know because I think lnet_selftest should be stack independent. Thanks Liang On 9/23/10 3:57 PM, Aurelien Degremont wrote: If you suspect the network is misbehaving, I'd run an LNET self-test. This is well documented in the manual (at least to people who already know how it works ;) and lets you soak-test the network from any convenient node. Ok :) I use it often, so that's ok. But lnet_selftest has difficulties to works nicely if your using different OFED stacks (at least v1.4.2 against v1.5.1). So it is difficult to use it as a test for my current issue. Thanks Aurélien Cheers, Eric -- Cheers Liang ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
Hi, thanks for the answer. I found it in the meantime; one of our ldap servers had a wrong size limit entry. The logs I had of course already looked at - they didn't yield much in terms of why, only what (as in, I could see it was permission errors, but they do of course not really tell you why you are getting them. There weren't any log entries that hinted at 'size limit exceeded' or anything.). Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). Tina On 23/09/10 11:20, Ashley Pittman wrote: On 23 Sep 2010, at 10:46, Tina Friedrich wrote: Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Could you give an indication as to the number of files in the directory concerned? What is the full ls command issued (allowing for shell aliases) and in the case where it works is there a large variation in the time it takes when it does work? In terms of debugging it I'd say the log files for the client in question and the MDS would be the most likely place to start. Ashley, -- Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd Diamond House, Harwell Science and Innovation Campus - 01235 77 8442 -- This e-mail and any attachments may contain confidential, copyright and or privileged material, and are for the use of the intended addressee only. If you are not the intended addressee or an authorised recipient of the addressee please notify us of receipt by returning the e-mail and do not use, copy, retain, distribute or disclose the information in or attached to the e-mail. Any opinions expressed within this e-mail are those of the individual and not necessarily of Diamond Light Source Ltd. Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments are free from viruses and we cannot accept liability for any damage which you may sustain as a result of software viruses which may be transmitted in or with the message. Diamond Light Source Limited (company no. 4375679). Registered in England and Wales with its registered office at Diamond House, Harwell Science and Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
On 9/23/10 10:03 PM, Tina Friedrich wrote: Hi, thanks for the answer. I found it in the meantime; one of our ldap servers had a wrong size limit entry. The logs I had of course already looked at - they didn't yield much in terms of why, only what (as in, I could see it was permission errors, but they do of course not really tell you why you are getting them. There weren't any log entries that hinted at 'size limit exceeded' or anything.). Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). The topic about User/Group Cache Upcall maybe helpful for you. For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter of 29.1. Good Luck! Cheers, Nasf Tina On 23/09/10 11:20, Ashley Pittman wrote: On 23 Sep 2010, at 10:46, Tina Friedrich wrote: Hello List, I'm after debugging hints... I have a couple of users that intermittently get I/O errors when trying to ls a directory (as in, within half an hour, works - doesn't work - works...). Users/groups are kept in ldap; as far as I can see/check, the ldap information is consistend everywhere (i.e. no replication failure or anything). I am trying to figure out what is going on here/where this is going wrong. Can someone give me a hint on how to debug this? Specifically, how does the MDS look up this sort of information, could there be a 'list too long' type of error involved, something like that? Could you give an indication as to the number of files in the directory concerned? What is the full ls command issued (allowing for shell aliases) and in the case where it works is there a large variation in the time it takes when it does work? In terms of debugging it I'd say the log files for the client in question and the MDS would be the most likely place to start. Ashley, ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss
Re: [Lustre-discuss] need help debuggin an access permission problem
On 2010-09-23, at 08:03, Tina Friedrich wrote: Still - could someone point me to the bit in the documentation that best describes how the MDS queries that sort of information (group/passwd info, I mean)? Or how to best test that it's mechanisms are working? For example, in this case, I always thought one would only hit the size limit if doing a bulk 'transfer' of data, not doing a lookup on one user - plus I could do these sort lookups fine on all machines involved (against all ldap servers). You can run l_getgroups -d {uid} (the utility that the MDS uses to query the groups database/LDAP) from the command-line. Cheers, Andreas -- Andreas Dilger Lustre Technical Lead Oracle Corporation Canada Inc. ___ Lustre-discuss mailing list Lustre-discuss@lists.lustre.org http://lists.lustre.org/mailman/listinfo/lustre-discuss