Re: [Lustre-discuss] Meaning of LND/neterrors ?

2010-09-23 Thread Aurelien Degremont
Hi

Eric Barton a écrit :
 It's expected that peers will crash and therefore the low-level
 network should not clutter the logs with noise and the upper
 layers should handle the problem by retrying or doing actual
 recovery.

Ok, so I can understand those errors to something like:
  - my IB network is not so clean
  - but Lustre upper layers will retry, and so this is transparent for them
as long as i do not have too many of this kind of issue.


 RDMA failed should really only occur when a peer node crashes.
 However it could be a sign that there are deeper problems with
 the network setup or hardware. 

Ok, but in my case we have issue where nodes do not crash but we got this kind 
of issues, like:
(this occurs on LNET routeurs)
Tx - ... cookie ... sending 1 waiting 0: failed  12
Closing conn to ... : error -5 (waiting)

Even if the corresponding node is responding and Lustre works for it.


 If you suspect the network is
 misbehaving, I'd run an LNET self-test.  This is well documented
 in the manual (at least to people who already know how it works ;)
 and lets you soak-test the network from any convenient node.

Ok :) I use it often, so that's ok.
But lnet_selftest has difficulties to works nicely if your using different OFED 
stacks (at least v1.4.2 against v1.5.1).
So it is difficult to use it as a test for my current issue.


Thanks

Aurélien


 
   Cheers,
Eric
 
 
 
 -Original Message-
 From: lustre-devel-boun...@lists.lustre.org 
 [mailto:lustre-devel-boun...@lists.lustre.org] On Behalf
 Of Aurelien Degremont
 Sent: 22 September 2010 5:20 PM
 To: lustre-de...@lists.lustre.org
 Subject: [Lustre-devel] Meaning of LND/neterrors ?

 Hello

 I've noticed that Lustre network error, especially LND errors, are 
 considered as maskable errors.
 That means that on a production node, where debug mask is 0, those specific 
 errors won't be displayed
 if they happened.

 Does that mean that they are harmless?
 Do upper-layers resend their RPC/packet if LNDs report an error?

 When, in my case, o2iblnd says something like RDMA failed (neterror). It 
 is a big issue? Some RPC
 were lost or not?

 Thanks in advance

 --
 Aurelien Degremont
 ___
 Lustre-devel mailing list
 lustre-de...@lists.lustre.org
 http://lists.lustre.org/mailman/listinfo/lustre-devel
 
 


-- 
Aurelien Degremont
CEA
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] [Lustre-devel] Meaning of LND/neterrors ?

2010-09-23 Thread Aurelien Degremont
Alexey Lyashkov a écrit :
 Hi Aurelien,
 
 That message you can see in two cases
 1) low level network error, that bad - because client will be reconnected and 
 resend requests after that error.
 that will add extra load to the service nodes.
 
 2) service node (MDS, OSS) is restarted or hung, at that case transfer 
 aborted.

In our cases nodes were not restarted, so the infiniband network seems to have 
issues.
But these errors could be ignored as long as they do not appear to often.



-- 
Aurelien Degremont
CEA
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hello List,

I'm after debugging hints...

I have a couple of users that intermittently get I/O errors when trying 
to ls a directory (as in, within half an hour, works - doesn't work - 
works...).

Users/groups are kept in ldap; as far as I can see/check, the ldap 
information is consistend everywhere (i.e. no replication failure or 
anything).

I am trying to figure out what is going on here/where this is going 
wrong. Can someone give me a hint on how to debug this? Specifically, 
how does the MDS look up this sort of information, could there be a 
'list too long' type of error involved, something like that?

Thanks,
Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] Meaning of LND/neterrors ?

2010-09-23 Thread liang.whamcloud
  Aurelien,
could you give us some details about those difficulties of lnet_selftest 
over different OFED stacks when you see them again? It will be 
interesting to know because I think lnet_selftest should be stack 
independent.

Thanks
Liang

On 9/23/10 3:57 PM, Aurelien Degremont wrote:

 If you suspect the network is
 misbehaving, I'd run an LNET self-test.  This is well documented
 in the manual (at least to people who already know how it works ;)
 and lets you soak-test the network from any convenient node.
 Ok :) I use it often, so that's ok.
 But lnet_selftest has difficulties to works nicely if your using different 
 OFED stacks (at least v1.4.2 against v1.5.1).
 So it is difficult to use it as a test for my current issue.


 Thanks

 Aurélien


Cheers,
 Eric






-- 
Cheers
Liang

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hi,

thanks for the answer. I found it in the meantime; one of our ldap 
servers had a wrong size limit entry.

The logs I had of course already looked at - they didn't yield much in 
terms of why, only what (as in, I could see it was permission errors, 
but they do of course not really tell you why you are getting them. 
There weren't any log entries that hinted at 'size limit exceeded' or 
anything.).

Still - could someone point me to the bit in the documentation that best 
describes how the MDS queries that sort of information (group/passwd 
info, I mean)? Or how to best test that it's mechanisms are working? For 
example, in this case, I always thought one would only hit the size 
limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
- plus I could do these sort lookups fine on all machines involved 
(against all ldap servers).

Tina

On 23/09/10 11:20, Ashley Pittman wrote:

 On 23 Sep 2010, at 10:46, Tina Friedrich wrote:

 Hello List,

 I'm after debugging hints...

 I have a couple of users that intermittently get I/O errors when trying
 to ls a directory (as in, within half an hour, works -  doesn't work -
 works...).

 Users/groups are kept in ldap; as far as I can see/check, the ldap
 information is consistend everywhere (i.e. no replication failure or
 anything).

 I am trying to figure out what is going on here/where this is going
 wrong. Can someone give me a hint on how to debug this? Specifically,
 how does the MDS look up this sort of information, could there be a
 'list too long' type of error involved, something like that?

 Could you give an indication as to the number of files in the directory 
 concerned?  What is the full ls command issued (allowing for shell aliases) 
 and in the case where it works is there a large variation in the time it 
 takes when it does work?

 In terms of debugging it I'd say the log files for the client in question and 
 the MDS would be the most likely place to start.

 Ashley,



-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Fan Yong
  On 9/23/10 10:03 PM, Tina Friedrich wrote:
 Hi,

 thanks for the answer. I found it in the meantime; one of our ldap
 servers had a wrong size limit entry.

 The logs I had of course already looked at - they didn't yield much in
 terms of why, only what (as in, I could see it was permission errors,
 but they do of course not really tell you why you are getting them.
 There weren't any log entries that hinted at 'size limit exceeded' or
 anything.).

 Still - could someone point me to the bit in the documentation that best
 describes how the MDS queries that sort of information (group/passwd
 info, I mean)? Or how to best test that it's mechanisms are working? For
 example, in this case, I always thought one would only hit the size
 limit if doing a bulk 'transfer' of data, not doing a lookup on one user
 - plus I could do these sort lookups fine on all machines involved
 (against all ldap servers).
The topic about User/Group Cache Upcall maybe helpful for you.
For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter 
of 29.1.
Good Luck!

Cheers,
Nasf
 Tina

 On 23/09/10 11:20, Ashley Pittman wrote:
 On 23 Sep 2010, at 10:46, Tina Friedrich wrote:

 Hello List,

 I'm after debugging hints...

 I have a couple of users that intermittently get I/O errors when trying
 to ls a directory (as in, within half an hour, works -   doesn't work -
 works...).

 Users/groups are kept in ldap; as far as I can see/check, the ldap
 information is consistend everywhere (i.e. no replication failure or
 anything).

 I am trying to figure out what is going on here/where this is going
 wrong. Can someone give me a hint on how to debug this? Specifically,
 how does the MDS look up this sort of information, could there be a
 'list too long' type of error involved, something like that?
 Could you give an indication as to the number of files in the directory 
 concerned?  What is the full ls command issued (allowing for shell aliases) 
 and in the case where it works is there a large variation in the time it 
 takes when it does work?

 In terms of debugging it I'd say the log files for the client in question 
 and the MDS would be the most likely place to start.

 Ashley,



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Andreas Dilger
On 2010-09-23, at 08:03, Tina Friedrich wrote:
 Still - could someone point me to the bit in the documentation that best 
 describes how the MDS queries that sort of information (group/passwd 
 info, I mean)? Or how to best test that it's mechanisms are working? For 
 example, in this case, I always thought one would only hit the size 
 limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
 - plus I could do these sort lookups fine on all machines involved 
 (against all ldap servers).

You can run l_getgroups -d {uid} (the utility that the MDS uses to query the 
groups database/LDAP) from the command-line.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss