Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Fan Yong
  In fact, the issues occurred when MDS does the upcall (default 
processed by user space "l_getgroups") for user/group information 
related with this RPC, one UID for each upcall, and all the 
supplementary groups (not more than sysconf(_SC_NGROUPS_MAX) count) of 
this UID will be returned. The whole process is not nothing related with 
single user or not. If it is the improper configuration (of LDAP) for 
some user(s) caused the failure, you have to verify all the users one by 
one.


Cheers,
Nasf

On 9/24/10 9:58 PM, Tina Friedrich wrote:
> Actually, what I hit was one of the LDAP server private to the MDS
> errounously had a size limit set where the others are unlimited. They're
> round robin'd which is why I was seeing an inermittent effect. So not a
> client issue, the clients would not have used this server for their
> lookups.
>
> Which is why I'm puzzled as to how this works, and trying to understand
> it a bit better; to my understanding, this should not affect lookups on
> single users, only 'bulk' transfers of data, at least as I understand this?
>
> Tina
>
> On 24/09/10 12:35, Daniel Kobras wrote:
>> Hi!
>>
>> On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
>>> Cheers Andreas. I had actually found that, but there doesn't seem to be
>>> that much documentation about it. Or I didn't find it :) Plus it
>>> appeared to find the users that were problematic whenever I tried it, so
>>> I wondered if that is all there is, or if there's some other mechanism I
>>> could test for.
>> Mind that access to cached files is no longer authorized by the MDS, but by 
>> the
>> client itself. I wouldn't call it documentation, but
>> http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
>> illustration of why this is a problem when nameservices become out of sync
>> between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
>> similar issue.
>>
>> Regards,
>>
>> Daniel.
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>>
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Andreas Dilger
I think there is a bit of confusion here. The MDS is doing the initial 
authorization for the file, using l_getgroups to access the group information 
from LDAP (or whatever database is used).

Daniel's point was that after the client has gotten access to the file, it will 
cache this file locally until the lock is dropped from the client. 

Cheers, Andreas

On 2010-09-24, at 7:58, Tina Friedrich  wrote:

> Actually, what I hit was one of the LDAP server private to the MDS 
> errounously had a size limit set where the others are unlimited. They're 
> round robin'd which is why I was seeing an inermittent effect. So not a 
> client issue, the clients would not have used this server for their 
> lookups.
> 
> Which is why I'm puzzled as to how this works, and trying to understand 
> it a bit better; to my understanding, this should not affect lookups on 
> single users, only 'bulk' transfers of data, at least as I understand this?
> 
> Tina
> 
> On 24/09/10 12:35, Daniel Kobras wrote:
>> Hi!
>> 
>> On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
>>> Cheers Andreas. I had actually found that, but there doesn't seem to be
>>> that much documentation about it. Or I didn't find it :) Plus it
>>> appeared to find the users that were problematic whenever I tried it, so
>>> I wondered if that is all there is, or if there's some other mechanism I
>>> could test for.
>> 
>> Mind that access to cached files is no longer authorized by the MDS, but by 
>> the
>> client itself. I wouldn't call it documentation, but
>> http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
>> illustration of why this is a problem when nameservices become out of sync
>> between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
>> similar issue.
>> 
>> Regards,
>> 
>> Daniel.
>> ___
>> Lustre-discuss mailing list
>> Lustre-discuss@lists.lustre.org
>> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>> 
> 
> 
> -- 
> Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
> Diamond House, Harwell Science and Innovation Campus - 01235 77 8442
> 
> -- 
> This e-mail and any attachments may contain confidential, copyright and or 
> privileged material, and are for the use of the intended addressee only. If 
> you are not the intended addressee or an authorised recipient of the 
> addressee please notify us of receipt by returning the e-mail and do not use, 
> copy, retain, distribute or disclose the information in or attached to the 
> e-mail.
> Any opinions expressed within this e-mail are those of the individual and not 
> necessarily of Diamond Light Source Ltd. 
> Diamond Light Source Ltd. cannot guarantee that this e-mail or any 
> attachments are free from viruses and we cannot accept liability for any 
> damage which you may sustain as a result of software viruses which may be 
> transmitted in or with the message.
> Diamond Light Source Limited (company no. 4375679). Registered in England and 
> Wales with its registered office at Diamond House, Harwell Science and 
> Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
> 
> 
> 
> 
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Tina Friedrich
Actually, what I hit was one of the LDAP server private to the MDS 
errounously had a size limit set where the others are unlimited. They're 
round robin'd which is why I was seeing an inermittent effect. So not a 
client issue, the clients would not have used this server for their 
lookups.

Which is why I'm puzzled as to how this works, and trying to understand 
it a bit better; to my understanding, this should not affect lookups on 
single users, only 'bulk' transfers of data, at least as I understand this?

Tina

On 24/09/10 12:35, Daniel Kobras wrote:
> Hi!
>
> On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
>> Cheers Andreas. I had actually found that, but there doesn't seem to be
>> that much documentation about it. Or I didn't find it :) Plus it
>> appeared to find the users that were problematic whenever I tried it, so
>> I wondered if that is all there is, or if there's some other mechanism I
>> could test for.
>
> Mind that access to cached files is no longer authorized by the MDS, but by 
> the
> client itself. I wouldn't call it documentation, but
> http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
> illustration of why this is a problem when nameservices become out of sync
> between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
> similar issue.
>
> Regards,
>
> Daniel.
> ___
> Lustre-discuss mailing list
> Lustre-discuss@lists.lustre.org
> http://lists.lustre.org/mailman/listinfo/lustre-discuss
>


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Daniel Kobras
Hi!

On Fri, Sep 24, 2010 at 09:18:15AM +0100, Tina Friedrich wrote:
> Cheers Andreas. I had actually found that, but there doesn't seem to be 
> that much documentation about it. Or I didn't find it :) Plus it 
> appeared to find the users that were problematic whenever I tried it, so 
> I wondered if that is all there is, or if there's some other mechanism I 
> could test for.

Mind that access to cached files is no longer authorized by the MDS, but by the
client itself. I wouldn't call it documentation, but
http://wiki.lustre.org/images/b/ba/Tuesday_lustre_automotive.pdf has an
illustration of why this is a problem when nameservices become out of sync
between MDS and Lustre clients (slides 23/24). Sounds like you hit a very
similar issue.

Regards,

Daniel.
___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-24 Thread Tina Friedrich
Cheers Andreas. I had actually found that, but there doesn't seem to be 
that much documentation about it. Or I didn't find it :) Plus it 
appeared to find the users that were problematic whenever I tried it, so 
I wondered if that is all there is, or if there's some other mechanism I 
could test for.

Tina

On 23/09/10 22:25, Andreas Dilger wrote:
> On 2010-09-23, at 08:03, Tina Friedrich wrote:
>> Still - could someone point me to the bit in the documentation that best
>> describes how the MDS queries that sort of information (group/passwd
>> info, I mean)? Or how to best test that it's mechanisms are working? For
>> example, in this case, I always thought one would only hit the size
>> limit if doing a bulk 'transfer' of data, not doing a lookup on one user
>> - plus I could do these sort lookups fine on all machines involved
>> (against all ldap servers).
>
> You can run "l_getgroups -d {uid}" (the utility that the MDS uses to query 
> the groups database/LDAP) from the command-line.
>
> Cheers, Andreas
> --
> Andreas Dilger
> Lustre Technical Lead
> Oracle Corporation Canada Inc.
>
>


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Andreas Dilger
On 2010-09-23, at 08:03, Tina Friedrich wrote:
> Still - could someone point me to the bit in the documentation that best 
> describes how the MDS queries that sort of information (group/passwd 
> info, I mean)? Or how to best test that it's mechanisms are working? For 
> example, in this case, I always thought one would only hit the size 
> limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
> - plus I could do these sort lookups fine on all machines involved 
> (against all ldap servers).

You can run "l_getgroups -d {uid}" (the utility that the MDS uses to query the 
groups database/LDAP) from the command-line.

Cheers, Andreas
--
Andreas Dilger
Lustre Technical Lead
Oracle Corporation Canada Inc.

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Fan Yong
  On 9/23/10 10:03 PM, Tina Friedrich wrote:
> Hi,
>
> thanks for the answer. I found it in the meantime; one of our ldap
> servers had a wrong size limit entry.
>
> The logs I had of course already looked at - they didn't yield much in
> terms of why, only what (as in, I could see it was permission errors,
> but they do of course not really tell you why you are getting them.
> There weren't any log entries that hinted at 'size limit exceeded' or
> anything.).
>
> Still - could someone point me to the bit in the documentation that best
> describes how the MDS queries that sort of information (group/passwd
> info, I mean)? Or how to best test that it's mechanisms are working? For
> example, in this case, I always thought one would only hit the size
> limit if doing a bulk 'transfer' of data, not doing a lookup on one user
> - plus I could do these sort lookups fine on all machines involved
> (against all ldap servers).
The topic about "User/Group Cache Upcall" maybe helpful for you.
For lustre-1.8.x, it is chapter of 28.1; for lustre-2.0.x, it is chapter 
of 29.1.
Good Luck!

Cheers,
Nasf
> Tina
>
> On 23/09/10 11:20, Ashley Pittman wrote:
>> On 23 Sep 2010, at 10:46, Tina Friedrich wrote:
>>
>>> Hello List,
>>>
>>> I'm after debugging hints...
>>>
>>> I have a couple of users that intermittently get I/O errors when trying
>>> to ls a directory (as in, within half an hour, works ->   doesn't work ->
>>> works...).
>>>
>>> Users/groups are kept in ldap; as far as I can see/check, the ldap
>>> information is consistend everywhere (i.e. no replication failure or
>>> anything).
>>>
>>> I am trying to figure out what is going on here/where this is going
>>> wrong. Can someone give me a hint on how to debug this? Specifically,
>>> how does the MDS look up this sort of information, could there be a
>>> 'list too long' type of error involved, something like that?
>> Could you give an indication as to the number of files in the directory 
>> concerned?  What is the full ls command issued (allowing for shell aliases) 
>> and in the case where it works is there a large variation in the time it 
>> takes when it does work?
>>
>> In terms of debugging it I'd say the log files for the client in question 
>> and the MDS would be the most likely place to start.
>>
>> Ashley,
>>
>

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hi,

thanks for the answer. I found it in the meantime; one of our ldap 
servers had a wrong size limit entry.

The logs I had of course already looked at - they didn't yield much in 
terms of why, only what (as in, I could see it was permission errors, 
but they do of course not really tell you why you are getting them. 
There weren't any log entries that hinted at 'size limit exceeded' or 
anything.).

Still - could someone point me to the bit in the documentation that best 
describes how the MDS queries that sort of information (group/passwd 
info, I mean)? Or how to best test that it's mechanisms are working? For 
example, in this case, I always thought one would only hit the size 
limit if doing a bulk 'transfer' of data, not doing a lookup on one user 
- plus I could do these sort lookups fine on all machines involved 
(against all ldap servers).

Tina

On 23/09/10 11:20, Ashley Pittman wrote:
>
> On 23 Sep 2010, at 10:46, Tina Friedrich wrote:
>
>> Hello List,
>>
>> I'm after debugging hints...
>>
>> I have a couple of users that intermittently get I/O errors when trying
>> to ls a directory (as in, within half an hour, works ->  doesn't work ->
>> works...).
>>
>> Users/groups are kept in ldap; as far as I can see/check, the ldap
>> information is consistend everywhere (i.e. no replication failure or
>> anything).
>>
>> I am trying to figure out what is going on here/where this is going
>> wrong. Can someone give me a hint on how to debug this? Specifically,
>> how does the MDS look up this sort of information, could there be a
>> 'list too long' type of error involved, something like that?
>
> Could you give an indication as to the number of files in the directory 
> concerned?  What is the full ls command issued (allowing for shell aliases) 
> and in the case where it works is there a large variation in the time it 
> takes when it does work?
>
> In terms of debugging it I'd say the log files for the client in question and 
> the MDS would be the most likely place to start.
>
> Ashley,
>


-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


Re: [Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Ashley Pittman

On 23 Sep 2010, at 10:46, Tina Friedrich wrote:

> Hello List,
> 
> I'm after debugging hints...
> 
> I have a couple of users that intermittently get I/O errors when trying 
> to ls a directory (as in, within half an hour, works -> doesn't work -> 
> works...).
> 
> Users/groups are kept in ldap; as far as I can see/check, the ldap 
> information is consistend everywhere (i.e. no replication failure or 
> anything).
> 
> I am trying to figure out what is going on here/where this is going 
> wrong. Can someone give me a hint on how to debug this? Specifically, 
> how does the MDS look up this sort of information, could there be a 
> 'list too long' type of error involved, something like that?

Could you give an indication as to the number of files in the directory 
concerned?  What is the full ls command issued (allowing for shell aliases) and 
in the case where it works is there a large variation in the time it takes when 
it does work?

In terms of debugging it I'd say the log files for the client in question and 
the MDS would be the most likely place to start.

Ashley,

-- 

Ashley Pittman, Bath, UK.

Padb - A parallel job inspection tool for cluster computing
http://padb.pittman.org.uk

___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss


[Lustre-discuss] need help debuggin an access permission problem

2010-09-23 Thread Tina Friedrich
Hello List,

I'm after debugging hints...

I have a couple of users that intermittently get I/O errors when trying 
to ls a directory (as in, within half an hour, works -> doesn't work -> 
works...).

Users/groups are kept in ldap; as far as I can see/check, the ldap 
information is consistend everywhere (i.e. no replication failure or 
anything).

I am trying to figure out what is going on here/where this is going 
wrong. Can someone give me a hint on how to debug this? Specifically, 
how does the MDS look up this sort of information, could there be a 
'list too long' type of error involved, something like that?

Thanks,
Tina

-- 
Tina Friedrich, Computer Systems Administrator, Diamond Light Source Ltd
Diamond House, Harwell Science and Innovation Campus - 01235 77 8442

-- 
This e-mail and any attachments may contain confidential, copyright and or 
privileged material, and are for the use of the intended addressee only. If you 
are not the intended addressee or an authorised recipient of the addressee 
please notify us of receipt by returning the e-mail and do not use, copy, 
retain, distribute or disclose the information in or attached to the e-mail.
Any opinions expressed within this e-mail are those of the individual and not 
necessarily of Diamond Light Source Ltd. 
Diamond Light Source Ltd. cannot guarantee that this e-mail or any attachments 
are free from viruses and we cannot accept liability for any damage which you 
may sustain as a result of software viruses which may be transmitted in or with 
the message.
Diamond Light Source Limited (company no. 4375679). Registered in England and 
Wales with its registered office at Diamond House, Harwell Science and 
Innovation Campus, Didcot, Oxfordshire, OX11 0DE, United Kingdom
 



___
Lustre-discuss mailing list
Lustre-discuss@lists.lustre.org
http://lists.lustre.org/mailman/listinfo/lustre-discuss