Is qrsh using the SSH subsystem? Or straight rsh/rlogin?

Does this happen with all users? Or a specific one?

Have you tried -verbose or set SGE_DEBUG_LEVEL?

Ian

On Tue, Apr 22, 2014 at 7:53 AM, Prentice Bisbal
<prentice.bis...@rutgers.edu> wrote:
> On 04/22/2014 03:13 AM, Mikael Brandström Durling wrote:
>>
>> 21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>:
>>
>>> After one of these qrsh jobs fails, I get the following e-mail:
>>>
>>> Job 5326173 caused action: Job 5326173 set to ERROR
>>> User        = xxxx
>>> Queue       =pow1...@yyyy.zzzz
>>> Start Time  = <unknown>
>>> End Time    = <unknown>
>>> failed assumedly before job:can't get password entry for user "xxxx".
>>> Either the user does not exist or NIS error!
>>>
>>>
>>> This error indicates there's something wrong with getting user
>>> information. However, I can ssh into the problematic execution hosts just
>>> fine, and when I do a 'getent passwd <username>', I get the correct results.
>>> I've gone over my PAM configuration, and my /etc/nsswitch.conf
>>> configuration, but I don't see anything obviously wrong. It appears to me
>>> that sge_execd is using some other mechanism for getting user information
>>> that is not configured correctly on these hosts.
>>
>> What password/account backend are you using for your system? I have been
>> seeing it occasionally on our system where users are authenticated using
>> winbind towards ab AD. My best guess after some debugging is that those
>> errors are generated when winbind for some reason returns a negative answer
>> when it can’t find the user in the cache, and the lookup takes too long due
>> to network latency and/or slow AD server. In particular, it was triggered by
>> system we run that occasionally submits largish array-jobs. When I changed
>> the user running to job to a user found in /etc/passwd the errors were gone.
>
>
> This system is using OpenLDAP to get user information. The nodes are all
> running Scientific Linux 6, so they are using SSSD on the client-side to
> provide name services and authentication.
>
>
>> To conclude, I don’t believe GridEngine is at fault for this.
>>
>
> Yes and no.  This is clearly a client-side configuration error that is
> impacting GE, but if GE is the only application/service that can't get user
> information correctly, maybe there is a problem with GE. Ultimately, I was
> just asking for help in figuring out what is misconfigured on these two
> systems that are preventing GE from workingon them when it works everywhere
> else.
>
> Prentice
>
> _______________________________________________
> users mailing list
> users@gridengine.org
> https://gridengine.org/mailman/listinfo/users



-- 
Ian Kaufman
Research Systems Administrator
UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu

_______________________________________________
users mailing list
users@gridengine.org
https://gridengine.org/mailman/listinfo/users

Reply via email to