Is qrsh using the SSH subsystem? Or straight rsh/rlogin? Does this happen with all users? Or a specific one?
Have you tried -verbose or set SGE_DEBUG_LEVEL? Ian On Tue, Apr 22, 2014 at 7:53 AM, Prentice Bisbal <prentice.bis...@rutgers.edu> wrote: > On 04/22/2014 03:13 AM, Mikael Brandström Durling wrote: >> >> 21 apr 2014 kl. 19:59 skrev Prentice Bisbal <prentice.bis...@rutgers.edu>: >> >>> After one of these qrsh jobs fails, I get the following e-mail: >>> >>> Job 5326173 caused action: Job 5326173 set to ERROR >>> User = xxxx >>> Queue =pow1...@yyyy.zzzz >>> Start Time = <unknown> >>> End Time = <unknown> >>> failed assumedly before job:can't get password entry for user "xxxx". >>> Either the user does not exist or NIS error! >>> >>> >>> This error indicates there's something wrong with getting user >>> information. However, I can ssh into the problematic execution hosts just >>> fine, and when I do a 'getent passwd <username>', I get the correct results. >>> I've gone over my PAM configuration, and my /etc/nsswitch.conf >>> configuration, but I don't see anything obviously wrong. It appears to me >>> that sge_execd is using some other mechanism for getting user information >>> that is not configured correctly on these hosts. >> >> What password/account backend are you using for your system? I have been >> seeing it occasionally on our system where users are authenticated using >> winbind towards ab AD. My best guess after some debugging is that those >> errors are generated when winbind for some reason returns a negative answer >> when it can’t find the user in the cache, and the lookup takes too long due >> to network latency and/or slow AD server. In particular, it was triggered by >> system we run that occasionally submits largish array-jobs. When I changed >> the user running to job to a user found in /etc/passwd the errors were gone. > > > This system is using OpenLDAP to get user information. The nodes are all > running Scientific Linux 6, so they are using SSSD on the client-side to > provide name services and authentication. > > >> To conclude, I don’t believe GridEngine is at fault for this. >> > > Yes and no. This is clearly a client-side configuration error that is > impacting GE, but if GE is the only application/service that can't get user > information correctly, maybe there is a problem with GE. Ultimately, I was > just asking for help in figuring out what is misconfigured on these two > systems that are preventing GE from workingon them when it works everywhere > else. > > Prentice > > _______________________________________________ > users mailing list > users@gridengine.org > https://gridengine.org/mailman/listinfo/users -- Ian Kaufman Research Systems Administrator UC San Diego, Jacobs School of Engineering ikaufman AT ucsd DOT edu _______________________________________________ users mailing list users@gridengine.org https://gridengine.org/mailman/listinfo/users