Hi,

we hit the same issue, up to 30.000 entries per day in the slurmctld log.

As we used SL6 the first time (Scientific Linux), we had massive problems with sssd, often crashing. We therefore decided to get rid of sssd and manually fill /etc/passwd and /etc/groups via cronjob.

So, yes we have a ldap, but it can't be the issue in our case, since user and group lookups are done locally.

Best
Marcus

On 6/12/19 3:36 PM, Christopher Benjamin Coffey wrote:
Hi, you may want to look into increasing the sssd cache length on the nodes, and 
improving the network connectivity to your ldap directory. I recall when playing with 
sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls 
-l" through a directory. Once the uid/gid is resolved, it shouldn't be hitting the 
directory anymore till the cache expires.

Do the nodes NAT through the head node?

Best,
Chris

—
Christopher Coffey
High-Performance Computing
Northern Arizona University
928-523-1167
On 6/12/19, 1:56 AM, "slurm-users on behalf of Bjørn-Helge Mevik" 
<slurm-users-boun...@lists.schedmd.com on behalf of b.h.me...@usit.uio.no> wrote:

     Another possible cause (we currently see it on one of our clusters):
     delays in ldap lookups.
We have sssd on the machines, and occasionally, when sssd contacts the
     ldap server, it takes 5 or 10 seconds (or even 15) before it gets an
     answer.  If that happens because slurmctld is trying to look up some
     user or group, etc, client commands depending on it will hang.  The
     default message timeout is 10 seconds, so if the delay is more than
     that, you get the timeout error.
We don't know why the delays are happening, but while we are debugging
     it, we've increased the MessageTimeout, which seems to have reduced the
     problem a bit.  We're also experimenting with GroupUpdateForce and
     GroupUpdateTime to reduce the number of times slurmctld needs to ask
     about groups, but I'm unsure how much that helps.
--
     Bjørn-Helge Mevik, dr. scient,
     Department for Research Computing, University of Oslo

--
Marcus Wagner, Dipl.-Inf.

IT Center
Abteilung: Systeme und Betrieb
RWTH Aachen University
Seffenter Weg 23
52074 Aachen
Tel: +49 241 80-24383
Fax: +49 241 80-624383
wag...@itc.rwth-aachen.de
www.itc.rwth-aachen.de


Reply via email to