[slurm-users] Sdiag: when does the counting of rpc start?

2019-06-12 Thread Marcelo Garcia
Hi How to interpret the output of "sdiag"? For example: [root@teta2 ~]# sdiag *** sdiag output at Wed Jun 12 17:29:38 2019 Data since Wed Jun 12 00:00:00 2019 *** (...) Remote Procedure

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcus Wagner
Hi, we hit the same issue, up to 30.000 entries per day in the slurmctld log. As we used SL6 the first time (Scientific Linux), we had massive problems with sssd, often crashing. We therefore decided to get rid of sssd and manually fill /etc/passwd and /etc/groups via cronjob. So, yes we

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Christopher Benjamin Coffey
Hi, you may want to look into increasing the sssd cache length on the nodes, and improving the network connectivity to your ldap directory. I recall when playing with sssd in the past that it wasn't actually caching. Verify with tcpdump, and "ls -l" through a directory. Once the uid/gid is

[slurm-users] Rename account or move user from one account to another

2019-06-12 Thread Christoph Brüning
Hi everyone, is it somehow possible to move a user between accounts together with his/her usage? I.e. transfer the historical resource consumption from one association to another? In a related question: is it possible to rename an account? While I could, of course, tamper with the

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Bjørn-Helge Mevik
Another possible cause (we currently see it on one of our clusters): delays in ldap lookups. We have sssd on the machines, and occasionally, when sssd contacts the ldap server, it takes 5 or 10 seconds (or even 15) before it gets an answer. If that happens because slurmctld is trying to look up

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcelo Garcia
Hi Steffen We are using Lustre as underlying file system: [root@teta2 ~]# cat /proc/fs/lustre/version lustre: 2.7.19.11 Nothing has changed. I think this is happening for a long time, but before was very sporadic, and only recently became more frequent. Best Regards mg. -Original