Hi Steffen We are using Lustre as underlying file system: [root@teta2 ~]# cat /proc/fs/lustre/version lustre: 2.7.19.11
Nothing has changed. I think this is happening for a long time, but before was very sporadic, and only recently became more frequent. Best Regards mg. -----Original Message----- From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of Steffen Grunewald Sent: Dienstag, 11. Juni 2019 16:28 To: Slurm User Community List <slurm-users@lists.schedmd.com> Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation" On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 > sbatch: error: Batch job submission failed: Socket timed out on send/recv > operation I've seen such an error message from the underlying file system. Is there anything special (e.g. non-NFS) in your setup that may have changed in the past few months? Just a shot in the dark, of course... > Ecflow runs preprocessing on the script which generates a second script that > is submitted to slurm. In our case, the submission script is called > "42.job1". > > The problem we have is that sometimes, the "sbatch" command fails with the > message above. We couldn't find any hint on the logs. Hardware and software > logs are clean. I increased the debug level of slurm, to > # scontrol show config > (..._) > SlurmctldDebug = info > > But still not glue about what is happening. Maybe the next thing to try is to > use "sdiag" to inspect the server. Another complication is that the problem > is random, so we put "sdiag" in a cronjob? Is there a better way to run > "sdiag" periodically? > > Thnaks for your attention. > > Best Regards > > mg. > - S -- Steffen Grunewald, Cluster Administrator Max Planck Institute for Gravitational Physics (Albert Einstein Institute) Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany ~~~ Fon: +49-331-567 7274 Mail: steffen.grunewald(at)aei.mpg.de ~~~ Click https://www.mailcontrol.com/sr/C3sVfTezEznGX2PQPOmvUj911dVlkoGM8wtqpF4T7nO4ifXHGgg4hDJ1wA0Q6k9yVX4zexuKDmbIiTKH8SslWQ== to report this email as spam.