Hi Steffen

We are using Lustre as underlying file system:
[root@teta2 ~]# cat /proc/fs/lustre/version

Nothing has changed. I think this is happening for a long time, but before was 
very sporadic, and only recently became more frequent. 

Best Regards


-----Original Message-----
From: slurm-users [mailto:slurm-users-boun...@lists.schedmd.com] On Behalf Of 
Steffen Grunewald
Sent: Dienstag, 11. Juni 2019 16:28
To: Slurm User Community List <slurm-users@lists.schedmd.com>
Subject: Re: [slurm-users] Random "sbatch" failure: "Socket timed out on 
send/recv operation"

On Tue, 2019-06-11 at 13:56:34 +0000, Marcelo Garcia wrote:
> Hi 
> Since mid-March 2019 we are having a strange problem with slurm. Sometimes, 
> the command "sbatch" fails:
> + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p 
> operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1
> sbatch: error: Batch job submission failed: Socket timed out on send/recv 
> operation

I've seen such an error message from the underlying file system.
Is there anything special (e.g. non-NFS) in your setup that may have changed
in the past few months?

Just a shot in the dark, of course...

> Ecflow runs preprocessing on the script which generates a second script that 
> is submitted to slurm. In our case, the submission script is called 
> "42.job1". 
> The problem we have is that sometimes, the "sbatch" command fails with the 
> message above. We couldn't find any hint on the logs. Hardware and software 
> logs are clean. I increased the debug level of slurm, to 
> # scontrol show config
> (..._)
> SlurmctldDebug          = info
> But still not glue about what is happening. Maybe the next thing to try is to 
> use "sdiag" to inspect the server. Another complication is that the problem 
> is random, so we put "sdiag" in a cronjob? Is there a better way to run 
> "sdiag" periodically?
> Thnaks for your attention.
> Best Regards
> mg.

- S

Steffen Grunewald, Cluster Administrator
Max Planck Institute for Gravitational Physics (Albert Einstein Institute)
Am Mühlenberg 1 * D-14476 Potsdam-Golm * Germany
Fon: +49-331-567 7274
Mail: steffen.grunewald(at)aei.mpg.de

  to report this email as spam.

Reply via email to