Re: [slurm-users] Slurm: Socket timed out on send/recv operation - slurm 17.02.2

2019-06-25 Thread Marcelo Garcia
Hi It seems a problem we discussed a few days ago: https://lists.schedmd.com/pipermail/slurm-users/2019-June/003524.html But in that thread I thinking we were using slurm with workflow managers. It's interesting that you have the problem after adding the second server and with NFS share. Do you

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-14 Thread Marcelo Garcia
Hi Chris You are right in pointing that the job actually runs, despite of the error in the sbatch. The customer mention that: === start === Problem had usual scenario - job script was submitted and executed, but sbatch command returned non-zero exit status to ecflow, which thus assumed job to b

[slurm-users] Sdiag: when does the counting of rpc start?

2019-06-12 Thread Marcelo Garcia
Hi How to interpret the output of "sdiag"? For example: [root@teta2 ~]# sdiag *** sdiag output at Wed Jun 12 17:29:38 2019 Data since Wed Jun 12 00:00:00 2019 *** (...) Remote Procedure Cal

Re: [slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-12 Thread Marcelo Garcia
19-06-11 at 13:56:34 +, Marcelo Garcia wrote: > Hi > > Since mid-March 2019 we are having a strange problem with slurm. Sometimes, > the command "sbatch" fails: > > + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p > operw /home2/mma0

[slurm-users] Random "sbatch" failure: "Socket timed out on send/recv operation"

2019-06-11 Thread Marcelo Garcia
Hi Since mid-March 2019 we are having a strange problem with slurm. Sometimes, the command "sbatch" fails: + sbatch -o /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.1 -p operw /home2/mma002/ecf/home/Aos/Prod/Main/Postproc/Lfullpos/50.job1 sbatch: error: Batch job submission failed: