I would guess that your machine can communicate with the cluster's head node (where the slurmctld daemon executes and creates the job allocation), but not the compute nodes (where the slurmd daemons execute and spawn your tasks). It's probably a network issue.
Quoting Reza Ramazani-Rend <[email protected]>: > Hi, > > I am trying to set up a machine for submitting jobs to a cluster that uses > slurm. But, when I try to submit a job, for example, using srun command, > despite the job being allocated resources (for example using squeue shows > the job running with the correct amount of resources allocated), it fails > to run the application, and I have to terminate the srun process by a kill > command on the local machine or use scancel to cancel the job and free the > resources for other users. I tried to follow the instructions given on the > mailing list for similar problems, and it seems that the machine that > submits the job fails to receive signals from the compute node. I am > attaching the output from “scontrol show config”, the srun command log > (logsrunlocal from “srun –vvvvvvvvv –p partitionname date 2>&1 | tee log”), > and the output of strace (from “strace –r –f –o logfile srun …”). > > Other machines on the network with similar configurations can submit jobs > without a problem. The log file from the “srun –vvvvv…” command does not > indicate any problems that I could see until I terminate the job to free > the resources (for comparison, logsrun301 is the log file from a successful > run from one of the compute nodes). The strace log, however, shows that the > client is waiting for a signal that it never receives (line 744, > futex(0x4724ba4, > FUTEX_WAIT_BITSET_PRIVATE|FUTEX_CLOCK_REALTIME, 1, {1355239853, 0}, > ffffffff <unfinished ...>, and line745, <... rt_sigtimedwait > resumed> ) = 15). > > The munge daemon is running on the client, and the permissions to all the > directories and files are set up as instructed in the installation > document. I also thought selinux might be blocking the communications, but > disabling it didn’t help. > > I was wondering if you can identify any problems that I have overlooked or > if anything is wrong with the set-up. > > Thank you. >
