Slurm-users,

My users here have developed a GUI application which serves as a GUI interface to various physics codes they use. From this GUI, they can submit jobs to Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has reported a problem when submitting Slurm jobs through this GUI app that do not occur when the same sbatch script is submitted from sbatch on the command-line.

The GUI application generates the following sbatch script (non-essential information redacted or omitted):

#!/bin/tcsh

#SBATCH --job-name=XXXXXXXX

#SBATCH --ntasks=32
#SBATCH --mem=2G
#SBATCH --time=00-2:00:00
#SBATCH --partition=YYYYYYYY
#SBATCH --export=ALL
#SBATCH --output=ZZZZZZZ
#SBATCH --mail-type=FAIL
#SBATCH --mail-user=xxx...@example.com

echo "The job's id is: $SLURM_JOBID"
echo "The master node of this job is: $SLURM_SUBMIT_HOST"
echo -n 'Started job at : ' ; date
echo " "

cd .
echo "Working directory : "$PWD

module purge
module use /p/focus/modules
module load mystellopt
module list

mpirun --verbose /path/to/command /path/to/input input.stellopt

echo " "
echo -n 'Ended job at  : ' ; date
echo " "
exit

When this job is submitted from the GUI application, every line is executed except the mpirun line, which can easily be verified by looking at the output file.  When this job is submitted with sbatch on the command-line, everything works as desired.

There is no error in the output file like "mpirun: command not found:", so it appears that the mpirun command is in the PATH in the job's environment. When I added the line "which mpirun" to the sbatch script, it found the correct mpirun command to use.

When I replaced the mpirun command with an equivalent srun command, everything works as desired, so the user can get back to work and be productive.

While srun is a suitable workaround, and is arguably the correct way to run an MPI job, I'd like to understand what is going on here. Any idea what is going wrong, or additional steps I can take to get more debug information?

The user does acknowledge that the GUI app itself could have been updated, which caused this, but since Slurm accepts the job, and the output of squeue and scontrol seems normal and the job is submitted and runs, it looks to me like the interactions between Slurm and the GUI app are fine.

Thanks in advance for your help.

--
Prentice


Reply via email to