On 3/28/19 1:25 PM, Reuti wrote:

Hi,

Am 22.03.2019 um 16:20 schrieb Prentice Bisbal <pbis...@pppl.gov>:

On 3/21/19 6:56 PM, Reuti wrote:
Am 21.03.2019 um 23:43 schrieb Prentice Bisbal:

Slurm-users,

My users here have developed a GUI application which serves as a GUI interface 
to various physics codes they use. From this GUI, they can submit jobs to 
Slurm. On Tuesday, we upgraded Slurm from 18.08.5-2 to 18.08.6-2,and a user has 
reported a problem when submitting Slurm jobs through this GUI app that do not 
occur when the same sbatch script is submitted from sbatch on the command-line.

[…]
When I replaced the mpirun command with an equivalent srun command, everything 
works as desired, so the user can get back to work and be productive.

While srun is a suitable workaround, and is arguably the correct way to run an 
MPI job, I'd like to understand what is going on here. Any idea what is going 
wrong, or additional steps I can take to get more debug information?
Was an alias to `mpirun` introduced? It may cover the real application and even 
the `which mpirun` will return the correct value, but never be executed.

$ type mpirun
$ alias mpirun

may tell in the jobscript.

Unfortunately, the script is in tcsh, so the 'type' command doesn't work since, 
 it's a bash built-in function. I did use the 'alias' command to see all the 
defined aliases, and mpirun and mpiexec are not aliased. Any other ideas?
What was the outcome of this issue – could it be solved?


The user added

#SBATCH --export=none

to his submission script to prevent any environment variables in the GUI's environment from being applied to his job. After making that change, his job worked as expected, so this confirmed it was an environment issue. We compared the differences in 'env' from GUI-submitted and manually submitted, jobs, and found a handfule of variables that were set in the GUI environment that were not present in the manual-submission environment. If memory serves me correctly, they were all Open MPI parameters.

The user was happy using "--export=none" to fix this problem, so we didn't bother going through the tedious task of removing the environment variables one by one until we found the offending one. While still testing/debugging, I did do one run where I thought I removed all the offending variables by unsetting them all in the sbatch script, but the error still occurred, so i must have missed the one that was causing the issue.

Since the user was happy with the --export=none fix, and I had other issues to fix in my queue, that's where we left it.

--
Prentice


Reply via email to