Re: [slurm-users] Submitting jobs across multiple nodes fails

Brian Andrus Thu, 04 Feb 2021 18:36:12 -0800

try:

export SLURM_OVERLAP=1
export SLURM_WHOLE=1

before your salloc and see if that helps. I have seen some mpi issuesthat were resolved with that.

You can also try it using just the regular mpirun on the nodesallocated. That will help with a datapoint as well.


Brian Andrus

On 2/4/2021 4:55 PM, Andrej Prsa wrote:

Hi Brian,

Thanks for your response!
Did you compile slurm with mpi support?
Yep:

andrej@terra:~$ srun --mpi=list
srun: MPI types are...
srun: cray_shasta
srun: none
srun: pmi2
srun: pmix
srun: pmix_v4
Your mpi libraries should be the same as that version and they shouldbe available in the same locations for all nodes. Also, ensure theyare accessible (PATH, LD_LIBRARY_PATH, etc are set)
They are: I have openmpi-4.1.0 installed cluster-wide. Running jobsvia rsh across multiple nodes works just fine, but through slurm theydo not (within salloc):
mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96python testmpi.py # worksmpirun -mca plm slurm -np 384 -Hnode15:96,node16:96,node17:96,node18:96 python testmpi.py # doesn't work
Thus, I believe that mpi works just fine. I passed this by theompi-devel folks and they are convinced that the issue is in slurmconfiguration. I'm trying to figure out what's causing this error topop up in the logs:
mpi/pmix: ERROR: Cannot bind() UNIX socket/var/spool/slurmd/stepd.slurm.pmix.841.0: Address already in use (98)
I wonder if the culprit is how srun calls openmpi's --bind-to?

Thanks again,
Andrej

Re: [slurm-users] Submitting jobs across multiple nodes fails

Reply via email to