Hi John,

does it work with `srun --overlap ...´ or if you do `export SLURM_OVERLAP=1´ 
before running your interactive job?

Best regards
Jürgen


* John DeSantis <desan...@usf.edu> [210428 09:41]:
> Hello all,
> 
> Just an update, the following URL almost mirrors the issue we're seeing: 
> https://github.com/open-mpi/ompi/issues/8378
> 
> But, SLURM 20.11.3 was shipped with the fix.  I've verified that the changes 
> are in the source code.
> 
> We don't want to have to downgrade SLURM to 20.02.x, but it seems that this 
> behaviour still exists.  Are no other sites on fresh installs of >= SLURM 
> 20.11.3 experiencing this problem?  
> 
> I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, 
> which is why 20.11.3 was selected.
> 
> Thanks,
> John DeSantis
> 
> On 4/26/21 5:12 PM, John DeSantis wrote:
> > Hello all,
> > 
> > We've recently (don't laugh!) updated two of our SLURM installations from 
> > 16.05.10-2 to 20.11.3 and 17.11.9, respectively.  Now, OpenMPI doesn't seem 
> > to function in interactive mode across multiple nodes as it did previously 
> > on the latest version 20.11.3;  using `srun` and `mpirun` on a single node 
> > gives desired results, while using multiple nodes causes a hang. Jobs 
> > submitted via `sbatch` do _work as expected_.  
> > 
> > [desantis@sclogin0 ~]$ scontrol show config |grep VERSION; srun -n 2 -N 2-2 
> > -t 00:05:00 --pty /bin/bash
> > SLURM_VERSION           = 17.11.9
> > [desantis@sccompute0 ~]$ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 
> > mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 
> > compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; 
> > mpirun hostname; module purge; echo; done
> > /apps/openmpi/1.8.5/bin/mpirun
> > sccompute0
> > sccompute1
> > 
> > /apps/openmpi/2.0.4/bin/mpirun
> > sccompute1
> > sccompute0
> > 
> > /apps/openmpi/2.0.4-psm2/bin/mpirun
> > sccompute1
> > sccompute0
> > 
> > /apps/openmpi/2.1.6/bin/mpirun
> > sccompute0
> > sccompute1
> > 
> > /apps/openmpi/3.1.6/bin/mpirun
> > sccompute0
> > sccompute1
> > 
> > /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
> > sccompute1
> > sccompute0
> > 
> > 
> > 15:58:28 Mon Apr 26 <0>
> > desantis@itn0
> > [~] $ scontrol show config|grep VERSION; srun -n 2 -N 2-2 --qos=devel 
> > --partition=devel -t 00:05:00 --pty /bin/bash
> > SLURM_VERSION           = 20.11.3
> > srun: job 1019599 queued and waiting for resources
> > srun: job 1019599 has been allocated resources
> > 15:58:46 Mon Apr 26 <0>
> > desantis@mdc-1057-30-1
> > [~] $ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 
> > mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 
> > compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; 
> > mpirun hostname; module purge; echo; done
> > /apps/openmpi/1.8.5/bin/mpirun
> > ^C
> > /apps/openmpi/2.0.4/bin/mpirun
> > ^C
> > /apps/openmpi/2.0.4-psm2/bin/mpirun
> > ^C
> > /apps/openmpi/2.1.6/bin/mpirun
> > ^C
> > /apps/openmpi/3.1.6/bin/mpirun
> > ^C
> > /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun
> > ^C[mpiexec@mdc-1057-30-1] Sending Ctrl-C to processes as requested
> > [mpiexec@mdc-1057-30-1] Press Ctrl-C again to force abort
> > ^C
> > 
> > Our SLURM installations are fairly straight forward.  We `rpmbuild` 
> > directly from the bzip2 files without any additional arguments.  We've done 
> > this since we first started using SLURM with version 14.03.3-2 and through 
> > all upgrades.  Due to SLURM's awesomeness(!), we've simply used the same 
> > configuration files between version changes, with the only changes being 
> > made to parameters which have been deprecated/renamed.  Our 
> > "Mpi{Default,Params}" have always been sent to "none".  The only real 
> > difference we're able to ascertain is that the MPI plugin for openmpi has 
> > been removed.
> > 
> > svc-3024-5-2: SLURM_VERSION           = 16.05.10-2
> > svc-3024-5-2: srun: MPI types are...
> > svc-3024-5-2: srun: mpi/openmpi
> > svc-3024-5-2: srun: mpi/mpich1_shmem
> > svc-3024-5-2: srun: mpi/mpichgm
> > svc-3024-5-2: srun: mpi/mvapich
> > svc-3024-5-2: srun: mpi/mpich1_p4
> > svc-3024-5-2: srun: mpi/lam
> > svc-3024-5-2: srun: mpi/none
> > svc-3024-5-2: srun: mpi/mpichmx
> > svc-3024-5-2: srun: mpi/pmi2
> > 
> > viking: SLURM_VERSION           = 20.11.3
> > viking: srun: MPI types are...
> > viking: srun: cray_shasta
> > viking: srun: pmi2
> > viking: srun: none
> > 
> > sclogin0: SLURM_VERSION           = 17.11.9
> > sclogin0: srun: MPI types are...
> > sclogin0: srun: openmpi
> > sclogin0: srun: none
> > sclogin0: srun: pmi2
> > sclogin0: 
> > 
> > As far as building OpenMPI, we've always withheld any SLURM specific flags, 
> > i.e. "--with-slurm", although during the build process SLURM is detected.  
> > 
> > Because OpenMPI was always built using this method, we never had to 
> > recompile OpenMPI after subsequent SLURM upgrades, and no cluster ready 
> > applications had to be rebuilt.  The only time OpenMPI had to be rebuilt 
> > was due to OPA hardware which was a simple addition of the "--with-psm2" 
> > flag.
> > 
> > It is my understanding that the openmpi plugin "never really did anything" 
> > (per perusing the mailing list), which is why it was removed.  Furthermore, 
> > searching the mailing list suggests that the appropriate method is to use 
> > `salloc` first, despite version 17.11.9 not needing `salloc` for an 
> > "interactive" sessions.
> > 
> > Before we go further down this rabbit hole, were other sites affected with 
> > a transition from SLURM versions 16.x,17.x,18.x(?) to versions 20.x?  If 
> > so, did the methodology for multinode interactive MPI sessions change?
> > 
> > Thanks!
> > John DeSantis
> > 
> > 
> > 
> > 
> 

-- 
GPG A997BA7A | 87FC DA31 5F00 C885 0DC3  E28F BD0D 4B33 A997 BA7A

Reply via email to