Hi John, does it work with `srun --overlap ...´ or if you do `export SLURM_OVERLAP=1´ before running your interactive job?
Best regards Jürgen * John DeSantis <desan...@usf.edu> [210428 09:41]: > Hello all, > > Just an update, the following URL almost mirrors the issue we're seeing: > https://github.com/open-mpi/ompi/issues/8378 > > But, SLURM 20.11.3 was shipped with the fix. I've verified that the changes > are in the source code. > > We don't want to have to downgrade SLURM to 20.02.x, but it seems that this > behaviour still exists. Are no other sites on fresh installs of >= SLURM > 20.11.3 experiencing this problem? > > I was aware of the changes in 20.11.{0..2} which received a lot of scrunity, > which is why 20.11.3 was selected. > > Thanks, > John DeSantis > > On 4/26/21 5:12 PM, John DeSantis wrote: > > Hello all, > > > > We've recently (don't laugh!) updated two of our SLURM installations from > > 16.05.10-2 to 20.11.3 and 17.11.9, respectively. Now, OpenMPI doesn't seem > > to function in interactive mode across multiple nodes as it did previously > > on the latest version 20.11.3; using `srun` and `mpirun` on a single node > > gives desired results, while using multiple nodes causes a hang. Jobs > > submitted via `sbatch` do _work as expected_. > > > > [desantis@sclogin0 ~]$ scontrol show config |grep VERSION; srun -n 2 -N 2-2 > > -t 00:05:00 --pty /bin/bash > > SLURM_VERSION = 17.11.9 > > [desantis@sccompute0 ~]$ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 > > mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 > > compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; > > mpirun hostname; module purge; echo; done > > /apps/openmpi/1.8.5/bin/mpirun > > sccompute0 > > sccompute1 > > > > /apps/openmpi/2.0.4/bin/mpirun > > sccompute1 > > sccompute0 > > > > /apps/openmpi/2.0.4-psm2/bin/mpirun > > sccompute1 > > sccompute0 > > > > /apps/openmpi/2.1.6/bin/mpirun > > sccompute0 > > sccompute1 > > > > /apps/openmpi/3.1.6/bin/mpirun > > sccompute0 > > sccompute1 > > > > /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun > > sccompute1 > > sccompute0 > > > > > > 15:58:28 Mon Apr 26 <0> > > desantis@itn0 > > [~] $ scontrol show config|grep VERSION; srun -n 2 -N 2-2 --qos=devel > > --partition=devel -t 00:05:00 --pty /bin/bash > > SLURM_VERSION = 20.11.3 > > srun: job 1019599 queued and waiting for resources > > srun: job 1019599 has been allocated resources > > 15:58:46 Mon Apr 26 <0> > > desantis@mdc-1057-30-1 > > [~] $ for OPENMPI in mpi/openmpi/1.8.5 mpi/openmpi/2.0.4 > > mpi/openmpi/2.0.4-psm2 mpi/openmpi/2.1.6 mpi/openmpi/3.1.6 > > compilers/intel/2020_cluster_xe; do module load $OPENMPI ; which mpirun; > > mpirun hostname; module purge; echo; done > > /apps/openmpi/1.8.5/bin/mpirun > > ^C > > /apps/openmpi/2.0.4/bin/mpirun > > ^C > > /apps/openmpi/2.0.4-psm2/bin/mpirun > > ^C > > /apps/openmpi/2.1.6/bin/mpirun > > ^C > > /apps/openmpi/3.1.6/bin/mpirun > > ^C > > /apps/intel/2020_u2/compilers_and_libraries_2020.2.254/linux/mpi/intel64/bin/mpirun > > ^C[mpiexec@mdc-1057-30-1] Sending Ctrl-C to processes as requested > > [mpiexec@mdc-1057-30-1] Press Ctrl-C again to force abort > > ^C > > > > Our SLURM installations are fairly straight forward. We `rpmbuild` > > directly from the bzip2 files without any additional arguments. We've done > > this since we first started using SLURM with version 14.03.3-2 and through > > all upgrades. Due to SLURM's awesomeness(!), we've simply used the same > > configuration files between version changes, with the only changes being > > made to parameters which have been deprecated/renamed. Our > > "Mpi{Default,Params}" have always been sent to "none". The only real > > difference we're able to ascertain is that the MPI plugin for openmpi has > > been removed. > > > > svc-3024-5-2: SLURM_VERSION = 16.05.10-2 > > svc-3024-5-2: srun: MPI types are... > > svc-3024-5-2: srun: mpi/openmpi > > svc-3024-5-2: srun: mpi/mpich1_shmem > > svc-3024-5-2: srun: mpi/mpichgm > > svc-3024-5-2: srun: mpi/mvapich > > svc-3024-5-2: srun: mpi/mpich1_p4 > > svc-3024-5-2: srun: mpi/lam > > svc-3024-5-2: srun: mpi/none > > svc-3024-5-2: srun: mpi/mpichmx > > svc-3024-5-2: srun: mpi/pmi2 > > > > viking: SLURM_VERSION = 20.11.3 > > viking: srun: MPI types are... > > viking: srun: cray_shasta > > viking: srun: pmi2 > > viking: srun: none > > > > sclogin0: SLURM_VERSION = 17.11.9 > > sclogin0: srun: MPI types are... > > sclogin0: srun: openmpi > > sclogin0: srun: none > > sclogin0: srun: pmi2 > > sclogin0: > > > > As far as building OpenMPI, we've always withheld any SLURM specific flags, > > i.e. "--with-slurm", although during the build process SLURM is detected. > > > > Because OpenMPI was always built using this method, we never had to > > recompile OpenMPI after subsequent SLURM upgrades, and no cluster ready > > applications had to be rebuilt. The only time OpenMPI had to be rebuilt > > was due to OPA hardware which was a simple addition of the "--with-psm2" > > flag. > > > > It is my understanding that the openmpi plugin "never really did anything" > > (per perusing the mailing list), which is why it was removed. Furthermore, > > searching the mailing list suggests that the appropriate method is to use > > `salloc` first, despite version 17.11.9 not needing `salloc` for an > > "interactive" sessions. > > > > Before we go further down this rabbit hole, were other sites affected with > > a transition from SLURM versions 16.x,17.x,18.x(?) to versions 20.x? If > > so, did the methodology for multinode interactive MPI sessions change? > > > > Thanks! > > John DeSantis > > > > > > > > > -- GPG A997BA7A | 87FC DA31 5F00 C885 0DC3 E28F BD0D 4B33 A997 BA7A