Hello list,
I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a
weird openpmix problem we've been having; I configured it using:
./configure --prefix=/usr/local --with-pmix=internal --with-slurm
--without-tm --without-moab --without-singularity --without-fca
--without-hcol
Hi Ralph,
Just trying to understand - why are you saying this is a pmix problem?
Obviously, something to do with mpirun is failing, but I don't see any
indication here that it has to do with pmix.
No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs
across multiple nodes usin
Hi Gilles,
I invite you to do some cleanup
sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix
and then
sudo make install
and try again.
Good catch! Alright, I deleted /usr/local/lib/openmpi and
/usr/local/lib/pmix, then I rebuilt (make clean; make) and installed
pmix from the latest mast
Hi Gilles,
what is your mpirun command line?
is mpirun invoked from a batch allocation?
I call mpirun directly; here's a full output:
andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca
pmix_base_verbose 10 -np 4 python testmpi.py
[terra:203257] mca: base: components_regi
Hi Gilles,
it seems only flux is a PMIx option, which is very suspicious.
can you check other components are available?
ls -l /usr/local/lib/openmpi/mca_pmix_*.so
andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so
-rwxr-xr-x 1 root root 97488 FebĀ 1 08:20
/usr/local
Hi Gilles,
that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)
Ah, I didn't -- I linked against the latest git pmix; here's the
configure line:
./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm
--without-tm --without-moab --without
, 2021 at 11:05 PM Andrej Prsa via devel
wrote:
Hi Gilles,
it seems only flux is a PMIx option, which is very suspicious.
can you check other components are available?
ls -l /usr/local/lib/openmpi/mca_pmix_*.so
andrej@terra:~/system/tests/MPI$ ls -l
/usr/local/lib/openmpi/mca_pmix_*.so
-rw
Alright, I rebuilt mpirun and it's working on a local machine. But now
I'm back to my original problem: running this works:
mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96
python testmpi.py
but running this doesn't:
mpirun -mca plm slurm -np 384 -H node15:96,node16:96,
Hi Gilles,
srun -N 1 -n 1 orted
that is expected to fail, but it should at least find all its
dependencies and start
This was quite illuminating!
andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted
srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin
version (20.02.6)
s
The saga continues.
I managed to build slurm with pmix by first patching slurm using this
patch and manually building the plugin:
https://bugs.schedmd.com/show_bug.cgi?id=10683
Now srun shows pmix as an option:
andrej@terra:~/system/tests/MPI$ srun --mpi=list
srun: MPI types are...
srun: cra
Hi Ralph, Gilles,
I fail to understand why you continue to think that PMI has anything to do with
this problem. I see no indication of a PMIx-related issue in anything you have
provided to date.
Oh, I went off the traceback that yelled about pmix, and slurm not being
able to find it until I
Hi Gilles,
I can reproduce this behavior ... when running outside of a slurm allocation.
I just tried from slurm (sbatch run.sh) and I get the exact same error.
What does
$ env | grep ^SLURM_
reports?
Empty; no environment variables have been defined.
Thanks,
Andrej
try (and send the logs if that fails)
$ salloc -N 4 -n 384
and once you get the allocation
$ env | grep ^SLURM_
$ mpirun --mca plm_base_verbose 10 --mca plm slurm true
Cheers,
Gilles
On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel
wrote:
Hi Gilles,
I can reproduce this behavior ... whe
Hi Gilles,
Here is what you can try
$ salloc -N 4 -n 384
/* and then from the allocation */
$ srun -n 1 orted
/* that should fail, but the error message can be helpful */
$ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true
andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 3
Hi Ralph,
Andrej - what version of Slurm are you using here?
It's slurm 20.11.3, i.e. the latest release afaik.
But Gilles is correct; the proposed test failed:
andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2
salloc: Granted job allocation 838
andrej@terra:~/system/tests/MPI$ srun hostnam
15 matches
Mail list logo