[OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel
Hello list, I just upgraded openmpi from 4.0.3 to 4.1.0 to see if it would solve a weird openpmix problem we've been having; I configured it using: ./configure --prefix=/usr/local --with-pmix=internal --with-slurm --without-tm --without-moab --without-singularity --without-fca --without-hcol

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-01-31 Thread Andrej Prsa via devel
Hi Ralph, Just trying to understand - why are you saying this is a pmix problem? Obviously, something to do with mpirun is failing, but I don't see any indication here that it has to do with pmix. No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs across multiple nodes usin

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I invite you to do some cleanup sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix and then sudo make install and try again. Good catch! Alright, I deleted /usr/local/lib/openmpi and /usr/local/lib/pmix, then I rebuilt (make clean; make) and installed pmix from the latest mast

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, what is your mpirun command line? is mpirun invoked from a batch allocation? I call mpirun directly; here's a full output: andrej@terra:~/system/tests/MPI$ mpirun --mca ess_base_verbose 10 --mca pmix_base_verbose 10 -np 4 python testmpi.py [terra:203257] mca: base: components_regi

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rwxr-xr-x 1 root root 97488 FebĀ  1 08:20 /usr/local

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, that's odd, there should be a mca_pmix_pmix3x.so (assuming you built with the internal pmix) Ah, I didn't -- I linked against the latest git pmix; here's the configure line: ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm --without-tm --without-moab --without

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
, 2021 at 11:05 PM Andrej Prsa via devel wrote: Hi Gilles, it seems only flux is a PMIx option, which is very suspicious. can you check other components are available? ls -l /usr/local/lib/openmpi/mca_pmix_*.so andrej@terra:~/system/tests/MPI$ ls -l /usr/local/lib/openmpi/mca_pmix_*.so -rw

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Alright, I rebuilt mpirun and it's working on a local machine. But now I'm back to my original problem: running this works: mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py but running this doesn't: mpirun -mca plm slurm -np 384 -H node15:96,node16:96,

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, srun -N 1 -n 1 orted that is expected to fail, but it should at least find all its dependencies and start This was quite illuminating! andrej@terra:~/system/tests/MPI$ srun -N 1 -n 1 orted srun: /usr/local/lib/slurm/switch_generic.so: Incompatible Slurm plugin version (20.02.6) s

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
The saga continues. I managed to build slurm with pmix by first patching slurm using this patch and manually building the plugin: https://bugs.schedmd.com/show_bug.cgi?id=10683 Now srun shows pmix as an option: andrej@terra:~/system/tests/MPI$ srun --mpi=list srun: MPI types are... srun: cra

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Gilles, I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date. Oh, I went off the traceback that yelled about pmix, and slurm not being able to find it until I

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, I can reproduce this behavior ... when running outside of a slurm allocation. I just tried from slurm (sbatch run.sh) and I get the exact same error. What does $ env | grep ^SLURM_ reports? Empty; no environment variables have been defined. Thanks, Andrej

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
try (and send the logs if that fails) $ salloc -N 4 -n 384 and once you get the allocation $ env | grep ^SLURM_ $ mpirun --mca plm_base_verbose 10 --mca plm slurm true Cheers, Gilles On Tue, Feb 2, 2021 at 9:27 AM Andrej Prsa via devel wrote: Hi Gilles, I can reproduce this behavior ... whe

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Gilles, Here is what you can try $ salloc -N 4 -n 384 /* and then from the allocation */ $ srun -n 1 orted /* that should fail, but the error message can be helpful */ $ /usr/local/bin/mpirun --mca plm slurm --mca plm_base_verbose 10 true andrej@terra:~/system/tests/MPI$ salloc -N 4 -n 3

Re: [OMPI devel] mpirun 4.1.0 segmentation fault

2021-02-01 Thread Andrej Prsa via devel
Hi Ralph, Andrej - what version of Slurm are you using here? It's slurm 20.11.3, i.e. the latest release afaik. But Gilles is correct; the proposed test failed: andrej@terra:~/system/tests/MPI$ salloc -N 2 -n 2 salloc: Granted job allocation 838 andrej@terra:~/system/tests/MPI$ srun hostnam