The Slurm launch component would only disqualify itself if it didn't see a Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you want to use mpirun in a Slurm cluster, you need to:
1. get an allocation from Slurm using "salloc" 2. then run "mpirun" Did you remember to get the allocation first? > On Feb 1, 2021, at 4:04 PM, Andrej Prsa via devel <devel@lists.open-mpi.org> > wrote: > > Hi Ralph, Gilles, > >> I fail to understand why you continue to think that PMI has anything to do >> with this problem. I see no indication of a PMIx-related issue in anything >> you have provided to date. > > Oh, I went off the traceback that yelled about pmix, and slurm not being able > to find it until I patched the latest version; I'm an astrophysicist > pretending to be a sys admin for our research cluster, so while I can hold my > ground with c, python and technical computing, I'm out of my depths when it > comes to mpi, pmix, slurm and all that good stuff. So I appreciate your > patience. I am trying though. :) > >> In the output below, it is clear what the problem is - you locked it to the >> "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not >> found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's see >> why that launcher wasn't accepted. > > andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plm > slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py > [terra:168998] mca: base: components_register: registering framework plm > components > [terra:168998] mca: base: components_register: found loaded component slurm > [terra:168998] mca: base: components_register: component slurm register > function successful > [terra:168998] mca: base: components_open: opening plm components > [terra:168998] mca: base: components_open: found loaded component slurm > [terra:168998] mca: base: components_open: component slurm open function > successful > [terra:168998] mca:base:select: Auto-selecting plm components > [terra:168998] mca:base:select:( plm) Querying component [slurm] > [terra:168998] mca:base:select:( plm) No component selected! > -------------------------------------------------------------------------- > It looks like orte_init failed for some reason; your parallel process is > likely to abort. There are many reasons that a parallel process can > fail during orte_init; some of which are due to configuration or > environment problems. This failure appears to be an internal failure; > here's some additional information (which may only be relevant to an > Open MPI developer): > > orte_plm_base_select failed > --> Returned value Not found (-13) instead of ORTE_SUCCESS > -------------------------------------------------------------------------- > > Gilles, I did try all the suggestions from the previous email but that led me > to think that slurm is the culprit, and now I'm back to openmpi. > > Cheers, > Andrej >