The Slurm launch component would only disqualify itself if it didn't see a 
Slurm allocation - i.e., there is no SLURM_JOBID in the environment. If you 
want to use mpirun in a Slurm cluster, you need to:

1. get an allocation from Slurm using "salloc"

2. then run "mpirun"

Did you remember to get the allocation first?


> On Feb 1, 2021, at 4:04 PM, Andrej Prsa via devel <devel@lists.open-mpi.org> 
> wrote:
> 
> Hi Ralph, Gilles,
> 
>> I fail to understand why you continue to think that PMI has anything to do 
>> with this problem. I see no indication of a PMIx-related issue in anything 
>> you have provided to date.
> 
> Oh, I went off the traceback that yelled about pmix, and slurm not being able 
> to find it until I patched the latest version; I'm an astrophysicist 
> pretending to be a sys admin for our research cluster, so while I can hold my 
> ground with c, python and technical computing, I'm out of my depths when it 
> comes to mpi, pmix, slurm and all that good stuff. So I appreciate your 
> patience. I am trying though. :)
> 
>> In the output below, it is clear what the problem is - you locked it to the 
>> "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not 
>> found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's see 
>> why that launcher wasn't accepted.
> 
> andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca plm 
> slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py
> [terra:168998] mca: base: components_register: registering framework plm 
> components
> [terra:168998] mca: base: components_register: found loaded component slurm
> [terra:168998] mca: base: components_register: component slurm register 
> function successful
> [terra:168998] mca: base: components_open: opening plm components
> [terra:168998] mca: base: components_open: found loaded component slurm
> [terra:168998] mca: base: components_open: component slurm open function 
> successful
> [terra:168998] mca:base:select: Auto-selecting plm components
> [terra:168998] mca:base:select:(  plm) Querying component [slurm]
> [terra:168998] mca:base:select:(  plm) No component selected!
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_plm_base_select failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
> 
> Gilles, I did try all the suggestions from the previous email but that led me 
> to think that slurm is the culprit, and now I'm back to openmpi.
> 
> Cheers,
> Andrej
> 


Reply via email to