Andrej,

I can reproduce this behavior ... when running outside of a slurm allocation.

What does
$ env | grep ^SLURM_
reports?

Cheers,

Gilles

On Tue, Feb 2, 2021 at 9:06 AM Andrej Prsa via devel
<devel@lists.open-mpi.org> wrote:
>
> Hi Ralph, Gilles,
>
> > I fail to understand why you continue to think that PMI has anything to do 
> > with this problem. I see no indication of a PMIx-related issue in anything 
> > you have provided to date.
>
> Oh, I went off the traceback that yelled about pmix, and slurm not being
> able to find it until I patched the latest version; I'm an
> astrophysicist pretending to be a sys admin for our research cluster, so
> while I can hold my ground with c, python and technical computing, I'm
> out of my depths when it comes to mpi, pmix, slurm and all that good
> stuff. So I appreciate your patience. I am trying though. :)
>
> > In the output below, it is clear what the problem is - you locked it to the 
> > "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not 
> > found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's 
> > see why that launcher wasn't accepted.
>
> andrej@terra:~/system/tests/MPI$ mpirun -mca plm_base_verbose 10 -mca
> plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python
> testmpi.py
> [terra:168998] mca: base: components_register: registering framework plm
> components
> [terra:168998] mca: base: components_register: found loaded component slurm
> [terra:168998] mca: base: components_register: component slurm register
> function successful
> [terra:168998] mca: base: components_open: opening plm components
> [terra:168998] mca: base: components_open: found loaded component slurm
> [terra:168998] mca: base: components_open: component slurm open function
> successful
> [terra:168998] mca:base:select: Auto-selecting plm components
> [terra:168998] mca:base:select:(  plm) Querying component [slurm]
> [terra:168998] mca:base:select:(  plm) No component selected!
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>    orte_plm_base_select failed
>    --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
>
> Gilles, I did try all the suggestions from the previous email but that
> led me to think that slurm is the culprit, and now I'm back to openmpi.
>
> Cheers,
> Andrej
>

Reply via email to