Andrej,

you are now invoking mpirun from a slurm allocation, right?

you can try this:

/usr/local/bin/mpirun -mca plm slurm -np 384 -H
node15:96,node16:96,node17:96,node18:96
python testmpi.py

if it does not work, you can collect more relevant logs with

mpirun -mca plm slurm -mca plm_base_verbose 10 -np 384 -H
node15:96,node16:96,node17:96,node18:96
python testmpi.py

an other test you can do is
srun -N 1 -n 1 orted

that is expected to fail, but it should at least find all its
dependencies and start


Cheers,

Gilles

On Tue, Feb 2, 2021 at 12:32 AM Andrej Prsa via devel
<devel@lists.open-mpi.org> wrote:
>
> Alright, I rebuilt mpirun and it's working on a local machine. But now
> I'm back to my original problem: running this works:
>
> mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96
> python testmpi.py
>
> but running this doesn't:
>
> mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96
> python testmpi.py
>
> Here's the verbose output from the latter command:
>
> andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca
> pmix_base_verbose 10 -mca plm slurm -np 384 -H
> node15:96,node16:96,node17:96,node18:96 python testmpi.py
> [terra:387112] mca: base: components_register: registering framework ess
> components
> [terra:387112] mca: base: components_register: found loaded component slurm
> [terra:387112] mca: base: components_register: component slurm has no
> register or open function
> [terra:387112] mca: base: components_register: found loaded component env
> [terra:387112] mca: base: components_register: component env has no
> register or open function
> [terra:387112] mca: base: components_register: found loaded component pmi
> [terra:387112] mca: base: components_register: component pmi has no
> register or open function
> [terra:387112] mca: base: components_register: found loaded component tool
> [terra:387112] mca: base: components_register: component tool register
> function successful
> [terra:387112] mca: base: components_register: found loaded component hnp
> [terra:387112] mca: base: components_register: component hnp has no
> register or open function
> [terra:387112] mca: base: components_register: found loaded component
> singleton
> [terra:387112] mca: base: components_register: component singleton
> register function successful
> [terra:387112] mca: base: components_open: opening ess components
> [terra:387112] mca: base: components_open: found loaded component slurm
> [terra:387112] mca: base: components_open: component slurm open function
> successful
> [terra:387112] mca: base: components_open: found loaded component env
> [terra:387112] mca: base: components_open: component env open function
> successful
> [terra:387112] mca: base: components_open: found loaded component pmi
> [terra:387112] mca: base: components_open: component pmi open function
> successful
> [terra:387112] mca: base: components_open: found loaded component tool
> [terra:387112] mca: base: components_open: component tool open function
> successful
> [terra:387112] mca: base: components_open: found loaded component hnp
> [terra:387112] mca: base: components_open: component hnp open function
> successful
> [terra:387112] mca: base: components_open: found loaded component singleton
> [terra:387112] mca: base: components_open: component singleton open
> function successful
> [terra:387112] mca:base:select: Auto-selecting ess components
> [terra:387112] mca:base:select:(  ess) Querying component [slurm]
> [terra:387112] mca:base:select:(  ess) Querying component [env]
> [terra:387112] mca:base:select:(  ess) Querying component [pmi]
> [terra:387112] mca:base:select:(  ess) Querying component [tool]
> [terra:387112] mca:base:select:(  ess) Querying component [hnp]
> [terra:387112] mca:base:select:(  ess) Query of component [hnp] set
> priority to 100
> [terra:387112] mca:base:select:(  ess) Querying component [singleton]
> [terra:387112] mca:base:select:(  ess) Selected component [hnp]
> [terra:387112] mca: base: close: component slurm closed
> [terra:387112] mca: base: close: unloading component slurm
> [terra:387112] mca: base: close: component env closed
> [terra:387112] mca: base: close: unloading component env
> [terra:387112] mca: base: close: component pmi closed
> [terra:387112] mca: base: close: unloading component pmi
> [terra:387112] mca: base: close: component tool closed
> [terra:387112] mca: base: close: unloading component tool
> [terra:387112] mca: base: close: component singleton closed
> [terra:387112] mca: base: close: unloading component singleton
> --------------------------------------------------------------------------
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
>
>    orte_plm_base_select failed
>    --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --------------------------------------------------------------------------
>
> This was the exact problem that prompted me to try and upgrade from
> 4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now
> installed on the head and on all compute nodes.
>
> I'd appreciate any ideas on what to try to overcome this.
>
> Cheers,
> Andrej
>
>
> On 2/1/21 9:57 AM, Andrej Prsa wrote:
> > Hi Gilles,
> >
> >> that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
> >> with the internal pmix)
> >
> > Ah, I didn't -- I linked against the latest git pmix; here's the
> > configure line:
> >
> > ./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm
> > --without-tm --without-moab --without-singularity --without-fca
> > --without-hcoll --without-ime --without-lustre --without-psm
> > --without-psm2 --without-mxm --with-gnu-ld --enable-debug
> >
> > I'll try nuking the install again and configuring it to use internal
> > pmix.
> >
> > Cheers,
> > Andrej
> >
>

Reply via email to