Alright, I rebuilt mpirun and it's working on a local machine. But now I'm back to my original problem: running this works:

mpirun -mca plm rsh -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py

but running this doesn't:

mpirun -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py

Here's the verbose output from the latter command:

andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca pmix_base_verbose 10 -mca plm slurm -np 384 -H node15:96,node16:96,node17:96,node18:96 python testmpi.py [terra:387112] mca: base: components_register: registering framework ess components
[terra:387112] mca: base: components_register: found loaded component slurm
[terra:387112] mca: base: components_register: component slurm has no register or open function
[terra:387112] mca: base: components_register: found loaded component env
[terra:387112] mca: base: components_register: component env has no register or open function
[terra:387112] mca: base: components_register: found loaded component pmi
[terra:387112] mca: base: components_register: component pmi has no register or open function
[terra:387112] mca: base: components_register: found loaded component tool
[terra:387112] mca: base: components_register: component tool register function successful
[terra:387112] mca: base: components_register: found loaded component hnp
[terra:387112] mca: base: components_register: component hnp has no register or open function [terra:387112] mca: base: components_register: found loaded component singleton [terra:387112] mca: base: components_register: component singleton register function successful
[terra:387112] mca: base: components_open: opening ess components
[terra:387112] mca: base: components_open: found loaded component slurm
[terra:387112] mca: base: components_open: component slurm open function successful
[terra:387112] mca: base: components_open: found loaded component env
[terra:387112] mca: base: components_open: component env open function successful
[terra:387112] mca: base: components_open: found loaded component pmi
[terra:387112] mca: base: components_open: component pmi open function successful
[terra:387112] mca: base: components_open: found loaded component tool
[terra:387112] mca: base: components_open: component tool open function successful
[terra:387112] mca: base: components_open: found loaded component hnp
[terra:387112] mca: base: components_open: component hnp open function successful
[terra:387112] mca: base: components_open: found loaded component singleton
[terra:387112] mca: base: components_open: component singleton open function successful
[terra:387112] mca:base:select: Auto-selecting ess components
[terra:387112] mca:base:select:(  ess) Querying component [slurm]
[terra:387112] mca:base:select:(  ess) Querying component [env]
[terra:387112] mca:base:select:(  ess) Querying component [pmi]
[terra:387112] mca:base:select:(  ess) Querying component [tool]
[terra:387112] mca:base:select:(  ess) Querying component [hnp]
[terra:387112] mca:base:select:(  ess) Query of component [hnp] set priority to 100
[terra:387112] mca:base:select:(  ess) Querying component [singleton]
[terra:387112] mca:base:select:(  ess) Selected component [hnp]
[terra:387112] mca: base: close: component slurm closed
[terra:387112] mca: base: close: unloading component slurm
[terra:387112] mca: base: close: component env closed
[terra:387112] mca: base: close: unloading component env
[terra:387112] mca: base: close: component pmi closed
[terra:387112] mca: base: close: unloading component pmi
[terra:387112] mca: base: close: component tool closed
[terra:387112] mca: base: close: unloading component tool
[terra:387112] mca: base: close: component singleton closed
[terra:387112] mca: base: close: unloading component singleton
--------------------------------------------------------------------------
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_plm_base_select failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--------------------------------------------------------------------------

This was the exact problem that prompted me to try and upgrade from 4.0.3 to 4.1.0. Openmpi 4.1.0 (in debug mode, with internal pmix) is now installed on the head and on all compute nodes.

I'd appreciate any ideas on what to try to overcome this.

Cheers,
Andrej


On 2/1/21 9:57 AM, Andrej Prsa wrote:
Hi Gilles,

that's odd, there should be a mca_pmix_pmix3x.so (assuming you built
with the internal pmix)

Ah, I didn't -- I linked against the latest git pmix; here's the configure line:

./configure --prefix=/usr/local --with-pmix=/usr/local --with-slurm --without-tm --without-moab --without-singularity --without-fca --without-hcoll --without-ime --without-lustre --without-psm --without-psm2 --without-mxm --with-gnu-ld --enable-debug

I'll try nuking the install again and configuring it to use internal pmix.

Cheers,
Andrej


Reply via email to