Andrej

I fail to understand why you continue to think that PMI has anything to do with 
this problem. I see no indication of a PMIx-related issue in anything you have 
provided to date.

In the output below, it is clear what the problem is - you locked it to the 
"slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not found. 
Try adding "--mca plm_base_verbose 10" to your cmd line and let's see why that 
launcher wasn't accepted.


> On Feb 1, 2021, at 2:47 PM, Gilles Gouaillardet via devel 
> <devel@lists.open-mpi.org> wrote:
> 
> Andrej,
> 
> My previous email listed other things to try
> 
> Cheers,
> 
> Gilles
> 
> Sent from my iPod
> 
>> On Feb 2, 2021, at 6:23, Andrej Prsa via devel <devel@lists.open-mpi.org> 
>> wrote:
>> 
>> The saga continues.
>> 
>> I managed to build slurm with pmix by first patching slurm using this patch 
>> and manually building the plugin:
>> 
>> https://bugs.schedmd.com/show_bug.cgi?id=10683
>> 
>> Now srun shows pmix as an option:
>> 
>> andrej@terra:~/system/tests/MPI$ srun --mpi=list
>> srun: MPI types are...
>> srun: cray_shasta
>> srun: none
>> srun: pmi2
>> srun: pmix
>> srun: pmix_v4
>> 
>> But when I try to run mpirun with slurm plugin, it still fails:
>> 
>> andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca 
>> pmix_base_verbose 10 -mca plm slurm -np 384 -H 
>> node15:96,node16:96,node17:96,node18:96 python testmpi.py
>> [terra:149214] mca: base: components_register: registering framework ess 
>> components
>> [terra:149214] mca: base: components_register: found loaded component slurm
>> [terra:149214] mca: base: components_register: component slurm has no 
>> register or open function
>> [terra:149214] mca: base: components_register: found loaded component env
>> [terra:149214] mca: base: components_register: component env has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component pmi
>> [terra:149214] mca: base: components_register: component pmi has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component tool
>> [terra:149214] mca: base: components_register: component tool register 
>> function successful
>> [terra:149214] mca: base: components_register: found loaded component hnp
>> [terra:149214] mca: base: components_register: component hnp has no register 
>> or open function
>> [terra:149214] mca: base: components_register: found loaded component 
>> singleton
>> [terra:149214] mca: base: components_register: component singleton register 
>> function successful
>> [terra:149214] mca: base: components_open: opening ess components
>> [terra:149214] mca: base: components_open: found loaded component slurm
>> [terra:149214] mca: base: components_open: component slurm open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component env
>> [terra:149214] mca: base: components_open: component env open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component pmi
>> [terra:149214] mca: base: components_open: component pmi open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component tool
>> [terra:149214] mca: base: components_open: component tool open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component hnp
>> [terra:149214] mca: base: components_open: component hnp open function 
>> successful
>> [terra:149214] mca: base: components_open: found loaded component singleton
>> [terra:149214] mca: base: components_open: component singleton open function 
>> successful
>> [terra:149214] mca:base:select: Auto-selecting ess components
>> [terra:149214] mca:base:select:(  ess) Querying component [slurm]
>> [terra:149214] mca:base:select:(  ess) Querying component [env]
>> [terra:149214] mca:base:select:(  ess) Querying component [pmi]
>> [terra:149214] mca:base:select:(  ess) Querying component [tool]
>> [terra:149214] mca:base:select:(  ess) Querying component [hnp]
>> [terra:149214] mca:base:select:(  ess) Query of component [hnp] set priority 
>> to 100
>> [terra:149214] mca:base:select:(  ess) Querying component [singleton]
>> [terra:149214] mca:base:select:(  ess) Selected component [hnp]
>> [terra:149214] mca: base: close: component slurm closed
>> [terra:149214] mca: base: close: unloading component slurm
>> [terra:149214] mca: base: close: component env closed
>> [terra:149214] mca: base: close: unloading component env
>> [terra:149214] mca: base: close: component pmi closed
>> [terra:149214] mca: base: close: unloading component pmi
>> [terra:149214] mca: base: close: component tool closed
>> [terra:149214] mca: base: close: unloading component tool
>> [terra:149214] mca: base: close: component singleton closed
>> [terra:149214] mca: base: close: unloading component singleton
>> --------------------------------------------------------------------------
>> It looks like orte_init failed for some reason; your parallel process is
>> likely to abort.  There are many reasons that a parallel process can
>> fail during orte_init; some of which are due to configuration or
>> environment problems.  This failure appears to be an internal failure;
>> here's some additional information (which may only be relevant to an
>> Open MPI developer):
>> 
>>  orte_plm_base_select failed
>>  --> Returned value Not found (-13) instead of ORTE_SUCCESS
>> --------------------------------------------------------------------------
>> 
>> I'm at my wits' end what to try, and all ears if anyone has any leads or 
>> suggestions.
>> 
>> Thanks,
>> Andrej
>> 


Reply via email to