Andrej I fail to understand why you continue to think that PMI has anything to do with this problem. I see no indication of a PMIx-related issue in anything you have provided to date.
In the output below, it is clear what the problem is - you locked it to the "slurm" launcher (with -mca plm slurm) and the "slurm" launcher was not found. Try adding "--mca plm_base_verbose 10" to your cmd line and let's see why that launcher wasn't accepted. > On Feb 1, 2021, at 2:47 PM, Gilles Gouaillardet via devel > <[email protected]> wrote: > > Andrej, > > My previous email listed other things to try > > Cheers, > > Gilles > > Sent from my iPod > >> On Feb 2, 2021, at 6:23, Andrej Prsa via devel <[email protected]> >> wrote: >> >> The saga continues. >> >> I managed to build slurm with pmix by first patching slurm using this patch >> and manually building the plugin: >> >> https://bugs.schedmd.com/show_bug.cgi?id=10683 >> >> Now srun shows pmix as an option: >> >> andrej@terra:~/system/tests/MPI$ srun --mpi=list >> srun: MPI types are... >> srun: cray_shasta >> srun: none >> srun: pmi2 >> srun: pmix >> srun: pmix_v4 >> >> But when I try to run mpirun with slurm plugin, it still fails: >> >> andrej@terra:~/system/tests/MPI$ mpirun -mca ess_base_verbose 10 --mca >> pmix_base_verbose 10 -mca plm slurm -np 384 -H >> node15:96,node16:96,node17:96,node18:96 python testmpi.py >> [terra:149214] mca: base: components_register: registering framework ess >> components >> [terra:149214] mca: base: components_register: found loaded component slurm >> [terra:149214] mca: base: components_register: component slurm has no >> register or open function >> [terra:149214] mca: base: components_register: found loaded component env >> [terra:149214] mca: base: components_register: component env has no register >> or open function >> [terra:149214] mca: base: components_register: found loaded component pmi >> [terra:149214] mca: base: components_register: component pmi has no register >> or open function >> [terra:149214] mca: base: components_register: found loaded component tool >> [terra:149214] mca: base: components_register: component tool register >> function successful >> [terra:149214] mca: base: components_register: found loaded component hnp >> [terra:149214] mca: base: components_register: component hnp has no register >> or open function >> [terra:149214] mca: base: components_register: found loaded component >> singleton >> [terra:149214] mca: base: components_register: component singleton register >> function successful >> [terra:149214] mca: base: components_open: opening ess components >> [terra:149214] mca: base: components_open: found loaded component slurm >> [terra:149214] mca: base: components_open: component slurm open function >> successful >> [terra:149214] mca: base: components_open: found loaded component env >> [terra:149214] mca: base: components_open: component env open function >> successful >> [terra:149214] mca: base: components_open: found loaded component pmi >> [terra:149214] mca: base: components_open: component pmi open function >> successful >> [terra:149214] mca: base: components_open: found loaded component tool >> [terra:149214] mca: base: components_open: component tool open function >> successful >> [terra:149214] mca: base: components_open: found loaded component hnp >> [terra:149214] mca: base: components_open: component hnp open function >> successful >> [terra:149214] mca: base: components_open: found loaded component singleton >> [terra:149214] mca: base: components_open: component singleton open function >> successful >> [terra:149214] mca:base:select: Auto-selecting ess components >> [terra:149214] mca:base:select:( ess) Querying component [slurm] >> [terra:149214] mca:base:select:( ess) Querying component [env] >> [terra:149214] mca:base:select:( ess) Querying component [pmi] >> [terra:149214] mca:base:select:( ess) Querying component [tool] >> [terra:149214] mca:base:select:( ess) Querying component [hnp] >> [terra:149214] mca:base:select:( ess) Query of component [hnp] set priority >> to 100 >> [terra:149214] mca:base:select:( ess) Querying component [singleton] >> [terra:149214] mca:base:select:( ess) Selected component [hnp] >> [terra:149214] mca: base: close: component slurm closed >> [terra:149214] mca: base: close: unloading component slurm >> [terra:149214] mca: base: close: component env closed >> [terra:149214] mca: base: close: unloading component env >> [terra:149214] mca: base: close: component pmi closed >> [terra:149214] mca: base: close: unloading component pmi >> [terra:149214] mca: base: close: component tool closed >> [terra:149214] mca: base: close: unloading component tool >> [terra:149214] mca: base: close: component singleton closed >> [terra:149214] mca: base: close: unloading component singleton >> -------------------------------------------------------------------------- >> It looks like orte_init failed for some reason; your parallel process is >> likely to abort. There are many reasons that a parallel process can >> fail during orte_init; some of which are due to configuration or >> environment problems. This failure appears to be an internal failure; >> here's some additional information (which may only be relevant to an >> Open MPI developer): >> >> orte_plm_base_select failed >> --> Returned value Not found (-13) instead of ORTE_SUCCESS >> -------------------------------------------------------------------------- >> >> I'm at my wits' end what to try, and all ears if anyone has any leads or >> suggestions. >> >> Thanks, >> Andrej >>
