Bug#978022: libopenmpi3 Runtime failure opal_pmix_base_select failed
Hi, On 24/12/20 at 17:16 +0100, Michael Banck wrote: > Package: libopenmpi3 > Version: 3.1.3-11 > Severity: serious > > Even with the fixed libpmix2_4.0.0~rc1-2, I am getting runtime failures > trying to run MPI programs, e.g. the nwchem autopkgtests all fail like > this: A simple way to reproduce is: $ mpiexec -n 1 true [groff:16932] [[40958,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 320 -- It looks like orte_init failed for some reason; your parallel process is likely to abort. There are many reasons that a parallel process can fail during orte_init; some of which are due to configuration or environment problems. This failure appears to be an internal failure; here's some additional information (which may only be relevant to an Open MPI developer): opal_pmix_base_select failed --> Returned value Not found (-13) instead of ORTE_SUCCESS -- It happens with those versions: $ dpkg -l |grep -e openmpi -e pmi ii libopenmpi3:amd64 4.1.0-1 amd64 high performance message passing library -- shared library ii libpmix2:amd644.0.0~rc1-2 amd64 Process Management Interface (Exascale) library ii openmpi-bin 4.1.0-1 amd64 high performance message passing library -- binaries ii openmpi-common4.1.0-1 all high performance message passing library -- common files It doesn't fail after downgrading openmpi to the version in testing (4.0.5-7) Lucas
Bug#978022: libopenmpi3 Runtime failure opal_pmix_base_select failed
Package: libopenmpi3 Version: 3.1.3-11 Severity: serious Even with the fixed libpmix2_4.0.0~rc1-2, I am getting runtime failures trying to run MPI programs, e.g. the nwchem autopkgtests all fail like this: | Running tests/water/water_md | | cleaning scratch | copying input and verified output files | running nwchem (/usr/bin/nwchem) with 1 processors | | NWChem execution failed |[kohn:13218] [[5127,0],0] ORTE_ERROR_LOG: Not found in file ../../../../../../orte/mca/ess/hnp/ess_hnp_module.c at line 320 |-- |It looks like orte_init failed for some reason; your parallel process is |likely to abort. There are many reasons that a parallel process can |fail during orte_init; some of which are due to configuration or |environment problems. This failure appears to be an internal failure; |here's some additional information (which may only be relevant to an |Open MPI developer): | | opal_pmix_base_select failed | --> Returned value Not found (-13) instead of ORTE_SUCCESS |-- Not sure whether this is libopenmpi3, openmpi-bin, libpmix2 or something else, so please reassign as needed. But at least the openmpi excuses is full of ci.debian.net regressions: https://qa.debian.org/excuses.php?package=openmpi Or is there something needed on the application side, like a new environment variable or library to be linked in? Michael