Andrej, The log you just posted strongly suggests a previously built (without --enable-debug) internal PMIx is being used.
I invite you to do some cleanup sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix and then sudo make install and try again. if the issue persists, please post the output of the following commands $ env | grep ^OPAL_ $ env | grep ^PMIX_ Cheers, Gilles On Mon, Feb 1, 2021 at 2:11 PM Andrej Prsa via devel <devel@lists.open-mpi.org> wrote: > > Hi Ralph, > > > Just trying to understand - why are you saying this is a pmix problem? > > Obviously, something to do with mpirun is failing, but I don't see any > > indication here that it has to do with pmix. > > No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs > across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this: > > -------------------------------------------------------------------------- > An ORTE daemon has unexpectedly failed after launch and before > communicating back to mpirun. This could be caused by a number > of factors, including an inability to create a connection back > to mpirun due to a lack of common network interfaces and/or no > route found between them. Please check network connectivity > (including firewalls and network routing requirements). > -------------------------------------------------------------------------- > > But the same would work if I submitted it with rsh (i.e. -mca plm rsh). > I read online that there were issues with cpu bind so I thought 4.1.0 > might have resolved it. > > So, back to the problem at hand. I reconfigured with --enable-debug and > this is what I get: > > andrej@terra:~/system/openmpi-4.1.0$ mpirun > [terra:4145441] *** Process received signal *** > [terra:4145441] Signal: Segmentation fault (11) > [terra:4145441] Signal code: (128) > [terra:4145441] Failing at address: (nil) > [terra:4145441] [ 0] > /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210] > [terra:4145441] [ 1] > /usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c] > [terra:4145441] [ 2] > /usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6] > [terra:4145441] [ 3] > /usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec] > [terra:4145441] [ 4] > /usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5] > [terra:4145441] [ 5] > /usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836] > [terra:4145441] [ 6] > /usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd] > [terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc] > [terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d] > [terra:4145441] [ 9] > /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3] > [terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e] > [terra:4145441] *** End of error message *** > Segmentation fault (core dumped) > > gdb backtrace: > > (gdb) r > Starting program: /usr/local/bin/mpirun > [Thread debugging using libthread_db enabled] > Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1". > > Program received signal SIGSEGV, Segmentation fault. > 0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from > /usr/local/lib/openmpi/mca_pmix_pmix3x.so > (gdb) bt > #0 0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from > /usr/local/lib/openmpi/mca_pmix_pmix3x.so > #1 0x00007ffff33042e6 in pmix3x_server_init () from > /usr/local/lib/openmpi/mca_pmix_pmix3x.so > #2 0x00007ffff7ef15ec in pmix_server_init () at > orted/pmix/pmix_server.c:296 > #3 0x00007ffff78cc8d5 in rte_init () at ess_hnp_module.c:329 > #4 0x00007ffff7f6d836 in orte_init (pargc=0x7fffffffddbc, > pargv=0x7fffffffddb0, flags=4) at runtime/orte_init.c:271 > #5 0x00007ffff7f6f0cd in orte_submit_init (argc=1, argv=0x7fffffffe478, > opts=0x0) at orted/orted_submit.c:570 > #6 0x00005555555556bc in orterun (argc=1, argv=0x7fffffffe478) at > orterun.c:136 > #7 0x000055555555534d in main (argc=1, argv=0x7fffffffe478) at main.c:13 > > This build is using the latest openpmix from github master. > > Thanks, > Andrej >