Andrej,

The log you just posted strongly suggests a previously built (without
--enable-debug) internal PMIx is being used.

I invite you to do some cleanup
sudo rm -rf /usr/local/lib/openmpi /usr/local/lib/pmix
and then
sudo make install
and try again.

if the issue persists, please post the output of the following commands
$ env | grep ^OPAL_
$ env | grep ^PMIX_

Cheers,

Gilles


On Mon, Feb 1, 2021 at 2:11 PM Andrej Prsa via devel
<devel@lists.open-mpi.org> wrote:
>
> Hi Ralph,
>
> > Just trying to understand - why are you saying this is a pmix problem? 
> > Obviously, something to do with mpirun is failing, but I don't see any 
> > indication here that it has to do with pmix.
>
> No -- 4.0.3 had the pmix problem -- whenever I tried to submit jobs
> across multiple nodes using slurm (i.e. -mca plm slurm), I'd get this:
>
> --------------------------------------------------------------------------
> An ORTE daemon has unexpectedly failed after launch and before
> communicating back to mpirun. This could be caused by a number
> of factors, including an inability to create a connection back
> to mpirun due to a lack of common network interfaces and/or no
> route found between them. Please check network connectivity
> (including firewalls and network routing requirements).
> --------------------------------------------------------------------------
>
> But the same would work if I submitted it with rsh (i.e. -mca plm rsh).
> I read online that there were issues with cpu bind so I thought 4.1.0
> might have resolved it.
>
> So, back to the problem at hand. I reconfigured with --enable-debug and
> this is what I get:
>
> andrej@terra:~/system/openmpi-4.1.0$ mpirun
> [terra:4145441] *** Process received signal ***
> [terra:4145441] Signal: Segmentation fault (11)
> [terra:4145441] Signal code:  (128)
> [terra:4145441] Failing at address: (nil)
> [terra:4145441] [ 0]
> /lib/x86_64-linux-gnu/libc.so.6(+0x46210)[0x7f487ebf4210]
> [terra:4145441] [ 1]
> /usr/local/lib/openmpi/mca_pmix_pmix3x.so(opal_pmix_pmix3x_check_evars+0x15c)[0x7f487a340b3c]
> [terra:4145441] [ 2]
> /usr/local/lib/openmpi/mca_pmix_pmix3x.so(pmix3x_server_init+0x496)[0x7f487a3422e6]
> [terra:4145441] [ 3]
> /usr/local/lib/libopen-rte.so.40(pmix_server_init+0x5da)[0x7f487ef2f5ec]
> [terra:4145441] [ 4]
> /usr/local/lib/openmpi/mca_ess_hnp.so(+0x58d5)[0x7f487e90a8d5]
> [terra:4145441] [ 5]
> /usr/local/lib/libopen-rte.so.40(orte_init+0x354)[0x7f487efab836]
> [terra:4145441] [ 6]
> /usr/local/lib/libopen-rte.so.40(orte_submit_init+0x123b)[0x7f487efad0cd]
> [terra:4145441] [ 7] mpirun(+0x16bc)[0x55d26c3bb6bc]
> [terra:4145441] [ 8] mpirun(+0x134d)[0x55d26c3bb34d]
> [terra:4145441] [ 9]
> /lib/x86_64-linux-gnu/libc.so.6(__libc_start_main+0xf3)[0x7f487ebd50b3]
> [terra:4145441] [10] mpirun(+0x126e)[0x55d26c3bb26e]
> [terra:4145441] *** End of error message ***
> Segmentation fault (core dumped)
>
> gdb backtrace:
>
> (gdb) r
> Starting program: /usr/local/bin/mpirun
> [Thread debugging using libthread_db enabled]
> Using host libthread_db library "/lib/x86_64-linux-gnu/libthread_db.so.1".
>
> Program received signal SIGSEGV, Segmentation fault.
> 0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from
> /usr/local/lib/openmpi/mca_pmix_pmix3x.so
> (gdb) bt
> #0  0x00007ffff3302b3c in opal_pmix_pmix3x_check_evars () from
> /usr/local/lib/openmpi/mca_pmix_pmix3x.so
> #1  0x00007ffff33042e6 in pmix3x_server_init () from
> /usr/local/lib/openmpi/mca_pmix_pmix3x.so
> #2  0x00007ffff7ef15ec in pmix_server_init () at
> orted/pmix/pmix_server.c:296
> #3  0x00007ffff78cc8d5 in rte_init () at ess_hnp_module.c:329
> #4  0x00007ffff7f6d836 in orte_init (pargc=0x7fffffffddbc,
> pargv=0x7fffffffddb0, flags=4) at runtime/orte_init.c:271
> #5  0x00007ffff7f6f0cd in orte_submit_init (argc=1, argv=0x7fffffffe478,
> opts=0x0) at orted/orted_submit.c:570
> #6  0x00005555555556bc in orterun (argc=1, argv=0x7fffffffe478) at
> orterun.c:136
> #7  0x000055555555534d in main (argc=1, argv=0x7fffffffe478) at main.c:13
>
> This build is using the latest openpmix from github master.
>
> Thanks,
> Andrej
>

Reply via email to