Re: [OMPI devel] Segfault on MPI init

2017-02-14 Thread Jeff Squyres (jsquyres)
You should also check your paths for non interactive remote logins and ensure 
that you are not accidentally mixing versions of open MPI (e.g., the new 
version in your local machine, and some other version on the remote machines). 

Sent from my phone. No type good. 

> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet 
>  wrote:
> 
> Cyril,
> 
> Are you running your jobs via a batch manager 
> If yes, was support for it correctly built ?
> 
> If you were able to get a core dump, can you post the gdb stacktrace ?
> 
> I guess your nodes have several IP interfaces, you might want to try
> mpirun --mca oob_tcp_if_include eth0 ...
> (replace eth0 with something appropriate if needed)
> 
> Cheers,
> 
> Gilles
> 
> Cyril Bordage  wrote:
>> Unfortunately this does not complete this thread. The problem is not
>> solved! It is not an installation problem. I have no previous
>> installation since I use separate directories.
>> I have nothing specific to MPI path in my env, I just use the complete
>> path to mpicc and mpirun.
>> 
>> The error depends on which node I run on. For example I can run on node1
>> and node2, or node1 and node3, or node2 and node3, but not on node1,
>> node2 and node3. With the official version of the platform (1.8.1) it
>> works like a charm.
>> 
>> George, maybe, you could see it by yourself by connecting to our
>> platform (plafrim), since you have an account. It should be easier to
>> understand and see our problem.
>> 
>> 
>> Cyril.
>> 
>>> Le 10/02/2017 à 18:15, George Bosilca a écrit :
>>> To complete this thread, the problem is now solved. Some .so were lingering 
>>> around from a previous installation causing startup pb.
>>> 
>>>  George.
>>> 
>>> 
 On Feb 10, 2017, at 05:38 , Cyril Bordage  wrote:
 
 Thank you for your answer.
 I am running the git master version (last tested was cad4c03).
 
 FYI, Clément Foyer is talking with George Bosilca about this problem.
 
 
 Cyril.
 
> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
> What version of Open MPI are you running?
> 
> The error is indicating that Open MPI is trying to start a user-level 
> helper daemon on the remote node, and the daemon is seg faulting (which 
> is unusual).
> 
> One thing to be aware of:
> 
>https://www.open-mpi.org/faq/?category=building#install-overwrite
> 
> 
> 
>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage  wrote:
>> 
>> Hello,
>> 
>> I cannot run the a program with MPI when I compile it myself.
>> On some nodes I have the following error:
>> 
>> [mimi012:17730] *** Process received signal ***
>> [mimi012:17730] Signal: Segmentation fault (11)
>> [mimi012:17730] Signal code: Address not mapped (1)
>> [mimi012:17730] Failing at address: 0xf8
>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x766c0500]
>> [mimi012:17730] [ 1]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7781fcb9]
>> [mimi012:17730] [ 2]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7197fbcd]
>> [mimi012:17730] [ 3]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x71981e34]
>> [mimi012:17730] [ 4]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7197bb1d]
>> [mimi012:17730] [ 5]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7782323c]
>> [mimi012:17730] [ 6]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x777c534c]
>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x766b8851]
>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7640694d]
>> [mimi012:17730] *** End of error message ***
>> --
>> ORTE has lost communication with its daemon located on node:
>> 
>> hostname:  mimi012
>> 
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --
>> 
>> 
>> The error does not appear with the official MPI installed in the
>> platform. I asked the admins about their compilation options but there
>> is nothing particular.
>> 
>> Moreover it appears only for some node lists. Still, the nodes seem to
>> be fine since it works with the official version of MPI of the 

Re: [OMPI devel] Segfault on MPI init

2017-02-14 Thread Cyril Bordage
I have no MPI installation in my environment.
If it was the case, would I have an error since I use the complete path
for mpirun?

I finally managed to get a backtrace:
#0  0x77533f18 in _exit () from /lib64/libc.so.6
#1  0x75169d68 in rte_abort (status=-51, report=true) at
../../../../../src/orte/mca/ess/pmi/ess_pmi_module.c:494
#2  0x77b4fb9d in ompi_rte_abort (error_code=-51, fmt=0x0) at
../../../../../src/ompi/mca/rte/orte/rte_orte_module.c:85
#3  0x77a927a3 in ompi_mpi_abort (comm=0x601280
, errcode=-51) at
../../src/ompi/runtime/ompi_mpi_abort.c:206
#4  0x77a77c6b in ompi_errhandler_callback (status=-51,
source=0x7fffe8003494, info=0x7fffe8003570, results=0x7fffe80034c8,
cbfunc=0x74058ee8 , cbdata=0x7fffe80033d0)
at ../../src/ompi/errhandler/errhandler.c:250
#5  0x740594f7 in _event_hdlr (sd=-1, args=4,
cbdata=0x7fffe80033d0) at
../../../../../src/opal/mca/pmix/pmix2x/pmix2x.c:216
#6  0x76ed2bdc in event_process_active_single_queue
(activeq=0x667cb0, base=0x668410) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370
#7  event_process_active (base=) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440
#8  opal_libevent2022_event_base_loop (base=0x668410, flags=1) at
../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644
#9  0x76e78263 in progress_engine (obj=0x667c68) at
../../src/opal/runtime/opal_progress_threads.c:105
#10 0x77821851 in start_thread () from /lib64/libpthread.so.0
#11 0x7756f94d in clone () from /lib64/libc.so.6


Cyril.

Le 14/02/2017 à 13:25, Jeff Squyres (jsquyres) a écrit :
> You should also check your paths for non interactive remote logins and ensure 
> that you are not accidentally mixing versions of open MPI (e.g., the new 
> version in your local machine, and some other version on the remote 
> machines). 
> 
> Sent from my phone. No type good. 
> 
>> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet 
>>  wrote:
>>
>> Cyril,
>>
>> Are you running your jobs via a batch manager 
>> If yes, was support for it correctly built ?
>>
>> If you were able to get a core dump, can you post the gdb stacktrace ?
>>
>> I guess your nodes have several IP interfaces, you might want to try
>> mpirun --mca oob_tcp_if_include eth0 ...
>> (replace eth0 with something appropriate if needed)
>>
>> Cheers,
>>
>> Gilles
>>
>> Cyril Bordage  wrote:
>>> Unfortunately this does not complete this thread. The problem is not
>>> solved! It is not an installation problem. I have no previous
>>> installation since I use separate directories.
>>> I have nothing specific to MPI path in my env, I just use the complete
>>> path to mpicc and mpirun.
>>>
>>> The error depends on which node I run on. For example I can run on node1
>>> and node2, or node1 and node3, or node2 and node3, but not on node1,
>>> node2 and node3. With the official version of the platform (1.8.1) it
>>> works like a charm.
>>>
>>> George, maybe, you could see it by yourself by connecting to our
>>> platform (plafrim), since you have an account. It should be easier to
>>> understand and see our problem.
>>>
>>>
>>> Cyril.
>>>
 Le 10/02/2017 à 18:15, George Bosilca a écrit :
 To complete this thread, the problem is now solved. Some .so were 
 lingering around from a previous installation causing startup pb.

  George.


> On Feb 10, 2017, at 05:38 , Cyril Bordage  wrote:
>
> Thank you for your answer.
> I am running the git master version (last tested was cad4c03).
>
> FYI, Clément Foyer is talking with George Bosilca about this problem.
>
>
> Cyril.
>
>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>> What version of Open MPI are you running?
>>
>> The error is indicating that Open MPI is trying to start a user-level 
>> helper daemon on the remote node, and the daemon is seg faulting (which 
>> is unusual).
>>
>> One thing to be aware of:
>>
>>https://www.open-mpi.org/faq/?category=building#install-overwrite
>>
>>
>>
>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage  
>>> wrote:
>>>
>>> Hello,
>>>
>>> I cannot run the a program with MPI when I compile it myself.
>>> On some nodes I have the following error:
>>> 
>>> [mimi012:17730] *** Process received signal ***
>>> [mimi012:17730] Signal: Segmentation fault (11)
>>> [mimi012:17730] Signal code: Address not mapped (1)
>>> [mimi012:17730] Failing at address: 0xf8
>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x766c0500]
>>> [mimi012:17730] [ 1]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7781fcb9]
>>> [mimi012:17730] [ 2]
>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmp

Re: [OMPI devel] Segfault on MPI init

2017-02-14 Thread Gilles Gouaillardet
Cyril,

your first post mentions a crash in orted, but
the stack trace is the one of a MPI task.

i would expect orted generate a core, and then you can use gdb post mortem
to get the stack trace.
there should be several threads, so you can
info threads
bt
you might have to switch to an other thread

Cheers,

Gilles

On Tuesday, February 14, 2017, Cyril Bordage  wrote:

> I have no MPI installation in my environment.
> If it was the case, would I have an error since I use the complete path
> for mpirun?
>
> I finally managed to get a backtrace:
> #0  0x77533f18 in _exit () from /lib64/libc.so.6
> #1  0x75169d68 in rte_abort (status=-51, report=true) at
> ../../../../../src/orte/mca/ess/pmi/ess_pmi_module.c:494
> #2  0x77b4fb9d in ompi_rte_abort (error_code=-51, fmt=0x0) at
> ../../../../../src/ompi/mca/rte/orte/rte_orte_module.c:85
> #3  0x77a927a3 in ompi_mpi_abort (comm=0x601280
> , errcode=-51) at
> ../../src/ompi/runtime/ompi_mpi_abort.c:206
> #4  0x77a77c6b in ompi_errhandler_callback (status=-51,
> source=0x7fffe8003494, info=0x7fffe8003570, results=0x7fffe80034c8,
> cbfunc=0x74058ee8 , cbdata=0x7fffe80033d0)
> at ../../src/ompi/errhandler/errhandler.c:250
> #5  0x740594f7 in _event_hdlr (sd=-1, args=4,
> cbdata=0x7fffe80033d0) at
> ../../../../../src/opal/mca/pmix/pmix2x/pmix2x.c:216
> #6  0x76ed2bdc in event_process_active_single_queue
> (activeq=0x667cb0, base=0x668410) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1370
> #7  event_process_active (base=) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1440
> #8  opal_libevent2022_event_base_loop (base=0x668410, flags=1) at
> ../../../../../../src/opal/mca/event/libevent2022/libevent/event.c:1644
> #9  0x76e78263 in progress_engine (obj=0x667c68) at
> ../../src/opal/runtime/opal_progress_threads.c:105
> #10 0x77821851 in start_thread () from /lib64/libpthread.so.0
> #11 0x7756f94d in clone () from /lib64/libc.so.6
>
>
> Cyril.
>
> Le 14/02/2017 à 13:25, Jeff Squyres (jsquyres) a écrit :
> > You should also check your paths for non interactive remote logins and
> ensure that you are not accidentally mixing versions of open MPI (e.g., the
> new version in your local machine, and some other version on the remote
> machines).
> >
> > Sent from my phone. No type good.
> >
> >> On Feb 13, 2017, at 8:14 AM, Gilles Gouaillardet <
> gilles.gouaillar...@gmail.com > wrote:
> >>
> >> Cyril,
> >>
> >> Are you running your jobs via a batch manager
> >> If yes, was support for it correctly built ?
> >>
> >> If you were able to get a core dump, can you post the gdb stacktrace ?
> >>
> >> I guess your nodes have several IP interfaces, you might want to try
> >> mpirun --mca oob_tcp_if_include eth0 ...
> >> (replace eth0 with something appropriate if needed)
> >>
> >> Cheers,
> >>
> >> Gilles
> >>
> >> Cyril Bordage > wrote:
> >>> Unfortunately this does not complete this thread. The problem is not
> >>> solved! It is not an installation problem. I have no previous
> >>> installation since I use separate directories.
> >>> I have nothing specific to MPI path in my env, I just use the complete
> >>> path to mpicc and mpirun.
> >>>
> >>> The error depends on which node I run on. For example I can run on
> node1
> >>> and node2, or node1 and node3, or node2 and node3, but not on node1,
> >>> node2 and node3. With the official version of the platform (1.8.1) it
> >>> works like a charm.
> >>>
> >>> George, maybe, you could see it by yourself by connecting to our
> >>> platform (plafrim), since you have an account. It should be easier to
> >>> understand and see our problem.
> >>>
> >>>
> >>> Cyril.
> >>>
>  Le 10/02/2017 à 18:15, George Bosilca a écrit :
>  To complete this thread, the problem is now solved. Some .so were
> lingering around from a previous installation causing startup pb.
> 
>   George.
> 
> 
> > On Feb 10, 2017, at 05:38 , Cyril Bordage  > wrote:
> >
> > Thank you for your answer.
> > I am running the git master version (last tested was cad4c03).
> >
> > FYI, Clément Foyer is talking with George Bosilca about this problem.
> >
> >
> > Cyril.
> >
> >> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
> >> What version of Open MPI are you running?
> >>
> >> The error is indicating that Open MPI is trying to start a
> user-level helper daemon on the remote node, and the daemon is seg faulting
> (which is unusual).
> >>
> >> One thing to be aware of:
> >>
> >>https://www.open-mpi.org/faq/?category=building#install-
> overwrite
> >>
> >>
> >>
> >>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage  > wrote:
> >>>
> >>> Hello,
> >>>
> >>> I cannot run the a program with MPI when I compile it myself.
> >>> On some nodes I have the following error:
> >>> ==