Thank you for your answer.
I am running the git master version (last tested was cad4c03).

FYI, Clément Foyer is talking with George Bosilca about this problem.


Cyril.

Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
> What version of Open MPI are you running?
> 
> The error is indicating that Open MPI is trying to start a user-level helper 
> daemon on the remote node, and the daemon is seg faulting (which is unusual).
> 
> One thing to be aware of:
> 
>      https://www.open-mpi.org/faq/?category=building#install-overwrite
> 
> 
> 
>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>
>> Hello,
>>
>> I cannot run the a program with MPI when I compile it myself.
>> On some nodes I have the following error:
>> ================================================================================
>> [mimi012:17730] *** Process received signal ***
>> [mimi012:17730] Signal: Segmentation fault (11)
>> [mimi012:17730] Signal code: Address not mapped (1)
>> [mimi012:17730] Failing at address: 0xf8
>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
>> [mimi012:17730] [ 1]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
>> [mimi012:17730] [ 2]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
>> [mimi012:17730] [ 3]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
>> [mimi012:17730] [ 4]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
>> [mimi012:17730] [ 5]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
>> [mimi012:17730] [ 6]
>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
>> [mimi012:17730] *** End of error message ***
>> --------------------------------------------------------------------------
>> ORTE has lost communication with its daemon located on node:
>>
>>  hostname:  mimi012
>>
>> This is usually due to either a failure of the TCP network
>> connection to the node, or possibly an internal failure of
>> the daemon itself. We cannot recover from this failure, and
>> therefore will terminate the job.
>> --------------------------------------------------------------------------
>> ================================================================================
>>
>> The error does not appear with the official MPI installed in the
>> platform. I asked the admins about their compilation options but there
>> is nothing particular.
>>
>> Moreover it appears only for some node lists. Still, the nodes seem to
>> be fine since it works with the official version of MPI of the platform.
>>
>> To be sure it is not a network problem I tried to use "-mca btl
>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>
>> Do you have any idea where this error may come from?
>>
>> Thank you.
>>
>>
>> Cyril Bordage.
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
> 
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to