Hello,

when I went back to this problem, the segfault did not happen again...
I do not know why, but I am glad with that.


Cyril.

Le 13/02/2017 à 10:15, Cyril Bordage a écrit :
> Unfortunately this does not complete this thread. The problem is not
> solved! It is not an installation problem. I have no previous
> installation since I use separate directories.
> I have nothing specific to MPI path in my env, I just use the complete
> path to mpicc and mpirun.
> 
> The error depends on which node I run on. For example I can run on node1
> and node2, or node1 and node3, or node2 and node3, but not on node1,
> node2 and node3. With the official version of the platform (1.8.1) it
> works like a charm.
> 
> George, maybe, you could see it by yourself by connecting to our
> platform (plafrim), since you have an account. It should be easier to
> understand and see our problem.
> 
> 
> Cyril.
> 
> Le 10/02/2017 à 18:15, George Bosilca a écrit :
>> To complete this thread, the problem is now solved. Some .so were lingering 
>> around from a previous installation causing startup pb.
>>
>>   George.
>>
>>
>>> On Feb 10, 2017, at 05:38 , Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>>
>>> Thank you for your answer.
>>> I am running the git master version (last tested was cad4c03).
>>>
>>> FYI, Clément Foyer is talking with George Bosilca about this problem.
>>>
>>>
>>> Cyril.
>>>
>>> Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit :
>>>> What version of Open MPI are you running?
>>>>
>>>> The error is indicating that Open MPI is trying to start a user-level 
>>>> helper daemon on the remote node, and the daemon is seg faulting (which is 
>>>> unusual).
>>>>
>>>> One thing to be aware of:
>>>>
>>>>     https://www.open-mpi.org/faq/?category=building#install-overwrite
>>>>
>>>>
>>>>
>>>>> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote:
>>>>>
>>>>> Hello,
>>>>>
>>>>> I cannot run the a program with MPI when I compile it myself.
>>>>> On some nodes I have the following error:
>>>>> ================================================================================
>>>>> [mimi012:17730] *** Process received signal ***
>>>>> [mimi012:17730] Signal: Segmentation fault (11)
>>>>> [mimi012:17730] Signal code: Address not mapped (1)
>>>>> [mimi012:17730] Failing at address: 0xf8
>>>>> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500]
>>>>> [mimi012:17730] [ 1]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9]
>>>>> [mimi012:17730] [ 2]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd]
>>>>> [mimi012:17730] [ 3]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34]
>>>>> [mimi012:17730] [ 4]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d]
>>>>> [mimi012:17730] [ 5]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c]
>>>>> [mimi012:17730] [ 6]
>>>>> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c]
>>>>> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851]
>>>>> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d]
>>>>> [mimi012:17730] *** End of error message ***
>>>>> --------------------------------------------------------------------------
>>>>> ORTE has lost communication with its daemon located on node:
>>>>>
>>>>> hostname:  mimi012
>>>>>
>>>>> This is usually due to either a failure of the TCP network
>>>>> connection to the node, or possibly an internal failure of
>>>>> the daemon itself. We cannot recover from this failure, and
>>>>> therefore will terminate the job.
>>>>> --------------------------------------------------------------------------
>>>>> ================================================================================
>>>>>
>>>>> The error does not appear with the official MPI installed in the
>>>>> platform. I asked the admins about their compilation options but there
>>>>> is nothing particular.
>>>>>
>>>>> Moreover it appears only for some node lists. Still, the nodes seem to
>>>>> be fine since it works with the official version of MPI of the platform.
>>>>>
>>>>> To be sure it is not a network problem I tried to use "-mca btl
>>>>> tcp,sm,self" or "-mca btl openib,sm,self" with no change.
>>>>>
>>>>> Do you have any idea where this error may come from?
>>>>>
>>>>> Thank you.
>>>>>
>>>>>
>>>>> Cyril Bordage.
>>>>> _______________________________________________
>>>>> devel mailing list
>>>>> devel@lists.open-mpi.org
>>>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>>>
>>>>
>>> _______________________________________________
>>> devel mailing list
>>> devel@lists.open-mpi.org
>>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
>> _______________________________________________
>> devel mailing list
>> devel@lists.open-mpi.org
>> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
>>
> _______________________________________________
> devel mailing list
> devel@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel
> 
_______________________________________________
devel mailing list
devel@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/devel

Reply via email to