Thank you for your answer. I am running the git master version (last tested was cad4c03).
FYI, Clément Foyer is talking with George Bosilca about this problem. Cyril. Le 08/02/2017 à 16:46, Jeff Squyres (jsquyres) a écrit : > What version of Open MPI are you running? > > The error is indicating that Open MPI is trying to start a user-level helper > daemon on the remote node, and the daemon is seg faulting (which is unusual). > > One thing to be aware of: > > https://www.open-mpi.org/faq/?category=building#install-overwrite > > > >> On Feb 6, 2017, at 8:14 AM, Cyril Bordage <cyril.bord...@inria.fr> wrote: >> >> Hello, >> >> I cannot run the a program with MPI when I compile it myself. >> On some nodes I have the following error: >> ================================================================================ >> [mimi012:17730] *** Process received signal *** >> [mimi012:17730] Signal: Segmentation fault (11) >> [mimi012:17730] Signal code: Address not mapped (1) >> [mimi012:17730] Failing at address: 0xf8 >> [mimi012:17730] [ 0] /lib64/libpthread.so.0(+0xf500)[0x7ffff66c0500] >> [mimi012:17730] [ 1] >> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_priority_set+0xa9)[0x7ffff781fcb9] >> [mimi012:17730] [ 2] >> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xebcd)[0x7ffff197fbcd] >> [mimi012:17730] [ 3] >> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(mca_oob_tcp_peer_accept+0xa1)[0x7ffff1981e34] >> [mimi012:17730] [ 4] >> /home/bordage/modules/openmpi/openmpi-debug/lib/openmpi/mca_oob_tcp.so(+0xab1d)[0x7ffff197bb1d] >> [mimi012:17730] [ 5] >> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(opal_libevent2022_event_base_loop+0x53c)[0x7ffff782323c] >> [mimi012:17730] [ 6] >> /home/bordage/modules/openmpi/openmpi-debug/lib/libopen-pal.so.0(+0x3d34c)[0x7ffff77c534c] >> [mimi012:17730] [ 7] /lib64/libpthread.so.0(+0x7851)[0x7ffff66b8851] >> [mimi012:17730] [ 8] /lib64/libc.so.6(clone+0x6d)[0x7ffff640694d] >> [mimi012:17730] *** End of error message *** >> -------------------------------------------------------------------------- >> ORTE has lost communication with its daemon located on node: >> >> hostname: mimi012 >> >> This is usually due to either a failure of the TCP network >> connection to the node, or possibly an internal failure of >> the daemon itself. We cannot recover from this failure, and >> therefore will terminate the job. >> -------------------------------------------------------------------------- >> ================================================================================ >> >> The error does not appear with the official MPI installed in the >> platform. I asked the admins about their compilation options but there >> is nothing particular. >> >> Moreover it appears only for some node lists. Still, the nodes seem to >> be fine since it works with the official version of MPI of the platform. >> >> To be sure it is not a network problem I tried to use "-mca btl >> tcp,sm,self" or "-mca btl openib,sm,self" with no change. >> >> Do you have any idea where this error may come from? >> >> Thank you. >> >> >> Cyril Bordage. >> _______________________________________________ >> devel mailing list >> devel@lists.open-mpi.org >> https://rfd.newmexicoconsortium.org/mailman/listinfo/devel > > _______________________________________________ devel mailing list devel@lists.open-mpi.org https://rfd.newmexicoconsortium.org/mailman/listinfo/devel