Hi All, I am seeing the following segfault with openmpi-master.
[root@maneybhanjang ~]# /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun --allow-run-as-root --hostfile /root/mpd.hosts -np 8 --prefix /usr/mpi/gcc/openmpi-2.0-dev/ --map-by node --display-allocation --oversubscribe --mca btl openib,sm,self /usr/mpi/gcc/openmpi-2.0-dev/tests/IMB/IMB-MPI1 ====================== ALLOCATED NODES ====================== maneybhanjang: flags=0x01 slots=8 max_slots=0 slots_inuse=0 state=UP 10.193.184.162: flags=0x03 slots=4 max_slots=0 slots_inuse=0 state=UNKNOWN ================================================================= [maneybhanjang:28532] *** Process received signal *** [maneybhanjang:28532] Signal: Segmentation fault (11) [maneybhanjang:28532] Signal code: Invalid permissions (2) [maneybhanjang:28532] Failing at address: 0x106ca70 [maneybhanjang:28532] [ 0] /lib64/libpthread.so.0[0x3aea40f710] [maneybhanjang:28532] [ 1] [0x106ca70] [maneybhanjang:28532] *** End of error message *** [tonglu:02068] *** Process received signal *** [tonglu:02068] Signal: Segmentation fault (11) [tonglu:02068] Signal code: Invalid permissions (2) [tonglu:02068] Failing at address: 0x2478500 [tonglu:02068] [ 0] /lib64/libpthread.so.0[0x3ef5c0f710] [tonglu:02068] [ 1] [0x2478500] [tonglu:02068] *** End of error message *** bash: line 1: 2068 Segmentation fault (core dumped) /usr/mpi/gcc/openmpi-2.0-dev/bin/orted --hnp-topo-sig 0N:2S:0L3:4L2:8L1:8C:8H:x86_64 -mca ess "env" -mca ess_base_jobid "3921674240" -mca ess_base_vpid 1 -mca ess_base_num_procs "2" -mca orte_hnp_uri "3921674240.0;usock;tcp://10.193.184.161,102.1.1.161,102.2.2.161:43160" --mca btl "openib,sm,self" -mca plm "rsh" -mca rmaps_base_mapping_policy "node" -mca orte_display_alloc "1" -mca rmaps_base_oversubscribe "1" Segmentation fault (core dumped) [root@maneybhanjang ~]# dmesg mpirun[28532]: segfault at 106ca70 ip 000000000106ca70 sp 00007fffc00a7f28 error 15 Segfault is seen on the other peer too. [root@tonglu ~]# dmesg orted[2068]: segfault at 2478500 ip 0000000002478500 sp 00007fff521c2e68 error 15 gdb on coredump points me to orted/pmix/pmix_server_gen.c:80 Following is the Back trace. [root@maneybhanjang ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/mpirun core.28532 Program terminated with signal 11, Segmentation fault. #0 0x000000000106ca70 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libudev-147-2.57.el6.x86_64 (gdb) bt #0 0x000000000106ca70 in ?? () #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 #2 0x00002b217fad5a7c in event_process_active_single_queue (base=0xfcc730, flags=1) at event.c:1370 #3 event_process_active (base=0xfcc730, flags=1) at event.c:1440 #4 opal_libevent2022_event_base_loop (base=0xfcc730, flags=1) at event.c:1644 #5 0x00000000004014d3 in orterun (argc=16, argv=0x7fffc00a81e8) at orterun.c:192 #6 0x0000000000400f04 in main (argc=16, argv=0x7fffc00a81e8) at main.c:13 (gdb) frame #0 0x000000000106ca70 in ?? () (gdb) up #1 0x00002b217f7a43aa in _client_conn (sd=-1, args=4, cbdata=0x2b2188022260) at orted/pmix/pmix_server_gen.c:80 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); Here is the backtrace of peer machine, pointing to same line: [root@tonglu ~]# gdb /usr/mpi/gcc/openmpi-2.0-dev/bin/orted core.2068 Program terminated with signal 11, Segmentation fault. #0 0x0000000002478500 in ?? () Missing separate debuginfos, use: debuginfo-install glibc-2.12-1.149.el6.x86_64 libgcc-4.4.7-11.el6.x86_64 libudev-147-2.57.el6.x86_64 numactl-2.0.9-2.el6.x86_64 (gdb) bt #0 0x0000000002478500 in ?? () #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, cbdata=0x2af458022260) at orted/pmix/pmix_server_gen.c:80 #2 0x00002af451474cac in event_process_active_single_queue (base=0x2408e90, flags=1) at event.c:1370 #3 event_process_active (base=0x2408e90, flags=1) at event.c:1440 #4 opal_libevent2022_event_base_loop (base=0x2408e90, flags=1) at event.c:1644 #5 0x00002af451123c57 in orte_daemon (argc=33, argv=0x7fff521c33d8) at orted/orted_main.c:859 #6 0x000000000040081a in main (argc=33, argv=0x7fff521c33d8) at orted.c:60 (gdb) frame #0 0x0000000002478500 in ?? () (gdb) up #1 0x00002af4511433ba in _client_conn (sd=-1, args=4, cbdata=0x2af458022260) at orted/pmix/pmix_server_gen.c:80 80 cd->cbfunc(OPAL_SUCCESS, cd->cbdata); I am using the tot of openmpi-master : commit 5795682aa56ce8f22e518462b22cfee49d407216 Merge: 5d32282 1bb7788 Author: Joshua Ladd <jladd.m...@gmail.com> List-Post: devel@lists.open-mpi.org Date: Mon Jun 27 12:59:20 2016 -0400 Merge pull request #1817 from shamisp/topic/oshmem_init OSHMEM: Removing erroneous initialization check I am happy to provide any further information and would appreciate any suggestions regarding the issue. Thanks, Bharat.