Hmmm…someone else recently reported this same issue (it was the Absoft folks hitting it occasionally on their MTT runs). I’m in the process of replacing that code path, so I don’t plan on pursuing it right now. However, we’ll have to see if the revised path resolves it.
> On Feb 26, 2015, at 5:45 PM, Paul Hargrove <phhargr...@lbl.gov> wrote: > > Initially I was testing Jeff's tarball for PR 410, on Mac OS X 10.8 where cc > is clang, I have configured with > --prefix=[...] --enable-debug --enable-osx-builtin-atomics CC=cc CXX=c++ > > I passed "make check", but when I try to run ring_c I get the first failure > shown (far) below. > HOWEVER, I tried 50 times to reproduce the failure and could not do so. > Since Jeff's tarball is not "official" I turned my attention to the current > master tarball instead. > > I next tried FIVE HUNDRED times with the current master tarball, and was able > to reproduce the failure ONCE. > The failed assertion and backtrace are different than what I saw before, so > they also appear below. > > Next, I tried with the master tarball without the builtin-atomics configure > option. > In that case my 95th trial failed and I didn't continue trying. > The failure output was (to me) indistinguishable from the one with > builtin-atomics, but it is also included below for completeness. > > Finally, I tried w/o clang leaving only "--prefix=[...] --enable-debug" on > the configure command line. > However, note that "gcc" is really "i686-apple-darwin11-llvm-gcc-4.2" and > thus shares MUCH in common with clang on the same system. > This configuration failed too, and the failure output is also provided below. > > I hope somebody knows how to proceed from here. > I don't really have any reason to believe this is specific to Mac OS X, but > don't have the spare cycles to dedicate to additional testing. > > -Paul > > Seen w/ Jeff's tarball: > > $ mpirun -mca btl sm,self -np 2 examples/ring_c' > Warning :: opal_list_remove_item - the item 0x7fc092a0cb50 is not on the > list 0x7fc0928006a0 > Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) > (kv))->obj_magic_id), function store, file > /Users/Paul/OMPI/openmpi-pr410-v4-macos10.8-x86-clang-atomics/openmpi-gitclone/opal/mca/dstore/hash/dstore_hash.c, > line 143. > [tesuji:26399] *** Process received signal *** > [tesuji:26399] Signal: Abort trap: 6 (6) > [tesuji:26399] Signal code: (0) > [tesuji:26399] [ 0] 0 libsystem_c.dylib > 0x00007fff91e2b90a _sigtramp + 26^@ > [tesuji:26399] [ 1] 0 ??? > 0x00000000ffffffff 0x0 + 4294967295^@ > [tesuji:26399] [ 2] 0 libsystem_c.dylib > 0x00007fff91e82f61 abort + 143^@ > [tesuji:26399] [ 3] 0 libsystem_c.dylib > 0x00007fff91e83cb9 __assert_rtn + 146^@ > [tesuji:26399] [ 4] 0 mca_dstore_hash.so > 0x000000010180803c store + 972^@ > [tesuji:26399] [ 5] 0 libopen-pal.0.dylib > 0x00000001016860c6 opal_dstore_base_store + 278^@ > [tesuji:26399] [ 6] 0 mca_pmix_native.so > 0x0000000101825795 native_get + 4709^@ > [tesuji:26399] [ 7] 0 libmpi.0.dylib > 0x000000010111f6a4 ompi_proc_complete_init + 980^@ > [tesuji:26399] [ 8] 0 libmpi.0.dylib > 0x0000000101126f24 ompi_mpi_init + 2372^@ > [tesuji:26399] [ 9] 0 libmpi.0.dylib > 0x00000001011744c0 MPI_Init + 480^@ > [tesuji:26399] [10] 0 ring_c > 0x00000001010e9c25 main + 53^@ > [tesuji:26399] [11] 0 libdyld.dylib > 0x00007fff8e03a7e1 start + 0^@ > [tesuji:26399] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal > 6 (Abort trap: 6). > -------------------------------------------------------------------------- > > Seen with master tarball and builtin-atomics: > > $ mpirun -mca btl sm,self -np 2 examples/ring_c' > Warning :: opal_list_remove_item - the item 0x7fc6d1900130 is not on the > list 0x7fc6d0c30df0 > Assertion failed: (0 == item->opal_list_item_refcount), function > opal_list_item_destruct, file > /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang-atomics/openmpi-dev-1118-gdc80863/opal/class/opal_list.c, > line 69. > [tesuji:62565] *** Process received signal *** > [tesuji:62565] Signal: Abort trap: 6 (6) > [tesuji:62565] Signal code: (0) > [tesuji:62565] [ 0] 0 libsystem_c.dylib > 0x00007fff91e2b90a _sigtramp + 26^@ > [tesuji:62565] [ 1] 0 ??? > 0x0000000000000000 0x0 + 0^@ > [tesuji:62565] [ 2] 0 libsystem_c.dylib > 0x00007fff91e82f61 abort + 143^@ > [tesuji:62565] [ 3] 0 libsystem_c.dylib > 0x00007fff91e83cb9 __assert_rtn + 146^@ > [tesuji:62565] [ 4] 0 libopen-pal.0.dylib > 0x0000000107d54dd5 opal_list_item_destruct + 85^@ > [tesuji:62565] [ 5] 0 mca_dstore_hash.so > 0x0000000107f67e21 opal_obj_run_destructors + 145^@ > [tesuji:62565] [ 6] 0 mca_dstore_hash.so > 0x0000000107f6707e store + 1054^@ > [tesuji:62565] [ 7] 0 libopen-pal.0.dylib > 0x0000000107de0336 opal_dstore_base_store + 278^@ > [tesuji:62565] [ 8] 0 mca_pmix_native.so > 0x0000000107f8aaa3 fencenb_cbfunc + 851^@ > [tesuji:62565] [ 9] 0 mca_pmix_native.so > 0x0000000107f8bf97 pmix_usock_process_msg + 695^@ > [tesuji:62565] [10] 0 libopen-pal.0.dylib > 0x0000000107dea38d event_process_active_single_queue + 493^@ > [tesuji:62565] [11] 0 libopen-pal.0.dylib > 0x0000000107de5f7c event_process_active + 140^@ > [tesuji:62565] [12] 0 libopen-pal.0.dylib > 0x0000000107de502e opal_libevent2022_event_base_loop + 830^@ > [tesuji:62565] [13] 0 libopen-pal.0.dylib > 0x0000000107d66532 progress_engine + 66^@ > [tesuji:62565] [14] 0 libsystem_c.dylib > 0x00007fff91e3d772 _pthread_start + 327^@ > [tesuji:62565] [15] 0 libsystem_c.dylib > 0x00007fff91e2a1a1 thread_start + 13^@ > [tesuji:62565] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal > 6 (Abort trap: 6). > -------------------------------------------------------------------------- > > Seen with master tarball and without builtin-atomics: > > $ mpirun -mca btl sm,self -np 2 examples/ring_c' > Warning :: opal_list_remove_item - the item 0x7f8ae2464f00 is not on the > list 0x7f8ae2600690 > Assertion failed: (0 == item->opal_list_item_refcount), function > opal_list_item_destruct, file > /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang/openmpi-dev-1118-gdc80863/opal/class/opal_list.c, > line 69. > [tesuji:86550] *** Process received signal *** > [tesuji:86550] Signal: Abort trap: 6 (6) > [tesuji:86550] Signal code: (0) > [tesuji:86550] [ 0] 0 libsystem_c.dylib > 0x00007fff91e2b90a _sigtramp + 26^@ > [tesuji:86550] [ 1] 0 ??? > 0x0000000000000000 0x0 + 0^@ > [tesuji:86550] [ 2] 0 libsystem_c.dylib > 0x00007fff91e82f61 abort + 143^@ > [tesuji:86550] [ 3] 0 libsystem_c.dylib > 0x00007fff91e83cb9 __assert_rtn + 146^@ > [tesuji:86550] [ 4] 0 libopen-pal.0.dylib > 0x0000000104e41365 opal_list_item_destruct + 85^@ > [tesuji:86550] [ 5] 0 mca_dstore_hash.so > 0x0000000105039fc1 opal_obj_run_destructors + 145^@ > [tesuji:86550] [ 6] 0 mca_dstore_hash.so > 0x000000010503921e store + 1054^@ > [tesuji:86550] [ 7] 0 libopen-pal.0.dylib > 0x0000000104ec8306 opal_dstore_base_store + 278^@ > [tesuji:86550] [ 8] 0 mca_pmix_native.so > 0x000000010505bef3 fencenb_cbfunc + 851^@ > [tesuji:86550] [ 9] 0 mca_pmix_native.so > 0x000000010505d337 pmix_usock_process_msg + 695^@ > [tesuji:86550] [10] 0 libopen-pal.0.dylib > 0x0000000104ed214d event_process_active_single_queue + 493^@ > [tesuji:86550] [11] 0 libopen-pal.0.dylib > 0x0000000104ecdd3c event_process_active + 140^@ > [tesuji:86550] [12] 0 libopen-pal.0.dylib > 0x0000000104eccdee opal_libevent2022_event_base_loop + 830^@ > [tesuji:86550] [13] 0 libopen-pal.0.dylib > 0x0000000104e521d2 progress_engine + 66^@ > [tesuji:86550] [14] 0 libsystem_c.dylib > 0x00007fff91e3d772 _pthread_start + 327^@ > [tesuji:86550] [15] 0 libsystem_c.dylib > 0x00007fff91e2a1a1 thread_start + 13^@ > [tesuji:86550] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal > 6 (Abort trap: 6). > -------------------------------------------------------------------------- > > Seen on master configured with only --prefix= and --enable-debug: > > $ mpirun -mca btl sm,self -np 2 examples/ring_c' > Warning :: opal_list_remove_item - the item 0x7fd104200130 is not on the > list 0x7fd10342b1e0 > Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) > (kv))->obj_magic_id), function store, file > /Users/Paul/OMPI/openmpi-master-macos10.8-x86-gcc/openmpi-dev-1118-gdc80863/opal/mca/dstore/hash/dstore_hash.c, > line 143. > [tesuji:12056] *** Process received signal *** > [tesuji:12056] Signal: Abort trap: 6 (6) > [tesuji:12056] Signal code: (0) > [tesuji:12056] [ 0] 0 libsystem_c.dylib > 0x00007fff91e2b90a _sigtramp + 26^@ > [tesuji:12056] [ 1] 0 ??? > 0x20656874202d206d 0x0 + 2334386829826793581^@ > [tesuji:12056] [ 2] 0 libsystem_c.dylib > 0x00007fff91e82f61 abort + 143^@ > [tesuji:12056] [ 3] 0 libsystem_c.dylib > 0x00007fff91e83cb9 __assert_rtn + 146^@ > [tesuji:12056] [ 4] 0 mca_dstore_hash.so > 0x000000010b22cf99 store + 873^@ > [tesuji:12056] [ 5] 0 libopen-pal.0.dylib > 0x000000010b0c1160 opal_dstore_base_store + 368^@ > [tesuji:12056] [ 6] 0 mca_pmix_native.so > 0x000000010b250b6f native_get + 6303^@ > [tesuji:12056] [ 7] 0 libmpi.0.dylib > 0x000000010ac32a9b ompi_proc_complete_init + 1659^@ > [tesuji:12056] [ 8] 0 libmpi.0.dylib > 0x000000010ac3be8d ompi_mpi_init + 3117^@ > [tesuji:12056] [ 9] 0 libmpi.0.dylib > 0x000000010ac881c1 MPI_Init + 609^@ > [tesuji:12056] [10] 0 ring_c > 0x000000010abe6bee main + 46^@ > [tesuji:12056] [11] 0 libdyld.dylib > 0x00007fff8e03a7e1 start + 0^@ > [tesuji:12056] [12] 0 ??? > 0x0000000000000001 0x0 + 1^@ > [tesuji:12056] *** End of error message *** > -------------------------------------------------------------------------- > mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal > 6 (Abort trap: 6). > -------------------------------------------------------------------------- > > -- > Paul H. Hargrove phhargr...@lbl.gov > <mailto:phhargr...@lbl.gov> > Computer Languages & Systems Software (CLaSS) Group > Computer Science Department Tel: +1-510-495-2352 > Lawrence Berkeley National Laboratory Fax: +1-510-486-6900 > _______________________________________________ > devel mailing list > de...@open-mpi.org > Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel > Link to this post: > http://www.open-mpi.org/community/lists/devel/2015/02/17071.php