Hmmm…someone else recently reported this same issue (it was the Absoft folks 
hitting it occasionally on their MTT runs). I’m in the process of replacing 
that code path, so I don’t plan on pursuing it right now. However, we’ll have 
to see if the revised path resolves it.


> On Feb 26, 2015, at 5:45 PM, Paul Hargrove <phhargr...@lbl.gov> wrote:
> 
> Initially I was testing Jeff's tarball for PR 410, on Mac OS X 10.8 where cc 
> is clang, I have configured with
>     --prefix=[...] --enable-debug --enable-osx-builtin-atomics CC=cc CXX=c++
> 
> I passed "make check", but when I try to run ring_c I get the first failure 
> shown (far) below.
> HOWEVER, I tried 50 times to reproduce the failure and could not do so.
> Since Jeff's tarball is not "official" I turned my attention to the current 
> master tarball instead.
> 
> I next tried FIVE HUNDRED times with the current master tarball, and was able 
> to reproduce the failure ONCE.
> The failed assertion and backtrace are different than what I saw before, so 
> they also appear below.
> 
> Next, I tried with the master tarball without the builtin-atomics configure 
> option.
> In that case my 95th trial failed and I didn't continue trying.
> The failure output was (to me) indistinguishable from the one with 
> builtin-atomics, but it is also included below for completeness.
> 
> Finally, I tried w/o clang leaving only "--prefix=[...] --enable-debug" on 
> the configure command line.
> However, note that "gcc" is really "i686-apple-darwin11-llvm-gcc-4.2" and 
> thus shares MUCH in common with clang on the same system.
> This configuration failed too, and the failure output is also provided below.
> 
> I hope somebody knows how to proceed from here.
> I don't really have any reason to believe this is specific to Mac OS X, but 
> don't have the spare cycles to dedicate to additional testing.
> 
> -Paul
> 
> Seen w/ Jeff's tarball:
> 
> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>  Warning :: opal_list_remove_item - the item 0x7fc092a0cb50 is not on the 
> list 0x7fc0928006a0
> Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
> (kv))->obj_magic_id), function store, file 
> /Users/Paul/OMPI/openmpi-pr410-v4-macos10.8-x86-clang-atomics/openmpi-gitclone/opal/mca/dstore/hash/dstore_hash.c,
>  line 143.
> [tesuji:26399] *** Process received signal ***
> [tesuji:26399] Signal: Abort trap: 6 (6)
> [tesuji:26399] Signal code:  (0)
> [tesuji:26399] [ 0] 0   libsystem_c.dylib                   
> 0x00007fff91e2b90a _sigtramp + 26^@
> [tesuji:26399] [ 1] 0   ???                                 
> 0x00000000ffffffff 0x0 + 4294967295^@
> [tesuji:26399] [ 2] 0   libsystem_c.dylib                   
> 0x00007fff91e82f61 abort + 143^@
> [tesuji:26399] [ 3] 0   libsystem_c.dylib                   
> 0x00007fff91e83cb9 __assert_rtn + 146^@
> [tesuji:26399] [ 4] 0   mca_dstore_hash.so                  
> 0x000000010180803c store + 972^@
> [tesuji:26399] [ 5] 0   libopen-pal.0.dylib                 
> 0x00000001016860c6 opal_dstore_base_store + 278^@
> [tesuji:26399] [ 6] 0   mca_pmix_native.so                  
> 0x0000000101825795 native_get + 4709^@
> [tesuji:26399] [ 7] 0   libmpi.0.dylib                      
> 0x000000010111f6a4 ompi_proc_complete_init + 980^@
> [tesuji:26399] [ 8] 0   libmpi.0.dylib                      
> 0x0000000101126f24 ompi_mpi_init + 2372^@
> [tesuji:26399] [ 9] 0   libmpi.0.dylib                      
> 0x00000001011744c0 MPI_Init + 480^@
> [tesuji:26399] [10] 0   ring_c                              
> 0x00000001010e9c25 main + 53^@
> [tesuji:26399] [11] 0   libdyld.dylib                       
> 0x00007fff8e03a7e1 start + 0^@
> [tesuji:26399] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 
> 6 (Abort trap: 6).
> --------------------------------------------------------------------------
> 
> Seen with master tarball and builtin-atomics:
> 
> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>  Warning :: opal_list_remove_item - the item 0x7fc6d1900130 is not on the 
> list 0x7fc6d0c30df0
> Assertion failed: (0 == item->opal_list_item_refcount), function 
> opal_list_item_destruct, file 
> /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang-atomics/openmpi-dev-1118-gdc80863/opal/class/opal_list.c,
>  line 69.
> [tesuji:62565] *** Process received signal ***
> [tesuji:62565] Signal: Abort trap: 6 (6)
> [tesuji:62565] Signal code:  (0)
> [tesuji:62565] [ 0] 0   libsystem_c.dylib                   
> 0x00007fff91e2b90a _sigtramp + 26^@
> [tesuji:62565] [ 1] 0   ???                                 
> 0x0000000000000000 0x0 + 0^@
> [tesuji:62565] [ 2] 0   libsystem_c.dylib                   
> 0x00007fff91e82f61 abort + 143^@
> [tesuji:62565] [ 3] 0   libsystem_c.dylib                   
> 0x00007fff91e83cb9 __assert_rtn + 146^@
> [tesuji:62565] [ 4] 0   libopen-pal.0.dylib                 
> 0x0000000107d54dd5 opal_list_item_destruct + 85^@
> [tesuji:62565] [ 5] 0   mca_dstore_hash.so                  
> 0x0000000107f67e21 opal_obj_run_destructors + 145^@
> [tesuji:62565] [ 6] 0   mca_dstore_hash.so                  
> 0x0000000107f6707e store + 1054^@
> [tesuji:62565] [ 7] 0   libopen-pal.0.dylib                 
> 0x0000000107de0336 opal_dstore_base_store + 278^@
> [tesuji:62565] [ 8] 0   mca_pmix_native.so                  
> 0x0000000107f8aaa3 fencenb_cbfunc + 851^@
> [tesuji:62565] [ 9] 0   mca_pmix_native.so                  
> 0x0000000107f8bf97 pmix_usock_process_msg + 695^@
> [tesuji:62565] [10] 0   libopen-pal.0.dylib                 
> 0x0000000107dea38d event_process_active_single_queue + 493^@
> [tesuji:62565] [11] 0   libopen-pal.0.dylib                 
> 0x0000000107de5f7c event_process_active + 140^@
> [tesuji:62565] [12] 0   libopen-pal.0.dylib                 
> 0x0000000107de502e opal_libevent2022_event_base_loop + 830^@
> [tesuji:62565] [13] 0   libopen-pal.0.dylib                 
> 0x0000000107d66532 progress_engine + 66^@
> [tesuji:62565] [14] 0   libsystem_c.dylib                   
> 0x00007fff91e3d772 _pthread_start + 327^@
> [tesuji:62565] [15] 0   libsystem_c.dylib                   
> 0x00007fff91e2a1a1 thread_start + 13^@
> [tesuji:62565] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 
> 6 (Abort trap: 6).
> --------------------------------------------------------------------------
> 
> Seen with master tarball and without builtin-atomics:
> 
> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>  Warning :: opal_list_remove_item - the item 0x7f8ae2464f00 is not on the 
> list 0x7f8ae2600690
> Assertion failed: (0 == item->opal_list_item_refcount), function 
> opal_list_item_destruct, file 
> /Users/Paul/OMPI/openmpi-master-macos10.8-x86-clang/openmpi-dev-1118-gdc80863/opal/class/opal_list.c,
>  line 69.
> [tesuji:86550] *** Process received signal ***
> [tesuji:86550] Signal: Abort trap: 6 (6)
> [tesuji:86550] Signal code:  (0)
> [tesuji:86550] [ 0] 0   libsystem_c.dylib                   
> 0x00007fff91e2b90a _sigtramp + 26^@
> [tesuji:86550] [ 1] 0   ???                                 
> 0x0000000000000000 0x0 + 0^@
> [tesuji:86550] [ 2] 0   libsystem_c.dylib                   
> 0x00007fff91e82f61 abort + 143^@
> [tesuji:86550] [ 3] 0   libsystem_c.dylib                   
> 0x00007fff91e83cb9 __assert_rtn + 146^@
> [tesuji:86550] [ 4] 0   libopen-pal.0.dylib                 
> 0x0000000104e41365 opal_list_item_destruct + 85^@
> [tesuji:86550] [ 5] 0   mca_dstore_hash.so                  
> 0x0000000105039fc1 opal_obj_run_destructors + 145^@
> [tesuji:86550] [ 6] 0   mca_dstore_hash.so                  
> 0x000000010503921e store + 1054^@
> [tesuji:86550] [ 7] 0   libopen-pal.0.dylib                 
> 0x0000000104ec8306 opal_dstore_base_store + 278^@
> [tesuji:86550] [ 8] 0   mca_pmix_native.so                  
> 0x000000010505bef3 fencenb_cbfunc + 851^@
> [tesuji:86550] [ 9] 0   mca_pmix_native.so                  
> 0x000000010505d337 pmix_usock_process_msg + 695^@
> [tesuji:86550] [10] 0   libopen-pal.0.dylib                 
> 0x0000000104ed214d event_process_active_single_queue + 493^@
> [tesuji:86550] [11] 0   libopen-pal.0.dylib                 
> 0x0000000104ecdd3c event_process_active + 140^@
> [tesuji:86550] [12] 0   libopen-pal.0.dylib                 
> 0x0000000104eccdee opal_libevent2022_event_base_loop + 830^@
> [tesuji:86550] [13] 0   libopen-pal.0.dylib                 
> 0x0000000104e521d2 progress_engine + 66^@
> [tesuji:86550] [14] 0   libsystem_c.dylib                   
> 0x00007fff91e3d772 _pthread_start + 327^@
> [tesuji:86550] [15] 0   libsystem_c.dylib                   
> 0x00007fff91e2a1a1 thread_start + 13^@
> [tesuji:86550] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 
> 6 (Abort trap: 6).
> --------------------------------------------------------------------------
> 
> Seen on master configured with only --prefix= and --enable-debug:
> 
> $ mpirun -mca btl sm,self -np 2 examples/ring_c'
>  Warning :: opal_list_remove_item - the item 0x7fd104200130 is not on the 
> list 0x7fd10342b1e0
> Assertion failed: (OPAL_OBJ_MAGIC_ID == ((opal_object_t *) 
> (kv))->obj_magic_id), function store, file 
> /Users/Paul/OMPI/openmpi-master-macos10.8-x86-gcc/openmpi-dev-1118-gdc80863/opal/mca/dstore/hash/dstore_hash.c,
>  line 143.
> [tesuji:12056] *** Process received signal ***
> [tesuji:12056] Signal: Abort trap: 6 (6)
> [tesuji:12056] Signal code:  (0)
> [tesuji:12056] [ 0] 0   libsystem_c.dylib                   
> 0x00007fff91e2b90a _sigtramp + 26^@
> [tesuji:12056] [ 1] 0   ???                                 
> 0x20656874202d206d 0x0 + 2334386829826793581^@
> [tesuji:12056] [ 2] 0   libsystem_c.dylib                   
> 0x00007fff91e82f61 abort + 143^@
> [tesuji:12056] [ 3] 0   libsystem_c.dylib                   
> 0x00007fff91e83cb9 __assert_rtn + 146^@
> [tesuji:12056] [ 4] 0   mca_dstore_hash.so                  
> 0x000000010b22cf99 store + 873^@
> [tesuji:12056] [ 5] 0   libopen-pal.0.dylib                 
> 0x000000010b0c1160 opal_dstore_base_store + 368^@
> [tesuji:12056] [ 6] 0   mca_pmix_native.so                  
> 0x000000010b250b6f native_get + 6303^@
> [tesuji:12056] [ 7] 0   libmpi.0.dylib                      
> 0x000000010ac32a9b ompi_proc_complete_init + 1659^@
> [tesuji:12056] [ 8] 0   libmpi.0.dylib                      
> 0x000000010ac3be8d ompi_mpi_init + 3117^@
> [tesuji:12056] [ 9] 0   libmpi.0.dylib                      
> 0x000000010ac881c1 MPI_Init + 609^@
> [tesuji:12056] [10] 0   ring_c                              
> 0x000000010abe6bee main + 46^@
> [tesuji:12056] [11] 0   libdyld.dylib                       
> 0x00007fff8e03a7e1 start + 0^@
> [tesuji:12056] [12] 0   ???                                 
> 0x0000000000000001 0x0 + 1^@
> [tesuji:12056] *** End of error message ***
> --------------------------------------------------------------------------
> mpirun noticed that process rank 1 with PID 0 on node tesuji exited on signal 
> 6 (Abort trap: 6).
> --------------------------------------------------------------------------
> 
> -- 
> Paul H. Hargrove                          phhargr...@lbl.gov 
> <mailto:phhargr...@lbl.gov>
> Computer Languages & Systems Software (CLaSS) Group
> Computer Science Department               Tel: +1-510-495-2352
> Lawrence Berkeley National Laboratory     Fax: +1-510-486-6900
> _______________________________________________
> devel mailing list
> de...@open-mpi.org
> Subscription: http://www.open-mpi.org/mailman/listinfo.cgi/devel
> Link to this post: 
> http://www.open-mpi.org/community/lists/devel/2015/02/17071.php

Reply via email to