Hi - we’ve started seeing over the last few days crashes and hangs in openmpi, 
in a code that hasn’t been touched in months, and an openmpi installation (v. 
1.8.5) that also hasn’t been touched in months.  The symptoms are either a 
hang, with a stack trace (from attaching to the one running process that’s got 
0% CPU usage) that looks like this:
(gdb) where
#0  0x000000358980f00d in nanosleep () from /lib64/libpthread.so.0
#1  0x00002af19a8758de in opal_memory_ptmalloc2_free () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#2  0x0000000002bca106 in for__free_vm ()
#3  0x0000000002b8cf62 in for__exit_handler ()
#4  0x0000000002b89782 in for__issue_diagnostic ()
#5  0x0000000002b90a50 in for__signal_handler ()
#6  <signal handler called>
#7  0x00002af19a8746fc in malloc_consolidate () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#8  0x00002af19a876e69 in opal_memory_ptmalloc2_int_malloc () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#9  0x00002af19a877c4f in opal_memory_ptmalloc2_int_memalign () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#10 0x00002af19a8788a3 in opal_memory_ptmalloc2_memalign () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#11 0x00002af19a29e0f4 in ompi_free_list_grow () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#12 0x00002af1a0718546 in append_frag_to_list () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#13 0x00002af1a0718cbe in match_one () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#14 0x00002af1a07190f3 in mca_pml_ob1_recv_frag_callback_match () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#15 0x00002af19fab4a48 in btl_openib_handle_incoming () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#16 0x00002af19fab5e1f in poll_device () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#17 0x00002af19fab618c in btl_openib_component_progress () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#18 0x00002af19a801f8a in opal_progress () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#19 0x00002af19a2b7a0d in ompi_request_default_wait_all () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#20 0x00002af1a17afef2 in ompi_coll_tuned_sendrecv_nonzero_actual () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#21 0x00002af1a17b7542 in ompi_coll_tuned_alltoallv_intra_pairwise () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#22 0x00002af19a2c9419 in PMPI_Alltoallv () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#23 0x00002af19a05f2a2 in pmpi_alltoallv__ () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.2
#24 0x0000000000416213 in m_alltoall_i (comm=..., xsnd=..., psnd=Cannot access 
memory at address 0x51
) at mpi.F:1906
#25 0x00000000029ca135 in mapset (grid=...) at fftmpi_map.F:267
#26 0x0000000002a15c62 in vamp () at main.F:2002
#27 0x000000000041281e in main ()
#28 0x000000358941ed1d in __libc_start_main () from /lib64/libc.so.6
#29 0x0000000000412729 in _start ()
(gdb) quit

Or segfault that looks like this 

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image              PC                Routine            Line        Source      
       
vasp.gamma_para.i  0000000002C7B031  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002C7916B  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002BECFF4  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002BECE06  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002B89827  Unknown               Unknown  Unknown
vasp.gamma_para.i  0000000002B90A50  Unknown               Unknown  Unknown
libpthread-2.12.s  0000003FED60F7E0  Unknown               Unknown  Unknown
libopen-pal.so.6.  00002AF7775346FC  Unknown               Unknown  Unknown
libopen-pal.so.6.  00002AF777536E69  opal_memory_ptmal     Unknown  Unknown
libopen-pal.so.6.  00002AF777537C4F  opal_memory_ptmal     Unknown  Unknown
libopen-pal.so.6.  00002AF7775388A3  opal_memory_ptmal     Unknown  Unknown
libmlx4-rdmav2.so  00002AF77EE87242  Unknown               Unknown  Unknown
libmlx4-rdmav2.so  00002AF77EE8979F  Unknown               Unknown  Unknown
libmlx4-rdmav2.so  00002AF77EE89AD6  Unknown               Unknown  Unknown
libibverbs.so.1.0  00002AF77CBFFDD2  ibv_create_qp         Unknown  Unknown
mca_btl_openib.so  00002AF77C7D15C5  Unknown               Unknown  Unknown
mca_btl_openib.so  00002AF77C7D4088  Unknown               Unknown  Unknown
mca_btl_openib.so  00002AF77C7C6CAD  mca_btl_openib_en     Unknown  Unknown
mca_pml_ob1.so     00002AF77D42D7F6  mca_pml_ob1_send_     Unknown  Unknown
mca_pml_ob1.so     00002AF77D424279  mca_pml_ob1_isend     Unknown  Unknown
mca_coll_tuned.so  00002AF77E4BDECB  ompi_coll_tuned_s     Unknown  Unknown
mca_coll_tuned.so  00002AF77E4C5542  ompi_coll_tuned_a     Unknown  Unknown
libmpi.so.1.6.0    00002AF776F89419  PMPI_Alltoallv        Unknown  Unknown
libmpi_mpifh.so.2  00002AF776D1F2A2  pmpi_alltoallv_       Unknown  Unknown
vasp.gamma_para.i  0000000000416213  m_alltoall_i_            1906  mpi.F
vasp.gamma_para.i  00000000029CA135  mapset_.R                 267  fftmpi_map.F
vasp.gamma_para.i  0000000002A15C62  MAIN__                   2002  main.F
vasp.gamma_para.i  000000000041281E  Unknown               Unknown  Unknown
libc-2.12.so       0000003FED21ED1D  __libc_start_main     Unknown  Unknown
vasp.gamma_para.i  0000000000412729  Unknown               Unknown  Unknown

This is on a Linux infiniband system, using CentOS 6 and the CentOS build in 
OFED.  It’s possible that the crashes only started after a recent kernel update.

I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 
5.4.1), just to make sure everything’s clean, but I was just wondering if 
anyone had any ideas as to what might even be causing this kind of behavior, or 
what other information might be useful for me to gather to figure out what’s 
going on.  As I implied at the top, this setup’s been working well for years, 
and I believe entirely untouched (the openmpi library and executable, I mean, 
since we did just have a kernel update) for far longer than these crashes.
        
                                                                thanks,
                                                                Noam

_______________________________________________
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Reply via email to