[OMPI users] malloc related crash inside openmpi

2016-11-17 Thread Noam Bernstein
Hi - we’ve started seeing over the last few days crashes and hangs in openmpi, 
in a code that hasn’t been touched in months, and an openmpi installation (v. 
1.8.5) that also hasn’t been touched in months.  The symptoms are either a 
hang, with a stack trace (from attaching to the one running process that’s got 
0% CPU usage) that looks like this:
(gdb) where
#0  0x00358980f00d in nanosleep () from /lib64/libpthread.so.0
#1  0x2af19a8758de in opal_memory_ptmalloc2_free () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#2  0x02bca106 in for__free_vm ()
#3  0x02b8cf62 in for__exit_handler ()
#4  0x02b89782 in for__issue_diagnostic ()
#5  0x02b90a50 in for__signal_handler ()
#6  
#7  0x2af19a8746fc in malloc_consolidate () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#8  0x2af19a876e69 in opal_memory_ptmalloc2_int_malloc () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#9  0x2af19a877c4f in opal_memory_ptmalloc2_int_memalign () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#10 0x2af19a8788a3 in opal_memory_ptmalloc2_memalign () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#11 0x2af19a29e0f4 in ompi_free_list_grow () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#12 0x2af1a0718546 in append_frag_to_list () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#13 0x2af1a0718cbe in match_one () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#14 0x2af1a07190f3 in mca_pml_ob1_recv_frag_callback_match () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_pml_ob1.so
#15 0x2af19fab4a48 in btl_openib_handle_incoming () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#16 0x2af19fab5e1f in poll_device () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#17 0x2af19fab618c in btl_openib_component_progress () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_btl_openib.so
#18 0x2af19a801f8a in opal_progress () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libopen-pal.so.6
#19 0x2af19a2b7a0d in ompi_request_default_wait_all () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#20 0x2af1a17afef2 in ompi_coll_tuned_sendrecv_nonzero_actual () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#21 0x2af1a17b7542 in ompi_coll_tuned_alltoallv_intra_pairwise () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/openmpi/mca_coll_tuned.so
#22 0x2af19a2c9419 in PMPI_Alltoallv () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi.so.1
#23 0x2af19a05f2a2 in pmpi_alltoallv__ () from 
/usr/local/openmpi/1.8.5/x86_64/ib/intel/12.1.6/lib/libmpi_mpifh.so.2
#24 0x00416213 in m_alltoall_i (comm=..., xsnd=..., psnd=Cannot access 
memory at address 0x51
) at mpi.F:1906
#25 0x029ca135 in mapset (grid=...) at fftmpi_map.F:267
#26 0x02a15c62 in vamp () at main.F:2002
#27 0x0041281e in main ()
#28 0x00358941ed1d in __libc_start_main () from /lib64/libc.so.6
#29 0x00412729 in _start ()
(gdb) quit

Or segfault that looks like this 

forrtl: severe (174): SIGSEGV, segmentation fault occurred
Image  PCRoutineLineSource  
   
vasp.gamma_para.i  02C7B031  Unknown   Unknown  Unknown
vasp.gamma_para.i  02C7916B  Unknown   Unknown  Unknown
vasp.gamma_para.i  02BECFF4  Unknown   Unknown  Unknown
vasp.gamma_para.i  02BECE06  Unknown   Unknown  Unknown
vasp.gamma_para.i  02B89827  Unknown   Unknown  Unknown
vasp.gamma_para.i  02B90A50  Unknown   Unknown  Unknown
libpthread-2.12.s  003FED60F7E0  Unknown   Unknown  Unknown
libopen-pal.so.6.  2AF7775346FC  Unknown   Unknown  Unknown
libopen-pal.so.6.  2AF777536E69  opal_memory_ptmal Unknown  Unknown
libopen-pal.so.6.  2AF777537C4F  opal_memory_ptmal Unknown  Unknown
libopen-pal.so.6.  2AF7775388A3  opal_memory_ptmal Unknown  Unknown
libmlx4-rdmav2.so  2AF77EE87242  Unknown   Unknown  Unknown
libmlx4-rdmav2.so  2AF77EE8979F  Unknown   Unknown  Unknown
libmlx4-rdmav2.so  2AF77EE89AD6  Unknown   Unknown  Unknown
libibverbs.so.1.0  2AF77CBFFDD2  ibv_create_qp Unknown  Unknown
mca_btl_openib.so  2AF77C7D15C5  Unknown   Unknown  Unknown
mca_btl_openib.so  2AF77C7D4088  Unknown   Unknown  Unknown
mca_btl_openib.so  2AF77C7C6CAD  mca_btl_openib_en Unknown  Unknown
mca_pml_ob1.so 2AF77D42D7F6  mca_pml_ob1_send_ Unk

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 17, 2016, at 3:22 PM, Noam Bernstein  
> wrote:
> 
> Hi - we’ve started seeing over the last few days crashes and hangs in 
> openmpi, in a code that hasn’t been touched in months, and an openmpi 
> installation (v. 1.8.5) that also hasn’t been touched in months.  The 
> symptoms are either a hang, with a stack trace (from attaching to the one 
> running process that’s got 0% CPU usage) that looks like this:
> .
> .
> .
> .
> I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code (vasp 
> 5.4.1), just to make sure everything’s clean, but I was just wondering if 
> anyone had any ideas as to what might even be causing this kind of behavior, 
> or what other information might be useful for me to gather to figure out 
> what’s going on.  As I implied at the top, this setup’s been working well for 
> years, and I believe entirely untouched (the openmpi library and executable, 
> I mean, since we did just have a kernel update) for far longer than these 
> crashes.
>   


No one has any suggestions about this problem?  I tried openmpi 1.8.8, and a 
newer version of Mellanox’s OFED, and behavior is the same.  

Does anyone who knows the guts of mpi have any ideas whether this even looks 
like an openmpi problem (as opposed to lower level, i.e. infiniband drivers, or 
higher level, i.e. calling code), from the stack traces I posted earlier?


Noam


||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread George Bosilca
Noam,

I do not recall exactly which version of Open MPI was affected, but we had
some issues with the non-reentrancy of our memory allocator. More recent
versions (1.10 and 2.0) will not have this issue. Can you update to a newer
version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?

Thanks,
  George.



On Wed, Nov 23, 2016 at 11:44 AM, Noam Bernstein <
noam.bernst...@nrl.navy.mil> wrote:

> On Nov 17, 2016, at 3:22 PM, Noam Bernstein 
> wrote:
>
> Hi - we’ve started seeing over the last few days crashes and hangs in
> openmpi, in a code that hasn’t been touched in months, and an openmpi
> installation (v. 1.8.5) that also hasn’t been touched in months.  The
> symptoms are either a hang, with a stack trace (from attaching to the one
> running process that’s got 0% CPU usage) that looks like this:
>
> .
>
> .
> .
> .
>
> I’m in the process of recompiling openmpi 1.8.8 and the mpi-using code
> (vasp 5.4.1), just to make sure everything’s clean, but I was just
> wondering if anyone had any ideas as to what might even be causing this
> kind of behavior, or what other information might be useful for me to
> gather to figure out what’s going on.  As I implied at the top, this
> setup’s been working well for years, and I believe entirely untouched (the
> openmpi library and executable, I mean, since we did just have a kernel
> update) for far longer than these crashes.
>
>
>
> No one has any suggestions about this problem?  I tried openmpi 1.8.8, and
> a newer version of Mellanox’s OFED, and behavior is the same.
>
> Does anyone who knows the guts of mpi have any ideas whether this even
> looks like an openmpi problem (as opposed to lower level, i.e. infiniband
> drivers, or higher level, i.e. calling code), from the stack traces I
> posted earlier?
>
> Noam
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 23, 2016, at 3:02 PM, George Bosilca  wrote:
> 
> Noam,
> 
> I do not recall exactly which version of Open MPI was affected, but we had 
> some issues with the non-reentrancy of our memory allocator. More recent 
> versions (1.10 and 2.0) will not have this issue. Can you update to a newer 
> version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?

Interesting.  I just tried 2.0.1 and it does seems to have fixed the problem, 
although it’s so far from deterministic that I can’t say this will full 
confidence yet.  

Is there any general advice on the merits of going to 1.10 vs. 2.0 (from 1.8)?

thanks,
Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein
> On Nov 23, 2016, at 3:08 PM, Noam Bernstein  
> wrote:
> 
>> On Nov 23, 2016, at 3:02 PM, George Bosilca > > wrote:
>> 
>> Noam,
>> 
>> I do not recall exactly which version of Open MPI was affected, but we had 
>> some issues with the non-reentrancy of our memory allocator. More recent 
>> versions (1.10 and 2.0) will not have this issue. Can you update to a newer 
>> version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?
> 
> Interesting.  I just tried 2.0.1 and it does seems to have fixed the problem, 
> although it’s so far from deterministic that I can’t say this will full 
> confidence yet.  

No, I spoke too soon.  It fails in the same way with 2.0.1.  I guess I’ll try 
1.10 just in case.

Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread George Bosilca
Thousands reasons ;)

https://raw.githubusercontent.com/open-mpi/ompi/v2.x/NEWS

  George.



On Wed, Nov 23, 2016 at 1:08 PM, Noam Bernstein  wrote:

> On Nov 23, 2016, at 3:02 PM, George Bosilca  wrote:
>
> Noam,
>
> I do not recall exactly which version of Open MPI was affected, but we had
> some issues with the non-reentrancy of our memory allocator. More recent
> versions (1.10 and 2.0) will not have this issue. Can you update to a newer
> version of Open MPI (1.10 or maybe 2.0) and see if you can reproduce it?
>
>
> Interesting.  I just tried 2.0.1 and it does seems to have fixed the
> problem, although it’s so far from deterministic that I can’t say this will
> full confidence yet.
>
> Is there any general advice on the merits of going to 1.10 vs. 2.0 (from
> 1.8)?
>
> thanks,
> Noam
>
> 
> |
> |
> |
> *U.S. NAVAL*
> |
> |
> _*RESEARCH*_
> |
> LABORATORY
>
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil
>
>
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users
>
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein

> On Nov 23, 2016, at 3:45 PM, George Bosilca  wrote:
> 
> Thousands reasons ;)

Still trying to check if 2.0.1 fixes the problem, and discovered that earlier 
runs weren’t actually using the version I intended.  When I do use 2.0.1, I get 
the following errors:
--
A requested component was not found, or was unable to be opened.  This
means that this component is either not installed or is unable to be
used on your system (e.g., sometimes this means that shared libraries
that the component requires are unable to be found/loaded).  Note that
Open MPI stopped checking at the first component that it did not find.

Host:  compute-1-35
Framework: ess
Component: pmi
--
--
It looks like orte_init failed for some reason; your parallel process is
likely to abort.  There are many reasons that a parallel process can
fail during orte_init; some of which are due to configuration or
environment problems.  This failure appears to be an internal failure;
here's some additional information (which may only be relevant to an
Open MPI developer):

  orte_ess_base_open failed
  --> Returned value Not found (-13) instead of ORTE_SUCCESS
--

I’ve confirmed that mpirun PATH and LD_LIBRARY_PATH are pointing to 2.0.1 
version of things within the job script.  Configure line is as I’ve used for 
1.8.x, i.e.
export CC=gcc
export CXX=g++
export F77=ifort
export FC=ifort 

./configure \
--prefix=${DEST} \
--with-tm=/usr/local/torque \
--enable-mpirun-prefix-by-default \
--with-verbs=/usr \
--with-verbs-libdir=/usr/lib64
Followed by “make install” Any suggestions for getting 2.0.1 working?

thanks,
Noam


||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread r...@open-mpi.org
It looks like the library may not have been fully installed on that node - can 
you see if the prefix location is present, and that the LD_LIBRARY_PATH on that 
node is correctly set? The referenced component did not exist prior to the 2.0 
series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on that node.


> On Nov 23, 2016, at 2:21 PM, Noam Bernstein  
> wrote:
> 
> 
>> On Nov 23, 2016, at 3:45 PM, George Bosilca > > wrote:
>> 
>> Thousands reasons ;)
> 
> Still trying to check if 2.0.1 fixes the problem, and discovered that earlier 
> runs weren’t actually using the version I intended.  When I do use 2.0.1, I 
> get the following errors:
> --
> A requested component was not found, or was unable to be opened.  This
> means that this component is either not installed or is unable to be
> used on your system (e.g., sometimes this means that shared libraries
> that the component requires are unable to be found/loaded).  Note that
> Open MPI stopped checking at the first component that it did not find.
> 
> Host:  compute-1-35
> Framework: ess
> Component: pmi
> --
> --
> It looks like orte_init failed for some reason; your parallel process is
> likely to abort.  There are many reasons that a parallel process can
> fail during orte_init; some of which are due to configuration or
> environment problems.  This failure appears to be an internal failure;
> here's some additional information (which may only be relevant to an
> Open MPI developer):
> 
>   orte_ess_base_open failed
>   --> Returned value Not found (-13) instead of ORTE_SUCCESS
> --
> 
> I’ve confirmed that mpirun PATH and LD_LIBRARY_PATH are pointing to 2.0.1 
> version of things within the job script.  Configure line is as I’ve used for 
> 1.8.x, i.e.
> export CC=gcc
> export CXX=g++
> export F77=ifort
> export FC=ifort 
> 
> ./configure \
> --prefix=${DEST} \
> --with-tm=/usr/local/torque \
> --enable-mpirun-prefix-by-default \
> --with-verbs=/usr \
> --with-verbs-libdir=/usr/lib64
> Followed by “make install” Any suggestions for getting 2.0.1 working?
> 
>   thanks,
>   Noam
> 
> 
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
> 
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-23 Thread Noam Bernstein

> On Nov 23, 2016, at 5:26 PM, r...@open-mpi.org wrote:
> 
> It looks like the library may not have been fully installed on that node - 
> can you see if the prefix location is present, and that the LD_LIBRARY_PATH 
> on that node is correctly set? The referenced component did not exist prior 
> to the 2.0 series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on 
> that node.

The LD_LIBRARY path is definitely correct on the node that’s running the 
mpirun, I checked that, and the openmpi directory is supposedly NFS mounted 
everywhere.  I suppose installation may have not fully worked and I didn’t 
notice.  What’s the name of the library it’s looking for?


Noam



||
|U.S. NAVAL|
|_RESEARCH_|
LABORATORY
Noam Bernstein, Ph.D.
Center for Materials Physics and Technology
U.S. Naval Research Laboratory
T +1 202 404 8628  F +1 202 404 7546
https://www.nrl.navy.mil 
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-24 Thread r...@open-mpi.org
Just to be clear: are you saying that mpirun exits with that message? Or is 
your application process exiting with it?

There is no reason for mpirun to be looking for that library.

The library in question is in the /lib/openmpi directory, and is named 
mca_ess_pmi.[la,so]


> On Nov 23, 2016, at 2:31 PM, Noam Bernstein  
> wrote:
> 
> 
>> On Nov 23, 2016, at 5:26 PM, r...@open-mpi.org  
>> wrote:
>> 
>> It looks like the library may not have been fully installed on that node - 
>> can you see if the prefix location is present, and that the LD_LIBRARY_PATH 
>> on that node is correctly set? The referenced component did not exist prior 
>> to the 2.0 series, so I’m betting that your LD_LIBRARY_PATH isn’t correct on 
>> that node.
> 
> The LD_LIBRARY path is definitely correct on the node that’s running the 
> mpirun, I checked that, and the openmpi directory is supposedly NFS mounted 
> everywhere.  I suppose installation may have not fully worked and I didn’t 
> notice.  What’s the name of the library it’s looking for?
> 
>   
> Noam
> 
> 
> 
> ||
> |U.S. NAVAL|
> |_RESEARCH_|
> LABORATORY
> 
> Noam Bernstein, Ph.D.
> Center for Materials Physics and Technology
> U.S. Naval Research Laboratory
> T +1 202 404 8628  F +1 202 404 7546
> https://www.nrl.navy.mil 
> ___
> users mailing list
> users@lists.open-mpi.org
> https://rfd.newmexicoconsortium.org/mailman/listinfo/users

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-25 Thread Noam Bernstein
> On Nov 24, 2016, at 10:52 AM, r...@open-mpi.org wrote:
> 
> Just to be clear: are you saying that mpirun exits with that message? Or is 
> your application process exiting with it?
> 
> There is no reason for mpirun to be looking for that library.
> 
> The library in question is in the /lib/openmpi directory, and is 
> named mca_ess_pmi.[la,so]
> 

Looks like this openmpi 2 crash was a matter of not using the correctly linked 
executable on all nodes. Now that it’s straightened out, I think it’s all 
working, and apparently even fixed my malloc related crash, so perhaps the 
allocator fix in 2.0.1 is really addressing the problem.

Thank you all for the help.

Noam
___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users

Re: [OMPI users] malloc related crash inside openmpi

2016-11-28 Thread Jeff Squyres (jsquyres)
> On Nov 25, 2016, at 11:20 AM, Noam Bernstein  
> wrote:
> 
> Looks like this openmpi 2 crash was a matter of not using the correctly 
> linked executable on all nodes. Now that it’s straightened out, I think it’s 
> all working, and apparently even fixed my malloc related crash, so perhaps 
> the allocator fix in 2.0.1 is really addressing the problem.

Glad you got it working!

One final note: the error message you saw is typical when there's more than one 
version of Open MPI installed into the same directory tree.  Check out this FAQ 
item for more detail:

https://www.open-mpi.org/faq/?category=building#install-overwrite

-- 
Jeff Squyres
jsquy...@cisco.com
For corporate legal information go to: 
http://www.cisco.com/web/about/doing_business/legal/cri/

___
users mailing list
users@lists.open-mpi.org
https://rfd.newmexicoconsortium.org/mailman/listinfo/users