Re: [OMPI users] Hang in MPI_Abort

Orion Poplawski Thu, 30 Jun 2016 16:55:36 -0400 (EDT)

valgrind output:

$ valgrind mpiexec -n 6 ./testphdf5
==8518== Memcheck, a memory error detector
==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al.
==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info
==8518== Command: mpiexec -n 6 ./testphdf5
==8518==
==8518== Conditional jump or move depends on uninitialised value(s)
==8518==    at 0x401C724: index (in /usr/lib/ld-2.23.90.so)
==8518==
==8518== Conditional jump or move depends on uninitialised value(s)
==8518==    at 0x401C728: index (in /usr/lib/ld-2.23.90.so)
==8518==
==8518== Conditional jump or move depends on uninitialised value(s)
==8518==    at 0x4008C04: fillin_rpath (in /usr/lib/ld-2.23.90.so)
==8518==    by 0x4009423: _dl_init_paths (in /usr/lib/ld-2.23.90.so)
==8518==
==8518== Conditional jump or move depends on uninitialised value(s)
==8518==    at 0x4016D48: dl_open_worker (in /usr/lib/ld-2.23.90.so)
==8518==
==8518== Conditional jump or move depends on uninitialised value(s)
==8518==    at 0x4009858: _dl_map_object (in /usr/lib/ld-2.23.90.so)
==8518==    by 0x4016DA3: dl_open_worker (in /usr/lib/ld-2.23.90.so)
==8518==
==8518== Invalid read of size 4
==8518==    at 0x401C724: index (in /usr/lib/ld-2.23.90.so)
==8518==  Address 0x4d1b7bc is 1 bytes after a block of size 43 alloc'd
==8518==    at 0x4849584: malloc (vg_replace_malloc.c:299)
==8518==    by 0x4BCB75F: __vasprintf_chk (in /usr/lib/libc-2.23.90.so)
==8518==    by 0x4BCB633: __asprintf_chk (in /usr/lib/libc-2.23.90.so)
==8518==    by 0x49393E3: UnknownInlinedFun (stdio2.h:178)
==8518==    by 0x49393E3: dlopen_open (dl_dlopen_module.c:77)
==8518==    by 0x491B22B: open_component (mca_base_component_find.c:558)
==8518==    by 0x491C6C3: find_dyn_components (mca_base_component_find.c:446)
==8518==    by 0x491C6C3: mca_base_component_find 
(mca_base_component_find.c:190)
==8518==    by 0x4926D5F: mca_base_framework_components_register
(mca_base_components_register.c:57)
==8518==    by 0x4927253: mca_base_framework_register (mca_base_framework.c:115)
==8518==    by 0x49272BB: mca_base_framework_open (mca_base_framework.c:134)
==8518==    by 0x48735D3: orte_init (orte_init.c:128)
==8518==    by 0x10C3F3: orterun (orterun.c:908)
==8518==    by 0x10B25F: main (main.c:13)
==8518==


I think this is mainly harmless.  Or at least not in openmpi.

Then:

aborting MPI processes
[arm03-packager00.cloud.fedoraproject.org:08518] 4 more processes have sent
help message help-mpi-api.txt / mpi-abort
[arm03-packager00.cloud.fedoraproject.org:08518] Set MCA parameter
"orte_base_help_aggregate" to 0 to see all help / error messages
==8518== Syscall param write(buf) points to uninitialised byte(s)
==8518==    at 0x4ABA888: write (in /usr/lib/libpthread-2.23.90.so)
==8518==    by 0x50FAC9B: component_shutdown (oob_tcp_component.c:658)
==8518==    by 0x48A9F67: orte_oob_base_close (oob_base_frame.c:73)
==8518==    by 0x49273EF: mca_base_framework_close (mca_base_framework.c:198)
==8518==    by 0x50BC647: rte_finalize (ess_hnp_module.c:882)
==8518==    by 0x4873433: orte_finalize (orte_finalize.c:65)
==8518==    by 0x10D257: orterun (orterun.c:1151)
==8518==    by 0x10B25F: main (main.c:13)
==8518==  Address 0xbd828898 is on thread 1's stack
==8518==  in frame #1, created by component_shutdown (oob_tcp_component.c:647)
==8518==
==8518==
==8518== HEAP SUMMARY:
==8518==     in use at exit: 244,487 bytes in 773 blocks
==8518==   total heap usage: 14,898 allocs, 14,125 frees, 4,150,667 bytes
allocated
==8518==
==8518== LEAK SUMMARY:
==8518==    definitely lost: 33,337 bytes in 23 blocks
==8518==    indirectly lost: 130,972 bytes in 20 blocks
==8518==      possibly lost: 2,368 bytes in 32 blocks
==8518==    still reachable: 77,810 bytes in 698 blocks
==8518==         suppressed: 0 bytes in 0 blocks
==8518== Rerun with --leak-check=full to see details of leaked memory
==8518==
==8518== For counts of detected and suppressed errors, rerun with: -v
==8518== Use --track-origins=yes to see where uninitialised values come from
==8518== ERROR SUMMARY: 310 errors from 8 contexts (suppressed: 0 from 0)


On 06/30/2016 11:59 AM, Ralph Castain wrote:
> So the application procs are all gone, but mpiexec isn’t exiting? I’d suggest 
> running valgrind, given the corruption.
> 
>> On Jun 30, 2016, at 10:21 AM, Orion Poplawski <or...@cora.nwra.com> wrote:
>>
>> On 06/30/2016 10:33 AM, Orion Poplawski wrote:
>>> No, just mpiexec is running.  single node.  Only see it when the test is
>>> executed with "make check", not seeing it if I just run mpiexec -n 6
>>> ./testphdf5 by hand.
>>
>>
>> Hmm, now I'm seeing it running mpiexec by hand.  Trying to check it via gdb
>> indicates a corrupted stack:
>>
>>
>> (gdb) bt
>> #0  0xb6cd8ac4 in poll () from /lib/libc.so.6
>> #1  0x00000000 in ?? ()
>> Backtrace stopped: previous frame identical to this frame (corrupt stack?)
>>
>> Any other tracing I can turn on?
>>
>> -- 
>> Orion Poplawski
>> Technical Manager                     303-415-9701 x222
>> NWRA, Boulder/CoRA Office             FAX: 303-415-9702
>> 3380 Mitchell Lane                       or...@nwra.com
>> Boulder, CO 80301                   http://www.nwra.com
>> _______________________________________________
>> users mailing list
>> us...@open-mpi.org
>> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
>> Link to this post: 
>> http://www.open-mpi.org/community/lists/users/2016/06/29578.php
> 
> _______________________________________________
> users mailing list
> us...@open-mpi.org
> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users
> Link to this post: 
> http://www.open-mpi.org/community/lists/users/2016/06/29579.php
> 


-- 
Orion Poplawski
Technical Manager                     303-415-9701 x222
NWRA, Boulder/CoRA Office             FAX: 303-415-9702
3380 Mitchell Lane                       or...@nwra.com
Boulder, CO 80301                   http://www.nwra.com

Re: [OMPI users] Hang in MPI_Abort

Reply via email to