valgrind output: $ valgrind mpiexec -n 6 ./testphdf5 ==8518== Memcheck, a memory error detector ==8518== Copyright (C) 2002-2015, and GNU GPL'd, by Julian Seward et al. ==8518== Using Valgrind-3.11.0 and LibVEX; rerun with -h for copyright info ==8518== Command: mpiexec -n 6 ./testphdf5 ==8518== ==8518== Conditional jump or move depends on uninitialised value(s) ==8518== at 0x401C724: index (in /usr/lib/ld-2.23.90.so) ==8518== ==8518== Conditional jump or move depends on uninitialised value(s) ==8518== at 0x401C728: index (in /usr/lib/ld-2.23.90.so) ==8518== ==8518== Conditional jump or move depends on uninitialised value(s) ==8518== at 0x4008C04: fillin_rpath (in /usr/lib/ld-2.23.90.so) ==8518== by 0x4009423: _dl_init_paths (in /usr/lib/ld-2.23.90.so) ==8518== ==8518== Conditional jump or move depends on uninitialised value(s) ==8518== at 0x4016D48: dl_open_worker (in /usr/lib/ld-2.23.90.so) ==8518== ==8518== Conditional jump or move depends on uninitialised value(s) ==8518== at 0x4009858: _dl_map_object (in /usr/lib/ld-2.23.90.so) ==8518== by 0x4016DA3: dl_open_worker (in /usr/lib/ld-2.23.90.so) ==8518== ==8518== Invalid read of size 4 ==8518== at 0x401C724: index (in /usr/lib/ld-2.23.90.so) ==8518== Address 0x4d1b7bc is 1 bytes after a block of size 43 alloc'd ==8518== at 0x4849584: malloc (vg_replace_malloc.c:299) ==8518== by 0x4BCB75F: __vasprintf_chk (in /usr/lib/libc-2.23.90.so) ==8518== by 0x4BCB633: __asprintf_chk (in /usr/lib/libc-2.23.90.so) ==8518== by 0x49393E3: UnknownInlinedFun (stdio2.h:178) ==8518== by 0x49393E3: dlopen_open (dl_dlopen_module.c:77) ==8518== by 0x491B22B: open_component (mca_base_component_find.c:558) ==8518== by 0x491C6C3: find_dyn_components (mca_base_component_find.c:446) ==8518== by 0x491C6C3: mca_base_component_find (mca_base_component_find.c:190) ==8518== by 0x4926D5F: mca_base_framework_components_register (mca_base_components_register.c:57) ==8518== by 0x4927253: mca_base_framework_register (mca_base_framework.c:115) ==8518== by 0x49272BB: mca_base_framework_open (mca_base_framework.c:134) ==8518== by 0x48735D3: orte_init (orte_init.c:128) ==8518== by 0x10C3F3: orterun (orterun.c:908) ==8518== by 0x10B25F: main (main.c:13) ==8518==
I think this is mainly harmless. Or at least not in openmpi. Then: aborting MPI processes [arm03-packager00.cloud.fedoraproject.org:08518] 4 more processes have sent help message help-mpi-api.txt / mpi-abort [arm03-packager00.cloud.fedoraproject.org:08518] Set MCA parameter "orte_base_help_aggregate" to 0 to see all help / error messages ==8518== Syscall param write(buf) points to uninitialised byte(s) ==8518== at 0x4ABA888: write (in /usr/lib/libpthread-2.23.90.so) ==8518== by 0x50FAC9B: component_shutdown (oob_tcp_component.c:658) ==8518== by 0x48A9F67: orte_oob_base_close (oob_base_frame.c:73) ==8518== by 0x49273EF: mca_base_framework_close (mca_base_framework.c:198) ==8518== by 0x50BC647: rte_finalize (ess_hnp_module.c:882) ==8518== by 0x4873433: orte_finalize (orte_finalize.c:65) ==8518== by 0x10D257: orterun (orterun.c:1151) ==8518== by 0x10B25F: main (main.c:13) ==8518== Address 0xbd828898 is on thread 1's stack ==8518== in frame #1, created by component_shutdown (oob_tcp_component.c:647) ==8518== ==8518== ==8518== HEAP SUMMARY: ==8518== in use at exit: 244,487 bytes in 773 blocks ==8518== total heap usage: 14,898 allocs, 14,125 frees, 4,150,667 bytes allocated ==8518== ==8518== LEAK SUMMARY: ==8518== definitely lost: 33,337 bytes in 23 blocks ==8518== indirectly lost: 130,972 bytes in 20 blocks ==8518== possibly lost: 2,368 bytes in 32 blocks ==8518== still reachable: 77,810 bytes in 698 blocks ==8518== suppressed: 0 bytes in 0 blocks ==8518== Rerun with --leak-check=full to see details of leaked memory ==8518== ==8518== For counts of detected and suppressed errors, rerun with: -v ==8518== Use --track-origins=yes to see where uninitialised values come from ==8518== ERROR SUMMARY: 310 errors from 8 contexts (suppressed: 0 from 0) On 06/30/2016 11:59 AM, Ralph Castain wrote: > So the application procs are all gone, but mpiexec isn’t exiting? I’d suggest > running valgrind, given the corruption. > >> On Jun 30, 2016, at 10:21 AM, Orion Poplawski <or...@cora.nwra.com> wrote: >> >> On 06/30/2016 10:33 AM, Orion Poplawski wrote: >>> No, just mpiexec is running. single node. Only see it when the test is >>> executed with "make check", not seeing it if I just run mpiexec -n 6 >>> ./testphdf5 by hand. >> >> >> Hmm, now I'm seeing it running mpiexec by hand. Trying to check it via gdb >> indicates a corrupted stack: >> >> >> (gdb) bt >> #0 0xb6cd8ac4 in poll () from /lib/libc.so.6 >> #1 0x00000000 in ?? () >> Backtrace stopped: previous frame identical to this frame (corrupt stack?) >> >> Any other tracing I can turn on? >> >> -- >> Orion Poplawski >> Technical Manager 303-415-9701 x222 >> NWRA, Boulder/CoRA Office FAX: 303-415-9702 >> 3380 Mitchell Lane or...@nwra.com >> Boulder, CO 80301 http://www.nwra.com >> _______________________________________________ >> users mailing list >> us...@open-mpi.org >> Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users >> Link to this post: >> http://www.open-mpi.org/community/lists/users/2016/06/29578.php > > _______________________________________________ > users mailing list > us...@open-mpi.org > Subscription: https://www.open-mpi.org/mailman/listinfo.cgi/users > Link to this post: > http://www.open-mpi.org/community/lists/users/2016/06/29579.php > -- Orion Poplawski Technical Manager 303-415-9701 x222 NWRA, Boulder/CoRA Office FAX: 303-415-9702 3380 Mitchell Lane or...@nwra.com Boulder, CO 80301 http://www.nwra.com