On Wed, Jul 22, 2015 at 8:39 AM, Michael Augspurger < [email protected]> wrote:
> Hello: > > I'm having a problem that I'm having a rough time diagnosing. My CFD > simulation code will run for a long time, sometimes up to 10K steps, and > then suddenly I'll get a SEGV error (If I run the same simulation again, > I'll get the same error, but always at a different time step, sometimes > thousands of steps different). There's nothing obvious going wrong in the > simulation at the time. Valgrind points to various internal petsc > operations, but the trail doesn't lead back to any part of my code, so I'm > not sure where to go next. > > Any advice or experience about where I can continue my investigation into > this failure? Thanks for any help, > > Mike Augspurger > > > > Here's part of the error code with valgrind: > > Residual norms for pres_redistribute_ solve. > 0 KSP Residual norm 2.343992292214e+00 > 1 KSP Residual norm 3.714369184378e-01 > 2 KSP Residual norm 5.045817070946e-02 > [2]PETSC ERROR: > ------------------------------------------------------------------------ > [2]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > probably memory access out of range > [2]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger > [2]PETSC ERROR: or see > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > [2]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS > X to find memory corruption errors > [2]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and > run > [2]PETSC ERROR: to get more information on the crash. > [2]PETSC ERROR: User provided function() line 0 in unknown file > ==11381== > ==11381== Process terminating with default action of signal 11 (SIGSEGV) > ==11381== General Protection Fault > ==11381== at 0x926047B: __intel_sse2_strcat (in > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5) > ==11381== by 0x817475E: opal_os_path (os_path.c:99) > ==11381== by 0x817B9B0: opal_show_help_vstring (show_help.c:153) > ==11381== by 0x80F7878: orte_show_help (show_help.c:566) > ==11381== by 0x80A7FFC: warn_fork_cb (ompi_mpi_init.c:139) > ==11381== by 0x3E4549A285: fork (in /lib64/libc-2.5.so) > ==11381== by 0x4D6A793: PetscAttachDebugger (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x4D6B93E: PetscAttachDebuggerErrorHandler (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x4D6E5BC: PetscError (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x4D70024: PetscSignalHandlerDefault (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x4D6F9F3: PetscSignalHandler_Private (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > Can you rerun this without -on_error_debugger_attach? And send ALL the output. We need to see what valgrind thinks is the real problem. Thanks, Matt > ==11381== by 0x3E4543002F: ??? (in /lib64/libc-2.5.so) > ==11381== > ==11381== HEAP SUMMARY: > ==11381== in use at exit: 50,117,187 bytes in 90,510 blocks > ==11381== total heap usage: 136,533,481 allocs, 136,442,971 frees, > 85,265,006,726 bytes allocated > ==11381== > ==11381== 2 bytes in 1 blocks are definitely lost in loss record 5 of 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x926098D: __intel_sse2_strdup (in > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5) > ==11381== by 0x5F6574726F5F4142: ??? > ==11381== by 0x747365725F6D756D: ??? > ==11381== by 0x4F00303D73747260: ??? > ==11381== by 0x5054554F5F4C414F: ??? > ==11381== by 0x52454454535F5454: ??? > ==11381== by 0x32333D44465F51: ??? > ==11381== by 0x41434D5F49504D4E: ??? > ==11381== by 0x696E69666661705E: ??? > ==11381== by 0x5F657361625F7973: ??? > ==11381== by 0x313D646E756F61: ??? > ==11381== > ==11381== 9 bytes in 1 blocks are definitely lost in loss record 430 of > 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x926098D: __intel_sse2_strdup (in > /opt/intel/composer_xe_2013.2.146/compiler/lib/intel64/libintlc.so.5) > ==11381== by 0x2020200A3E746364: ??? > ==11381== by 0x3C2020202020201F: ??? > ==11381== by 0x3E7463656A626F2E: ??? > ==11381== by 0x2020202020202009: ??? > ==11381== by 0xF3: ??? > ==11381== by 0xF3: ??? > ==11381== by 0x3: ??? > ==11381== by 0x3: ??? > ==11381== by 0xE4: ??? > ==11381== by 0xE5: ??? > ==11381== > ==11381== 11 bytes in 1 blocks are definitely lost in loss record 472 of > 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x812B188: opal_argv_join (argv.c:269) > ==11381== by 0xD4F5370: ompi_btl_openib_connect_base_register > (btl_openib_connect_base.c:72) > ==11381== by 0xD4F0CB0: btl_openib_register_mca_params > (btl_openib_mca.c:652) > ==11381== by 0xD4E24B5: btl_openib_component_register > (btl_openib_component.c:166) > ==11381== by 0x815DCC5: mca_base_components_open > (mca_base_components_open.c:387) > ==11381== by 0x80D7140: mca_btl_base_open (btl_base_open.c:115) > ==11381== by 0xC4612C6: ??? > ==11381== by 0x815DD37: mca_base_components_open > (mca_base_components_open.c:427) > ==11381== by 0x80E4CCA: mca_pml_base_open (pml_base_open.c:126) > ==11381== by 0x80A7594: ompi_mpi_init (ompi_mpi_init.c:485) > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84) > ==11381== > ==11381== 16 bytes in 1 blocks are definitely lost in loss record 710 of > 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490) > ==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120) > ==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84) > ==11381== by 0x81021EC: orte_util_nidmap_init (nidmap.c:117) > ==11381== by 0xAA17573: rte_init (ess_env_module.c:173) > ==11381== by 0x80E75CA: orte_init (orte_init.c:127) > ==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357) > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84) > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75) > ==11381== by 0x4D0F99F: petscinitialize_ (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x60B1AB: elafintstartmpi_ (in > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE) > ==11381== > ==11381== 16 bytes in 1 blocks are definitely lost in loss record 711 of > 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x813EE92: opal_dss_unpack_byte_object (dss_unpack.c:490) > ==11381== by 0x813F3AE: opal_dss_unpack_buffer (dss_unpack.c:120) > ==11381== by 0x813DBD9: opal_dss_unpack (dss_unpack.c:84) > ==11381== by 0x810222C: orte_util_nidmap_init (nidmap.c:130) > ==11381== by 0xAA17573: rte_init (ess_env_module.c:173) > ==11381== by 0x80E75CA: orte_init (orte_init.c:127) > ==11381== by 0x80A7005: ompi_mpi_init (ompi_mpi_init.c:357) > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84) > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75) > ==11381== by 0x4D0F99F: petscinitialize_ (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x60B1AB: elafintstartmpi_ (in > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE) > ==11381== > ==11381== 16 bytes in 16 blocks are definitely lost in loss record 712 of > 4,925 > ==11381== at 0x4A0646F: malloc (vg_replace_malloc.c:236) > ==11381== by 0x810F37B: orte_grpcomm_base_get_proc_attr > (grpcomm_base_modex.c:801) > ==11381== by 0x8098A44: ompi_comm_cid_init (comm_cid.c:139) > ==11381== by 0x80A7C52: ompi_mpi_init (ompi_mpi_init.c:846) > ==11381== by 0x80BF902: PMPI_Init (pinit.c:84) > ==11381== by 0x71F3956: MPI_INIT (pinit_f.c:75) > ==11381== by 0x4D0F99F: petscinitialize_ (in > /Users/augspurger/NumericalLibraries/petsc/intel-opt/lib/libpetsc.so.3.6.0) > ==11381== by 0x60B1AB: elafintstartmpi_ (in > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE) > ==11381== by 0x60A052: MAIN__ (in > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE) > ==11381== by 0x42412B: main (in > /nfsscratch/Users/augspurger/PAPER2/PELAFINT3D_EXE) > ==11381== > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
