https://github.com/spack/spack/pull/11132
If this works - please add a comment on the PR Satish On Mon, 8 Apr 2019, Balay, Satish via petsc-users wrote: > spack update to use mumps-5.1.2 - with this patch is in branch > 'balay/mumps-5.1.2' > > Satish > > On Mon, 8 Apr 2019, Satish Balay wrote: > > > Yes - mumps via spack is unlikely to have this patch - but it can be added. > > > > https://bitbucket.org/petsc/pkg-mumps/commits/5fe5b9e56f78de2b7b1c199688f6c73ff3ff4c2d > > > > Satish > > > > On Mon, 8 Apr 2019, Manav Bhatia wrote: > > > > > This is helpful, Thibaut. Thanks! > > > > > > For reference: all my Linux installs are using Spack, while my Mac > > > install is through a petsc config where I let it download and install > > > mumps. > > > > > > Could this be a source of difference in patch level for Mumps? > > > > > > > > > > On Apr 8, 2019, at 1:56 PM, Appel, Thibaut <t.appe...@imperial.ac.uk> > > > > wrote: > > > > > > > > Hi Manav, > > > > This seems to be the bug in MUMPS that I reported to their developers > > > > last summer. > > > > But I thought Satish Balay had issued a patch in the maint branch of > > > > PETSc to correct that a few months ago? > > > > The temporary workaround was to disable the ScaLAPACK root node, > > > > ICNTL(13)=1 > > > > One of the developers said later > > > >> A workaround consists in modifying the file src/dtype3_root.F near > > > >> line 808 > > > >> and replace the lines: > > > >> > > > >> SUBROUTINE DMUMPS_INIT_ROOT_FAC( N, root, FILS, IROOT, > > > >> & KEEP, INFO ) > > > >> IMPLICIT NONE > > > >> INCLUDE 'dmumps_root.h' > > > >> by: > > > >> > > > >> SUBROUTINE DMUMPS_INIT_ROOT_FAC( N, root, FILS, IROOT, > > > >> & KEEP, INFO ) > > > >> USE DMUMPS_STRUC_DEF > > > >> IMPLICIT NONE > > > >> > > > > > > > > Weird that you’re getting this now if it has been corrected in PETSc? > > > > > > > > Thibaut > > > >> > > > >> > On Apr 8, 2019, at 1:33 PM, Mark Adams <mfadams at lbl.gov > > > >> > <https://lists.mcs.anl.gov/mailman/listinfo/petsc-users>> wrote: > > > >> > > > > >> > Are you able to run the exact same job on your Mac? ie, same number > > > >> > of processes, etc. > > > >> > > > >> This is what I am trying to dig into now. > > > >> > > > >> My Mac has 4 cores. > > > >> > > > >> I have used several different Linux machines with different number of > > > >> processors: 4, 12, 10, 20. They all eventually crash. > > > >> > > > >> I am trying to establish if the point of crash is the same across > > > >> machines. > > > >> > > > >> -Manav > > > > > > > > > > > >> On 8 Apr 2019, at 20:24, petsc-users-requ...@mcs.anl.gov > > > >> <mailto:petsc-users-requ...@mcs.anl.gov> wrote: > > > >> > > > >> Send petsc-users mailing list submissions to > > > >> petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov> > > > >> > > > >> To subscribe or unsubscribe via the World Wide Web, visit > > > >> https://lists.mcs.anl.gov/mailman/listinfo/petsc-users > > > >> or, via email, send a message with subject or body 'help' to > > > >> petsc-users-requ...@mcs.anl.gov > > > >> > > > >> You can reach the person managing the list at > > > >> petsc-users-ow...@mcs.anl.gov > > > >> > > > >> When replying, please edit your Subject line so it is more specific > > > >> than "Re: Contents of petsc-users digest..." > > > >> > > > >> > > > >> Today's Topics: > > > >> > > > >> 1. Error with parallel solve (Manav Bhatia) > > > >> 2. Re: Error with parallel solve (Smith, Barry F.) > > > >> 3. Re: Error with parallel solve (Mark Adams) > > > >> 4. Re: Error with parallel solve (Manav Bhatia) > > > >> > > > >> > > > >> ---------------------------------------------------------------------- > > > >> > > > >> Message: 1 > > > >> Date: Mon, 8 Apr 2019 12:12:06 -0500 > > > >> From: Manav Bhatia <bhatiama...@gmail.com> > > > >> To: Evan Um via petsc-users <petsc-users@mcs.anl.gov> > > > >> Subject: [petsc-users] Error with parallel solve > > > >> Message-ID: <bb21322f-a7c6-4d93-98b4-4b2d0d484...@gmail.com> > > > >> Content-Type: text/plain; charset="us-ascii" > > > >> > > > >> > > > >> Hi, > > > >> > > > >> I am running a code a nonlinear simulation using mesh-refinement on > > > >> libMesh. The code runs without issues on a Mac (can run for days > > > >> without issues), but crashes on Linux (Centos 6). I am using version > > > >> 3.11 on Linux with openmpi 3.1.3 and gcc8.2. > > > >> > > > >> I tried to use the -on_error_attach_debugger, but it only gave me > > > >> this message. Does this message imply something to the more > > > >> experienced eyes? > > > >> > > > >> I am going to try to build a debug version of petsc to figure out > > > >> what is going wrong. I will get and share more detailed logs in a bit. > > > >> > > > >> Regards, > > > >> Manav > > > >> > > > >> [8]PETSC ERROR: > > > >> ------------------------------------------------------------------------ > > > >> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > > > >> probably memory access out of range > > > >> [8]PETSC ERROR: Try option -start_in_debugger or > > > >> -on_error_attach_debugger > > > >> [8]PETSC ERROR: or see > > > >> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > > > >> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac > > > >> OS X to find memory corruption errors > > > >> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, > > > >> and run > > > >> [8]PETSC ERROR: to get more information on the crash. > > > >> [8]PETSC ERROR: User provided function() line 0 in unknown file > > > >> PETSC: Attaching gdb to > > > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >> of pid 2108 on display localhost:10.0 on machine > > > >> Warhawk1.HPC.MsState.Edu > > > >> PETSC: Attaching gdb to > > > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >> of pid 2112 on display localhost:10.0 on machine > > > >> Warhawk1.HPC.MsState.Edu > > > >> 0 :INTERNAL Error: recvd root arrowhead > > > >> 0 :not belonging to me. IARR,JARR= 67525 67525 > > > >> 0 :IROW_GRID,JCOL_GRID= 0 4 > > > >> 0 :MYROW, MYCOL= 0 0 > > > >> 0 :IPOSROOT,JPOSROOT= 92264688 92264688 > > > >> -------------------------------------------------------------------------- > > > >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > > > >> with errorcode -99. > > > >> > > > >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > > >> You may or may not see output from other processes, depending on > > > >> exactly when Open MPI kills them. > > > >> -------------------------------------------------------------------------- > > > >> > > > >> -------------- next part -------------- > > > >> An HTML attachment was scrubbed... > > > >> URL: > > > >> <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/25b954eb/attachment-0001.html> > > > >> > > > >> ------------------------------ > > > >> > > > >> Message: 2 > > > >> Date: Mon, 8 Apr 2019 17:36:53 +0000 > > > >> From: "Smith, Barry F." <bsm...@mcs.anl.gov> > > > >> To: Manav Bhatia <bhatiama...@gmail.com> > > > >> Cc: Evan Um via petsc-users <petsc-users@mcs.anl.gov> > > > >> Subject: Re: [petsc-users] Error with parallel solve > > > >> Message-ID: <b0e28830-aaf4-425b-8c7d-63686af1b...@anl.gov> > > > >> Content-Type: text/plain; charset="us-ascii" > > > >> > > > >> Difficult to tell what is going on. > > > >> > > > >> The message User provided function() line 0 in unknown file > > > >> indicates the crash took place OUTSIDE of PETSc code and error message > > > >> INTERNAL Error: recvd root arrowhead is definitely not coming from > > > >> PETSc. > > > >> > > > >> Yes, debug with the debug version and also try valgrind. > > > >> > > > >> Barry > > > >> > > > >> > > > >>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users > > > >>> <petsc-users@mcs.anl.gov> wrote: > > > >>> > > > >>> > > > >>> Hi, > > > >>> > > > >>> I am running a code a nonlinear simulation using mesh-refinement > > > >>> on libMesh. The code runs without issues on a Mac (can run for days > > > >>> without issues), but crashes on Linux (Centos 6). I am using version > > > >>> 3.11 on Linux with openmpi 3.1.3 and gcc8.2. > > > >>> > > > >>> I tried to use the -on_error_attach_debugger, but it only gave me > > > >>> this message. Does this message imply something to the more > > > >>> experienced eyes? > > > >>> > > > >>> I am going to try to build a debug version of petsc to figure out > > > >>> what is going wrong. I will get and share more detailed logs in a > > > >>> bit. > > > >>> > > > >>> Regards, > > > >>> Manav > > > >>> > > > >>> [8]PETSC ERROR: > > > >>> ------------------------------------------------------------------------ > > > >>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > > > >>> probably memory access out of range > > > >>> [8]PETSC ERROR: Try option -start_in_debugger or > > > >>> -on_error_attach_debugger > > > >>> [8]PETSC ERROR: or see > > > >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > > > >>> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac > > > >>> OS X to find memory corruption errors > > > >>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, > > > >>> link, and run > > > >>> [8]PETSC ERROR: to get more information on the crash. > > > >>> [8]PETSC ERROR: User provided function() line 0 in unknown file > > > >>> PETSC: Attaching gdb to > > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>> of pid 2108 on display localhost:10.0 on machine > > > >>> Warhawk1.HPC.MsState.Edu > > > >>> PETSC: Attaching gdb to > > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>> of pid 2112 on display localhost:10.0 on machine > > > >>> Warhawk1.HPC.MsState.Edu > > > >>> 0 :INTERNAL Error: recvd root arrowhead > > > >>> 0 :not belonging to me. IARR,JARR= 67525 67525 > > > >>> 0 :IROW_GRID,JCOL_GRID= 0 4 > > > >>> 0 :MYROW, MYCOL= 0 0 > > > >>> 0 :IPOSROOT,JPOSROOT= 92264688 92264688 > > > >>> -------------------------------------------------------------------------- > > > >>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > > > >>> with errorcode -99. > > > >>> > > > >>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > > >>> You may or may not see output from other processes, depending on > > > >>> exactly when Open MPI kills them. > > > >>> -------------------------------------------------------------------------- > > > >>> > > > >> > > > >> > > > >> > > > >> ------------------------------ > > > >> > > > >> Message: 3 > > > >> Date: Mon, 8 Apr 2019 13:58:55 -0400 > > > >> From: Mark Adams <mfad...@lbl.gov> > > > >> To: "Smith, Barry F." <bsm...@mcs.anl.gov> > > > >> Cc: Manav Bhatia <bhatiama...@gmail.com>, > > > >> Evan Um via petsc-users > > > >> <petsc-users@mcs.anl.gov> > > > >> Subject: Re: [petsc-users] Error with parallel solve > > > >> Message-ID: > > > >> <cadoheh6ezqtfkwzjym0icomx5_utfdbuh6t4wwu7_duey9k...@mail.gmail.com> > > > >> Content-Type: text/plain; charset="utf-8" > > > >> > > > >> This looks like an error in MUMPS: > > > >> > > > >> IF ( IROW_GRID .NE. root%MYROW .OR. > > > >> & JCOL_GRID .NE. root%MYCOL ) THEN > > > >> WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead ' > > > >> > > > >> > > > >> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users < > > > >> petsc-users@mcs.anl.gov> wrote: > > > >> > > > >>> Difficult to tell what is going on. > > > >>> > > > >>> The message User provided function() line 0 in unknown file > > > >>> indicates > > > >>> the crash took place OUTSIDE of PETSc code and error message INTERNAL > > > >>> Error: recvd root arrowhead is definitely not coming from PETSc. > > > >>> > > > >>> Yes, debug with the debug version and also try valgrind. > > > >>> > > > >>> Barry > > > >>> > > > >>> > > > >>>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users < > > > >>> petsc-users@mcs.anl.gov> wrote: > > > >>>> > > > >>>> > > > >>>> Hi, > > > >>>> > > > >>>> I am running a code a nonlinear simulation using mesh-refinement > > > >>>> on > > > >>> libMesh. The code runs without issues on a Mac (can run for days > > > >>> without > > > >>> issues), but crashes on Linux (Centos 6). I am using version 3.11 on > > > >>> Linux > > > >>> with openmpi 3.1.3 and gcc8.2. > > > >>>> > > > >>>> I tried to use the -on_error_attach_debugger, but it only gave me > > > >>> this message. Does this message imply something to the more > > > >>> experienced > > > >>> eyes? > > > >>>> > > > >>>> I am going to try to build a debug version of petsc to figure out > > > >>> what is going wrong. I will get and share more detailed logs in a bit. > > > >>>> > > > >>>> Regards, > > > >>>> Manav > > > >>>> > > > >>>> [8]PETSC ERROR: > > > >>> ------------------------------------------------------------------------ > > > >>>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, > > > >>> probably memory access out of range > > > >>>> [8]PETSC ERROR: Try option -start_in_debugger or > > > >>> -on_error_attach_debugger > > > >>>> [8]PETSC ERROR: or see > > > >>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > > > >>>> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac > > > >>> OS X to find memory corruption errors > > > >>>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, > > > >>>> link, > > > >>> and run > > > >>>> [8]PETSC ERROR: to get more information on the crash. > > > >>>> [8]PETSC ERROR: User provided function() line 0 in unknown file > > > >>>> PETSC: Attaching gdb to > > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>> of pid 2108 on display localhost:10.0 on machine > > > >>> Warhawk1.HPC.MsState.Edu > > > >>>> PETSC: Attaching gdb to > > > >>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>> of pid 2112 on display localhost:10.0 on machine > > > >>> Warhawk1.HPC.MsState.Edu > > > >>>> 0 :INTERNAL Error: recvd root arrowhead > > > >>>> 0 :not belonging to me. IARR,JARR= 67525 67525 > > > >>>> 0 :IROW_GRID,JCOL_GRID= 0 4 > > > >>>> 0 :MYROW, MYCOL= 0 0 > > > >>>> 0 :IPOSROOT,JPOSROOT= 92264688 92264688 > > > >>>> > > > >>> -------------------------------------------------------------------------- > > > >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > > > >>>> with errorcode -99. > > > >>>> > > > >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > > >>>> You may or may not see output from other processes, depending on > > > >>>> exactly when Open MPI kills them. > > > >>>> > > > >>> -------------------------------------------------------------------------- > > > >>>> > > > >>> > > > >>> > > > >> -------------- next part -------------- > > > >> An HTML attachment was scrubbed... > > > >> URL: > > > >> <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/d1a86834/attachment-0001.html> > > > >> > > > >> ------------------------------ > > > >> > > > >> Message: 4 > > > >> Date: Mon, 8 Apr 2019 13:23:14 -0500 > > > >> From: Manav Bhatia <bhatiama...@gmail.com> > > > >> To: Mark Adams <mfad...@lbl.gov> > > > >> Cc: "Smith, Barry F." <bsm...@mcs.anl.gov>, > > > >> Evan Um via petsc-users > > > >> <petsc-users@mcs.anl.gov> > > > >> Subject: Re: [petsc-users] Error with parallel solve > > > >> Message-ID: <e1168c22-17f5-4a5f-870a-88f0d5fbf...@gmail.com> > > > >> Content-Type: text/plain; charset="us-ascii" > > > >> > > > >> Thanks for identifying this, Mark. > > > >> > > > >> If I compile the debug version of Petsc, will it also build a debug > > > >> version of Mumps? > > > >> > > > >>> On Apr 8, 2019, at 12:58 PM, Mark Adams <mfad...@lbl.gov> wrote: > > > >>> > > > >>> This looks like an error in MUMPS: > > > >>> > > > >>> IF ( IROW_GRID .NE. root%MYROW .OR. > > > >>> & JCOL_GRID .NE. root%MYCOL ) THEN > > > >>> WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead ' > > > >>> > > > >>> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users > > > >>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>> wrote: > > > >>> Difficult to tell what is going on. > > > >>> > > > >>> The message User provided function() line 0 in unknown file > > > >>> indicates the crash took place OUTSIDE of PETSc code and error > > > >>> message INTERNAL Error: recvd root arrowhead is definitely not > > > >>> coming from PETSc. > > > >>> > > > >>> Yes, debug with the debug version and also try valgrind. > > > >>> > > > >>> Barry > > > >>> > > > >>> > > > >>>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users > > > >>>> <petsc-users@mcs.anl.gov <mailto:petsc-users@mcs.anl.gov>> wrote: > > > >>>> > > > >>>> > > > >>>> Hi, > > > >>>> > > > >>>> I am running a code a nonlinear simulation using mesh-refinement > > > >>>> on libMesh. The code runs without issues on a Mac (can run for days > > > >>>> without issues), but crashes on Linux (Centos 6). I am using version > > > >>>> 3.11 on Linux with openmpi 3.1.3 and gcc8.2. > > > >>>> > > > >>>> I tried to use the -on_error_attach_debugger, but it only gave me > > > >>>> this message. Does this message imply something to the more > > > >>>> experienced eyes? > > > >>>> > > > >>>> I am going to try to build a debug version of petsc to figure out > > > >>>> what is going wrong. I will get and share more detailed logs in a > > > >>>> bit. > > > >>>> > > > >>>> Regards, > > > >>>> Manav > > > >>>> > > > >>>> [8]PETSC ERROR: > > > >>>> ------------------------------------------------------------------------ > > > >>>> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation > > > >>>> Violation, probably memory access out of range > > > >>>> [8]PETSC ERROR: Try option -start_in_debugger or > > > >>>> -on_error_attach_debugger > > > >>>> [8]PETSC ERROR: or see > > > >>>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind > > > >>>> <http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind> > > > >>>> [8]PETSC ERROR: or try http://valgrind.org <http://valgrind.org/> on > > > >>>> GNU/linux and Apple Mac OS X to find memory corruption errors > > > >>>> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, > > > >>>> link, and run > > > >>>> [8]PETSC ERROR: to get more information on the crash. > > > >>>> [8]PETSC ERROR: User provided function() line 0 in unknown file > > > >>>> PETSC: Attaching gdb to > > > >>>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>>> of pid 2108 on display localhost:10.0 on machine > > > >>>> Warhawk1.HPC.MsState.Edu <http://warhawk1.hpc.msstate.edu/> > > > >>>> PETSC: Attaching gdb to > > > >>>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5 > > > >>>> of pid 2112 on display localhost:10.0 on machine > > > >>>> Warhawk1.HPC.MsState.Edu <http://warhawk1.hpc.msstate.edu/> > > > >>>> 0 :INTERNAL Error: recvd root arrowhead > > > >>>> 0 :not belonging to me. IARR,JARR= 67525 67525 > > > >>>> 0 :IROW_GRID,JCOL_GRID= 0 4 > > > >>>> 0 :MYROW, MYCOL= 0 0 > > > >>>> 0 :IPOSROOT,JPOSROOT= 92264688 92264688 > > > >>>> -------------------------------------------------------------------------- > > > >>>> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD > > > >>>> with errorcode -99. > > > >>>> > > > >>>> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes. > > > >>>> You may or may not see output from other processes, depending on > > > >>>> exactly when Open MPI kills them. > > > >>>> -------------------------------------------------------------------------- > > > >>>> > > > >>> > > > >> > > > >> -------------- next part -------------- > > > >> An HTML attachment was scrubbed... > > > >> URL: > > > >> <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/04ea668f/attachment.html> > > > >> > > > >> ------------------------------ > > > >> > > > >> Subject: Digest Footer > > > >> > > > >> _______________________________________________ > > > >> petsc-users mailing list > > > >> petsc-users@mcs.anl.gov > > > >> https://lists.mcs.anl.gov/mailman/listinfo/petsc-users > > > >> > > > >> > > > >> ------------------------------ > > > >> > > > >> End of petsc-users Digest, Vol 124, Issue 31 > > > >> ******************************************** > > > > > > > > > > > > >