Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Balay, Satish via petsc-users
On Mon, 8 Apr 2019, Manav Bhatia via petsc-users wrote:

> 
> 
> > On Apr 8, 2019, at 2:19 PM, Stefano Zampini  
> > wrote:
> > 
> > You can circumvent the problem by using a sequential solver for it. There's 
> > a command line option in petsc as well as API that allows you to do so. 
> > -mat_mumps_icntl_13 1
> 
> Stefano, 
> 
>   Do you know if there is a performance penalty to using this option as 
> opposed to fixing it with the patch? 

If would suggest first trying both the fixes if see either of them work for you.

Satish


Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Balay, Satish via petsc-users
> > > >> 
> > > >> 
> > > >> --
> > > >> 
> > > >> Message: 1
> > > >> Date: Mon, 8 Apr 2019 12:12:06 -0500
> > > >> From: Manav Bhatia 
> > > >> To: Evan Um via petsc-users 
> > > >> Subject: [petsc-users] Error with parallel solve
> > > >> Message-ID: 
> > > >> Content-Type: text/plain; charset="us-ascii"
> > > >> 
> > > >> 
> > > >> Hi,
> > > >> 
> > > >>I am running a code a nonlinear simulation using mesh-refinement on 
> > > >> libMesh. The code runs without issues on a Mac (can run for days 
> > > >> without issues), but crashes on Linux (Centos 6). I am using version 
> > > >> 3.11 on Linux with openmpi 3.1.3 and gcc8.2. 
> > > >> 
> > > >>I tried to use the -on_error_attach_debugger, but it only gave me 
> > > >> this message. Does this message imply something to the more 
> > > >> experienced eyes? 
> > > >> 
> > > >>I am going to try to build a debug version of petsc to figure out 
> > > >> what is going wrong. I will get and share more detailed logs in a bit. 
> > > >> 
> > > >> Regards,
> > > >> Manav
> > > >> 
> > > >> [8]PETSC ERROR: 
> > > >> 
> > > >> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> > > >> probably memory access out of range
> > > >> [8]PETSC ERROR: Try option -start_in_debugger or 
> > > >> -on_error_attach_debugger
> > > >> [8]PETSC ERROR: or see 
> > > >> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > > >> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac 
> > > >> OS X to find memory corruption errors
> > > >> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, 
> > > >> and run 
> > > >> [8]PETSC ERROR: to get more information on the crash.
> > > >> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > > >> PETSC: Attaching gdb to 
> > > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > > >>  of pid 2108 on display localhost:10.0 on machine 
> > > >> Warhawk1.HPC.MsState.Edu
> > > >> PETSC: Attaching gdb to 
> > > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > > >>  of pid 2112 on display localhost:10.0 on machine 
> > > >> Warhawk1.HPC.MsState.Edu
> > > >>   0 :INTERNAL Error: recvd root arrowhead 
> > > >>   0 :not belonging to me. IARR,JARR=   67525   67525
> > > >>   0 :IROW_GRID,JCOL_GRID=   0   4
> > > >>   0 :MYROW, MYCOL=   0   0
> > > >>   0 :IPOSROOT,JPOSROOT=9226468892264688
> > > >> --
> > > >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > > >> with errorcode -99.
> > > >> 
> > > >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > > >> You may or may not see output from other processes, depending on
> > > >> exactly when Open MPI kills them.
> > > >> --
> > > >> 
> > > >> -- next part --
> > > >> An HTML attachment was scrubbed...
> > > >> URL: 
> > > >> <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/25b954eb/attachment-0001.html>
> > > >> 
> > > >> --
> > > >> 
> > > >> Message: 2
> > > >> Date: Mon, 8 Apr 2019 17:36:53 +
> > > >> From: "Smith, Barry F." 
> > > >> To: Manav Bhatia 
> > > >> Cc: Evan Um via petsc-users 
> > > >> Subject: Re: [petsc-users] Error with parallel solve
> > > >> Message-ID: 
> > > >> Content

Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Balay, Satish via petsc-users
ement on 
> > >> libMesh. The code runs without issues on a Mac (can run for days without 
> > >> issues), but crashes on Linux (Centos 6). I am using version 3.11 on 
> > >> Linux with openmpi 3.1.3 and gcc8.2. 
> > >> 
> > >>I tried to use the -on_error_attach_debugger, but it only gave me 
> > >> this message. Does this message imply something to the more experienced 
> > >> eyes? 
> > >> 
> > >>I am going to try to build a debug version of petsc to figure out 
> > >> what is going wrong. I will get and share more detailed logs in a bit. 
> > >> 
> > >> Regards,
> > >> Manav
> > >> 
> > >> [8]PETSC ERROR: 
> > >> 
> > >> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> > >> probably memory access out of range
> > >> [8]PETSC ERROR: Try option -start_in_debugger or 
> > >> -on_error_attach_debugger
> > >> [8]PETSC ERROR: or see 
> > >> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > >> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS 
> > >> X to find memory corruption errors
> > >> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, 
> > >> and run 
> > >> [8]PETSC ERROR: to get more information on the crash.
> > >> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > >> PETSC: Attaching gdb to 
> > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > >>  of pid 2108 on display localhost:10.0 on machine 
> > >> Warhawk1.HPC.MsState.Edu
> > >> PETSC: Attaching gdb to 
> > >> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> > >>  of pid 2112 on display localhost:10.0 on machine 
> > >> Warhawk1.HPC.MsState.Edu
> > >>   0 :INTERNAL Error: recvd root arrowhead 
> > >>   0 :not belonging to me. IARR,JARR=   67525   67525
> > >>   0 :IROW_GRID,JCOL_GRID=   0   4
> > >>   0 :MYROW, MYCOL=   0   0
> > >>   0 :IPOSROOT,JPOSROOT=9226468892264688
> > >> --
> > >> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > >> with errorcode -99.
> > >> 
> > >> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > >> You may or may not see output from other processes, depending on
> > >> exactly when Open MPI kills them.
> > >> --
> > >> 
> > >> -- next part --
> > >> An HTML attachment was scrubbed...
> > >> URL: 
> > >> <http://lists.mcs.anl.gov/pipermail/petsc-users/attachments/20190408/25b954eb/attachment-0001.html>
> > >> 
> > >> --
> > >> 
> > >> Message: 2
> > >> Date: Mon, 8 Apr 2019 17:36:53 +
> > >> From: "Smith, Barry F." 
> > >> To: Manav Bhatia 
> > >> Cc: Evan Um via petsc-users 
> > >> Subject: Re: [petsc-users] Error with parallel solve
> > >> Message-ID: 
> > >> Content-Type: text/plain; charset="us-ascii"
> > >> 
> > >>  Difficult to tell what is going on. 
> > >> 
> > >>  The message User provided function() line 0 in  unknown file  indicates 
> > >> the crash took place OUTSIDE of PETSc code and error message INTERNAL 
> > >> Error: recvd root arrowhead  is definitely not coming from PETSc. 
> > >> 
> > >>   Yes, debug with the debug version and also try valgrind.
> > >> 
> > >>   Barry
> > >> 
> > >> 
> > >>> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users 
> > >>>  wrote:
> > >>> 
> > >>> 
> > >>> Hi,
> > >>> 
> > >>>I am running a code a nonlinear simulation using mesh-refinement on 
> > >>> libMesh. The code runs without issues on a Mac (can run for days 
> > >>> without issues), but crashes on Linux (C

Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Mark Adams via petsc-users
On Mon, Apr 8, 2019 at 2:23 PM Manav Bhatia  wrote:

> Thanks for identifying this, Mark.
>
> If I compile the debug version of Petsc, will it also build a debug
> version of Mumps?
>

The debug compiler flags will get passed down to MUMPS if you are
downloading MUMPS in PETSc. Otherwise, yes build a debug version.

Are you able to run the exact same job on your Mac? ie, same number of
processes, etc.


>
> On Apr 8, 2019, at 12:58 PM, Mark Adams  wrote:
>
> This looks like an error in MUMPS:
>
> IF ( IROW_GRID .NE. root%MYROW .OR.
>  &   JCOL_GRID .NE. root%MYCOL ) THEN
> WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead '
>
>
> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
>
>>   Difficult to tell what is going on.
>>
>>   The message User provided function() line 0 in  unknown file  indicates
>> the crash took place OUTSIDE of PETSc code and error message INTERNAL
>> Error: recvd root arrowhead  is definitely not coming from PETSc.
>>
>>Yes, debug with the debug version and also try valgrind.
>>
>>Barry
>>
>>
>> > On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>> >
>> >
>> > Hi,
>> >
>> > I am running a code a nonlinear simulation using mesh-refinement on
>> libMesh. The code runs without issues on a Mac (can run for days without
>> issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux
>> with openmpi 3.1.3 and gcc8.2.
>> >
>> > I tried to use the -on_error_attach_debugger, but it only gave me
>> this message. Does this message imply something to the more experienced
>> eyes?
>> >
>> > I am going to try to build a debug version of petsc to figure out
>> what is going wrong. I will get and share more detailed logs in a bit.
>> >
>> > Regards,
>> > Manav
>> >
>> > [8]PETSC ERROR:
>> 
>> > [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
>> probably memory access out of range
>> > [8]PETSC ERROR: Try option -start_in_debugger or
>> -on_error_attach_debugger
>> > [8]PETSC ERROR: or see
>> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
>> > [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
>> OS X to find memory corruption errors
>> > [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
>> and run
>> > [8]PETSC ERROR: to get more information on the crash.
>> > [8]PETSC ERROR: User provided function() line 0 in  unknown file
>> > PETSC: Attaching gdb to
>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
>> of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
>> 
>> > PETSC: Attaching gdb to
>> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
>> of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
>> 
>> >0 :INTERNAL Error: recvd root arrowhead
>> >0 :not belonging to me. IARR,JARR=   67525   67525
>> >0 :IROW_GRID,JCOL_GRID=   0   4
>> >0 :MYROW, MYCOL=   0   0
>> >0 :IPOSROOT,JPOSROOT=9226468892264688
>> >
>> --
>> > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
>> > with errorcode -99.
>> >
>> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
>> > You may or may not see output from other processes, depending on
>> > exactly when Open MPI kills them.
>> >
>> --
>> >
>>
>>
>


Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Manav Bhatia via petsc-users
Thanks for identifying this, Mark. 

If I compile the debug version of Petsc, will it also build a debug version of 
Mumps? 

> On Apr 8, 2019, at 12:58 PM, Mark Adams  wrote:
> 
> This looks like an error in MUMPS:
> 
> IF ( IROW_GRID .NE. root%MYROW .OR.
>  &   JCOL_GRID .NE. root%MYCOL ) THEN
> WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead '
> 
> On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users 
> mailto:petsc-users@mcs.anl.gov>> wrote:
>   Difficult to tell what is going on. 
> 
>   The message User provided function() line 0 in  unknown file  indicates the 
> crash took place OUTSIDE of PETSc code and error message INTERNAL Error: 
> recvd root arrowhead  is definitely not coming from PETSc. 
> 
>Yes, debug with the debug version and also try valgrind.
> 
>Barry
> 
> 
> > On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users 
> > mailto:petsc-users@mcs.anl.gov>> wrote:
> > 
> > 
> > Hi,
> >   
> > I am running a code a nonlinear simulation using mesh-refinement on 
> > libMesh. The code runs without issues on a Mac (can run for days without 
> > issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux 
> > with openmpi 3.1.3 and gcc8.2. 
> > 
> > I tried to use the -on_error_attach_debugger, but it only gave me this 
> > message. Does this message imply something to the more experienced eyes? 
> > 
> > I am going to try to build a debug version of petsc to figure out what 
> > is going wrong. I will get and share more detailed logs in a bit. 
> > 
> > Regards,
> > Manav
> > 
> > [8]PETSC ERROR: 
> > 
> > [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> > probably memory access out of range
> > [8]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> > [8]PETSC ERROR: or see 
> > http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind 
> > 
> > [8]PETSC ERROR: or try http://valgrind.org  on 
> > GNU/linux and Apple Mac OS X to find memory corruption errors
> > [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and 
> > run 
> > [8]PETSC ERROR: to get more information on the crash.
> > [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> > PETSC: Attaching gdb to 
> > /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> >  of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu 
> > 
> > PETSC: Attaching gdb to 
> > /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> >  of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu 
> > 
> >0 :INTERNAL Error: recvd root arrowhead 
> >0 :not belonging to me. IARR,JARR=   67525   67525
> >0 :IROW_GRID,JCOL_GRID=   0   4
> >0 :MYROW, MYCOL=   0   0
> >0 :IPOSROOT,JPOSROOT=9226468892264688
> > --
> > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > with errorcode -99.
> > 
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> > --
> > 
> 



Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Mark Adams via petsc-users
This looks like an error in MUMPS:

IF ( IROW_GRID .NE. root%MYROW .OR.
 &   JCOL_GRID .NE. root%MYCOL ) THEN
WRITE(*,*) MYID,':INTERNAL Error: recvd root arrowhead '


On Mon, Apr 8, 2019 at 1:37 PM Smith, Barry F. via petsc-users <
petsc-users@mcs.anl.gov> wrote:

>   Difficult to tell what is going on.
>
>   The message User provided function() line 0 in  unknown file  indicates
> the crash took place OUTSIDE of PETSc code and error message INTERNAL
> Error: recvd root arrowhead  is definitely not coming from PETSc.
>
>Yes, debug with the debug version and also try valgrind.
>
>Barry
>
>
> > On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users <
> petsc-users@mcs.anl.gov> wrote:
> >
> >
> > Hi,
> >
> > I am running a code a nonlinear simulation using mesh-refinement on
> libMesh. The code runs without issues on a Mac (can run for days without
> issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux
> with openmpi 3.1.3 and gcc8.2.
> >
> > I tried to use the -on_error_attach_debugger, but it only gave me
> this message. Does this message imply something to the more experienced
> eyes?
> >
> > I am going to try to build a debug version of petsc to figure out
> what is going wrong. I will get and share more detailed logs in a bit.
> >
> > Regards,
> > Manav
> >
> > [8]PETSC ERROR:
> 
> > [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation,
> probably memory access out of range
> > [8]PETSC ERROR: Try option -start_in_debugger or
> -on_error_attach_debugger
> > [8]PETSC ERROR: or see
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> > [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac
> OS X to find memory corruption errors
> > [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link,
> and run
> > [8]PETSC ERROR: to get more information on the crash.
> > [8]PETSC ERROR: User provided function() line 0 in  unknown file
> > PETSC: Attaching gdb to
> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> > PETSC: Attaching gdb to
> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
> of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> >0 :INTERNAL Error: recvd root arrowhead
> >0 :not belonging to me. IARR,JARR=   67525   67525
> >0 :IROW_GRID,JCOL_GRID=   0   4
> >0 :MYROW, MYCOL=   0   0
> >0 :IPOSROOT,JPOSROOT=9226468892264688
> >
> --
> > MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> > with errorcode -99.
> >
> > NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> > You may or may not see output from other processes, depending on
> > exactly when Open MPI kills them.
> >
> --
> >
>
>


Re: [petsc-users] Error with parallel solve

2019-04-08 Thread Smith, Barry F. via petsc-users
  Difficult to tell what is going on. 

  The message User provided function() line 0 in  unknown file  indicates the 
crash took place OUTSIDE of PETSc code and error message INTERNAL Error: recvd 
root arrowhead  is definitely not coming from PETSc. 

   Yes, debug with the debug version and also try valgrind.

   Barry


> On Apr 8, 2019, at 12:12 PM, Manav Bhatia via petsc-users 
>  wrote:
> 
> 
> Hi,
>   
> I am running a code a nonlinear simulation using mesh-refinement on 
> libMesh. The code runs without issues on a Mac (can run for days without 
> issues), but crashes on Linux (Centos 6). I am using version 3.11 on Linux 
> with openmpi 3.1.3 and gcc8.2. 
> 
> I tried to use the -on_error_attach_debugger, but it only gave me this 
> message. Does this message imply something to the more experienced eyes? 
> 
> I am going to try to build a debug version of petsc to figure out what is 
> going wrong. I will get and share more detailed logs in a bit. 
> 
> Regards,
> Manav
> 
> [8]PETSC ERROR: 
> 
> [8]PETSC ERROR: Caught signal number 11 SEGV: Segmentation Violation, 
> probably memory access out of range
> [8]PETSC ERROR: Try option -start_in_debugger or -on_error_attach_debugger
> [8]PETSC ERROR: or see 
> http://www.mcs.anl.gov/petsc/documentation/faq.html#valgrind
> [8]PETSC ERROR: or try http://valgrind.org on GNU/linux and Apple Mac OS X to 
> find memory corruption errors
> [8]PETSC ERROR: configure using --with-debugging=yes, recompile, link, and 
> run 
> [8]PETSC ERROR: to get more information on the crash.
> [8]PETSC ERROR: User provided function() line 0 in  unknown file  
> PETSC: Attaching gdb to 
> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
>  of pid 2108 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
> PETSC: Attaching gdb to 
> /cavs/projects/brg_codes/users/bhatia/mast/mast_topology/opt/examples/structural/example_5/structural_example_5
>  of pid 2112 on display localhost:10.0 on machine Warhawk1.HPC.MsState.Edu
>0 :INTERNAL Error: recvd root arrowhead 
>0 :not belonging to me. IARR,JARR=   67525   67525
>0 :IROW_GRID,JCOL_GRID=   0   4
>0 :MYROW, MYCOL=   0   0
>0 :IPOSROOT,JPOSROOT=9226468892264688
> --
> MPI_ABORT was invoked on rank 0 in communicator MPI_COMM_WORLD
> with errorcode -99.
> 
> NOTE: invoking MPI_ABORT causes Open MPI to kill all MPI processes.
> You may or may not see output from other processes, depending on
> exactly when Open MPI kills them.
> --
>