Re: [petsc-users] Memory growth issue

2019-05-29 Thread Smith, Barry F. via petsc-users


   This is indeed worrisome. 

Would it be possible to put PetscMemoryGetCurrentUsage() around each call 
to KSPSolve() and each call to your data exchange? See if at each step they 
increase? 

One thing to be aware of with "max resident set size" is that it measures 
the number of pages that have been set up. Not the amount of memory allocated. 
So, if, for example, you allocate a very large array but don't actually read or 
write the memory in that array until later in the code it won't appear in the 
"resident set size" until you read or write the memory (because Unix doesn't 
set up pages until it needs to). 

   You should also try another MPI. Both OpenMPI and MPICH can be installed 
with brew or you can use --download-mpich or --download-openmp to see if the 
MPI implementation is making a difference.

For now I would focus on the PETSc only solvers to eliminate one variable 
from the equation; once that is understood you can go back to the question of 
the memory management of the other solvers

  Barry


> On May 29, 2019, at 11:51 PM, Sanjay Govindjee via petsc-users 
>  wrote:
> 
> I am trying to track down a memory issue with my code; apologies in advance 
> for the longish message.
> 
> I am solving a FEA problem with a number of load steps involving about 3000
> right hand side and tangent assemblies and solves.  The program is mainly 
> Fortran, with a C memory allocator.
> 
> When I run my code in strictly serial mode (no Petsc or MPI routines) the 
> memory stays constant over the whole run.
> 
> When I run it in parallel mode with petsc solvers with num_processes=1, the 
> memory (max resident set size) also stays constant:
> 
> PetscMalloc = 28,976, ProgramNativeMalloc = constant, Resident Size = 
> 24,854,528 (constant) [CG/JACOBI]
> 
> [PetscMalloc and Resident Size as reported by PetscMallocGetCurrentUsage and 
> PetscMemoryGetCurrentUsage (and summed across processes as needed);
> ProgramNativeMalloc reported by program memory allocator.]
> 
> When I run it in parallel mode with petsc solvers but num_processes=2, the 
> resident memory grows steadily during the run:
> 
> PetscMalloc = 3,039,072 (constant), ProgramNativeMalloc = constant, Resident 
> Size = (finish) 31,313,920 (start) 24,698,880 [CG/JACOBI]
> 
> When I run it in parallel mode with petsc solvers but num_processes=4, the 
> resident memory grows steadily during the run:
> 
> PetscMalloc = 3,307,888 (constant), ProgramNativeMalloc = 1,427,584 
> (constant), Resident Size = (finish) 70,787,072  (start) 45,801,472 
> [CG/JACOBI]
> PetscMalloc = 5,903,808 (constant), ProgramNativeMalloc = 1,427,584 
> (constant), Resident Size = (finish) 112,410,624 (start) 52,076,544 
> [GMRES/BJACOBI]
> PetscMalloc = 3,188,944 (constant), ProgramNativeMalloc = 1,427,584 
> (constant), Resident Size = (finish) 712,798,208 (start) 381,480,960 [SUPERLU]
> PetscMalloc = 6,539,408 (constant), ProgramNativeMalloc = 1,427,584 
> (constant), Resident Size = (finish) 591,048,704 (start) 278,671,360 [MUMPS]
> 
> The memory growth feels alarming but maybe I do not understand the values in 
> ru_maxrss from getrusage().
> 
> My box (MacBook Pro) has a broken Valgrind so I need to get to a system with 
> a functional one; notwithstanding, the code has always been Valgrind clean.
> There are no Fortran Pointers or Fortran Allocatable arrays in the part of 
> the code being used.  The program's C memory allocator keeps track of
> itself so I do not see that the problem is there.  The Petsc malloc is also 
> steady.
> 
> Other random hints:
> 
> 1) If I comment out the call to KSPSolve and to my MPI data-exchange routine 
> (for passing solution values between processes after each solve,
> use  MPI_Isend, MPI_Recv, MPI_BARRIER)  the memory growth essentially goes 
> away.
> 
> 2) If I comment out the call to my MPI data-exchange routine but leave the 
> call to KSPSolve the problem remains but is substantially reduced
> for CG/JACOBI, and is marginally reduced for the GMRES/BJACOBI, SUPERLU, and 
> MUMPS runs.
> 
> 3) If I comment out the call to KSPSolve but leave the call to my MPI 
> data-exchange routine the problem remains.
> 
> Any suggestions/hints of where to look will be great.
> 
> -sanjay
> 
> 



Re: [petsc-users] Memory growth issue

2019-05-30 Thread Smith, Barry F. via petsc-users


  Let us know how it goes with MPICH


> On May 30, 2019, at 2:01 AM, Sanjay Govindjee  wrote:
> 
> I put in calls to PetscMemoryGetCurrentUsage( ) around KSPSolve and my data 
> exchange routine.  The problem is clearly mostly in my data exchange routine.
> Attached are graphs of the change in memory  for each call.  Lots of calls 
> have zero change, but on a periodic regular basis the memory goes up from the 
> data exchange; much less
> so with the KSPSolve calls (and then mostly on the first calls).
> 
> For the CG/Jacobi  data_exchange_total = 21,311,488; kspsolve_total = 
> 2,625,536
> For the GMRES/BJACOBI data_exchange_total = 6,619,136; kspsolve_total = 
> 54,403,072 (dominated by initial calls)
> 
> I will try to switch up my MPI to see if anything changes; right now my 
> configure is with  --download-openmpi.
> I've also attached the data exchange routine in case there is something 
> obviously wrong.
> 
> NB: Graphs/Data are from just one run each.
> 
> -sanjay
> 
> On 5/29/19 10:17 PM, Smith, Barry F. wrote:
>>This is indeed worrisome.
>> 
>> Would it be possible to put PetscMemoryGetCurrentUsage() around each 
>> call to KSPSolve() and each call to your data exchange? See if at each step 
>> they increase?
>> 
>> One thing to be aware of with "max resident set size" is that it 
>> measures the number of pages that have been set up. Not the amount of memory 
>> allocated. So, if, for example, you allocate a very large array but don't 
>> actually read or write the memory in that array until later in the code it 
>> won't appear in the "resident set size" until you read or write the memory 
>> (because Unix doesn't set up pages until it needs to).
>> 
>>You should also try another MPI. Both OpenMPI and MPICH can be installed 
>> with brew or you can use --download-mpich or --download-openmp to see if the 
>> MPI implementation is making a difference.
>> 
>> For now I would focus on the PETSc only solvers to eliminate one 
>> variable from the equation; once that is understood you can go back to the 
>> question of the memory management of the other solvers
>> 
>>   Barry
>> 
>> 
>>> On May 29, 2019, at 11:51 PM, Sanjay Govindjee via petsc-users 
>>>  wrote:
>>> 
>>> I am trying to track down a memory issue with my code; apologies in advance 
>>> for the longish message.
>>> 
>>> I am solving a FEA problem with a number of load steps involving about 3000
>>> right hand side and tangent assemblies and solves.  The program is mainly 
>>> Fortran, with a C memory allocator.
>>> 
>>> When I run my code in strictly serial mode (no Petsc or MPI routines) the 
>>> memory stays constant over the whole run.
>>> 
>>> When I run it in parallel mode with petsc solvers with num_processes=1, the 
>>> memory (max resident set size) also stays constant:
>>> 
>>> PetscMalloc = 28,976, ProgramNativeMalloc = constant, Resident Size = 
>>> 24,854,528 (constant) [CG/JACOBI]
>>> 
>>> [PetscMalloc and Resident Size as reported by PetscMallocGetCurrentUsage 
>>> and PetscMemoryGetCurrentUsage (and summed across processes as needed);
>>> ProgramNativeMalloc reported by program memory allocator.]
>>> 
>>> When I run it in parallel mode with petsc solvers but num_processes=2, the 
>>> resident memory grows steadily during the run:
>>> 
>>> PetscMalloc = 3,039,072 (constant), ProgramNativeMalloc = constant, 
>>> Resident Size = (finish) 31,313,920 (start) 24,698,880 [CG/JACOBI]
>>> 
>>> When I run it in parallel mode with petsc solvers but num_processes=4, the 
>>> resident memory grows steadily during the run:
>>> 
>>> PetscMalloc = 3,307,888 (constant), ProgramNativeMalloc = 1,427,584 
>>> (constant), Resident Size = (finish) 70,787,072  (start) 45,801,472 
>>> [CG/JACOBI]
>>> PetscMalloc = 5,903,808 (constant), ProgramNativeMalloc = 1,427,584 
>>> (constant), Resident Size = (finish) 112,410,624 (start) 52,076,544 
>>> [GMRES/BJACOBI]
>>> PetscMalloc = 3,188,944 (constant), ProgramNativeMalloc = 1,427,584 
>>> (constant), Resident Size = (finish) 712,798,208 (start) 381,480,960 
>>> [SUPERLU]
>>> PetscMalloc = 6,539,408 (constant), ProgramNativeMalloc = 1,427,584 
>>> (constant), Resident Size = (finish) 591,048,704 (start) 278,671,360 [MUMPS]
>>> 
>>> The memory growth feels alarming but maybe I do not understand the values 
>>> in ru_maxrss from getrusage().
>>> 
>>> My box (MacBook Pro) has a broken Valgrind so I need to get to a system 
>>> with a functional one; notwithstanding, the code has always been Valgrind 
>>> clean.
>>> There are no Fortran Pointers or Fortran Allocatable arrays in the part of 
>>> the code being used.  The program's C memory allocator keeps track of
>>> itself so I do not see that the problem is there.  The Petsc malloc is also 
>>> steady.
>>> 
>>> Other random hints:
>>> 
>>> 1) If I comment out the call to KSPSolve and to my MPI data-exchange 
>>> routine (for passing solution values between processes after each solve,
>>> use  MPI_Isend, MPI_Recv, MPI_BA

Re: [petsc-users] Memory growth issue

2019-05-30 Thread Sanjay Govindjee via petsc-users
The problem seems to persist but with a different signature.  Graphs 
attached as before.


Totals with MPICH (NB: single run)

For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total = 
38,289,408
For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total = 
41,324,544

Just reading the MPI docs I am wondering if I need some sort of 
MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
I would have thought that with the blocking receives and the MPI_Barrier 
that everything will have fully completed and cleaned up before

all processes exited the routine, but perhaps I am wrong on that.

-sanjay

On 5/30/19 12:14 AM, Smith, Barry F. wrote:

   Let us know how it goes with MPICH



On May 30, 2019, at 2:01 AM, Sanjay Govindjee  wrote:

I put in calls to PetscMemoryGetCurrentUsage( ) around KSPSolve and my data 
exchange routine.  The problem is clearly mostly in my data exchange routine.
Attached are graphs of the change in memory  for each call.  Lots of calls have 
zero change, but on a periodic regular basis the memory goes up from the data 
exchange; much less
so with the KSPSolve calls (and then mostly on the first calls).

For the CG/Jacobi  data_exchange_total = 21,311,488; kspsolve_total = 
2,625,536
For the GMRES/BJACOBI data_exchange_total = 6,619,136; kspsolve_total = 
54,403,072 (dominated by initial calls)

I will try to switch up my MPI to see if anything changes; right now my 
configure is with  --download-openmpi.
I've also attached the data exchange routine in case there is something 
obviously wrong.

NB: Graphs/Data are from just one run each.

-sanjay

On 5/29/19 10:17 PM, Smith, Barry F. wrote:

This is indeed worrisome.

 Would it be possible to put PetscMemoryGetCurrentUsage() around each call 
to KSPSolve() and each call to your data exchange? See if at each step they 
increase?

 One thing to be aware of with "max resident set size" is that it measures the number 
of pages that have been set up. Not the amount of memory allocated. So, if, for example, you 
allocate a very large array but don't actually read or write the memory in that array until later 
in the code it won't appear in the "resident set size" until you read or write the memory 
(because Unix doesn't set up pages until it needs to).

You should also try another MPI. Both OpenMPI and MPICH can be installed 
with brew or you can use --download-mpich or --download-openmp to see if the 
MPI implementation is making a difference.

 For now I would focus on the PETSc only solvers to eliminate one variable 
from the equation; once that is understood you can go back to the question of 
the memory management of the other solvers

   Barry



On May 29, 2019, at 11:51 PM, Sanjay Govindjee via petsc-users 
 wrote:

I am trying to track down a memory issue with my code; apologies in advance for 
the longish message.

I am solving a FEA problem with a number of load steps involving about 3000
right hand side and tangent assemblies and solves.  The program is mainly 
Fortran, with a C memory allocator.

When I run my code in strictly serial mode (no Petsc or MPI routines) the 
memory stays constant over the whole run.

When I run it in parallel mode with petsc solvers with num_processes=1, the 
memory (max resident set size) also stays constant:

PetscMalloc = 28,976, ProgramNativeMalloc = constant, Resident Size = 
24,854,528 (constant) [CG/JACOBI]

[PetscMalloc and Resident Size as reported by PetscMallocGetCurrentUsage and 
PetscMemoryGetCurrentUsage (and summed across processes as needed);
ProgramNativeMalloc reported by program memory allocator.]

When I run it in parallel mode with petsc solvers but num_processes=2, the 
resident memory grows steadily during the run:

PetscMalloc = 3,039,072 (constant), ProgramNativeMalloc = constant, Resident 
Size = (finish) 31,313,920 (start) 24,698,880 [CG/JACOBI]

When I run it in parallel mode with petsc solvers but num_processes=4, the 
resident memory grows steadily during the run:

PetscMalloc = 3,307,888 (constant), ProgramNativeMalloc = 1,427,584 (constant), 
Resident Size = (finish) 70,787,072  (start) 45,801,472 [CG/JACOBI]
PetscMalloc = 5,903,808 (constant), ProgramNativeMalloc = 1,427,584 (constant), 
Resident Size = (finish) 112,410,624 (start) 52,076,544 [GMRES/BJACOBI]
PetscMalloc = 3,188,944 (constant), ProgramNativeMalloc = 1,427,584 (constant), 
Resident Size = (finish) 712,798,208 (start) 381,480,960 [SUPERLU]
PetscMalloc = 6,539,408 (constant), ProgramNativeMalloc = 1,427,584 (constant), 
Resident Size = (finish) 591,048,704 (start) 278,671,360 [MUMPS]

The memory growth feels alarming but maybe I do not understand the values in 
ru_maxrss from getrusage().

My box (MacBook Pro) has a broken Valgrind so I need to get to a system with a 
functional one; notwithstanding, the code has always been Valgrind clean.
There are no Fortran Pointers or Fortran Allocatable arrays in the part of the

Re: [petsc-users] Memory growth issue

2019-05-30 Thread Lawrence Mitchell via petsc-users
Hi Sanjay,

> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>  wrote:
> 
> The problem seems to persist but with a different signature.  Graphs attached 
> as before.
> 
> Totals with MPICH (NB: single run)
> 
> For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total = 
> 38,289,408
> For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total = 
> 41,324,544
> 
> Just reading the MPI docs I am wondering if I need some sort of 
> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
> I would have thought that with the blocking receives and the MPI_Barrier that 
> everything will have fully completed and cleaned up before
> all processes exited the routine, but perhaps I am wrong on that.


Skimming the fortran code you sent you do:

for i in ...:
   call MPI_Isend(..., req, ierr)

for i in ...:
   call MPI_Recv(..., ierr)

But you never call MPI_Wait on the request you got back from the Isend. So the 
MPI library will never free the data structures it created.

The usual pattern for these non-blocking communications is to allocate an array 
for the requests of length nsend+nrecv and then do:

for i in nsend:
   call MPI_Isend(..., req[i], ierr)
for j in nrecv:
   call MPI_Irecv(..., req[nsend+j], ierr)

call MPI_Waitall(req, ..., ierr)

I note also there's no need for the Barrier at the end of the routine, this 
kind of communication does neighbourwise synchronisation, no need to add 
(unnecessary) global synchronisation too.

As an aside, is there a reason you don't use PETSc's VecScatter to manage this 
global to local exchange?

Cheers,

Lawrence

Re: [petsc-users] Memory growth issue

2019-05-30 Thread Smith, Barry F. via petsc-users


  Great observation Lawrence. 
https://www.slideshare.net/jsquyres/friends-dont-let-friends-leak-mpirequests

  You can add the following option to --download-mpich 

--download-mpich-configure-arguments="--enable-error-messages=all --enable-g"

  then MPICH will report all MPI resources that have not been freed during the 
run. This helps catching missing waits, etc. We have a nightly test that 
utilizes this for the PETSc libraries.

   Barry
 



> On May 30, 2019, at 6:48 AM, Lawrence Mitchell  wrote:
> 
> Hi Sanjay,
> 
>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>>  wrote:
>> 
>> The problem seems to persist but with a different signature.  Graphs 
>> attached as before.
>> 
>> Totals with MPICH (NB: single run)
>> 
>> For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total 
>> = 38,289,408
>> For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total 
>> = 41,324,544
>> 
>> Just reading the MPI docs I am wondering if I need some sort of 
>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
>> I would have thought that with the blocking receives and the MPI_Barrier 
>> that everything will have fully completed and cleaned up before
>> all processes exited the routine, but perhaps I am wrong on that.
> 
> 
> Skimming the fortran code you sent you do:
> 
> for i in ...:
>   call MPI_Isend(..., req, ierr)
> 
> for i in ...:
>   call MPI_Recv(..., ierr)
> 
> But you never call MPI_Wait on the request you got back from the Isend. So 
> the MPI library will never free the data structures it created.
> 
> The usual pattern for these non-blocking communications is to allocate an 
> array for the requests of length nsend+nrecv and then do:
> 
> for i in nsend:
>   call MPI_Isend(..., req[i], ierr)
> for j in nrecv:
>   call MPI_Irecv(..., req[nsend+j], ierr)
> 
> call MPI_Waitall(req, ..., ierr)
> 
> I note also there's no need for the Barrier at the end of the routine, this 
> kind of communication does neighbourwise synchronisation, no need to add 
> (unnecessary) global synchronisation too.
> 
> As an aside, is there a reason you don't use PETSc's VecScatter to manage 
> this global to local exchange?
> 
> Cheers,
> 
> Lawrence



Re: [petsc-users] Memory growth issue

2019-05-30 Thread Smith, Barry F. via petsc-users


   Thanks for the update. So the current conclusions are that using the Waitall 
in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve 

3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) the 
wait all fix?

If you literately just comment out the call to KSPSolve() with OpenMPI is there 
no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a TS example 
that uses the linear solver (like with -ts_type beuler)? Thanks


  Barry



> On May 30, 2019, at 1:47 PM, Sanjay Govindjee  wrote:
> 
> Lawrence,
> Thanks for taking a look!  This is what I had been wondering about -- my 
> knowledge of MPI is pretty minimal and
> this origins of the routine were from a programmer we hired a decade+ back 
> from NERSC.  I'll have to look into
> VecScatter.  It will be great to dispense with our roll-your-own routines (we 
> even have our own reduceALL scattered around the code).
> 
> Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI but 
> it still persists with MPICH.  Graphs attached.
> I'm going to run with openmpi for now (but I guess I really still need to 
> figure out what is wrong with MPICH and WaitALL;
> I'll try Barry's suggestion of 
> --download-mpich-configure-arguments="--enable-error-messages=all --enable-g" 
> later today and report back).
> 
> Regarding MPI_Barrier, it was put in due a problem that some processes were 
> finishing up sending and receiving and exiting the subroutine
> before the receiving processes had completed (which resulted in data loss as 
> the buffers are freed after the call to the routine). MPI_Barrier was the 
> solution proposed
> to us.  I don't think I can dispense with it, but will think about some more.
> 
> I'm not so sure about using MPI_IRecv as it will require a bit of rewriting 
> since right now I process the received
> data sequentially after each blocking MPI_Recv -- clearly slower but easier 
> to code.
> 
> Thanks again for the help.
> 
> -sanjay
> 
> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>> Hi Sanjay,
>> 
>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>>>  wrote:
>>> 
>>> The problem seems to persist but with a different signature.  Graphs 
>>> attached as before.
>>> 
>>> Totals with MPICH (NB: single run)
>>> 
>>> For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total 
>>> = 38,289,408
>>> For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total 
>>> = 41,324,544
>>> 
>>> Just reading the MPI docs I am wondering if I need some sort of 
>>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
>>> I would have thought that with the blocking receives and the MPI_Barrier 
>>> that everything will have fully completed and cleaned up before
>>> all processes exited the routine, but perhaps I am wrong on that.
>> 
>> Skimming the fortran code you sent you do:
>> 
>> for i in ...:
>>call MPI_Isend(..., req, ierr)
>> 
>> for i in ...:
>>call MPI_Recv(..., ierr)
>> 
>> But you never call MPI_Wait on the request you got back from the Isend. So 
>> the MPI library will never free the data structures it created.
>> 
>> The usual pattern for these non-blocking communications is to allocate an 
>> array for the requests of length nsend+nrecv and then do:
>> 
>> for i in nsend:
>>call MPI_Isend(..., req[i], ierr)
>> for j in nrecv:
>>call MPI_Irecv(..., req[nsend+j], ierr)
>> 
>> call MPI_Waitall(req, ..., ierr)
>> 
>> I note also there's no need for the Barrier at the end of the routine, this 
>> kind of communication does neighbourwise synchronisation, no need to add 
>> (unnecessary) global synchronisation too.
>> 
>> As an aside, is there a reason you don't use PETSc's VecScatter to manage 
>> this global to local exchange?
>> 
>> Cheers,
>> 
>> Lawrence
> 
> 



Re: [petsc-users] Memory growth issue

2019-05-30 Thread Sanjay Govindjee via petsc-users
1) Correct: Placing a WaitAll before the MPI_Barrier solve the problem 
in our send-get routine for OPENMPI

2) Correct: The problem persists with KSPSolve
3) Correct: WaitAll did not fix the problem in our send-get nor in 
KSPSolve when using MPICH


Also correct.  Commenting out the call to KSPSolve results in zero 
memory growth on OPENMPI.


On 5/30/19 11:59 AM, Smith, Barry F. wrote:

Thanks for the update. So the current conclusions are that using the 
Waitall in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve

3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) the 
wait all fix?

If you literately just comment out the call to KSPSolve() with OpenMPI is there 
no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a TS example 
that uses the linear solver (like with -ts_type beuler)? Thanks


   Barry




On May 30, 2019, at 1:47 PM, Sanjay Govindjee  wrote:

Lawrence,
Thanks for taking a look!  This is what I had been wondering about -- my 
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+ back from 
NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own routines (we 
even have our own reduceALL scattered around the code).

Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI but it 
still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still need to 
figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of 
--download-mpich-configure-arguments="--enable-error-messages=all --enable-g" 
later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some processes were 
finishing up sending and receiving and exiting the subroutine
before the receiving processes had completed (which resulted in data loss as 
the buffers are freed after the call to the routine). MPI_Barrier was the 
solution proposed
to us.  I don't think I can dispense with it, but will think about some more.

I'm not so sure about using MPI_IRecv as it will require a bit of rewriting 
since right now I process the received
data sequentially after each blocking MPI_Recv -- clearly slower but easier to 
code.

Thanks again for the help.

-sanjay

On 5/30/19 4:48 AM, Lawrence Mitchell wrote:

Hi Sanjay,


On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
 wrote:

The problem seems to persist but with a different signature.  Graphs attached 
as before.

Totals with MPICH (NB: single run)

For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total = 
38,289,408
For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total = 
41,324,544

Just reading the MPI docs I am wondering if I need some sort of 
MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
I would have thought that with the blocking receives and the MPI_Barrier that 
everything will have fully completed and cleaned up before
all processes exited the routine, but perhaps I am wrong on that.

Skimming the fortran code you sent you do:

for i in ...:
call MPI_Isend(..., req, ierr)

for i in ...:
call MPI_Recv(..., ierr)

But you never call MPI_Wait on the request you got back from the Isend. So the 
MPI library will never free the data structures it created.

The usual pattern for these non-blocking communications is to allocate an array 
for the requests of length nsend+nrecv and then do:

for i in nsend:
call MPI_Isend(..., req[i], ierr)
for j in nrecv:
call MPI_Irecv(..., req[nsend+j], ierr)

call MPI_Waitall(req, ..., ierr)

I note also there's no need for the Barrier at the end of the routine, this 
kind of communication does neighbourwise synchronisation, no need to add 
(unnecessary) global synchronisation too.

As an aside, is there a reason you don't use PETSc's VecScatter to manage this 
global to local exchange?

Cheers,

Lawrence






Re: [petsc-users] Memory growth issue

2019-05-30 Thread Zhang, Junchao via petsc-users

Hi, Sanjay,
  Could you send your modified data exchange code (psetb.F) with MPI_Waitall? 
See other inlined comments below. Thanks.

On Thu, May 30, 2019 at 1:49 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Lawrence,
Thanks for taking a look!  This is what I had been wondering about -- my
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+
back from NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own
routines (we even have our own reduceALL scattered around the code).
Petsc VecScatter has a very simple interface and you definitely should go with. 
 With VecScatter, you can think in familiar vectors and indices instead of the 
low level MPI_Send/Recv. Besides that, PETSc has optimized VecScatter so that 
communication is efficient.

Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI
but it still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still need
to figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all
--enable-g" later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some processes
were finishing up sending and receiving and exiting the subroutine
before the receiving processes had completed (which resulted in data
loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
to us.  I don't think I can dispense with it, but will think about some
more.
After MPI_Send(), or after MPI_Isend(..,req) and MPI_Wait(req), you can safely 
free the send buffer without worry that the receive has not completed. MPI 
guarantees the receiver can get the data, for example, through internal 
buffering.

I'm not so sure about using MPI_IRecv as it will require a bit of
rewriting since right now I process the received
data sequentially after each blocking MPI_Recv -- clearly slower but
easier to code.

Thanks again for the help.

-sanjay

On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
> Hi Sanjay,
>
>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>> mailto:petsc-users@mcs.anl.gov>> wrote:
>>
>> The problem seems to persist but with a different signature.  Graphs 
>> attached as before.
>>
>> Totals with MPICH (NB: single run)
>>
>> For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total 
>> = 38,289,408
>> For the GMRES/BJACOBI  data_exchange_total = 41,324,544; kspsolve_total 
>> = 41,324,544
>>
>> Just reading the MPI docs I am wondering if I need some sort of 
>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange routine?
>> I would have thought that with the blocking receives and the MPI_Barrier 
>> that everything will have fully completed and cleaned up before
>> all processes exited the routine, but perhaps I am wrong on that.
>
> Skimming the fortran code you sent you do:
>
> for i in ...:
> call MPI_Isend(..., req, ierr)
>
> for i in ...:
> call MPI_Recv(..., ierr)
>
> But you never call MPI_Wait on the request you got back from the Isend. So 
> the MPI library will never free the data structures it created.
>
> The usual pattern for these non-blocking communications is to allocate an 
> array for the requests of length nsend+nrecv and then do:
>
> for i in nsend:
> call MPI_Isend(..., req[i], ierr)
> for j in nrecv:
> call MPI_Irecv(..., req[nsend+j], ierr)
>
> call MPI_Waitall(req, ..., ierr)
>
> I note also there's no need for the Barrier at the end of the routine, this 
> kind of communication does neighbourwise synchronisation, no need to add 
> (unnecessary) global synchronisation too.
>
> As an aside, is there a reason you don't use PETSc's VecScatter to manage 
> this global to local exchange?
>
> Cheers,
>
> Lawrence



Re: [petsc-users] Memory growth issue

2019-05-30 Thread Sanjay Govindjee via petsc-users

Hi Juanchao,
Thanks for the hints below, they will take some time to absorb as the 
vectors that are being  moved around

are actually partly petsc vectors and partly local process vectors.

Attached is the modified routine that now works (on leaking memory) with 
openmpi.


-sanjay


On 5/30/19 8:41 PM, Zhang, Junchao wrote:


Hi, Sanjay,
  Could you send your modified data exchange code (psetb.F) with 
MPI_Waitall? See other inlined comments below. Thanks.


On Thu, May 30, 2019 at 1:49 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:


Lawrence,
Thanks for taking a look!  This is what I had been wondering about
-- my
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+
back from NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own
routines (we even have our own reduceALL scattered around the code).

Petsc VecScatter has a very simple interface and you definitely should 
go with.  With VecScatter, you can think in familiar vectors and 
indices instead of the low level MPI_Send/Recv. Besides that, PETSc 
has optimized VecScatter so that communication is efficient.



Interestingly, the MPI_WaitALL has solved the problem when using
OpenMPI
but it still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still
need
to figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all
--enable-g" later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some
processes
were finishing up sending and receiving and exiting the subroutine
before the receiving processes had completed (which resulted in data
loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
to us.  I don't think I can dispense with it, but will think about
some
more.

After MPI_Send(), or after MPI_Isend(..,req) and MPI_Wait(req), you 
can safely free the send buffer without worry that the receive has not 
completed. MPI guarantees the receiver can get the data, for example, 
through internal buffering.



I'm not so sure about using MPI_IRecv as it will require a bit of
rewriting since right now I process the received
data sequentially after each blocking MPI_Recv -- clearly slower but
easier to code.

Thanks again for the help.

-sanjay

On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
> Hi Sanjay,
>
>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users
mailto:petsc-users@mcs.anl.gov>> wrote:
>>
>> The problem seems to persist but with a different signature. 
Graphs attached as before.
>>
>> Totals with MPICH (NB: single run)
>>
>> For the CG/Jacobi          data_exchange_total = 41,385,984;
kspsolve_total = 38,289,408
>> For the GMRES/BJACOBI      data_exchange_total = 41,324,544;
kspsolve_total = 41,324,544
>>
>> Just reading the MPI docs I am wondering if I need some sort of
MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange
routine?
>> I would have thought that with the blocking receives and the
MPI_Barrier that everything will have fully completed and cleaned
up before
>> all processes exited the routine, but perhaps I am wrong on that.
>
> Skimming the fortran code you sent you do:
>
> for i in ...:
>     call MPI_Isend(..., req, ierr)
>
> for i in ...:
>     call MPI_Recv(..., ierr)
>
> But you never call MPI_Wait on the request you got back from the
Isend. So the MPI library will never free the data structures it
created.
>
> The usual pattern for these non-blocking communications is to
allocate an array for the requests of length nsend+nrecv and then do:
>
> for i in nsend:
>     call MPI_Isend(..., req[i], ierr)
> for j in nrecv:
>     call MPI_Irecv(..., req[nsend+j], ierr)
>
> call MPI_Waitall(req, ..., ierr)
>
> I note also there's no need for the Barrier at the end of the
routine, this kind of communication does neighbourwise
synchronisation, no need to add (unnecessary) global
synchronisation too.
>
> As an aside, is there a reason you don't use PETSc's VecScatter
to manage this global to local exchange?
>
> Cheers,
>
> Lawrence



!$Id:$
  subroutine psetb(b,getp,getv,senp,senv,eq, ndf, rdatabuf,sdatabuf)

!  * * F E A P * * A Finite Element Analysis Program

!  Copyright (c) 1984-2017: Regents of the University of California
!   All rights reserved

!-[--.+.+.-]
! Modification logDate (dd/mm/year)
! 

Re: [petsc-users] Memory growth issue

2019-05-31 Thread Sanjay Govindjee via petsc-users

Matt,
  Here is the process as it currently stands:

1) I have a PETSc Vec (sol), which come from a KSPSolve

2) Each processor grabs its section of sol via VecGetOwnershipRange and 
VecGetArrayReadF90
and inserts parts of its section of sol in a local array (locarr) using 
a complex but easily computable mapping.


3) The routine you are looking at then exchanges various parts of the 
locarr between the processors.


4) Each processor then does computations using its updated locarr.

Typing it out this way, I guess the answer to your question is "yes."  I 
have a global Vec and I want its values

sent in a complex but computable way to local vectors on each process.

-sanjay

On 5/31/19 3:37 AM, Matthew Knepley wrote:
On Thu, May 30, 2019 at 11:55 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:


Hi Juanchao,
Thanks for the hints below, they will take some time to absorb as
the vectors that are being  moved around
are actually partly petsc vectors and partly local process vectors.


Is this code just doing a global-to-local map? Meaning, does it just 
map all the local unknowns to some global
unknown on some process? We have an even simpler interface for that, 
where we make the VecScatter

automatically,

https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/IS/ISLocalToGlobalMappingCreate.html#ISLocalToGlobalMappingCreate

Then you can use it with Vecs, Mats, etc.

  Thanks,

     Matt

Attached is the modified routine that now works (on leaking
memory) with openmpi.

-sanjay
On 5/30/19 8:41 PM, Zhang, Junchao wrote:


Hi, Sanjay,
  Could you send your modified data exchange code (psetb.F) with
MPI_Waitall? See other inlined comments below. Thanks.

On Thu, May 30, 2019 at 1:49 PM Sanjay Govindjee via petsc-users
mailto:petsc-users@mcs.anl.gov>> wrote:

Lawrence,
Thanks for taking a look!  This is what I had been wondering
about -- my
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a
decade+
back from NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own
routines (we even have our own reduceALL scattered around the
code).

Petsc VecScatter has a very simple interface and you definitely
should go with.  With VecScatter, you can think in familiar
vectors and indices instead of the low level MPI_Send/Recv.
Besides that, PETSc has optimized VecScatter so that
communication is efficient.


Interestingly, the MPI_WaitALL has solved the problem when
using OpenMPI
but it still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really
still need
to figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all

--enable-g" later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some
processes
were finishing up sending and receiving and exiting the
subroutine
before the receiving processes had completed (which resulted
in data
loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
to us.  I don't think I can dispense with it, but will think
about some
more.

After MPI_Send(), or after MPI_Isend(..,req) and MPI_Wait(req),
you can safely free the send buffer without worry that the
receive has not completed. MPI guarantees the receiver can get
the data, for example, through internal buffering.


I'm not so sure about using MPI_IRecv as it will require a
bit of
rewriting since right now I process the received
data sequentially after each blocking MPI_Recv -- clearly
slower but
easier to code.

Thanks again for the help.

-sanjay

On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
> Hi Sanjay,
>
>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users
mailto:petsc-users@mcs.anl.gov>> wrote:
>>
>> The problem seems to persist but with a different
signature.  Graphs attached as before.
>>
>> Totals with MPICH (NB: single run)
>>
>> For the CG/Jacobi data_exchange_total = 41,385,984;
kspsolve_total = 38,289,408
>> For the GMRES/BJACOBI data_exchange_total = 41,324,544;
kspsolve_total = 41,324,544
>>
>> Just reading the MPI docs I am wondering if I need some
sort of MPI_Wait/MPI_Waitall before my MPI_Barrier in the
data exchange routine?
>> I would have thought that with the blocking receives and
the MPI_Barrier that everything will have fully comple

Re: [petsc-users] Memory growth issue

2019-05-31 Thread Zhang, Junchao via petsc-users
Sanjay,
I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and end 
of KSPSolve and then computed the delta and summed over processes. Then I 
tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
With OpenMPI,
mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler -ts_max_steps 
500 > 128.log
grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log
1:RSS Delta= 69632, Malloc Delta= 0
2:RSS Delta= 69632, Malloc Delta= 0
3:RSS Delta= 69632, Malloc Delta= 0
4:RSS Delta= 69632, Malloc Delta= 0
9:RSS Delta=9.25286e+06, Malloc Delta= 0
22:RSS Delta= 49152, Malloc Delta= 0
44:RSS Delta= 20480, Malloc Delta= 0
53:RSS Delta= 49152, Malloc Delta= 0
66:RSS Delta=  4096, Malloc Delta= 0
97:RSS Delta= 16384, Malloc Delta= 0
119:RSS Delta= 20480, Malloc Delta= 0
141:RSS Delta= 53248, Malloc Delta= 0
176:RSS Delta= 16384, Malloc Delta= 0
308:RSS Delta= 16384, Malloc Delta= 0
352:RSS Delta= 16384, Malloc Delta= 0
550:RSS Delta= 16384, Malloc Delta= 0
572:RSS Delta= 16384, Malloc Delta= 0
669:RSS Delta= 40960, Malloc Delta= 0
924:RSS Delta= 32768, Malloc Delta= 0
1694:RSS Delta= 20480, Malloc Delta= 0
2099:RSS Delta= 16384, Malloc Delta= 0
2244:RSS Delta= 20480, Malloc Delta= 0
3001:RSS Delta= 16384, Malloc Delta= 0
5883:RSS Delta= 16384, Malloc Delta= 0

If I increased the grid
mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps 
500 -malloc_test >512.log
grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log
1:RSS Delta=1.05267e+06, Malloc Delta= 0
2:RSS Delta=1.05267e+06, Malloc Delta= 0
3:RSS Delta=1.05267e+06, Malloc Delta= 0
4:RSS Delta=1.05267e+06, Malloc Delta= 0
13:RSS Delta=1.24932e+08, Malloc Delta= 0

So we did see RSS increase in 4k-page sizes after KSPSolve. As long as no 
memory leaks, why do you care about it? Is it because you run out of memory?

On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:

   Thanks for the update. So the current conclusions are that using the Waitall 
in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve

3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) the 
wait all fix?

If you literately just comment out the call to KSPSolve() with OpenMPI is there 
no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a TS example 
that uses the linear solver (like with -ts_type beuler)? Thanks


  Barry



> On May 30, 2019, at 1:47 PM, Sanjay Govindjee 
> mailto:s...@berkeley.edu>> wrote:
>
> Lawrence,
> Thanks for taking a look!  This is what I had been wondering about -- my 
> knowledge of MPI is pretty minimal and
> this origins of the routine were from a programmer we hired a decade+ back 
> from NERSC.  I'll have to look into
> VecScatter.  It will be great to dispense with our roll-your-own routines (we 
> even have our own reduceALL scattered around the code).
>
> Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI but 
> it still persists with MPICH.  Graphs attached.
> I'm going to run with openmpi for now (but I guess I really still need to 
> figure out what is wrong with MPICH and WaitALL;
> I'll try Barry's suggestion of 
> --download-mpich-configure-arguments="--enable-error-messages=all --enable-g" 
> later today and report back).
>
> Regarding MPI_Barrier, it was put in due a problem that some processes were 
> finishing up sending and receiving and exiting the subroutine
> before the receiving processes had completed (which resulted in data loss as 
> the buffers are freed after the call to the routine). MPI_Barrier was the 
> solution proposed
> to us.  I don't think I can dispense with it, but will think about some more.
>
> I'm not so sure about using MPI_IRecv as it will require a bit of rewriting 
> since right now I process the received
> data sequentially after each blocking MPI_Recv -- clearly slower but easier 
> to code.
>
> Thanks again for the help.
>
> -sanjay
>
> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>> Hi Sanjay,
>>
>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>>> mailto:petsc-users@mcs.anl.gov>> wrote:
>>>
>>> The problem seems to persist but with a different signature.  Graphs 
>>> attached as before.
>>>
>>> Totals with MPICH (NB: single run)
>>>
>>> For the CG/Jacobi  data_exchange_total = 41,385,984; kspsolve_total 
>>> = 38,289,408
>>> For t

Re: [petsc-users] Memory growth issue

2019-05-31 Thread Sanjay Govindjee via petsc-users

Yes, the issue is running out of memory on long runs.
Perhaps some clean-up happens latter when the memory pressure builds but 
that

is a bit non-ideal.

-sanjay

On 5/31/19 12:53 PM, Zhang, Junchao wrote:

Sanjay,
I tried petsc with MPICH and OpenMPI on my Macbook. I 
inserted PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the 
beginning and end of KSPSolve and then computed the delta and summed 
over processes. Then I tested 
with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c

With OpenMPI,
mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler 
-ts_max_steps 500 > 128.log

grep -n -v "RSS Delta=         0, Malloc Delta= 0" 128.log
1:RSS Delta=     69632, Malloc Delta=         0
2:RSS Delta=     69632, Malloc Delta=         0
3:RSS Delta=     69632, Malloc Delta=         0
4:RSS Delta=     69632, Malloc Delta=         0
9:RSS Delta=9.25286e+06, Malloc Delta=         0
22:RSS Delta=     49152, Malloc Delta=         0
44:RSS Delta=     20480, Malloc Delta=         0
53:RSS Delta=     49152, Malloc Delta=         0
66:RSS Delta=      4096, Malloc Delta=         0
97:RSS Delta=     16384, Malloc Delta=         0
119:RSS Delta=     20480, Malloc Delta=         0
141:RSS Delta=     53248, Malloc Delta=         0
176:RSS Delta=     16384, Malloc Delta=         0
308:RSS Delta=     16384, Malloc Delta=         0
352:RSS Delta=     16384, Malloc Delta=         0
550:RSS Delta=     16384, Malloc Delta=         0
572:RSS Delta=     16384, Malloc Delta=         0
669:RSS Delta=     40960, Malloc Delta=         0
924:RSS Delta=     32768, Malloc Delta=         0
1694:RSS Delta=     20480, Malloc Delta=         0
2099:RSS Delta=     16384, Malloc Delta=         0
2244:RSS Delta=     20480, Malloc Delta=         0
3001:RSS Delta=     16384, Malloc Delta=         0
5883:RSS Delta=     16384, Malloc Delta=         0

If I increased the grid
mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
-ts_max_steps 500 -malloc_test >512.log

grep -n -v "RSS Delta=         0, Malloc Delta= 0" 512.log
1:RSS Delta=1.05267e+06, Malloc Delta=         0
2:RSS Delta=1.05267e+06, Malloc Delta=         0
3:RSS Delta=1.05267e+06, Malloc Delta=         0
4:RSS Delta=1.05267e+06, Malloc Delta=         0
13:RSS Delta=1.24932e+08, Malloc Delta=         0

So we did see RSS increase in 4k-page sizes after KSPSolve. As long as 
no memory leaks, why do you care about it? Is it because you run out 
of memory?


On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. > wrote:



   Thanks for the update. So the current conclusions are that
using the Waitall in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve

3) MPICH has memory issues both for your code and PETSc KSPSolve
(despite) the wait all fix?

If you literately just comment out the call to KSPSolve() with
OpenMPI is there no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in
MPICH and not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a
TS example that uses the linear solver (like with -ts_type
beuler)? Thanks


  Barry



> On May 30, 2019, at 1:47 PM, Sanjay Govindjee mailto:s...@berkeley.edu>> wrote:
>
> Lawrence,
> Thanks for taking a look!  This is what I had been wondering
about -- my knowledge of MPI is pretty minimal and
> this origins of the routine were from a programmer we hired a
decade+ back from NERSC.  I'll have to look into
> VecScatter.  It will be great to dispense with our roll-your-own
routines (we even have our own reduceALL scattered around the code).
>
> Interestingly, the MPI_WaitALL has solved the problem when using
OpenMPI but it still persists with MPICH.  Graphs attached.
> I'm going to run with openmpi for now (but I guess I really
still need to figure out what is wrong with MPICH and WaitALL;
> I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all
--enable-g" later today and report back).
>
> Regarding MPI_Barrier, it was put in due a problem that some
processes were finishing up sending and receiving and exiting the
subroutine
> before the receiving processes had completed (which resulted in
data loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
> to us.  I don't think I can dispense with it, but will think
about some more.
>
> I'm not so sure about using MPI_IRecv as it will require a bit
of rewriting since right now I process the received
> data sequentially after each blocking MPI_Recv -- clearly slower
but easier to code.
>
> Thanks again for the help.
>
> -sanjay
>
> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>> Hi Sanjay,
  

Re: [petsc-users] Memory growth issue

2019-05-31 Thread Sanjay Govindjee via petsc-users

Thanks Stefano.

Reading the manual pages a bit more carefully,
I think I can see what I should be doing.  Which should be roughly to

1. Set up target Seq vectors on PETSC_COMM_SELF
2. Use ISCreateGeneral to create ISs for the target Vecs  and the source 
Vec which will be MPI on PETSC_COMM_WORLD.

3. Create the scatter context with VecScatterCreate
4. Call VecScatterBegin/End on each process (instead of using my prior 
routine).


Lingering questions:

a. Is there any performance advantage/disadvantage to creating a single 
parallel target Vec instead

of multiple target Seq Vecs (in terms of the scatter operation)?

b. The data that ends up in the target on each processor needs to be in 
an application
array.  Is there a clever way to 'move' the data from the scatter target 
to the array (short

of just running a loop over it and copying)?

-sanjay



On 5/31/19 12:02 PM, Stefano Zampini wrote:



On May 31, 2019, at 9:50 PM, Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:


Matt,
  Here is the process as it currently stands:

1) I have a PETSc Vec (sol), which come from a KSPSolve

2) Each processor grabs its section of sol via VecGetOwnershipRange 
and VecGetArrayReadF90
and inserts parts of its section of sol in a local array (locarr) 
using a complex but easily computable mapping.


3) The routine you are looking at then exchanges various parts of the 
locarr between the processors.




You need a VecScatter object 
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterCreate.html#VecScatterCreate 




4) Each processor then does computations using its updated locarr.

Typing it out this way, I guess the answer to your question is 
"yes."  I have a global Vec and I want its values

sent in a complex but computable way to local vectors on each process.

-sanjay
On 5/31/19 3:37 AM, Matthew Knepley wrote:
On Thu, May 30, 2019 at 11:55 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:


Hi Juanchao,
Thanks for the hints below, they will take some time to absorb
as the vectors that are being moved around
are actually partly petsc vectors and partly local process vectors.


Is this code just doing a global-to-local map? Meaning, does it just 
map all the local unknowns to some global
unknown on some process? We have an even simpler interface for that, 
where we make the VecScatter

automatically,

https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/IS/ISLocalToGlobalMappingCreate.html#ISLocalToGlobalMappingCreate

Then you can use it with Vecs, Mats, etc.

  Thanks,

     Matt

Attached is the modified routine that now works (on leaking
memory) with openmpi.

-sanjay
On 5/30/19 8:41 PM, Zhang, Junchao wrote:


Hi, Sanjay,
  Could you send your modified data exchange code (psetb.F)
with MPI_Waitall? See other inlined comments below. Thanks.

On Thu, May 30, 2019 at 1:49 PM Sanjay Govindjee via
petsc-users mailto:petsc-users@mcs.anl.gov>> wrote:

Lawrence,
Thanks for taking a look!  This is what I had been
wondering about -- my
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired
a decade+
back from NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our
roll-your-own
routines (we even have our own reduceALL scattered around
the code).

Petsc VecScatter has a very simple interface and you definitely
should go with.  With VecScatter, you can think in familiar
vectors and indices instead of the low level MPI_Send/Recv.
Besides that, PETSc has optimized VecScatter so that
communication is efficient.


Interestingly, the MPI_WaitALL has solved the problem when
using OpenMPI
but it still persists with MPICH. Graphs attached.
I'm going to run with openmpi for now (but I guess I really
still need
to figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all

--enable-g" later today and report back).

Regarding MPI_Barrier, it was put in due a problem that
some processes
were finishing up sending and receiving and exiting the
subroutine
before the receiving processes had completed (which
resulted in data
loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
to us.  I don't think I can dispense with it, but will
think about some
more.

After MPI_Send(), or after MPI_Isend(..,req) and MPI_Wait(req),
you can safely free the send buffer without worry that the
receive has not completed. MPI guarantees the receiver can get
the data, for example, through internal buffering.


I'm not so sure about usi

Re: [petsc-users] Memory growth issue

2019-05-31 Thread Zhang, Junchao via petsc-users


On Fri, May 31, 2019 at 3:48 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Thanks Stefano.

Reading the manual pages a bit more carefully,
I think I can see what I should be doing.  Which should be roughly to

1. Set up target Seq vectors on PETSC_COMM_SELF
2. Use ISCreateGeneral to create ISs for the target Vecs  and the source Vec 
which will be MPI on PETSC_COMM_WORLD.
3. Create the scatter context with VecScatterCreate
4. Call VecScatterBegin/End on each process (instead of using my prior routine).

Lingering questions:

a. Is there any performance advantage/disadvantage to creating a single 
parallel target Vec instead
of multiple target Seq Vecs (in terms of the scatter operation)?
No performance difference. But pay attention, if you use seq vec, the indices 
in IS are locally numbered; if you use MPI vec, the indices are globally 
numbered.


b. The data that ends up in the target on each processor needs to be in an 
application
array.  Is there a clever way to 'move' the data from the scatter target to the 
array (short
of just running a loop over it and copying)?

See VecGetArray, VecGetArrayRead etc, which pull the data out of Vecs without 
memory copying.
 -sanjay



On 5/31/19 12:02 PM, Stefano Zampini wrote:


On May 31, 2019, at 9:50 PM, Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

Matt,
  Here is the process as it currently stands:

1) I have a PETSc Vec (sol), which come from a KSPSolve

2) Each processor grabs its section of sol via VecGetOwnershipRange and 
VecGetArrayReadF90
and inserts parts of its section of sol in a local array (locarr) using a 
complex but easily computable mapping.

3) The routine you are looking at then exchanges various parts of the locarr 
between the processors.


You need a VecScatter object 
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/Vec/VecScatterCreate.html#VecScatterCreate

4) Each processor then does computations using its updated locarr.

Typing it out this way, I guess the answer to your question is "yes."  I have a 
global Vec and I want its values
sent in a complex but computable way to local vectors on each process.

-sanjay
On 5/31/19 3:37 AM, Matthew Knepley wrote:
On Thu, May 30, 2019 at 11:55 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Hi Juanchao,
Thanks for the hints below, they will take some time to absorb as the vectors 
that are being  moved around
are actually partly petsc vectors and partly local process vectors.

Is this code just doing a global-to-local map? Meaning, does it just map all 
the local unknowns to some global
unknown on some process? We have an even simpler interface for that, where we 
make the VecScatter
automatically,

  
https://www.mcs.anl.gov/petsc/petsc-current/docs/manualpages/IS/ISLocalToGlobalMappingCreate.html#ISLocalToGlobalMappingCreate

Then you can use it with Vecs, Mats, etc.

  Thanks,

 Matt

Attached is the modified routine that now works (on leaking memory) with 
openmpi.

-sanjay
On 5/30/19 8:41 PM, Zhang, Junchao wrote:

Hi, Sanjay,
  Could you send your modified data exchange code (psetb.F) with MPI_Waitall? 
See other inlined comments below. Thanks.

On Thu, May 30, 2019 at 1:49 PM Sanjay Govindjee via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:
Lawrence,
Thanks for taking a look!  This is what I had been wondering about -- my
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+
back from NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own
routines (we even have our own reduceALL scattered around the code).
Petsc VecScatter has a very simple interface and you definitely should go with. 
 With VecScatter, you can think in familiar vectors and indices instead of the 
low level MPI_Send/Recv. Besides that, PETSc has optimized VecScatter so that 
communication is efficient.

Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI
but it still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still need
to figure out what is wrong with MPICH and WaitALL;
I'll try Barry's suggestion of
--download-mpich-configure-arguments="--enable-error-messages=all
--enable-g" later today and report back).

Regarding MPI_Barrier, it was put in due a problem that some processes
were finishing up sending and receiving and exiting the subroutine
before the receiving processes had completed (which resulted in data
loss as the buffers are freed after the call to the routine).
MPI_Barrier was the solution proposed
to us.  I don't think I can dispense with it, but will think about some
more.
After MPI_Send(), or after MPI_Isend(..,req) and MPI_Wait(req), you can safely 
free the send buffer without worry that the receive has not completed. MPI 
guarantees the receiver can get the data, for example, through internal 
buffering.

I'm not so su

Re: [petsc-users] Memory growth issue

2019-06-01 Thread Smith, Barry F. via petsc-users


  Junchao,

 This is insane. Either the OpenMPI library or something in the OS 
underneath related to sockets and interprocess communication is grabbing 
additional space for each round of MPI communication!  Does MPICH have the same 
values or different values than OpenMP? When you run on Linux do you get the 
same values as Apple or different. --- Same values seem to indicate the issue 
is inside OpenMPI/MPICH different values indicates problem is more likely at 
the OS level. Does this happen only with the default VecScatter that uses 
blocking MPI, what happens with PetscSF under Vec? Is it somehow related to 
PETSc's use of nonblocking sends and receives? One could presumably use 
valgrind to see exactly what lines in what code are causing these increases. I 
don't think we can just shrug and say this is the way it is, we need to track 
down and understand the cause (and if possible fix).

  Barry


> On May 31, 2019, at 2:53 PM, Zhang, Junchao  wrote:
> 
> Sanjay,
> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and 
> end of KSPSolve and then computed the delta and summed over processes. Then I 
> tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
> With OpenMPI, 
> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler -ts_max_steps 
> 500 > 128.log
> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log
> 1:RSS Delta= 69632, Malloc Delta= 0
> 2:RSS Delta= 69632, Malloc Delta= 0
> 3:RSS Delta= 69632, Malloc Delta= 0
> 4:RSS Delta= 69632, Malloc Delta= 0
> 9:RSS Delta=9.25286e+06, Malloc Delta= 0
> 22:RSS Delta= 49152, Malloc Delta= 0
> 44:RSS Delta= 20480, Malloc Delta= 0
> 53:RSS Delta= 49152, Malloc Delta= 0
> 66:RSS Delta=  4096, Malloc Delta= 0
> 97:RSS Delta= 16384, Malloc Delta= 0
> 119:RSS Delta= 20480, Malloc Delta= 0
> 141:RSS Delta= 53248, Malloc Delta= 0
> 176:RSS Delta= 16384, Malloc Delta= 0
> 308:RSS Delta= 16384, Malloc Delta= 0
> 352:RSS Delta= 16384, Malloc Delta= 0
> 550:RSS Delta= 16384, Malloc Delta= 0
> 572:RSS Delta= 16384, Malloc Delta= 0
> 669:RSS Delta= 40960, Malloc Delta= 0
> 924:RSS Delta= 32768, Malloc Delta= 0
> 1694:RSS Delta= 20480, Malloc Delta= 0
> 2099:RSS Delta= 16384, Malloc Delta= 0
> 2244:RSS Delta= 20480, Malloc Delta= 0
> 3001:RSS Delta= 16384, Malloc Delta= 0
> 5883:RSS Delta= 16384, Malloc Delta= 0
> 
> If I increased the grid
> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps 
> 500 -malloc_test >512.log
> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log
> 1:RSS Delta=1.05267e+06, Malloc Delta= 0
> 2:RSS Delta=1.05267e+06, Malloc Delta= 0
> 3:RSS Delta=1.05267e+06, Malloc Delta= 0
> 4:RSS Delta=1.05267e+06, Malloc Delta= 0
> 13:RSS Delta=1.24932e+08, Malloc Delta= 0
> 
> So we did see RSS increase in 4k-page sizes after KSPSolve. As long as no 
> memory leaks, why do you care about it? Is it because you run out of memory?
> 
> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F.  wrote:
> 
>Thanks for the update. So the current conclusions are that using the 
> Waitall in your code
> 
> 1) solves the memory issue with OpenMPI in your code
> 
> 2) does not solve the memory issue with PETSc KSPSolve 
> 
> 3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) 
> the wait all fix?
> 
> If you literately just comment out the call to KSPSolve() with OpenMPI is 
> there no growth in memory usage?
> 
> 
> Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
> not freeing all MPI resources in KSPSolve()
> 
> Junchao, can you please investigate 2 and 3 with, for example, a TS example 
> that uses the linear solver (like with -ts_type beuler)? Thanks
> 
> 
>   Barry
> 
> 
> 
> > On May 30, 2019, at 1:47 PM, Sanjay Govindjee  wrote:
> > 
> > Lawrence,
> > Thanks for taking a look!  This is what I had been wondering about -- my 
> > knowledge of MPI is pretty minimal and
> > this origins of the routine were from a programmer we hired a decade+ back 
> > from NERSC.  I'll have to look into
> > VecScatter.  It will be great to dispense with our roll-your-own routines 
> > (we even have our own reduceALL scattered around the code).
> > 
> > Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI 
> > but it still persists with MPICH.  Graphs attached.
> > I'm going to run with openmpi for now (but I guess I really still need to 
> > figure out what is wrong with MPICH and WaitALL;
> > I'll try Barry's suggestion of 
> > --download-mpich-configure-arguments="--enable-error-messages=all 
> > --enable-g" later t

Re: [petsc-users] Memory growth issue

2019-06-01 Thread Sanjay Govindjee via petsc-users

Barry,

If you look at the graphs I generated (on my Mac),  you will see that 
OpenMPI and MPICH have very different values (along with the fact that 
MPICH does not seem to adhere

to the standard (for releasing MPI_ISend resources following and MPI_Wait).

-sanjay

PS: I agree with Barry's assessment; this is really not that acceptable.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:

   Junchao,

  This is insane. Either the OpenMPI library or something in the OS 
underneath related to sockets and interprocess communication is grabbing 
additional space for each round of MPI communication!  Does MPICH have the same 
values or different values than OpenMP? When you run on Linux do you get the 
same values as Apple or different. --- Same values seem to indicate the issue 
is inside OpenMPI/MPICH different values indicates problem is more likely at 
the OS level. Does this happen only with the default VecScatter that uses 
blocking MPI, what happens with PetscSF under Vec? Is it somehow related to 
PETSc's use of nonblocking sends and receives? One could presumably use 
valgrind to see exactly what lines in what code are causing these increases. I 
don't think we can just shrug and say this is the way it is, we need to track 
down and understand the cause (and if possible fix).

   Barry



On May 31, 2019, at 2:53 PM, Zhang, Junchao  wrote:

Sanjay,
I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and end 
of KSPSolve and then computed the delta and summed over processes. Then I 
tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
With OpenMPI,
mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler -ts_max_steps 500 
> 128.log
grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log
1:RSS Delta= 69632, Malloc Delta= 0
2:RSS Delta= 69632, Malloc Delta= 0
3:RSS Delta= 69632, Malloc Delta= 0
4:RSS Delta= 69632, Malloc Delta= 0
9:RSS Delta=9.25286e+06, Malloc Delta= 0
22:RSS Delta= 49152, Malloc Delta= 0
44:RSS Delta= 20480, Malloc Delta= 0
53:RSS Delta= 49152, Malloc Delta= 0
66:RSS Delta=  4096, Malloc Delta= 0
97:RSS Delta= 16384, Malloc Delta= 0
119:RSS Delta= 20480, Malloc Delta= 0
141:RSS Delta= 53248, Malloc Delta= 0
176:RSS Delta= 16384, Malloc Delta= 0
308:RSS Delta= 16384, Malloc Delta= 0
352:RSS Delta= 16384, Malloc Delta= 0
550:RSS Delta= 16384, Malloc Delta= 0
572:RSS Delta= 16384, Malloc Delta= 0
669:RSS Delta= 40960, Malloc Delta= 0
924:RSS Delta= 32768, Malloc Delta= 0
1694:RSS Delta= 20480, Malloc Delta= 0
2099:RSS Delta= 16384, Malloc Delta= 0
2244:RSS Delta= 20480, Malloc Delta= 0
3001:RSS Delta= 16384, Malloc Delta= 0
5883:RSS Delta= 16384, Malloc Delta= 0

If I increased the grid
mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps 500 
-malloc_test >512.log
grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log
1:RSS Delta=1.05267e+06, Malloc Delta= 0
2:RSS Delta=1.05267e+06, Malloc Delta= 0
3:RSS Delta=1.05267e+06, Malloc Delta= 0
4:RSS Delta=1.05267e+06, Malloc Delta= 0
13:RSS Delta=1.24932e+08, Malloc Delta= 0

So we did see RSS increase in 4k-page sizes after KSPSolve. As long as no 
memory leaks, why do you care about it? Is it because you run out of memory?

On Thu, May 30, 2019 at 1:59 PM Smith, Barry F.  wrote:

Thanks for the update. So the current conclusions are that using the 
Waitall in your code

1) solves the memory issue with OpenMPI in your code

2) does not solve the memory issue with PETSc KSPSolve

3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) the 
wait all fix?

If you literately just comment out the call to KSPSolve() with OpenMPI is there 
no growth in memory usage?


Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
not freeing all MPI resources in KSPSolve()

Junchao, can you please investigate 2 and 3 with, for example, a TS example 
that uses the linear solver (like with -ts_type beuler)? Thanks


   Barry




On May 30, 2019, at 1:47 PM, Sanjay Govindjee  wrote:

Lawrence,
Thanks for taking a look!  This is what I had been wondering about -- my 
knowledge of MPI is pretty minimal and
this origins of the routine were from a programmer we hired a decade+ back from 
NERSC.  I'll have to look into
VecScatter.  It will be great to dispense with our roll-your-own routines (we 
even have our own reduceALL scattered around the code).

Interestingly, the MPI_WaitALL has solved the problem when using OpenMPI but it 
still persists with MPICH.  Graphs attached.
I'm going to run with openmpi for now (but I guess I really still need to

Re: [petsc-users] Memory growth issue

2019-06-01 Thread Zhang, Junchao via petsc-users


On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
Barry,

If you look at the graphs I generated (on my Mac),  you will see that
OpenMPI and MPICH have very different values (along with the fact that
MPICH does not seem to adhere
to the standard (for releasing MPI_ISend resources following and MPI_Wait).

-sanjay
PS: I agree with Barry's assessment; this is really not that acceptable.

I also agree. I am doing various experiments to know why.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>Junchao,
>
>   This is insane. Either the OpenMPI library or something in the OS 
> underneath related to sockets and interprocess communication is grabbing 
> additional space for each round of MPI communication!  Does MPICH have the 
> same values or different values than OpenMP? When you run on Linux do you get 
> the same values as Apple or different. --- Same values seem to indicate the 
> issue is inside OpenMPI/MPICH different values indicates problem is more 
> likely at the OS level. Does this happen only with the default VecScatter 
> that uses blocking MPI, what happens with PetscSF under Vec? Is it somehow 
> related to PETSc's use of nonblocking sends and receives? One could 
> presumably use valgrind to see exactly what lines in what code are causing 
> these increases. I don't think we can just shrug and say this is the way it 
> is, we need to track down and understand the cause (and if possible fix).
>
>Barry
>
>
>> On May 31, 2019, at 2:53 PM, Zhang, Junchao 
>> mailto:jczh...@mcs.anl.gov>> wrote:
>>
>> Sanjay,
>> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and 
>> end of KSPSolve and then computed the delta and summed over processes. Then 
>> I tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>> With OpenMPI,
>> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler 
>> -ts_max_steps 500 > 128.log
>> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log
>> 1:RSS Delta= 69632, Malloc Delta= 0
>> 2:RSS Delta= 69632, Malloc Delta= 0
>> 3:RSS Delta= 69632, Malloc Delta= 0
>> 4:RSS Delta= 69632, Malloc Delta= 0
>> 9:RSS Delta=9.25286e+06, Malloc Delta= 0
>> 22:RSS Delta= 49152, Malloc Delta= 0
>> 44:RSS Delta= 20480, Malloc Delta= 0
>> 53:RSS Delta= 49152, Malloc Delta= 0
>> 66:RSS Delta=  4096, Malloc Delta= 0
>> 97:RSS Delta= 16384, Malloc Delta= 0
>> 119:RSS Delta= 20480, Malloc Delta= 0
>> 141:RSS Delta= 53248, Malloc Delta= 0
>> 176:RSS Delta= 16384, Malloc Delta= 0
>> 308:RSS Delta= 16384, Malloc Delta= 0
>> 352:RSS Delta= 16384, Malloc Delta= 0
>> 550:RSS Delta= 16384, Malloc Delta= 0
>> 572:RSS Delta= 16384, Malloc Delta= 0
>> 669:RSS Delta= 40960, Malloc Delta= 0
>> 924:RSS Delta= 32768, Malloc Delta= 0
>> 1694:RSS Delta= 20480, Malloc Delta= 0
>> 2099:RSS Delta= 16384, Malloc Delta= 0
>> 2244:RSS Delta= 20480, Malloc Delta= 0
>> 3001:RSS Delta= 16384, Malloc Delta= 0
>> 5883:RSS Delta= 16384, Malloc Delta= 0
>>
>> If I increased the grid
>> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>> -ts_max_steps 500 -malloc_test >512.log
>> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log
>> 1:RSS Delta=1.05267e+06, Malloc Delta= 0
>> 2:RSS Delta=1.05267e+06, Malloc Delta= 0
>> 3:RSS Delta=1.05267e+06, Malloc Delta= 0
>> 4:RSS Delta=1.05267e+06, Malloc Delta= 0
>> 13:RSS Delta=1.24932e+08, Malloc Delta= 0
>>
>> So we did see RSS increase in 4k-page sizes after KSPSolve. As long as no 
>> memory leaks, why do you care about it? Is it because you run out of memory?
>>
>> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. 
>> mailto:bsm...@mcs.anl.gov>> wrote:
>>
>> Thanks for the update. So the current conclusions are that using the 
>> Waitall in your code
>>
>> 1) solves the memory issue with OpenMPI in your code
>>
>> 2) does not solve the memory issue with PETSc KSPSolve
>>
>> 3) MPICH has memory issues both for your code and PETSc KSPSolve (despite) 
>> the wait all fix?
>>
>> If you literately just comment out the call to KSPSolve() with OpenMPI is 
>> there no growth in memory usage?
>>
>>
>> Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH and 
>> not freeing all MPI resources in KSPSolve()
>>
>> Junchao, can you please investigate 2 and 3 with, for example, a TS example 
>> that uses the linear solver (like with -ts_type beuler)? Thanks
>>
>>
>>Barry
>>
>>
>>
>>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee 
>>> mailto:s...@berkeley.edu>> wrote:
>>>
>>> Lawrence,
>>> Thanks for taking a look!  This is what I had been wondering about -- my

Re: [petsc-users] Memory growth issue

2019-06-03 Thread Zhang, Junchao via petsc-users
Sanjay & Barry,
  Sorry, I made a mistake that I said I could reproduced Sanjay's experiments. 
I found 1) to correctly use PetscMallocGetCurrentUsage() when petsc is 
configured without debugging, I have to add -malloc to run the program. 2) I 
have to instrument the code outside of KSPSolve(). In my case, it is in 
SNESSolve_NEWTONLS. In old experiments, I did it inside KSPSolve. Since 
KSPSolve can recursively call KSPSolve, the old results were misleading.
 With these fixes, I measured differences of RSS and Petsc malloc before/after 
KSPSolve. I did experiments on MacBook using 
src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands like 
mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler -ts_max_steps 500 
-malloc.
 I find if the grid size is small, I can see a non-zero RSS-delta randomly, 
either with one mpi rank or multiple ranks, with MPICH or OpenMPI. If I 
increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I only see non-zero 
RSS-delta randomly at the first few iterations (with MPICH or OpenMPI). When 
the computer workload is high by simultaneously running ex5-openmpi and 
ex5-mpich, the MPICH one pops up much more non-zero RSS-delta. But "Malloc 
Delta" behavior is stable across all runs. There is only one nonzero malloc 
delta value in the first KSPSolve call. All remaining are zero. Something like 
this:
mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
-ts_max_steps 500 -malloc
RSS Delta=   32489472, Malloc Delta=   26290304, RSS End=  136114176
RSS Delta=  32768, Malloc Delta=  0, RSS End=  138510336
RSS Delta=  0, Malloc Delta=  0, RSS End=  138522624
RSS Delta=  0, Malloc Delta=  0, RSS End=  138539008
So I think I can conclude there is no unfreed memory in KSPSolve() allocated by 
PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That is possible and I 
am trying to find a way like PetscMallocGetCurrentUsage() to measure that. 
Also, I think RSS delta is not a good way to measure memory allocation. It is 
dynamic and depends on state of the computer (swap, shared libraries loaded 
etc) when running the code. We should focus on malloc instead.  If there was a 
valgrind tool, like performance profiling tools,  that can let users measure 
memory allocated but not freed in a user specified code segment, that would be 
very helpful in this case. But I have not found one.

Sanjay, did you say currently you could run with OpenMPI without out of memory, 
but with MPICH, you ran out of memory?  Is it feasible to share your code so 
that I can test with? Thanks.

--Junchao Zhang

On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
Barry,

If you look at the graphs I generated (on my Mac),  you will see that
OpenMPI and MPICH have very different values (along with the fact that
MPICH does not seem to adhere
to the standard (for releasing MPI_ISend resources following and MPI_Wait).

-sanjay

PS: I agree with Barry's assessment; this is really not that acceptable.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>Junchao,
>
>   This is insane. Either the OpenMPI library or something in the OS 
> underneath related to sockets and interprocess communication is grabbing 
> additional space for each round of MPI communication!  Does MPICH have the 
> same values or different values than OpenMP? When you run on Linux do you get 
> the same values as Apple or different. --- Same values seem to indicate the 
> issue is inside OpenMPI/MPICH different values indicates problem is more 
> likely at the OS level. Does this happen only with the default VecScatter 
> that uses blocking MPI, what happens with PetscSF under Vec? Is it somehow 
> related to PETSc's use of nonblocking sends and receives? One could 
> presumably use valgrind to see exactly what lines in what code are causing 
> these increases. I don't think we can just shrug and say this is the way it 
> is, we need to track down and understand the cause (and if possible fix).
>
>Barry
>
>
>> On May 31, 2019, at 2:53 PM, Zhang, Junchao 
>> mailto:jczh...@mcs.anl.gov>> wrote:
>>
>> Sanjay,
>> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and 
>> end of KSPSolve and then computed the delta and summed over processes. Then 
>> I tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>> With OpenMPI,
>> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler 
>> -ts_max_steps 500 > 128.log
>> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log
>> 1:RSS Delta= 69632, Malloc Delta= 0
>> 2:RSS Delta= 69632, Malloc Delta= 0
>> 3:RSS Delta= 69632, Malloc Delta= 0
>> 4:RSS Delta= 69632, Malloc Delta= 0
>> 9:RSS Delta=9.25286e+06, Malloc Delta= 0
>> 22:RSS Delta= 49152, Malloc De

Re: [petsc-users] Memory growth issue

2019-06-03 Thread Zhang, Junchao via petsc-users


On Mon, Jun 3, 2019 at 5:23 PM Stefano Zampini 
mailto:stefano.zamp...@gmail.com>> wrote:


On Jun 4, 2019, at 1:17 AM, Zhang, Junchao via petsc-users 
mailto:petsc-users@mcs.anl.gov>> wrote:

Sanjay & Barry,
  Sorry, I made a mistake that I said I could reproduced Sanjay's experiments. 
I found 1) to correctly use PetscMallocGetCurrentUsage() when petsc is 
configured without debugging, I have to add -malloc to run the program. 2) I 
have to instrument the code outside of KSPSolve(). In my case, it is in 
SNESSolve_NEWTONLS. In old experiments, I did it inside KSPSolve. Since 
KSPSolve can recursively call KSPSolve, the old results were misleading.
 With these fixes, I measured differences of RSS and Petsc malloc before/after 
KSPSolve. I did experiments on MacBook using 
src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands like 
mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler -ts_max_steps 500 
-malloc.
 I find if the grid size is small, I can see a non-zero RSS-delta randomly, 
either with one mpi rank or multiple ranks, with MPICH or OpenMPI. If I 
increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I only see non-zero 
RSS-delta randomly at the first few iterations (with MPICH or OpenMPI). When 
the computer workload is high by simultaneously running ex5-openmpi and 
ex5-mpich, the MPICH one pops up much more non-zero RSS-delta. But "Malloc 
Delta" behavior is stable across all runs. There is only one nonzero malloc 
delta value in the first KSPSolve call. All remaining are zero. Something like 
this:
mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
-ts_max_steps 500 -malloc
RSS Delta=   32489472, Malloc Delta=   26290304, RSS End=  136114176
RSS Delta=  32768, Malloc Delta=  0, RSS End=  138510336
RSS Delta=  0, Malloc Delta=  0, RSS End=  138522624
RSS Delta=  0, Malloc Delta=  0, RSS End=  138539008
So I think I can conclude there is no unfreed memory in KSPSolve() allocated by 
PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That is possible and I 
am trying to find a way like PetscMallocGetCurrentUsage() to measure that. 
Also, I think RSS delta is not a good way to measure memory allocation. It is 
dynamic and depends on state of the computer (swap, shared libraries loaded 
etc) when running the code. We should focus on malloc instead.  If there was a 
valgrind tool, like performance profiling tools,  that can let users measure 
memory allocated but not freed in a user specified code segment, that would be 
very helpful in this case. But I have not found one.


Junchao

Have you ever tried Massif? http://valgrind.org/docs/manual/ms-manual.html

No. I came across it but not familiar with it.  I did not find APIs to call to 
get current memory usage. Will look at it further. Thanks.


Sanjay, did you say currently you could run with OpenMPI without out of memory, 
but with MPICH, you ran out of memory?  Is it feasible to share your code so 
that I can test with? Thanks.

--Junchao Zhang

On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
Barry,

If you look at the graphs I generated (on my Mac),  you will see that
OpenMPI and MPICH have very different values (along with the fact that
MPICH does not seem to adhere
to the standard (for releasing MPI_ISend resources following and MPI_Wait).

-sanjay

PS: I agree with Barry's assessment; this is really not that acceptable.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>Junchao,
>
>   This is insane. Either the OpenMPI library or something in the OS 
> underneath related to sockets and interprocess communication is grabbing 
> additional space for each round of MPI communication!  Does MPICH have the 
> same values or different values than OpenMP? When you run on Linux do you get 
> the same values as Apple or different. --- Same values seem to indicate the 
> issue is inside OpenMPI/MPICH different values indicates problem is more 
> likely at the OS level. Does this happen only with the default VecScatter 
> that uses blocking MPI, what happens with PetscSF under Vec? Is it somehow 
> related to PETSc's use of nonblocking sends and receives? One could 
> presumably use valgrind to see exactly what lines in what code are causing 
> these increases. I don't think we can just shrug and say this is the way it 
> is, we need to track down and understand the cause (and if possible fix).
>
>Barry
>
>
>> On May 31, 2019, at 2:53 PM, Zhang, Junchao 
>> mailto:jczh...@mcs.anl.gov>> wrote:
>>
>> Sanjay,
>> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning and 
>> end of KSPSolve and then computed the delta and summed over processes. Then 
>> I tested with src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>> With OpenMPI,
>> mpirun -n 4 ./ex5 -da_grid_x 128 -da_

Re: [petsc-users] Memory growth issue

2019-06-03 Thread Matthew Knepley via petsc-users
On Mon, Jun 3, 2019 at 6:56 PM Zhang, Junchao via petsc-users <
petsc-users@mcs.anl.gov> wrote:

> On Mon, Jun 3, 2019 at 5:23 PM Stefano Zampini 
> wrote:
>
>>
>>
>> On Jun 4, 2019, at 1:17 AM, Zhang, Junchao via petsc-users <
>> petsc-users@mcs.anl.gov> wrote:
>>
>> Sanjay & Barry,
>>   Sorry, I made a mistake that I said I could reproduced Sanjay's
>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when
>> petsc is configured without debugging, I have to add -malloc to run the
>> program. 2) I have to instrument the code outside of KSPSolve(). In my
>> case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside
>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results
>> were misleading.
>>  With these fixes, I measured differences of RSS and Petsc malloc
>> before/after KSPSolve. I did experiments on MacBook
>> using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with
>> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler
>> -ts_max_steps 500 -malloc.
>>  I find if the grid size is small, I can see a non-zero RSS-delta
>> randomly, either with one mpi rank or multiple ranks, with MPICH or
>> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I
>> only see non-zero RSS-delta randomly at the first few iterations (with
>> MPICH or OpenMPI). When the computer workload is high by simultaneously
>> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more non-zero
>> RSS-delta. But "Malloc Delta" behavior is stable across all runs. There is
>> only one nonzero malloc delta value in the first KSPSolve call. All
>> remaining are zero. Something like this:
>>
>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler
>> -ts_max_steps 500 -malloc
>> RSS Delta=   32489472, Malloc Delta=   26290304, RSS End=
>>  136114176
>> RSS Delta=  32768, Malloc Delta=  0, RSS End=
>>  138510336
>> RSS Delta=  0, Malloc Delta=  0, RSS End=
>>  138522624
>> RSS Delta=  0, Malloc Delta=  0, RSS End=
>>  138539008
>>
>> So I think I can conclude there is no unfreed memory in KSPSolve()
>> allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That
>> is possible and I am trying to find a way like PetscMallocGetCurrentUsage()
>> to measure that. Also, I think RSS delta is not a good way to measure
>> memory allocation. It is dynamic and depends on state of the computer
>> (swap, shared libraries loaded etc) when running the code. We should focus
>> on malloc instead.  If there was a valgrind tool, like performance
>> profiling tools,  that can let users measure memory allocated but not freed
>> in a user specified code segment, that would be very helpful in this case.
>> But I have not found one.
>>
>>
>> Junchao
>>
>> Have you ever tried Massif?
>> http://valgrind.org/docs/manual/ms-manual.html
>>
>
> No. I came across it but not familiar with it.  I did not find APIs to
> call to get current memory usage. Will look at it further. Thanks.
>

This is definitely the correct tool. It intercepts all calls to
malloc()/free() so it can give you the complete picture of allocated
memory at any time. It will draw a line graph of this labeled by the
routine that does each allocation.

   Matt

> Sanjay, did you say currently you could run with OpenMPI without out of
>> memory, but with MPICH, you ran out of memory?  Is it feasible to share
>> your code so that I can test with? Thanks.
>>
>> --Junchao Zhang
>>
>> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee  wrote:
>>
>>> Barry,
>>>
>>> If you look at the graphs I generated (on my Mac),  you will see that
>>> OpenMPI and MPICH have very different values (along with the fact that
>>> MPICH does not seem to adhere
>>> to the standard (for releasing MPI_ISend resources following and
>>> MPI_Wait).
>>>
>>> -sanjay
>>>
>>> PS: I agree with Barry's assessment; this is really not that acceptable.
>>>
>>> On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>>> >Junchao,
>>> >
>>> >   This is insane. Either the OpenMPI library or something in the
>>> OS underneath related to sockets and interprocess communication is grabbing
>>> additional space for each round of MPI communication!  Does MPICH have the
>>> same values or different values than OpenMP? When you run on Linux do you
>>> get the same values as Apple or different. --- Same values seem to indicate
>>> the issue is inside OpenMPI/MPICH different values indicates problem is
>>> more likely at the OS level. Does this happen only with the default
>>> VecScatter that uses blocking MPI, what happens with PetscSF under Vec? Is
>>> it somehow related to PETSc's use of nonblocking sends and receives? One
>>> could presumably use valgrind to see exactly what lines in what code are
>>> causing these increases. I don't think we can just shrug and say this is
>>> the way it is, we need to track down and understand the cause (and if

Re: [petsc-users] Memory growth issue

2019-06-03 Thread Sanjay Govindjee via petsc-users

Junchao,
  I won't be feasible to share the code but I will run a similar test 
as you have done (large problem); I will
try with both MPICH and OpenMPI.  I also agree that deltas are not ideal 
as there they do not account for latency in the freeing of memory
etc.  But I will note when we have the memory growth issue latency 
associated with free( ) appears not to be in play since the total

memory footprint grows monotonically.

  I'll also have a look at massif.  If you figure out the interface, 
and can send me the lines to instrument the code with that will save me

some time.
-sanjay

On 6/3/19 3:17 PM, Zhang, Junchao wrote:

Sanjay & Barry,
  Sorry, I made a mistake that I said I could reproduced Sanjay's 
experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() 
when petsc is configured without debugging, I have to add -malloc to 
run the program. 2) I have to instrument the code outside of 
KSPSolve(). In my case, it is in SNESSolve_NEWTONLS. In old 
experiments, I did it inside KSPSolve. Since KSPSolve can recursively 
call KSPSolve, the old results were misleading.
 With these fixes, I measured differences of RSS and Petsc malloc 
before/after KSPSolve. I did experiments on MacBook 
using src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c 
with commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 
-ts_type beuler -ts_max_steps 500 -malloc.
 I find if the grid size is small, I can see a non-zero RSS-delta 
randomly, either with one mpi rank or multiple ranks, with MPICH or 
OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 
256, I only see non-zero RSS-delta randomly at the first few 
iterations (with MPICH or OpenMPI). When the computer workload is high 
by simultaneously running ex5-openmpi and ex5-mpich, the MPICH one 
pops up much more non-zero RSS-delta. But "Malloc Delta" behavior is 
stable across all runs. There is only one nonzero malloc delta value 
in the first KSPSolve call. All remaining are zero. Something like this:


mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type
beuler -ts_max_steps 500 -malloc
RSS Delta= 32489472, Malloc Delta=       26290304, RSS End=  136114176
RSS Delta=  32768, Malloc Delta=              0, RSS End=  138510336
RSS Delta=    0, Malloc Delta=              0, RSS End=  138522624
RSS Delta=    0, Malloc Delta=              0, RSS End=  138539008

So I think I can conclude there is no unfreed memory in KSPSolve() 
allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? 
That is possible and I am trying to find a way like 
PetscMallocGetCurrentUsage() to measure that. Also, I think RSS delta 
is not a good way to measure memory allocation. It is dynamic and 
depends on state of the computer (swap, shared libraries loaded etc) 
when running the code. We should focus on malloc instead.  If there 
was a valgrind tool, like performance profiling tools,  that can let 
users measure memory allocated but not freed in a user specified code 
segment, that would be very helpful in this case. But I have not found 
one.


Sanjay, did you say currently you could run with OpenMPI without out 
of memory, but with MPICH, you ran out of memory? Is it feasible to 
share your code so that I can test with? Thanks.


--Junchao Zhang

On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee > wrote:


Barry,

If you look at the graphs I generated (on my Mac),  you will see that
OpenMPI and MPICH have very different values (along with the fact
that
MPICH does not seem to adhere
to the standard (for releasing MPI_ISend resources following and
MPI_Wait).

-sanjay

PS: I agree with Barry's assessment; this is really not that
acceptable.

On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>    Junchao,
>
>       This is insane. Either the OpenMPI library or something in
the OS underneath related to sockets and interprocess
communication is grabbing additional space for each round of MPI
communication!  Does MPICH have the same values or different
values than OpenMP? When you run on Linux do you get the same
values as Apple or different. --- Same values seem to indicate the
issue is inside OpenMPI/MPICH different values indicates problem
is more likely at the OS level. Does this happen only with the
default VecScatter that uses blocking MPI, what happens with
PetscSF under Vec? Is it somehow related to PETSc's use of
nonblocking sends and receives? One could presumably use valgrind
to see exactly what lines in what code are causing these
increases. I don't think we can just shrug and say this is the way
it is, we need to track down and understand the cause (and if
possible fix).
>
>    Barry
>
>
>> On May 31, 2019, at 2:53 PM, Zhang, Junchao
mailto:jczh...@mcs.anl.gov>> wrote:
>>
>> Sanjay,
>> I tried petsc with MPICH and OpenMPI on my Mac

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Sanjay GOVINDJEE via petsc-users
PetscMemoryGetCurrentUsage( ) is just a cover for rgetusage( ), so the use of 
the function is unrelated to Petsc.  The only difference here is mpich versus 
openmpi.
Notwithstanding, I can make a plot of the sum of the deltas around kspsolve.

Sent from my iPad

> On Jun 5, 2019, at 7:22 AM, Zhang, Junchao  wrote:
> 
> Sanjay, 
>   It sounds like the memory is allocated by PETSc, since you call 
> PetscMemoryGetCurrentUsage().  Make sure you use the latest PETSc version. 
> You can also do an experiment that puts two PetscMemoryGetCurrentUsage() 
> before & after KSPSolve(), calculates the delta, and then sums over 
> processes, so we know whether the memory is allocated in KSPSolve().
> 
> --Junchao Zhang
> 
> 
>> On Wed, Jun 5, 2019 at 1:19 AM Sanjay Govindjee  wrote:
>> Junchao,
>> 
>>   Attached is a graph of total RSS from my Mac using openmpi and mpich 
>> (installed with --download-openmpi and --download-mpich).
>> 
>>   The difference is pretty stark!  The WaitAll( ) in my part of the code 
>> fixed the run away memory 
>> problem using openmpi but definitely not with mpich.
>> 
>>   Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
>> update :(
>> Then I can try to run there and reproduce the same (or find out it is a Mac 
>> quirk, though the
>> reason I started looking at this was that a use on an HPC system pointed it 
>> out to me).
>> 
>> -sanjay
>> 
>> PS: To generate the data, all I did was place a call to 
>> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
>> MPI_AllReduce( ) to sum across the job (4 processors).
>>> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>>> Hi, Sanjay,
>>>   I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 
>>> 500 time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 
>>> --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>>> -ts_max_steps 500 -malloc"
>>>   I visualized the output with massif-visualizer. From the attached 
>>> picture, we can see the total heap size keeps constant most of the time and 
>>> is NOT monotonically increasing.  We can also see MPI only allocated memory 
>>> at initialization time and kept it. So it is unlikely that MPICH keeps 
>>> allocating memory in each KSPSolve call.
>>>   From graphs you sent, I can only see RSS is randomly increased after 
>>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>>> you also profile your code with valgrind massif and visualize it. I failed 
>>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>>> installed on Ubuntu.
>>>   I want you to confirm that with the MPI_Waitall fix, you still run out of 
>>> memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get 
>>> its current memory usage so that we can calculate its difference after each 
>>> KSPSolve call.
>>>  
>>> 
>>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> 
 On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee  wrote:
 Junchao,
   I won't be feasible to share the code but I will run a similar test as 
 you have done (large problem); I will
 try with both MPICH and OpenMPI.  I also agree that deltas are not ideal 
 as there they do not account for latency in the freeing of memory
 etc.  But I will note when we have the memory growth issue latency 
 associated with free( ) appears not to be in play since the total
 memory footprint grows monotonically.
 
   I'll also have a look at massif.  If you figure out the interface, and 
 can send me the lines to instrument the code with that will save me
 some time.
 -sanjay
> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
> Sanjay & Barry,
>   Sorry, I made a mistake that I said I could reproduced Sanjay's 
> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() 
> when petsc is configured without debugging, I have to add -malloc to run 
> the program. 2) I have to instrument the code outside of KSPSolve(). In 
> my case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
> were misleading.
>  With these fixes, I measured differences of RSS and Petsc malloc 
> before/after KSPSolve. I did experiments on MacBook using 
> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with 
> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type 
> beuler -ts_max_steps 500 -malloc.
>  I find if the grid size is small, I can see a non-zero RSS-delta 
> randomly, either with one mpi rank or multiple ranks, with MPICH or 
> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I 
> only see non-zero RSS-delta randomly at the first few iterations (with 
> MPICH or OpenMPI). When the computer workload is high by simultaneously 
> running ex5-openmpi and ex5-mpich, the MPICH one pops up much 

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Zhang, Junchao via petsc-users
OK, I see. I mistakenly read  PetscMemoryGetCurrentUsage as 
PetscMallocGetCurrentUsage.  You should also do PetscMallocGetCurrentUsage(), 
so that we know whether the increased memory is allocated by PETSc.

On Wed, Jun 5, 2019, 9:58 AM Sanjay GOVINDJEE 
mailto:s...@berkeley.edu>> wrote:
PetscMemoryGetCurrentUsage( ) is just a cover for rgetusage( ), so the use of 
the function is unrelated to Petsc.  The only difference here is mpich versus 
openmpi.
Notwithstanding, I can make a plot of the sum of the deltas around kspsolve.

Sent from my iPad

On Jun 5, 2019, at 7:22 AM, Zhang, Junchao 
mailto:jczh...@mcs.anl.gov>> wrote:

Sanjay,
  It sounds like the memory is allocated by PETSc, since you call 
PetscMemoryGetCurrentUsage().  Make sure you use the latest PETSc version. You 
can also do an experiment that puts two PetscMemoryGetCurrentUsage() before & 
after KSPSolve(), calculates the delta, and then sums over processes, so we 
know whether the memory is allocated in KSPSolve().

--Junchao Zhang


On Wed, Jun 5, 2019 at 1:19 AM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
Junchao,

  Attached is a graph of total RSS from my Mac using openmpi and mpich 
(installed with --download-openmpi and --download-mpich).

  The difference is pretty stark!  The WaitAll( ) in my part of the code fixed 
the run away memory
problem using openmpi but definitely not with mpich.

  Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
update :(
Then I can try to run there and reproduce the same (or find out it is a Mac 
quirk, though the
reason I started looking at this was that a use on an HPC system pointed it out 
to me).

-sanjay

PS: To generate the data, all I did was place a call to 
PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
MPI_AllReduce( ) to sum across the job (4 processors).

On 6/4/19 4:27 PM, Zhang, Junchao wrote:
Hi, Sanjay,
  I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 500 
time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 
--detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
-ts_max_steps 500 -malloc"
  I visualized the output with massif-visualizer. From the attached picture, we 
can see the total heap size keeps constant most of the time and is NOT 
monotonically increasing.  We can also see MPI only allocated memory at 
initialization time and kept it. So it is unlikely that MPICH keeps allocating 
memory in each KSPSolve call.
  From graphs you sent, I can only see RSS is randomly increased after 
KSPSolve, but that does not mean heap size keeps increasing.  I recommend you 
also profile your code with valgrind massif and visualize it. I failed to 
install massif-visualizer on MacBook and CentOS. But I easily got it installed 
on Ubuntu.
  I want you to confirm that with the MPI_Waitall fix, you still run out of 
memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get its 
current memory usage so that we can calculate its difference after each 
KSPSolve call.




--Junchao Zhang


On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
Junchao,
  I won't be feasible to share the code but I will run a similar test as you 
have done (large problem); I will
try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
there they do not account for latency in the freeing of memory
etc.  But I will note when we have the memory growth issue latency associated 
with free( ) appears not to be in play since the total
memory footprint grows monotonically.

  I'll also have a look at massif.  If you figure out the interface, and can 
send me the lines to instrument the code with that will save me
some time.
-sanjay
On 6/3/19 3:17 PM, Zhang, Junchao wrote:
Sanjay & Barry,
  Sorry, I made a mistake that I said I could reproduced Sanjay's experiments. 
I found 1) to correctly use PetscMallocGetCurrentUsage() when petsc is 
configured without debugging, I have to add -malloc to run the program. 2) I 
have to instrument the code outside of KSPSolve(). In my case, it is in 
SNESSolve_NEWTONLS. In old experiments, I did it inside KSPSolve. Since 
KSPSolve can recursively call KSPSolve, the old results were misleading.
 With these fixes, I measured differences of RSS and Petsc malloc before/after 
KSPSolve. I did experiments on MacBook using 
src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands like 
mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler -ts_max_steps 500 
-malloc.
 I find if the grid size is small, I can see a non-zero RSS-delta randomly, 
either with one mpi rank or multiple ranks, with MPICH or OpenMPI. If I 
increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I only see non-zero 
RSS-delta randomly at the first few iterations (with MPICH or OpenMPI). When 
the computer workload is high by simultaneously running ex5-openmpi and 
ex5-mpich, the MPICH one pops up much more non-zero RSS-

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Smith, Barry F. via petsc-users


  Are you reusing the same KSP the whole time, just making calls to KSPSolve, 
or are you creating a new KSP object? 

  Do you make any calls to KSPReset()? 

  Are you doing any MPI_Comm_dup()? 

  Are you attaching any attributes to MPI communicators?

   Thanks

> On Jun 5, 2019, at 1:18 AM, Sanjay Govindjee  wrote:
> 
> Junchao,
> 
>   Attached is a graph of total RSS from my Mac using openmpi and mpich 
> (installed with --download-openmpi and --download-mpich).
> 
>   The difference is pretty stark!  The WaitAll( ) in my part of the code 
> fixed the run away memory 
> problem using openmpi but definitely not with mpich.
> 
>   Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
> update :(
> Then I can try to run there and reproduce the same (or find out it is a Mac 
> quirk, though the
> reason I started looking at this was that a use on an HPC system pointed it 
> out to me).
> 
> -sanjay
> 
> PS: To generate the data, all I did was place a call to 
> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
> MPI_AllReduce( ) to sum across the job (4 processors).
> 
> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>> Hi, Sanjay,
>>   I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 
>> 500 time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 
>> --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>> -ts_max_steps 500 -malloc"
>>   I visualized the output with massif-visualizer. From the attached picture, 
>> we can see the total heap size keeps constant most of the time and is NOT 
>> monotonically increasing.  We can also see MPI only allocated memory at 
>> initialization time and kept it. So it is unlikely that MPICH keeps 
>> allocating memory in each KSPSolve call.
>>   From graphs you sent, I can only see RSS is randomly increased after 
>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>> you also profile your code with valgrind massif and visualize it. I failed 
>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>> installed on Ubuntu.
>>   I want you to confirm that with the MPI_Waitall fix, you still run out of 
>> memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get its 
>> current memory usage so that we can calculate its difference after each 
>> KSPSolve call.
>>  
>> 
>> 
>> 
>> --Junchao Zhang
>> 
>> 
>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee  wrote:
>> Junchao,
>>   I won't be feasible to share the code but I will run a similar test as you 
>> have done (large problem); I will
>> try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
>> there they do not account for latency in the freeing of memory
>> etc.  But I will note when we have the memory growth issue latency 
>> associated with free( ) appears not to be in play since the total
>> memory footprint grows monotonically.
>> 
>>   I'll also have a look at massif.  If you figure out the interface, and can 
>> send me the lines to instrument the code with that will save me
>> some time.
>> -sanjay
>> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
>>> Sanjay & Barry,
>>>   Sorry, I made a mistake that I said I could reproduced Sanjay's 
>>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when 
>>> petsc is configured without debugging, I have to add -malloc to run the 
>>> program. 2) I have to instrument the code outside of KSPSolve(). In my 
>>> case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
>>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
>>> were misleading.
>>>  With these fixes, I measured differences of RSS and Petsc malloc 
>>> before/after KSPSolve. I did experiments on MacBook using 
>>> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands 
>>> like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler 
>>> -ts_max_steps 500 -malloc.
>>>  I find if the grid size is small, I can see a non-zero RSS-delta randomly, 
>>> either with one mpi rank or multiple ranks, with MPICH or OpenMPI. If I 
>>> increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I only see 
>>> non-zero RSS-delta randomly at the first few iterations (with MPICH or 
>>> OpenMPI). When the computer workload is high by simultaneously running 
>>> ex5-openmpi and ex5-mpich, the MPICH one pops up much more non-zero 
>>> RSS-delta. But "Malloc Delta" behavior is stable across all runs. There is 
>>> only one nonzero malloc delta value in the first KSPSolve call. All 
>>> remaining are zero. Something like this:
>>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
>>> -ts_max_steps 500 -malloc
>>> RSS Delta=   32489472, Malloc Delta=   26290304, RSS End=  
>>> 136114176
>>> RSS Delta=  32768, Malloc Delta=  0, RSS End=  
>>> 138510336
>>> RSS Delta=  0, Malloc Delta=  0, RSS End=   

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Sanjay Govindjee via petsc-users
I found the bug (naturally in my own code).  When I made the MPI_Wait( ) 
changes, I missed one location where this
was needed.   See the attached graphs for openmpi and mpich using 
CG+Jacobi and GMRES+BJacobi.


Interesting that openmpi did not care about this but mpich did. Also 
interesting that the memory was growing so much when the size of the 
data packets going back and forth where just a few hundred bytes.


Thanks for your efforts and patience.

-sanjay


On 6/5/19 2:38 PM, Smith, Barry F. wrote:

   Are you reusing the same KSP the whole time, just making calls to KSPSolve, 
or are you creating a new KSP object?

   Do you make any calls to KSPReset()?

   Are you doing any MPI_Comm_dup()?

   Are you attaching any attributes to MPI communicators?

Thanks


On Jun 5, 2019, at 1:18 AM, Sanjay Govindjee  wrote:

Junchao,

   Attached is a graph of total RSS from my Mac using openmpi and mpich 
(installed with --download-openmpi and --download-mpich).

   The difference is pretty stark!  The WaitAll( ) in my part of the code fixed 
the run away memory
problem using openmpi but definitely not with mpich.

   Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
update :(
Then I can try to run there and reproduce the same (or find out it is a Mac 
quirk, though the
reason I started looking at this was that a use on an HPC system pointed it out 
to me).

-sanjay

PS: To generate the data, all I did was place a call to 
PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
MPI_AllReduce( ) to sum across the job (4 processors).

On 6/4/19 4:27 PM, Zhang, Junchao wrote:

Hi, Sanjay,
   I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 500 time 
steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 --detailed-freq=1 
./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler -ts_max_steps 500 -malloc"
   I visualized the output with massif-visualizer. From the attached picture, 
we can see the total heap size keeps constant most of the time and is NOT 
monotonically increasing.  We can also see MPI only allocated memory at 
initialization time and kept it. So it is unlikely that MPICH keeps allocating 
memory in each KSPSolve call.
   From graphs you sent, I can only see RSS is randomly increased after 
KSPSolve, but that does not mean heap size keeps increasing.  I recommend you 
also profile your code with valgrind massif and visualize it. I failed to 
install massif-visualizer on MacBook and CentOS. But I easily got it installed 
on Ubuntu.
   I want you to confirm that with the MPI_Waitall fix, you still run out of 
memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get its 
current memory usage so that we can calculate its difference after each 
KSPSolve call.
  




--Junchao Zhang


On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee  wrote:
Junchao,
   I won't be feasible to share the code but I will run a similar test as you 
have done (large problem); I will
try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
there they do not account for latency in the freeing of memory
etc.  But I will note when we have the memory growth issue latency associated 
with free( ) appears not to be in play since the total
memory footprint grows monotonically.

   I'll also have a look at massif.  If you figure out the interface, and can 
send me the lines to instrument the code with that will save me
some time.
-sanjay
On 6/3/19 3:17 PM, Zhang, Junchao wrote:

Sanjay & Barry,
   Sorry, I made a mistake that I said I could reproduced Sanjay's experiments. 
I found 1) to correctly use PetscMallocGetCurrentUsage() when petsc is 
configured without debugging, I have to add -malloc to run the program. 2) I 
have to instrument the code outside of KSPSolve(). In my case, it is in 
SNESSolve_NEWTONLS. In old experiments, I did it inside KSPSolve. Since 
KSPSolve can recursively call KSPSolve, the old results were misleading.
  With these fixes, I measured differences of RSS and Petsc malloc before/after 
KSPSolve. I did experiments on MacBook using 
src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands like 
mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler -ts_max_steps 500 
-malloc.
  I find if the grid size is small, I can see a non-zero RSS-delta randomly, either with 
one mpi rank or multiple ranks, with MPICH or OpenMPI. If I increase grid sizes, e.g., 
-da_grid_x 256 -da_grid_y 256, I only see non-zero RSS-delta randomly at the first few 
iterations (with MPICH or OpenMPI). When the computer workload is high by simultaneously 
running ex5-openmpi and ex5-mpich, the MPICH one pops up much more non-zero RSS-delta. 
But "Malloc Delta" behavior is stable across all runs. There is only one 
nonzero malloc delta value in the first KSPSolve call. All remaining are zero. Something 
like this:
mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
-ts_max_steps 500 -m

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Smith, Barry F. via petsc-users


  Thanks for letting us know, now we can rest easy.



> On Jun 5, 2019, at 1:00 PM, Zhang, Junchao  wrote:
> 
> OK, I see. I mistakenly read  PetscMemoryGetCurrentUsage as 
> PetscMallocGetCurrentUsage.  You should also do PetscMallocGetCurrentUsage(), 
> so that we know whether the increased memory is allocated by PETSc.
> 
> On Wed, Jun 5, 2019, 9:58 AM Sanjay GOVINDJEE  wrote:
> PetscMemoryGetCurrentUsage( ) is just a cover for rgetusage( ), so the use of 
> the function is unrelated to Petsc.  The only difference here is mpich versus 
> openmpi.
> Notwithstanding, I can make a plot of the sum of the deltas around kspsolve.
> 
> Sent from my iPad
> 
> On Jun 5, 2019, at 7:22 AM, Zhang, Junchao  wrote:
> 
>> Sanjay, 
>>   It sounds like the memory is allocated by PETSc, since you call 
>> PetscMemoryGetCurrentUsage().  Make sure you use the latest PETSc version. 
>> You can also do an experiment that puts two PetscMemoryGetCurrentUsage() 
>> before & after KSPSolve(), calculates the delta, and then sums over 
>> processes, so we know whether the memory is allocated in KSPSolve().
>> 
>> --Junchao Zhang
>> 
>> 
>> On Wed, Jun 5, 2019 at 1:19 AM Sanjay Govindjee  wrote:
>> Junchao,
>> 
>>   Attached is a graph of total RSS from my Mac using openmpi and mpich 
>> (installed with --download-openmpi and --download-mpich).
>> 
>>   The difference is pretty stark!  The WaitAll( ) in my part of the code 
>> fixed the run away memory 
>> problem using openmpi but definitely not with mpich.
>> 
>>   Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
>> update :(
>> Then I can try to run there and reproduce the same (or find out it is a Mac 
>> quirk, though the
>> reason I started looking at this was that a use on an HPC system pointed it 
>> out to me).
>> 
>> -sanjay
>> 
>> PS: To generate the data, all I did was place a call to 
>> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
>> MPI_AllReduce( ) to sum across the job (4 processors).
>> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>>> Hi, Sanjay,
>>>   I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 
>>> 500 time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 
>>> --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>>> -ts_max_steps 500 -malloc"
>>>   I visualized the output with massif-visualizer. From the attached 
>>> picture, we can see the total heap size keeps constant most of the time and 
>>> is NOT monotonically increasing.  We can also see MPI only allocated memory 
>>> at initialization time and kept it. So it is unlikely that MPICH keeps 
>>> allocating memory in each KSPSolve call.
>>>   From graphs you sent, I can only see RSS is randomly increased after 
>>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>>> you also profile your code with valgrind massif and visualize it. I failed 
>>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>>> installed on Ubuntu.
>>>   I want you to confirm that with the MPI_Waitall fix, you still run out of 
>>> memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get 
>>> its current memory usage so that we can calculate its difference after each 
>>> KSPSolve call.
>>>  
>>> 
>>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee  wrote:
>>> Junchao,
>>>   I won't be feasible to share the code but I will run a similar test as 
>>> you have done (large problem); I will
>>> try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
>>> there they do not account for latency in the freeing of memory
>>> etc.  But I will note when we have the memory growth issue latency 
>>> associated with free( ) appears not to be in play since the total
>>> memory footprint grows monotonically.
>>> 
>>>   I'll also have a look at massif.  If you figure out the interface, and 
>>> can send me the lines to instrument the code with that will save me
>>> some time.
>>> -sanjay
>>> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
 Sanjay & Barry,
   Sorry, I made a mistake that I said I could reproduced Sanjay's 
 experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when 
 petsc is configured without debugging, I have to add -malloc to run the 
 program. 2) I have to instrument the code outside of KSPSolve(). In my 
 case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
 KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
 were misleading.
  With these fixes, I measured differences of RSS and Petsc malloc 
 before/after KSPSolve. I did experiments on MacBook using 
 src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands 
 like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler 
 -ts_max_steps 500 -malloc.
  I find if the grid size is small, I can see a non-zero RSS-delta 
>>

Re: [petsc-users] Memory growth issue

2019-06-05 Thread Zhang, Junchao via petsc-users
Sanjay,
   You have one more reason to use VecScatter, which is heavily used and 
well-tested.
--Junchao Zhang


On Wed, Jun 5, 2019 at 5:47 PM Sanjay Govindjee 
mailto:s...@berkeley.edu>> wrote:
I found the bug (naturally in my own code).  When I made the MPI_Wait( )
changes, I missed one location where this
was needed.   See the attached graphs for openmpi and mpich using
CG+Jacobi and GMRES+BJacobi.

Interesting that openmpi did not care about this but mpich did. Also
interesting that the memory was growing so much when the size of the
data packets going back and forth where just a few hundred bytes.

Thanks for your efforts and patience.

-sanjay


On 6/5/19 2:38 PM, Smith, Barry F. wrote:
>Are you reusing the same KSP the whole time, just making calls to 
> KSPSolve, or are you creating a new KSP object?
>
>Do you make any calls to KSPReset()?
>
>Are you doing any MPI_Comm_dup()?
>
>Are you attaching any attributes to MPI communicators?
>
> Thanks
>
>> On Jun 5, 2019, at 1:18 AM, Sanjay Govindjee 
>> mailto:s...@berkeley.edu>> wrote:
>>
>> Junchao,
>>
>>Attached is a graph of total RSS from my Mac using openmpi and mpich 
>> (installed with --download-openmpi and --download-mpich).
>>
>>The difference is pretty stark!  The WaitAll( ) in my part of the code 
>> fixed the run away memory
>> problem using openmpi but definitely not with mpich.
>>
>>Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
>> update :(
>> Then I can try to run there and reproduce the same (or find out it is a Mac 
>> quirk, though the
>> reason I started looking at this was that a use on an HPC system pointed it 
>> out to me).
>>
>> -sanjay
>>
>> PS: To generate the data, all I did was place a call to 
>> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
>> MPI_AllReduce( ) to sum across the job (4 processors).
>>
>> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>>> Hi, Sanjay,
>>>I managed to use Valgrind massif + MPICH master + PETSc master. I ran 
>>> ex5 500 time steps with "mpirun -n 4 valgrind --tool=massif 
>>> --max-snapshots=200 --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 
>>> -ts_type beuler -ts_max_steps 500 -malloc"
>>>I visualized the output with massif-visualizer. From the attached 
>>> picture, we can see the total heap size keeps constant most of the time and 
>>> is NOT monotonically increasing.  We can also see MPI only allocated memory 
>>> at initialization time and kept it. So it is unlikely that MPICH keeps 
>>> allocating memory in each KSPSolve call.
>>>From graphs you sent, I can only see RSS is randomly increased after 
>>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>>> you also profile your code with valgrind massif and visualize it. I failed 
>>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>>> installed on Ubuntu.
>>>I want you to confirm that with the MPI_Waitall fix, you still run out 
>>> of memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get 
>>> its current memory usage so that we can calculate its difference after each 
>>> KSPSolve call.
>>>
>>> 
>>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee 
>>> mailto:s...@berkeley.edu>> wrote:
>>> Junchao,
>>>I won't be feasible to share the code but I will run a similar test as 
>>> you have done (large problem); I will
>>> try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
>>> there they do not account for latency in the freeing of memory
>>> etc.  But I will note when we have the memory growth issue latency 
>>> associated with free( ) appears not to be in play since the total
>>> memory footprint grows monotonically.
>>>
>>>I'll also have a look at massif.  If you figure out the interface, and 
>>> can send me the lines to instrument the code with that will save me
>>> some time.
>>> -sanjay
>>> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
 Sanjay & Barry,
Sorry, I made a mistake that I said I could reproduced Sanjay's 
 experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when 
 petsc is configured without debugging, I have to add -malloc to run the 
 program. 2) I have to instrument the code outside of KSPSolve(). In my 
 case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
 KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
 were misleading.
   With these fixes, I measured differences of RSS and Petsc malloc 
 before/after KSPSolve. I did experiments on MacBook using 
 src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands 
 like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler 
 -ts_max_steps 500 -malloc.
   I find if the grid size is small, I can see a non-zero RSS-delta 
 randomly, either with one mpi rank or multiple ranks, with MPICH or 
 Open