Re: [petsc-users] Memory growth issue

Sanjay GOVINDJEE via petsc-users Wed, 05 Jun 2019 07:59:18 -0700

PetscMemoryGetCurrentUsage( ) is just a cover for rgetusage( ), so the use of 
the function is unrelated to Petsc.  The only difference here is mpich versus 
openmpi.
Notwithstanding, I can make a plot of the sum of the deltas around kspsolve.


Sent from my iPad

> On Jun 5, 2019, at 7:22 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote:
> 
> Sanjay, 
>   It sounds like the memory is allocated by PETSc, since you call 
> PetscMemoryGetCurrentUsage().  Make sure you use the latest PETSc version. 
> You can also do an experiment that puts two PetscMemoryGetCurrentUsage() 
> before & after KSPSolve(), calculates the delta, and then sums over 
> processes, so we know whether the memory is allocated in KSPSolve().
> 
> --Junchao Zhang
> 
> 
>> On Wed, Jun 5, 2019 at 1:19 AM Sanjay Govindjee <s...@berkeley.edu> wrote:
>> Junchao,
>> 
>>   Attached is a graph of total RSS from my Mac using openmpi and mpich 
>> (installed with --download-openmpi and --download-mpich).
>> 
>>   The difference is pretty stark!  The WaitAll( ) in my part of the code 
>> fixed the run away memory 
>> problem using openmpi but definitely not with mpich.
>> 
>>   Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
>> update :(
>> Then I can try to run there and reproduce the same (or find out it is a Mac 
>> quirk, though the
>> reason I started looking at this was that a use on an HPC system pointed it 
>> out to me).
>> 
>> -sanjay
>> 
>> PS: To generate the data, all I did was place a call to 
>> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
>> MPI_AllReduce( ) to sum across the job (4 processors).
>>> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>>> Hi, Sanjay,
>>>   I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 
>>> 500 time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 
>>> --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>>> -ts_max_steps 500 -malloc"
>>>   I visualized the output with massif-visualizer. From the attached 
>>> picture, we can see the total heap size keeps constant most of the time and 
>>> is NOT monotonically increasing.  We can also see MPI only allocated memory 
>>> at initialization time and kept it. So it is unlikely that MPICH keeps 
>>> allocating memory in each KSPSolve call.
>>>   From graphs you sent, I can only see RSS is randomly increased after 
>>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>>> you also profile your code with valgrind massif and visualize it. I failed 
>>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>>> installed on Ubuntu.
>>>   I want you to confirm that with the MPI_Waitall fix, you still run out of 
>>> memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get 
>>> its current memory usage so that we can calculate its difference after each 
>>> KSPSolve call.
>>>  
>>> <massif-ex5.png>
>>> 
>>> 
>>> --Junchao Zhang
>>> 
>>> 
>>>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee <s...@berkeley.edu> wrote:
>>>> Junchao,
>>>>   I won't be feasible to share the code but I will run a similar test as 
>>>> you have done (large problem); I will
>>>> try with both MPICH and OpenMPI.  I also agree that deltas are not ideal 
>>>> as there they do not account for latency in the freeing of memory
>>>> etc.  But I will note when we have the memory growth issue latency 
>>>> associated with free( ) appears not to be in play since the total
>>>> memory footprint grows monotonically.
>>>> 
>>>>   I'll also have a look at massif.  If you figure out the interface, and 
>>>> can send me the lines to instrument the code with that will save me
>>>> some time.
>>>> -sanjay
>>>>> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
>>>>> Sanjay & Barry,
>>>>>   Sorry, I made a mistake that I said I could reproduced Sanjay's 
>>>>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() 
>>>>> when petsc is configured without debugging, I have to add -malloc to run 
>>>>> the program. 2) I have to instrument the code outside of KSPSolve(). In 
>>>>> my case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
>>>>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
>>>>> were misleading.
>>>>>  With these fixes, I measured differences of RSS and Petsc malloc 
>>>>> before/after KSPSolve. I did experiments on MacBook using 
>>>>> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with 
>>>>> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type 
>>>>> beuler -ts_max_steps 500 -malloc.
>>>>>  I find if the grid size is small, I can see a non-zero RSS-delta 
>>>>> randomly, either with one mpi rank or multiple ranks, with MPICH or 
>>>>> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I 
>>>>> only see non-zero RSS-delta randomly at the first few iterations (with 
>>>>> MPICH or OpenMPI). When the computer workload is high by simultaneously 
>>>>> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more 
>>>>> non-zero RSS-delta. But "Malloc Delta" behavior is stable across all 
>>>>> runs. There is only one nonzero malloc delta value in the first KSPSolve 
>>>>> call. All remaining are zero. Something like this:
>>>>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
>>>>> -ts_max_steps 500 -malloc
>>>>> RSS Delta=       32489472, Malloc Delta=       26290304, RSS End=      
>>>>> 136114176
>>>>> RSS Delta=          32768, Malloc Delta=              0, RSS End=      
>>>>> 138510336
>>>>> RSS Delta=              0, Malloc Delta=              0, RSS End=      
>>>>> 138522624
>>>>> RSS Delta=              0, Malloc Delta=              0, RSS End=      
>>>>> 138539008
>>>>> So I think I can conclude there is no unfreed memory in KSPSolve() 
>>>>> allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That 
>>>>> is possible and I am trying to find a way like 
>>>>> PetscMallocGetCurrentUsage() to measure that. Also, I think RSS delta is 
>>>>> not a good way to measure memory allocation. It is dynamic and depends on 
>>>>> state of the computer (swap, shared libraries loaded etc) when running 
>>>>> the code. We should focus on malloc instead.  If there was a valgrind 
>>>>> tool, like performance profiling tools,  that can let users measure 
>>>>> memory allocated but not freed in a user specified code segment, that 
>>>>> would be very helpful in this case. But I have not found one.
>>>>> 
>>>>> Sanjay, did you say currently you could run with OpenMPI without out of 
>>>>> memory, but with MPICH, you ran out of memory?  Is it feasible to share 
>>>>> your code so that I can test with? Thanks.
>>>>> 
>>>>> --Junchao Zhang
>>>>> 
>>>>>> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s...@berkeley.edu> 
>>>>>> wrote:
>>>>>> Barry,
>>>>>> 
>>>>>> If you look at the graphs I generated (on my Mac),  you will see that 
>>>>>> OpenMPI and MPICH have very different values (along with the fact that 
>>>>>> MPICH does not seem to adhere
>>>>>> to the standard (for releasing MPI_ISend resources following and 
>>>>>> MPI_Wait).
>>>>>> 
>>>>>> -sanjay
>>>>>> 
>>>>>> PS: I agree with Barry's assessment; this is really not that acceptable.
>>>>>> 
>>>>>> On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>>>>>> >    Junchao,
>>>>>> >
>>>>>> >       This is insane. Either the OpenMPI library or something in the 
>>>>>> > OS underneath related to sockets and interprocess communication is 
>>>>>> > grabbing additional space for each round of MPI communication!  Does 
>>>>>> > MPICH have the same values or different values than OpenMP? When you 
>>>>>> > run on Linux do you get the same values as Apple or different. --- 
>>>>>> > Same values seem to indicate the issue is inside OpenMPI/MPICH 
>>>>>> > different values indicates problem is more likely at the OS level. 
>>>>>> > Does this happen only with the default VecScatter that uses blocking 
>>>>>> > MPI, what happens with PetscSF under Vec? Is it somehow related to 
>>>>>> > PETSc's use of nonblocking sends and receives? One could presumably 
>>>>>> > use valgrind to see exactly what lines in what code are causing these 
>>>>>> > increases. I don't think we can just shrug and say this is the way it 
>>>>>> > is, we need to track down and understand the cause (and if possible 
>>>>>> > fix).
>>>>>> >
>>>>>> >    Barry
>>>>>> >
>>>>>> >
>>>>>> >> On May 31, 2019, at 2:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> 
>>>>>> >> wrote:
>>>>>> >>
>>>>>> >> Sanjay,
>>>>>> >> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
>>>>>> >> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the 
>>>>>> >> beginning and end of KSPSolve and then computed the delta and summed 
>>>>>> >> over processes. Then I tested with 
>>>>>> >> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>>>>>> >> With OpenMPI,
>>>>>> >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler 
>>>>>> >> -ts_max_steps 500 > 128.log
>>>>>> >> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 128.log
>>>>>> >> 1:RSS Delta=     69632, Malloc Delta=         0
>>>>>> >> 2:RSS Delta=     69632, Malloc Delta=         0
>>>>>> >> 3:RSS Delta=     69632, Malloc Delta=         0
>>>>>> >> 4:RSS Delta=     69632, Malloc Delta=         0
>>>>>> >> 9:RSS Delta=9.25286e+06, Malloc Delta=         0
>>>>>> >> 22:RSS Delta=     49152, Malloc Delta=         0
>>>>>> >> 44:RSS Delta=     20480, Malloc Delta=         0
>>>>>> >> 53:RSS Delta=     49152, Malloc Delta=         0
>>>>>> >> 66:RSS Delta=      4096, Malloc Delta=         0
>>>>>> >> 97:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 119:RSS Delta=     20480, Malloc Delta=         0
>>>>>> >> 141:RSS Delta=     53248, Malloc Delta=         0
>>>>>> >> 176:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 308:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 352:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 550:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 572:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 669:RSS Delta=     40960, Malloc Delta=         0
>>>>>> >> 924:RSS Delta=     32768, Malloc Delta=         0
>>>>>> >> 1694:RSS Delta=     20480, Malloc Delta=         0
>>>>>> >> 2099:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 2244:RSS Delta=     20480, Malloc Delta=         0
>>>>>> >> 3001:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >> 5883:RSS Delta=     16384, Malloc Delta=         0
>>>>>> >>
>>>>>> >> If I increased the grid
>>>>>> >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>>>>>> >> -ts_max_steps 500 -malloc_test >512.log
>>>>>> >> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 512.log
>>>>>> >> 1:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> >> 2:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> >> 3:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> >> 4:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> >> 13:RSS Delta=1.24932e+08, Malloc Delta=         0
>>>>>> >>
>>>>>> >> So we did see RSS increase in 4k-page sizes after KSPSolve. As long 
>>>>>> >> as no memory leaks, why do you care about it? Is it because you run 
>>>>>> >> out of memory?
>>>>>> >>
>>>>>> >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. <bsm...@mcs.anl.gov> 
>>>>>> >> wrote:
>>>>>> >>
>>>>>> >>     Thanks for the update. So the current conclusions are that using 
>>>>>> >> the Waitall in your code
>>>>>> >>
>>>>>> >> 1) solves the memory issue with OpenMPI in your code
>>>>>> >>
>>>>>> >> 2) does not solve the memory issue with PETSc KSPSolve
>>>>>> >>
>>>>>> >> 3) MPICH has memory issues both for your code and PETSc KSPSolve 
>>>>>> >> (despite) the wait all fix?
>>>>>> >>
>>>>>> >> If you literately just comment out the call to KSPSolve() with 
>>>>>> >> OpenMPI is there no growth in memory usage?
>>>>>> >>
>>>>>> >>
>>>>>> >> Both 2 and 3 are concerning, indicate possible memory leak bugs in 
>>>>>> >> MPICH and not freeing all MPI resources in KSPSolve()
>>>>>> >>
>>>>>> >> Junchao, can you please investigate 2 and 3 with, for example, a TS 
>>>>>> >> example that uses the linear solver (like with -ts_type beuler)? 
>>>>>> >> Thanks
>>>>>> >>
>>>>>> >>
>>>>>> >>    Barry
>>>>>> >>
>>>>>> >>
>>>>>> >>
>>>>>> >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee <s...@berkeley.edu> 
>>>>>> >>> wrote:
>>>>>> >>>
>>>>>> >>> Lawrence,
>>>>>> >>> Thanks for taking a look!  This is what I had been wondering about 
>>>>>> >>> -- my knowledge of MPI is pretty minimal and
>>>>>> >>> this origins of the routine were from a programmer we hired a 
>>>>>> >>> decade+ back from NERSC.  I'll have to look into
>>>>>> >>> VecScatter.  It will be great to dispense with our roll-your-own 
>>>>>> >>> routines (we even have our own reduceALL scattered around the code).
>>>>>> >>>
>>>>>> >>> Interestingly, the MPI_WaitALL has solved the problem when using 
>>>>>> >>> OpenMPI but it still persists with MPICH.  Graphs attached.
>>>>>> >>> I'm going to run with openmpi for now (but I guess I really still 
>>>>>> >>> need to figure out what is wrong with MPICH and WaitALL;
>>>>>> >>> I'll try Barry's suggestion of 
>>>>>> >>> --download-mpich-configure-arguments="--enable-error-messages=all 
>>>>>> >>> --enable-g" later today and report back).
>>>>>> >>>
>>>>>> >>> Regarding MPI_Barrier, it was put in due a problem that some 
>>>>>> >>> processes were finishing up sending and receiving and exiting the 
>>>>>> >>> subroutine
>>>>>> >>> before the receiving processes had completed (which resulted in data 
>>>>>> >>> loss as the buffers are freed after the call to the routine). 
>>>>>> >>> MPI_Barrier was the solution proposed
>>>>>> >>> to us.  I don't think I can dispense with it, but will think about 
>>>>>> >>> some more.
>>>>>> >>>
>>>>>> >>> I'm not so sure about using MPI_IRecv as it will require a bit of 
>>>>>> >>> rewriting since right now I process the received
>>>>>> >>> data sequentially after each blocking MPI_Recv -- clearly slower but 
>>>>>> >>> easier to code.
>>>>>> >>>
>>>>>> >>> Thanks again for the help.
>>>>>> >>>
>>>>>> >>> -sanjay
>>>>>> >>>
>>>>>> >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>>>>>> >>>> Hi Sanjay,
>>>>>> >>>>
>>>>>> >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>>>>>> >>>>> <petsc-users@mcs.anl.gov> wrote:
>>>>>> >>>>>
>>>>>> >>>>> The problem seems to persist but with a different signature.  
>>>>>> >>>>> Graphs attached as before.
>>>>>> >>>>>
>>>>>> >>>>> Totals with MPICH (NB: single run)
>>>>>> >>>>>
>>>>>> >>>>> For the CG/Jacobi          data_exchange_total = 41,385,984; 
>>>>>> >>>>> kspsolve_total = 38,289,408
>>>>>> >>>>> For the GMRES/BJACOBI      data_exchange_total = 41,324,544; 
>>>>>> >>>>> kspsolve_total = 41,324,544
>>>>>> >>>>>
>>>>>> >>>>> Just reading the MPI docs I am wondering if I need some sort of 
>>>>>> >>>>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange 
>>>>>> >>>>> routine?
>>>>>> >>>>> I would have thought that with the blocking receives and the 
>>>>>> >>>>> MPI_Barrier that everything will have fully completed and cleaned 
>>>>>> >>>>> up before
>>>>>> >>>>> all processes exited the routine, but perhaps I am wrong on that.
>>>>>> >>>> Skimming the fortran code you sent you do:
>>>>>> >>>>
>>>>>> >>>> for i in ...:
>>>>>> >>>>     call MPI_Isend(..., req, ierr)
>>>>>> >>>>
>>>>>> >>>> for i in ...:
>>>>>> >>>>     call MPI_Recv(..., ierr)
>>>>>> >>>>
>>>>>> >>>> But you never call MPI_Wait on the request you got back from the 
>>>>>> >>>> Isend. So the MPI library will never free the data structures it 
>>>>>> >>>> created.
>>>>>> >>>>
>>>>>> >>>> The usual pattern for these non-blocking communications is to 
>>>>>> >>>> allocate an array for the requests of length nsend+nrecv and then 
>>>>>> >>>> do:
>>>>>> >>>>
>>>>>> >>>> for i in nsend:
>>>>>> >>>>     call MPI_Isend(..., req[i], ierr)
>>>>>> >>>> for j in nrecv:
>>>>>> >>>>     call MPI_Irecv(..., req[nsend+j], ierr)
>>>>>> >>>>
>>>>>> >>>> call MPI_Waitall(req, ..., ierr)
>>>>>> >>>>
>>>>>> >>>> I note also there's no need for the Barrier at the end of the 
>>>>>> >>>> routine, this kind of communication does neighbourwise 
>>>>>> >>>> synchronisation, no need to add (unnecessary) global 
>>>>>> >>>> synchronisation too.
>>>>>> >>>>
>>>>>> >>>> As an aside, is there a reason you don't use PETSc's VecScatter to 
>>>>>> >>>> manage this global to local exchange?
>>>>>> >>>>
>>>>>> >>>> Cheers,
>>>>>> >>>>
>>>>>> >>>> Lawrence
>>>>>> >>> <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>
>>>>>> 
>>>> 
>>

Re: [petsc-users] Memory growth issue

Reply via email to