Sanjay,
   You have one more reason to use VecScatter, which is heavily used and 
well-tested.
--Junchao Zhang


On Wed, Jun 5, 2019 at 5:47 PM Sanjay Govindjee 
<s...@berkeley.edu<mailto:s...@berkeley.edu>> wrote:
I found the bug (naturally in my own code).  When I made the MPI_Wait( )
changes, I missed one location where this
was needed.   See the attached graphs for openmpi and mpich using
CG+Jacobi and GMRES+BJacobi.

Interesting that openmpi did not care about this but mpich did. Also
interesting that the memory was growing so much when the size of the
data packets going back and forth where just a few hundred bytes.

Thanks for your efforts and patience.

-sanjay


On 6/5/19 2:38 PM, Smith, Barry F. wrote:
>    Are you reusing the same KSP the whole time, just making calls to 
> KSPSolve, or are you creating a new KSP object?
>
>    Do you make any calls to KSPReset()?
>
>    Are you doing any MPI_Comm_dup()?
>
>    Are you attaching any attributes to MPI communicators?
>
>     Thanks
>
>> On Jun 5, 2019, at 1:18 AM, Sanjay Govindjee 
>> <s...@berkeley.edu<mailto:s...@berkeley.edu>> wrote:
>>
>> Junchao,
>>
>>    Attached is a graph of total RSS from my Mac using openmpi and mpich 
>> (installed with --download-openmpi and --download-mpich).
>>
>>    The difference is pretty stark!  The WaitAll( ) in my part of the code 
>> fixed the run away memory
>> problem using openmpi but definitely not with mpich.
>>
>>    Tomorrow I hope to get my linux box set up; unfortunately it needs an OS 
>> update :(
>> Then I can try to run there and reproduce the same (or find out it is a Mac 
>> quirk, though the
>> reason I started looking at this was that a use on an HPC system pointed it 
>> out to me).
>>
>> -sanjay
>>
>> PS: To generate the data, all I did was place a call to 
>> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an 
>> MPI_AllReduce( ) to sum across the job (4 processors).
>>
>> On 6/4/19 4:27 PM, Zhang, Junchao wrote:
>>> Hi, Sanjay,
>>>    I managed to use Valgrind massif + MPICH master + PETSc master. I ran 
>>> ex5 500 time steps with "mpirun -n 4 valgrind --tool=massif 
>>> --max-snapshots=200 --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 
>>> -ts_type beuler -ts_max_steps 500 -malloc"
>>>    I visualized the output with massif-visualizer. From the attached 
>>> picture, we can see the total heap size keeps constant most of the time and 
>>> is NOT monotonically increasing.  We can also see MPI only allocated memory 
>>> at initialization time and kept it. So it is unlikely that MPICH keeps 
>>> allocating memory in each KSPSolve call.
>>>    From graphs you sent, I can only see RSS is randomly increased after 
>>> KSPSolve, but that does not mean heap size keeps increasing.  I recommend 
>>> you also profile your code with valgrind massif and visualize it. I failed 
>>> to install massif-visualizer on MacBook and CentOS. But I easily got it 
>>> installed on Ubuntu.
>>>    I want you to confirm that with the MPI_Waitall fix, you still run out 
>>> of memory with MPICH (but not OpenMPI).  If needed, I can hack MPICH to get 
>>> its current memory usage so that we can calculate its difference after each 
>>> KSPSolve call.
>>>
>>> <massif-ex5.png>
>>>
>>>
>>> --Junchao Zhang
>>>
>>>
>>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee 
>>> <s...@berkeley.edu<mailto:s...@berkeley.edu>> wrote:
>>> Junchao,
>>>    I won't be feasible to share the code but I will run a similar test as 
>>> you have done (large problem); I will
>>> try with both MPICH and OpenMPI.  I also agree that deltas are not ideal as 
>>> there they do not account for latency in the freeing of memory
>>> etc.  But I will note when we have the memory growth issue latency 
>>> associated with free( ) appears not to be in play since the total
>>> memory footprint grows monotonically.
>>>
>>>    I'll also have a look at massif.  If you figure out the interface, and 
>>> can send me the lines to instrument the code with that will save me
>>> some time.
>>> -sanjay
>>> On 6/3/19 3:17 PM, Zhang, Junchao wrote:
>>>> Sanjay & Barry,
>>>>    Sorry, I made a mistake that I said I could reproduced Sanjay's 
>>>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() when 
>>>> petsc is configured without debugging, I have to add -malloc to run the 
>>>> program. 2) I have to instrument the code outside of KSPSolve(). In my 
>>>> case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside 
>>>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results 
>>>> were misleading.
>>>>   With these fixes, I measured differences of RSS and Petsc malloc 
>>>> before/after KSPSolve. I did experiments on MacBook using 
>>>> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with commands 
>>>> like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type beuler 
>>>> -ts_max_steps 500 -malloc.
>>>>   I find if the grid size is small, I can see a non-zero RSS-delta 
>>>> randomly, either with one mpi rank or multiple ranks, with MPICH or 
>>>> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I 
>>>> only see non-zero RSS-delta randomly at the first few iterations (with 
>>>> MPICH or OpenMPI). When the computer workload is high by simultaneously 
>>>> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more 
>>>> non-zero RSS-delta. But "Malloc Delta" behavior is stable across all runs. 
>>>> There is only one nonzero malloc delta value in the first KSPSolve call. 
>>>> All remaining are zero. Something like this:
>>>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler 
>>>> -ts_max_steps 500 -malloc
>>>> RSS Delta=       32489472, Malloc Delta=       26290304, RSS End=      
>>>> 136114176
>>>> RSS Delta=          32768, Malloc Delta=              0, RSS End=      
>>>> 138510336
>>>> RSS Delta=              0, Malloc Delta=              0, RSS End=      
>>>> 138522624
>>>> RSS Delta=              0, Malloc Delta=              0, RSS End=      
>>>> 138539008
>>>> So I think I can conclude there is no unfreed memory in KSPSolve() 
>>>> allocated by PETSc.  Has MPICH allocated unfreed memory in KSPSolve? That 
>>>> is possible and I am trying to find a way like 
>>>> PetscMallocGetCurrentUsage() to measure that. Also, I think RSS delta is 
>>>> not a good way to measure memory allocation. It is dynamic and depends on 
>>>> state of the computer (swap, shared libraries loaded etc) when running the 
>>>> code. We should focus on malloc instead.  If there was a valgrind tool, 
>>>> like performance profiling tools,  that can let users measure memory 
>>>> allocated but not freed in a user specified code segment, that would be 
>>>> very helpful in this case. But I have not found one.
>>>>
>>>> Sanjay, did you say currently you could run with OpenMPI without out of 
>>>> memory, but with MPICH, you ran out of memory?  Is it feasible to share 
>>>> your code so that I can test with? Thanks.
>>>>
>>>> --Junchao Zhang
>>>>
>>>> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee 
>>>> <s...@berkeley.edu<mailto:s...@berkeley.edu>> wrote:
>>>> Barry,
>>>>
>>>> If you look at the graphs I generated (on my Mac),  you will see that
>>>> OpenMPI and MPICH have very different values (along with the fact that
>>>> MPICH does not seem to adhere
>>>> to the standard (for releasing MPI_ISend resources following and MPI_Wait).
>>>>
>>>> -sanjay
>>>>
>>>> PS: I agree with Barry's assessment; this is really not that acceptable.
>>>>
>>>> On 6/1/19 1:00 AM, Smith, Barry F. wrote:
>>>>>     Junchao,
>>>>>
>>>>>        This is insane. Either the OpenMPI library or something in the OS 
>>>>> underneath related to sockets and interprocess communication is grabbing 
>>>>> additional space for each round of MPI communication!  Does MPICH have 
>>>>> the same values or different values than OpenMP? When you run on Linux do 
>>>>> you get the same values as Apple or different. --- Same values seem to 
>>>>> indicate the issue is inside OpenMPI/MPICH different values indicates 
>>>>> problem is more likely at the OS level. Does this happen only with the 
>>>>> default VecScatter that uses blocking MPI, what happens with PetscSF 
>>>>> under Vec? Is it somehow related to PETSc's use of nonblocking sends and 
>>>>> receives? One could presumably use valgrind to see exactly what lines in 
>>>>> what code are causing these increases. I don't think we can just shrug 
>>>>> and say this is the way it is, we need to track down and understand the 
>>>>> cause (and if possible fix).
>>>>>
>>>>>     Barry
>>>>>
>>>>>
>>>>>> On May 31, 2019, at 2:53 PM, Zhang, Junchao 
>>>>>> <jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
>>>>>>
>>>>>> Sanjay,
>>>>>> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted 
>>>>>> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the beginning 
>>>>>> and end of KSPSolve and then computed the delta and summed over 
>>>>>> processes. Then I tested with 
>>>>>> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c
>>>>>> With OpenMPI,
>>>>>> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler 
>>>>>> -ts_max_steps 500 > 128.log
>>>>>> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 128.log
>>>>>> 1:RSS Delta=     69632, Malloc Delta=         0
>>>>>> 2:RSS Delta=     69632, Malloc Delta=         0
>>>>>> 3:RSS Delta=     69632, Malloc Delta=         0
>>>>>> 4:RSS Delta=     69632, Malloc Delta=         0
>>>>>> 9:RSS Delta=9.25286e+06, Malloc Delta=         0
>>>>>> 22:RSS Delta=     49152, Malloc Delta=         0
>>>>>> 44:RSS Delta=     20480, Malloc Delta=         0
>>>>>> 53:RSS Delta=     49152, Malloc Delta=         0
>>>>>> 66:RSS Delta=      4096, Malloc Delta=         0
>>>>>> 97:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 119:RSS Delta=     20480, Malloc Delta=         0
>>>>>> 141:RSS Delta=     53248, Malloc Delta=         0
>>>>>> 176:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 308:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 352:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 550:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 572:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 669:RSS Delta=     40960, Malloc Delta=         0
>>>>>> 924:RSS Delta=     32768, Malloc Delta=         0
>>>>>> 1694:RSS Delta=     20480, Malloc Delta=         0
>>>>>> 2099:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 2244:RSS Delta=     20480, Malloc Delta=         0
>>>>>> 3001:RSS Delta=     16384, Malloc Delta=         0
>>>>>> 5883:RSS Delta=     16384, Malloc Delta=         0
>>>>>>
>>>>>> If I increased the grid
>>>>>> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler 
>>>>>> -ts_max_steps 500 -malloc_test >512.log
>>>>>> grep -n -v "RSS Delta=         0, Malloc Delta=         0" 512.log
>>>>>> 1:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> 2:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> 3:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> 4:RSS Delta=1.05267e+06, Malloc Delta=         0
>>>>>> 13:RSS Delta=1.24932e+08, Malloc Delta=         0
>>>>>>
>>>>>> So we did see RSS increase in 4k-page sizes after KSPSolve. As long as 
>>>>>> no memory leaks, why do you care about it? Is it because you run out of 
>>>>>> memory?
>>>>>>
>>>>>> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. 
>>>>>> <bsm...@mcs.anl.gov<mailto:bsm...@mcs.anl.gov>> wrote:
>>>>>>
>>>>>>      Thanks for the update. So the current conclusions are that using 
>>>>>> the Waitall in your code
>>>>>>
>>>>>> 1) solves the memory issue with OpenMPI in your code
>>>>>>
>>>>>> 2) does not solve the memory issue with PETSc KSPSolve
>>>>>>
>>>>>> 3) MPICH has memory issues both for your code and PETSc KSPSolve 
>>>>>> (despite) the wait all fix?
>>>>>>
>>>>>> If you literately just comment out the call to KSPSolve() with OpenMPI 
>>>>>> is there no growth in memory usage?
>>>>>>
>>>>>>
>>>>>> Both 2 and 3 are concerning, indicate possible memory leak bugs in MPICH 
>>>>>> and not freeing all MPI resources in KSPSolve()
>>>>>>
>>>>>> Junchao, can you please investigate 2 and 3 with, for example, a TS 
>>>>>> example that uses the linear solver (like with -ts_type beuler)? Thanks
>>>>>>
>>>>>>
>>>>>>     Barry
>>>>>>
>>>>>>
>>>>>>
>>>>>>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee 
>>>>>>> <s...@berkeley.edu<mailto:s...@berkeley.edu>> wrote:
>>>>>>>
>>>>>>> Lawrence,
>>>>>>> Thanks for taking a look!  This is what I had been wondering about -- 
>>>>>>> my knowledge of MPI is pretty minimal and
>>>>>>> this origins of the routine were from a programmer we hired a decade+ 
>>>>>>> back from NERSC.  I'll have to look into
>>>>>>> VecScatter.  It will be great to dispense with our roll-your-own 
>>>>>>> routines (we even have our own reduceALL scattered around the code).
>>>>>>>
>>>>>>> Interestingly, the MPI_WaitALL has solved the problem when using 
>>>>>>> OpenMPI but it still persists with MPICH.  Graphs attached.
>>>>>>> I'm going to run with openmpi for now (but I guess I really still need 
>>>>>>> to figure out what is wrong with MPICH and WaitALL;
>>>>>>> I'll try Barry's suggestion of 
>>>>>>> --download-mpich-configure-arguments="--enable-error-messages=all 
>>>>>>> --enable-g" later today and report back).
>>>>>>>
>>>>>>> Regarding MPI_Barrier, it was put in due a problem that some processes 
>>>>>>> were finishing up sending and receiving and exiting the subroutine
>>>>>>> before the receiving processes had completed (which resulted in data 
>>>>>>> loss as the buffers are freed after the call to the routine). 
>>>>>>> MPI_Barrier was the solution proposed
>>>>>>> to us.  I don't think I can dispense with it, but will think about some 
>>>>>>> more.
>>>>>>>
>>>>>>> I'm not so sure about using MPI_IRecv as it will require a bit of 
>>>>>>> rewriting since right now I process the received
>>>>>>> data sequentially after each blocking MPI_Recv -- clearly slower but 
>>>>>>> easier to code.
>>>>>>>
>>>>>>> Thanks again for the help.
>>>>>>>
>>>>>>> -sanjay
>>>>>>>
>>>>>>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote:
>>>>>>>> Hi Sanjay,
>>>>>>>>
>>>>>>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users 
>>>>>>>>> <petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
>>>>>>>>>
>>>>>>>>> The problem seems to persist but with a different signature.  Graphs 
>>>>>>>>> attached as before.
>>>>>>>>>
>>>>>>>>> Totals with MPICH (NB: single run)
>>>>>>>>>
>>>>>>>>> For the CG/Jacobi          data_exchange_total = 41,385,984; 
>>>>>>>>> kspsolve_total = 38,289,408
>>>>>>>>> For the GMRES/BJACOBI      data_exchange_total = 41,324,544; 
>>>>>>>>> kspsolve_total = 41,324,544
>>>>>>>>>
>>>>>>>>> Just reading the MPI docs I am wondering if I need some sort of 
>>>>>>>>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange 
>>>>>>>>> routine?
>>>>>>>>> I would have thought that with the blocking receives and the 
>>>>>>>>> MPI_Barrier that everything will have fully completed and cleaned up 
>>>>>>>>> before
>>>>>>>>> all processes exited the routine, but perhaps I am wrong on that.
>>>>>>>> Skimming the fortran code you sent you do:
>>>>>>>>
>>>>>>>> for i in ...:
>>>>>>>>      call MPI_Isend(..., req, ierr)
>>>>>>>>
>>>>>>>> for i in ...:
>>>>>>>>      call MPI_Recv(..., ierr)
>>>>>>>>
>>>>>>>> But you never call MPI_Wait on the request you got back from the 
>>>>>>>> Isend. So the MPI library will never free the data structures it 
>>>>>>>> created.
>>>>>>>>
>>>>>>>> The usual pattern for these non-blocking communications is to allocate 
>>>>>>>> an array for the requests of length nsend+nrecv and then do:
>>>>>>>>
>>>>>>>> for i in nsend:
>>>>>>>>      call MPI_Isend(..., req[i], ierr)
>>>>>>>> for j in nrecv:
>>>>>>>>      call MPI_Irecv(..., req[nsend+j], ierr)
>>>>>>>>
>>>>>>>> call MPI_Waitall(req, ..., ierr)
>>>>>>>>
>>>>>>>> I note also there's no need for the Barrier at the end of the routine, 
>>>>>>>> this kind of communication does neighbourwise synchronisation, no need 
>>>>>>>> to add (unnecessary) global synchronisation too.
>>>>>>>>
>>>>>>>> As an aside, is there a reason you don't use PETSc's VecScatter to 
>>>>>>>> manage this global to local exchange?
>>>>>>>>
>>>>>>>> Cheers,
>>>>>>>>
>>>>>>>> Lawrence
>>>>>>> <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png>
>> <mac-mpich-openmpi.png>

Reply via email to