PetscMemoryGetCurrentUsage( ) is just a cover for rgetusage( ), so the use of the function is unrelated to Petsc. The only difference here is mpich versus openmpi. Notwithstanding, I can make a plot of the sum of the deltas around kspsolve.
Sent from my iPad > On Jun 5, 2019, at 7:22 AM, Zhang, Junchao <jczh...@mcs.anl.gov> wrote: > > Sanjay, > It sounds like the memory is allocated by PETSc, since you call > PetscMemoryGetCurrentUsage(). Make sure you use the latest PETSc version. > You can also do an experiment that puts two PetscMemoryGetCurrentUsage() > before & after KSPSolve(), calculates the delta, and then sums over > processes, so we know whether the memory is allocated in KSPSolve(). > > --Junchao Zhang > > >> On Wed, Jun 5, 2019 at 1:19 AM Sanjay Govindjee <s...@berkeley.edu> wrote: >> Junchao, >> >> Attached is a graph of total RSS from my Mac using openmpi and mpich >> (installed with --download-openmpi and --download-mpich). >> >> The difference is pretty stark! The WaitAll( ) in my part of the code >> fixed the run away memory >> problem using openmpi but definitely not with mpich. >> >> Tomorrow I hope to get my linux box set up; unfortunately it needs an OS >> update :( >> Then I can try to run there and reproduce the same (or find out it is a Mac >> quirk, though the >> reason I started looking at this was that a use on an HPC system pointed it >> out to me). >> >> -sanjay >> >> PS: To generate the data, all I did was place a call to >> PetscMemoryGetCurrentUsage( ) right after KSPSolve( ), followed by an >> MPI_AllReduce( ) to sum across the job (4 processors). >>> On 6/4/19 4:27 PM, Zhang, Junchao wrote: >>> Hi, Sanjay, >>> I managed to use Valgrind massif + MPICH master + PETSc master. I ran ex5 >>> 500 time steps with "mpirun -n 4 valgrind --tool=massif --max-snapshots=200 >>> --detailed-freq=1 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler >>> -ts_max_steps 500 -malloc" >>> I visualized the output with massif-visualizer. From the attached >>> picture, we can see the total heap size keeps constant most of the time and >>> is NOT monotonically increasing. We can also see MPI only allocated memory >>> at initialization time and kept it. So it is unlikely that MPICH keeps >>> allocating memory in each KSPSolve call. >>> From graphs you sent, I can only see RSS is randomly increased after >>> KSPSolve, but that does not mean heap size keeps increasing. I recommend >>> you also profile your code with valgrind massif and visualize it. I failed >>> to install massif-visualizer on MacBook and CentOS. But I easily got it >>> installed on Ubuntu. >>> I want you to confirm that with the MPI_Waitall fix, you still run out of >>> memory with MPICH (but not OpenMPI). If needed, I can hack MPICH to get >>> its current memory usage so that we can calculate its difference after each >>> KSPSolve call. >>> >>> <massif-ex5.png> >>> >>> >>> --Junchao Zhang >>> >>> >>>> On Mon, Jun 3, 2019 at 6:36 PM Sanjay Govindjee <s...@berkeley.edu> wrote: >>>> Junchao, >>>> I won't be feasible to share the code but I will run a similar test as >>>> you have done (large problem); I will >>>> try with both MPICH and OpenMPI. I also agree that deltas are not ideal >>>> as there they do not account for latency in the freeing of memory >>>> etc. But I will note when we have the memory growth issue latency >>>> associated with free( ) appears not to be in play since the total >>>> memory footprint grows monotonically. >>>> >>>> I'll also have a look at massif. If you figure out the interface, and >>>> can send me the lines to instrument the code with that will save me >>>> some time. >>>> -sanjay >>>>> On 6/3/19 3:17 PM, Zhang, Junchao wrote: >>>>> Sanjay & Barry, >>>>> Sorry, I made a mistake that I said I could reproduced Sanjay's >>>>> experiments. I found 1) to correctly use PetscMallocGetCurrentUsage() >>>>> when petsc is configured without debugging, I have to add -malloc to run >>>>> the program. 2) I have to instrument the code outside of KSPSolve(). In >>>>> my case, it is in SNESSolve_NEWTONLS. In old experiments, I did it inside >>>>> KSPSolve. Since KSPSolve can recursively call KSPSolve, the old results >>>>> were misleading. >>>>> With these fixes, I measured differences of RSS and Petsc malloc >>>>> before/after KSPSolve. I did experiments on MacBook using >>>>> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c with >>>>> commands like mpirun -n 4 ./ex5 -da_grid_x 64 -da_grid_y 64 -ts_type >>>>> beuler -ts_max_steps 500 -malloc. >>>>> I find if the grid size is small, I can see a non-zero RSS-delta >>>>> randomly, either with one mpi rank or multiple ranks, with MPICH or >>>>> OpenMPI. If I increase grid sizes, e.g., -da_grid_x 256 -da_grid_y 256, I >>>>> only see non-zero RSS-delta randomly at the first few iterations (with >>>>> MPICH or OpenMPI). When the computer workload is high by simultaneously >>>>> running ex5-openmpi and ex5-mpich, the MPICH one pops up much more >>>>> non-zero RSS-delta. But "Malloc Delta" behavior is stable across all >>>>> runs. There is only one nonzero malloc delta value in the first KSPSolve >>>>> call. All remaining are zero. Something like this: >>>>> mpirun -n 4 ./ex5-mpich -da_grid_x 256 -da_grid_y 256 -ts_type beuler >>>>> -ts_max_steps 500 -malloc >>>>> RSS Delta= 32489472, Malloc Delta= 26290304, RSS End= >>>>> 136114176 >>>>> RSS Delta= 32768, Malloc Delta= 0, RSS End= >>>>> 138510336 >>>>> RSS Delta= 0, Malloc Delta= 0, RSS End= >>>>> 138522624 >>>>> RSS Delta= 0, Malloc Delta= 0, RSS End= >>>>> 138539008 >>>>> So I think I can conclude there is no unfreed memory in KSPSolve() >>>>> allocated by PETSc. Has MPICH allocated unfreed memory in KSPSolve? That >>>>> is possible and I am trying to find a way like >>>>> PetscMallocGetCurrentUsage() to measure that. Also, I think RSS delta is >>>>> not a good way to measure memory allocation. It is dynamic and depends on >>>>> state of the computer (swap, shared libraries loaded etc) when running >>>>> the code. We should focus on malloc instead. If there was a valgrind >>>>> tool, like performance profiling tools, that can let users measure >>>>> memory allocated but not freed in a user specified code segment, that >>>>> would be very helpful in this case. But I have not found one. >>>>> >>>>> Sanjay, did you say currently you could run with OpenMPI without out of >>>>> memory, but with MPICH, you ran out of memory? Is it feasible to share >>>>> your code so that I can test with? Thanks. >>>>> >>>>> --Junchao Zhang >>>>> >>>>>> On Sat, Jun 1, 2019 at 3:21 AM Sanjay Govindjee <s...@berkeley.edu> >>>>>> wrote: >>>>>> Barry, >>>>>> >>>>>> If you look at the graphs I generated (on my Mac), you will see that >>>>>> OpenMPI and MPICH have very different values (along with the fact that >>>>>> MPICH does not seem to adhere >>>>>> to the standard (for releasing MPI_ISend resources following and >>>>>> MPI_Wait). >>>>>> >>>>>> -sanjay >>>>>> >>>>>> PS: I agree with Barry's assessment; this is really not that acceptable. >>>>>> >>>>>> On 6/1/19 1:00 AM, Smith, Barry F. wrote: >>>>>> > Junchao, >>>>>> > >>>>>> > This is insane. Either the OpenMPI library or something in the >>>>>> > OS underneath related to sockets and interprocess communication is >>>>>> > grabbing additional space for each round of MPI communication! Does >>>>>> > MPICH have the same values or different values than OpenMP? When you >>>>>> > run on Linux do you get the same values as Apple or different. --- >>>>>> > Same values seem to indicate the issue is inside OpenMPI/MPICH >>>>>> > different values indicates problem is more likely at the OS level. >>>>>> > Does this happen only with the default VecScatter that uses blocking >>>>>> > MPI, what happens with PetscSF under Vec? Is it somehow related to >>>>>> > PETSc's use of nonblocking sends and receives? One could presumably >>>>>> > use valgrind to see exactly what lines in what code are causing these >>>>>> > increases. I don't think we can just shrug and say this is the way it >>>>>> > is, we need to track down and understand the cause (and if possible >>>>>> > fix). >>>>>> > >>>>>> > Barry >>>>>> > >>>>>> > >>>>>> >> On May 31, 2019, at 2:53 PM, Zhang, Junchao <jczh...@mcs.anl.gov> >>>>>> >> wrote: >>>>>> >> >>>>>> >> Sanjay, >>>>>> >> I tried petsc with MPICH and OpenMPI on my Macbook. I inserted >>>>>> >> PetscMemoryGetCurrentUsage/PetscMallocGetCurrentUsage at the >>>>>> >> beginning and end of KSPSolve and then computed the delta and summed >>>>>> >> over processes. Then I tested with >>>>>> >> src/ts/examples/tutorials/advection-diffusion-reaction/ex5.c >>>>>> >> With OpenMPI, >>>>>> >> mpirun -n 4 ./ex5 -da_grid_x 128 -da_grid_y 128 -ts_type beuler >>>>>> >> -ts_max_steps 500 > 128.log >>>>>> >> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 128.log >>>>>> >> 1:RSS Delta= 69632, Malloc Delta= 0 >>>>>> >> 2:RSS Delta= 69632, Malloc Delta= 0 >>>>>> >> 3:RSS Delta= 69632, Malloc Delta= 0 >>>>>> >> 4:RSS Delta= 69632, Malloc Delta= 0 >>>>>> >> 9:RSS Delta=9.25286e+06, Malloc Delta= 0 >>>>>> >> 22:RSS Delta= 49152, Malloc Delta= 0 >>>>>> >> 44:RSS Delta= 20480, Malloc Delta= 0 >>>>>> >> 53:RSS Delta= 49152, Malloc Delta= 0 >>>>>> >> 66:RSS Delta= 4096, Malloc Delta= 0 >>>>>> >> 97:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 119:RSS Delta= 20480, Malloc Delta= 0 >>>>>> >> 141:RSS Delta= 53248, Malloc Delta= 0 >>>>>> >> 176:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 308:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 352:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 550:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 572:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 669:RSS Delta= 40960, Malloc Delta= 0 >>>>>> >> 924:RSS Delta= 32768, Malloc Delta= 0 >>>>>> >> 1694:RSS Delta= 20480, Malloc Delta= 0 >>>>>> >> 2099:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 2244:RSS Delta= 20480, Malloc Delta= 0 >>>>>> >> 3001:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> 5883:RSS Delta= 16384, Malloc Delta= 0 >>>>>> >> >>>>>> >> If I increased the grid >>>>>> >> mpirun -n 4 ./ex5 -da_grid_x 512 -da_grid_y 512 -ts_type beuler >>>>>> >> -ts_max_steps 500 -malloc_test >512.log >>>>>> >> grep -n -v "RSS Delta= 0, Malloc Delta= 0" 512.log >>>>>> >> 1:RSS Delta=1.05267e+06, Malloc Delta= 0 >>>>>> >> 2:RSS Delta=1.05267e+06, Malloc Delta= 0 >>>>>> >> 3:RSS Delta=1.05267e+06, Malloc Delta= 0 >>>>>> >> 4:RSS Delta=1.05267e+06, Malloc Delta= 0 >>>>>> >> 13:RSS Delta=1.24932e+08, Malloc Delta= 0 >>>>>> >> >>>>>> >> So we did see RSS increase in 4k-page sizes after KSPSolve. As long >>>>>> >> as no memory leaks, why do you care about it? Is it because you run >>>>>> >> out of memory? >>>>>> >> >>>>>> >> On Thu, May 30, 2019 at 1:59 PM Smith, Barry F. <bsm...@mcs.anl.gov> >>>>>> >> wrote: >>>>>> >> >>>>>> >> Thanks for the update. So the current conclusions are that using >>>>>> >> the Waitall in your code >>>>>> >> >>>>>> >> 1) solves the memory issue with OpenMPI in your code >>>>>> >> >>>>>> >> 2) does not solve the memory issue with PETSc KSPSolve >>>>>> >> >>>>>> >> 3) MPICH has memory issues both for your code and PETSc KSPSolve >>>>>> >> (despite) the wait all fix? >>>>>> >> >>>>>> >> If you literately just comment out the call to KSPSolve() with >>>>>> >> OpenMPI is there no growth in memory usage? >>>>>> >> >>>>>> >> >>>>>> >> Both 2 and 3 are concerning, indicate possible memory leak bugs in >>>>>> >> MPICH and not freeing all MPI resources in KSPSolve() >>>>>> >> >>>>>> >> Junchao, can you please investigate 2 and 3 with, for example, a TS >>>>>> >> example that uses the linear solver (like with -ts_type beuler)? >>>>>> >> Thanks >>>>>> >> >>>>>> >> >>>>>> >> Barry >>>>>> >> >>>>>> >> >>>>>> >> >>>>>> >>> On May 30, 2019, at 1:47 PM, Sanjay Govindjee <s...@berkeley.edu> >>>>>> >>> wrote: >>>>>> >>> >>>>>> >>> Lawrence, >>>>>> >>> Thanks for taking a look! This is what I had been wondering about >>>>>> >>> -- my knowledge of MPI is pretty minimal and >>>>>> >>> this origins of the routine were from a programmer we hired a >>>>>> >>> decade+ back from NERSC. I'll have to look into >>>>>> >>> VecScatter. It will be great to dispense with our roll-your-own >>>>>> >>> routines (we even have our own reduceALL scattered around the code). >>>>>> >>> >>>>>> >>> Interestingly, the MPI_WaitALL has solved the problem when using >>>>>> >>> OpenMPI but it still persists with MPICH. Graphs attached. >>>>>> >>> I'm going to run with openmpi for now (but I guess I really still >>>>>> >>> need to figure out what is wrong with MPICH and WaitALL; >>>>>> >>> I'll try Barry's suggestion of >>>>>> >>> --download-mpich-configure-arguments="--enable-error-messages=all >>>>>> >>> --enable-g" later today and report back). >>>>>> >>> >>>>>> >>> Regarding MPI_Barrier, it was put in due a problem that some >>>>>> >>> processes were finishing up sending and receiving and exiting the >>>>>> >>> subroutine >>>>>> >>> before the receiving processes had completed (which resulted in data >>>>>> >>> loss as the buffers are freed after the call to the routine). >>>>>> >>> MPI_Barrier was the solution proposed >>>>>> >>> to us. I don't think I can dispense with it, but will think about >>>>>> >>> some more. >>>>>> >>> >>>>>> >>> I'm not so sure about using MPI_IRecv as it will require a bit of >>>>>> >>> rewriting since right now I process the received >>>>>> >>> data sequentially after each blocking MPI_Recv -- clearly slower but >>>>>> >>> easier to code. >>>>>> >>> >>>>>> >>> Thanks again for the help. >>>>>> >>> >>>>>> >>> -sanjay >>>>>> >>> >>>>>> >>> On 5/30/19 4:48 AM, Lawrence Mitchell wrote: >>>>>> >>>> Hi Sanjay, >>>>>> >>>> >>>>>> >>>>> On 30 May 2019, at 08:58, Sanjay Govindjee via petsc-users >>>>>> >>>>> <petsc-users@mcs.anl.gov> wrote: >>>>>> >>>>> >>>>>> >>>>> The problem seems to persist but with a different signature. >>>>>> >>>>> Graphs attached as before. >>>>>> >>>>> >>>>>> >>>>> Totals with MPICH (NB: single run) >>>>>> >>>>> >>>>>> >>>>> For the CG/Jacobi data_exchange_total = 41,385,984; >>>>>> >>>>> kspsolve_total = 38,289,408 >>>>>> >>>>> For the GMRES/BJACOBI data_exchange_total = 41,324,544; >>>>>> >>>>> kspsolve_total = 41,324,544 >>>>>> >>>>> >>>>>> >>>>> Just reading the MPI docs I am wondering if I need some sort of >>>>>> >>>>> MPI_Wait/MPI_Waitall before my MPI_Barrier in the data exchange >>>>>> >>>>> routine? >>>>>> >>>>> I would have thought that with the blocking receives and the >>>>>> >>>>> MPI_Barrier that everything will have fully completed and cleaned >>>>>> >>>>> up before >>>>>> >>>>> all processes exited the routine, but perhaps I am wrong on that. >>>>>> >>>> Skimming the fortran code you sent you do: >>>>>> >>>> >>>>>> >>>> for i in ...: >>>>>> >>>> call MPI_Isend(..., req, ierr) >>>>>> >>>> >>>>>> >>>> for i in ...: >>>>>> >>>> call MPI_Recv(..., ierr) >>>>>> >>>> >>>>>> >>>> But you never call MPI_Wait on the request you got back from the >>>>>> >>>> Isend. So the MPI library will never free the data structures it >>>>>> >>>> created. >>>>>> >>>> >>>>>> >>>> The usual pattern for these non-blocking communications is to >>>>>> >>>> allocate an array for the requests of length nsend+nrecv and then >>>>>> >>>> do: >>>>>> >>>> >>>>>> >>>> for i in nsend: >>>>>> >>>> call MPI_Isend(..., req[i], ierr) >>>>>> >>>> for j in nrecv: >>>>>> >>>> call MPI_Irecv(..., req[nsend+j], ierr) >>>>>> >>>> >>>>>> >>>> call MPI_Waitall(req, ..., ierr) >>>>>> >>>> >>>>>> >>>> I note also there's no need for the Barrier at the end of the >>>>>> >>>> routine, this kind of communication does neighbourwise >>>>>> >>>> synchronisation, no need to add (unnecessary) global >>>>>> >>>> synchronisation too. >>>>>> >>>> >>>>>> >>>> As an aside, is there a reason you don't use PETSc's VecScatter to >>>>>> >>>> manage this global to local exchange? >>>>>> >>>> >>>>>> >>>> Cheers, >>>>>> >>>> >>>>>> >>>> Lawrence >>>>>> >>> <cg_mpichwall.png><cg_wall.png><gmres_mpichwall.png><gmres_wall.png> >>>>>> >>>> >>