On 14 July 2016 at 01:07, frank <hengj...@uci.edu> wrote: > Hi Dave, > > Sorry for the late reply. > Thank you so much for your detailed reply. > > I have a question about the estimation of the memory usage. There are > 4223139840 allocated non-zeros and 18432 MPI processes. Double precision is > used. So the memory per process is: > 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? > Did I do sth wrong here? Because this seems too small. >
No - I totally f***ed it up. You are correct. That'll teach me for fumbling around with my iphone calculator and not using my brain. (Note that to convert to MB just divide by 1e6, not 1024^2 - although I apparently cannot convert between units correctly....) >From the PETSc objects associated with the solver, It looks like it _should_ run with 2GB per MPI rank. Sorry for my mistake. Possibilities are: somewhere in your usage of PETSc you've introduced a memory leak; PETSc is doing a huge over allocation (e.g. as per our discussion of MatPtAP); or in your application code there are other objects you have forgotten to log the memory for. > I am running this job on Bluewater > <https://bluewaters.ncsa.illinois.edu/user-guide> > I am using the 7 points FD stencil in 3D. > I thought so on both counts. > > I apologize that I made a stupid mistake in computing the memory per core. > My settings render each core can access only 2G memory on average instead > of 8G which I mentioned in previous email. I re-run the job with 8G memory > per core on average and there is no "Out Of Memory" error. I would do more > test to see if there is still some memory issue. > Ok. I'd still like to know where the memory was being used since my estimates were off. Thanks, Dave > > Regards, > Frank > > > > On 07/11/2016 01:18 PM, Dave May wrote: > > Hi Frank, > > > On 11 July 2016 at 19:14, frank <hengj...@uci.edu> wrote: > >> Hi Dave, >> >> I re-run the test using bjacobi as the preconditioner on the coarse mesh >> of telescope. The Grid is 3072*256*768 and process mesh is 96*8*24. The >> petsc option file is attached. >> I still got the "Out Of Memory" error. The error occurred before the >> linear solver finished one step. So I don't have the full info from >> ksp_view. The info from ksp_view_pre is attached. >> > > Okay - that is essentially useless (sorry) > > >> >> It seems to me that the error occurred when the decomposition was going >> to be changed. >> > > Based on what information? > Running with -info would give us more clues, but will create a ton of > output. > Please try running the case which failed with -info > > >> I had another test with a grid of 1536*128*384 and the same process mesh >> as above. There was no error. The ksp_view info is attached for comparison. >> Thank you. >> > > > [3] Here is my crude estimate of your memory usage. > I'll target the biggest memory hogs only to get an order of magnitude > estimate > > * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB per MPI > rank assuming double precision. > The indices for the AIJ could amount to another 0.3 GB (assuming 32 bit > integers) > > * You use 5 levels of coarsening, so the other operators should represent > (collectively) > 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank on the > communicator with 18432 ranks. > The coarse grid should consume ~ 0.5 MB per MPI rank on the communicator > with 18432 ranks. > > * You use a reduction factor of 64, making the new communicator with 288 > MPI ranks. > PCTelescope will first gather a temporary matrix associated with your > coarse level operator assuming a comm size of 288 living on the comm with > size 18432. > This matrix will require approximately 0.5 * 64 = 32 MB per core on the > 288 ranks. > This matrix is then used to form a new MPIAIJ matrix on the subcomm, thus > require another 32 MB per rank. > The temporary matrix is now destroyed. > > * Because a DMDA is detected, a permutation matrix is assembled. > This requires 2 doubles per point in the DMDA. > Your coarse DMDA contains 92 x 16 x 48 points. > Thus the permutation matrix will require < 1 MB per MPI rank on the > sub-comm. > > * Lastly, the matrix is permuted. This uses MatPtAP(), but the resulting > operator will have the same memory footprint as the unpermuted matrix (32 > MB). At any stage in PCTelescope, only 2 operators of size 32 MB are held > in memory when the DMDA is provided. > > From my rough estimates, the worst case memory foot print for any given > core, given your options is approximately > 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB > This is way below 8 GB. > > Note this estimate completely ignores: > (1) the memory required for the restriction operator, > (2) the potential growth in the number of non-zeros per row due to > Galerkin coarsening (I wished -ksp_view_pre reported the output from > MatView so we could see the number of non-zeros required by the coarse > level operators) > (3) all temporary vectors required by the CG solver, and those required by > the smoothers. > (4) internal memory allocated by MatPtAP > (5) memory associated with IS's used within PCTelescope > > So either I am completely off in my estimates, or you have not carefully > estimated the memory usage of your application code. Hopefully others might > examine/correct my rough estimates > > Since I don't have your code I cannot access the latter. > Since I don't have access to the same machine you are running on, I think > we need to take a step back. > > [1] What machine are you running on? Send me a URL if its available > > [2] What discretization are you using? (I am guessing a scalar 7 point FD > stencil) > If it's a 7 point FD stencil, we should be able to examine the memory > usage of your solver configuration using a standard, light weight existing > PETSc example, run on your machine at the same scale. > This would hopefully enable us to correctly evaluate the actual memory > usage required by the solver configuration you are using. > > Thanks, > Dave > > >> >> >> Frank >> >> >> >> >> On 07/08/2016 10:38 PM, Dave May wrote: >> >> >> >> On Saturday, 9 July 2016, frank <hengj...@uci.edu> wrote: >> >>> Hi Barry and Dave, >>> >>> Thank both of you for the advice. >>> >>> @Barry >>> I made a mistake in the file names in last email. I attached the correct >>> files this time. >>> For all the three tests, 'Telescope' is used as the coarse >>> preconditioner. >>> >>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12 >>> Part of the memory usage: Vector 125 124 3971904 0. >>> Matrix 101 101 >>> 9462372 0 >>> >>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24 >>> Part of the memory usage: Vector 125 124 681672 0. >>> Matrix 101 101 >>> 1462180 0. >>> >>> In theory, the memory usage in Test1 should be 8 times of Test2. In my >>> case, it is about 6 times. >>> >>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24. Sub-domain per >>> process: 32*32*32 >>> Here I get the out of memory error. >>> >>> I tried to use -mg_coarse jacobi. In this way, I don't need to set >>> -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right? >>> The linear solver didn't work in this case. Petsc output some errors. >>> >>> @Dave >>> In test3, I use only one instance of 'Telescope'. On the coarse mesh of >>> 'Telescope', I used LU as the preconditioner instead of SVD. >>> If my set the levels correctly, then on the last coarse mesh of MG where >>> it calls 'Telescope', the sub-domain per process is 2*2*2. >>> On the last coarse mesh of 'Telescope', there is only one grid point per >>> process. >>> I still got the OOM error. The detailed petsc option file is attached. >> >> >> Do you understand the expected memory usage for the particular parallel >> LU implementation you are using? I don't (seriously). Replace LU with >> bjacobi and re-run this test. My point about solver debugging is still >> valid. >> >> And please send the result of KSPView so we can see what is actually used >> in the computations >> >> Thanks >> Dave >> >> >>> >>> >>> Thank you so much. >>> >>> Frank >>> >>> >>> >>> On 07/06/2016 02:51 PM, Barry Smith wrote: >>> >>>> On Jul 6, 2016, at 4:19 PM, frank < <hengj...@uci.edu>hengj...@uci.edu> >>>>> wrote: >>>>> >>>>> Hi Barry, >>>>> >>>>> Thank you for you advice. >>>>> I tried three test. In the 1st test, the grid is 3072*256*768 and the >>>>> process mesh is 96*8*24. >>>>> The linear solver is 'cg' the preconditioner is 'mg' and 'telescope' >>>>> is used as the preconditioner at the coarse mesh. >>>>> The system gives me the "Out of Memory" error before the linear system >>>>> is completely solved. >>>>> The info from '-ksp_view_pre' is attached. I seems to me that the >>>>> error occurs when it reaches the coarse mesh. >>>>> >>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is 96*8*24. >>>>> The 3rd test uses the same grid but a different process mesh 48*4*12. >>>>> >>>> Are you sure this is right? The total matrix and vector memory >>>> usage goes from 2nd test >>>> Vector 384 383 8,193,712 0. >>>> Matrix 103 103 11,508,688 0. >>>> to 3rd test >>>> Vector 384 383 1,590,520 0. >>>> Matrix 103 103 3,508,664 0. >>>> that is the memory usage got smaller but if you have only 1/8th the >>>> processes and the same grid it should have gotten about 8 times bigger. Did >>>> you maybe cut the grid by a factor of 8 also? If so that still doesn't >>>> explain it because the memory usage changed by a factor of 5 something for >>>> the vectors and 3 something for the matrices. >>>> >>>> >>>> The linear solver and petsc options in 2nd and 3rd tests are the same >>>>> in 1st test. The linear solver works fine in both test. >>>>> I attached the memory usage of the 2nd and 3rd tests. The memory info >>>>> is from the option '-log_summary'. I tried to use '-momery_info' as you >>>>> suggested, but in my case petsc treated it as an unused option. It output >>>>> nothing about the memory. Do I need to add sth to my code so I can use >>>>> '-memory_info'? >>>>> >>>> Sorry, my mistake the option is -memory_view >>>> >>>> Can you run the one case with -memory_view and -mg_coarse jacobi >>>> -ksp_max_it 1 (just so it doesn't iterate forever) to see how much memory >>>> is used without the telescope? Also run case 2 the same way. >>>> >>>> Barry >>>> >>>> >>>> >>>> In both tests the memory usage is not large. >>>>> >>>>> It seems to me that it might be the 'telescope' preconditioner that >>>>> allocated a lot of memory and caused the error in the 1st test. >>>>> Is there is a way to show how much memory it allocated? >>>>> >>>>> Frank >>>>> >>>>> On 07/05/2016 03:37 PM, Barry Smith wrote: >>>>> >>>>>> Frank, >>>>>> >>>>>> You can run with -ksp_view_pre to have it "view" the KSP before >>>>>> the solve so hopefully it gets that far. >>>>>> >>>>>> Please run the problem that does fit with -memory_info when the >>>>>> problem completes it will show the "high water mark" for PETSc allocated >>>>>> memory and total memory used. We first want to look at these numbers to >>>>>> see >>>>>> if it is using more memory than you expect. You could also run with say >>>>>> half the grid spacing to see how the memory usage scaled with the >>>>>> increase >>>>>> in grid points. Make the runs also with -log_view and send all the output >>>>>> from these options. >>>>>> >>>>>> Barry >>>>>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank < <hengj...@uci.edu> >>>>>>> hengj...@uci.edu> wrote: >>>>>>> >>>>>>> Hi, >>>>>>> >>>>>>> I am using the CG ksp solver and Multigrid preconditioner to solve >>>>>>> a linear system in parallel. >>>>>>> I chose to use the 'Telescope' as the preconditioner on the coarse >>>>>>> mesh for its good performance. >>>>>>> The petsc options file is attached. >>>>>>> >>>>>>> The domain is a 3d box. >>>>>>> It works well when the grid is 1536*128*384 and the process mesh is >>>>>>> 96*8*24. When I double the size of grid and keep the same process mesh >>>>>>> and >>>>>>> petsc options, I get an "out of memory" error from the super-cluster I >>>>>>> am >>>>>>> using. >>>>>>> Each process has access to at least 8G memory, which should be more >>>>>>> than enough for my application. I am sure that all the other parts of my >>>>>>> code( except the linear solver ) do not use much memory. So I doubt if >>>>>>> there is something wrong with the linear solver. >>>>>>> The error occurs before the linear system is completely solved so I >>>>>>> don't have the info from ksp view. I am not able to re-produce the error >>>>>>> with a smaller problem either. >>>>>>> In addition, I tried to use the block jacobi as the preconditioner >>>>>>> with the same grid and same decomposition. The linear solver runs >>>>>>> extremely >>>>>>> slow but there is no memory error. >>>>>>> >>>>>>> How can I diagnose what exactly cause the error? >>>>>>> Thank you so much. >>>>>>> >>>>>>> Frank >>>>>>> <petsc_options.txt> >>>>>>> >>>>>> >>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc_options.txt> >>>>> >>>> >>> >> > >