On Thursday, 15 September 2016, Hengjie Wang <hengj...@uci.edu> wrote:
> Hi Dave, > > Sorry, I should have put more comment to explain the code. > No problem. I was looking at the code after only 3 hrs of sleep.... > > The number of process in each dimension is the same: Px = Py=Pz=P. So is > the domain size. > So if the you want to run the code for a 512^3 grid points on 16^3 cores, > you need to set "-N 512 -P 16" in the command line. > I add more comments and also fix an error in the attached code. ( The > error only effects the accuracy of solution but not the memory usage. ) > Yep thanks, I see that now. I know this is only a test, but this is kinda clunky. The dmda can automatically choose the partition, and if the user wants control over it, they can use the command line options -da_processors_{x,y,z} (as in your options file). For my testing purposes I'll have to tweak your code as I don't want to always have to change two options when changing the partition size or mesh size (as I'll certainly get it wrong every second time leading to a lose of my time due to queue wait times) Thanks, Dave > > > Thank you. > Frank > > On 9/14/2016 9:05 PM, Dave May wrote: > > > > On Thursday, 15 September 2016, Dave May <dave.mayhe...@gmail.com > <javascript:_e(%7B%7D,'cvml','dave.mayhe...@gmail.com');>> wrote: > >> >> >> On Thursday, 15 September 2016, frank <hengj...@uci.edu> wrote: >> >>> Hi, >>> >>> I write a simple code to re-produce the error. I hope this can help to >>> diagnose the problem. >>> The code just solves a 3d poisson equation. >>> >> >> Why is the stencil width a runtime parameter?? And why is the default >> value 2? For 7-pnt FD Laplace, you only need a stencil width of 1. >> >> Was this choice made to mimic something in the real application code? >> > > Please ignore - I misunderstood your usage of the param set by -P > > >> >> >>> >>> I run the code on a 1024^3 mesh. The process partition is 32 * 32 * 32. >>> That's when I re-produce the OOM error. Each core has about 2G memory. >>> I also run the code on a 512^3 mesh with 16 * 16 * 16 processes. The ksp >>> solver works fine. >>> I attached the code, ksp_view_pre's output and my petsc option file. >>> >>> Thank you. >>> Frank >>> >>> On 09/09/2016 06:38 PM, Hengjie Wang wrote: >>> >>> Hi Barry, >>> >>> I checked. On the supercomputer, I had the option "-ksp_view_pre" but it >>> is not in file I sent you. I am sorry for the confusion. >>> >>> Regards, >>> Frank >>> >>> On Friday, September 9, 2016, Barry Smith <bsm...@mcs.anl.gov> wrote: >>> >>>> >>>> > On Sep 9, 2016, at 3:11 PM, frank <hengj...@uci.edu> wrote: >>>> > >>>> > Hi Barry, >>>> > >>>> > I think the first KSP view output is from -ksp_view_pre. Before I >>>> submitted the test, I was not sure whether there would be OOM error or not. >>>> So I added both -ksp_view_pre and -ksp_view. >>>> >>>> But the options file you sent specifically does NOT list the >>>> -ksp_view_pre so how could it be from that? >>>> >>>> Sorry to be pedantic but I've spent too much time in the past trying >>>> to debug from incorrect information and want to make sure that the >>>> information I have is correct before thinking. Please recheck exactly what >>>> happened. Rerun with the exact input file you emailed if that is needed. >>>> >>>> Barry >>>> >>>> > >>>> > Frank >>>> > >>>> > >>>> > On 09/09/2016 12:38 PM, Barry Smith wrote: >>>> >> Why does ksp_view2.txt have two KSP views in it while >>>> ksp_view1.txt has only one KSPView in it? Did you run two different solves >>>> in the 2 case but not the one? >>>> >> >>>> >> Barry >>>> >> >>>> >> >>>> >> >>>> >>> On Sep 9, 2016, at 10:56 AM, frank <hengj...@uci.edu> wrote: >>>> >>> >>>> >>> Hi, >>>> >>> >>>> >>> I want to continue digging into the memory problem here. >>>> >>> I did find a work around in the past, which is to use less cores >>>> per node so that each core has 8G memory. However this is deficient and >>>> expensive. I hope to locate the place that uses the most memory. >>>> >>> >>>> >>> Here is a brief summary of the tests I did in past: >>>> >>>> Test1: Mesh 1536*128*384 | Process Mesh 48*4*12 >>>> >>> Maximum (over computational time) process memory: total >>>> 7.0727e+08 >>>> >>> Current process memory: >>>> total 7.0727e+08 >>>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>>> 6.3908e+11 >>>> >>> Current space PetscMalloc()ed: >>>> total 1.8275e+09 >>>> >>> >>>> >>>> Test2: Mesh 1536*128*384 | Process Mesh 96*8*24 >>>> >>> Maximum (over computational time) process memory: total >>>> 5.9431e+09 >>>> >>> Current process memory: >>>> total 5.9431e+09 >>>> >>> Maximum (over computational time) space PetscMalloc()ed: total >>>> 5.3202e+12 >>>> >>> Current space PetscMalloc()ed: >>>> total 5.4844e+09 >>>> >>> >>>> >>>> Test3: Mesh 3072*256*768 | Process Mesh 96*8*24 >>>> >>> OOM( Out Of Memory ) killer of the supercomputer terminated the >>>> job during "KSPSolve". >>>> >>> >>>> >>> I attached the output of ksp_view( the third test's output is from >>>> ksp_view_pre ), memory_view and also the petsc options. >>>> >>> >>>> >>> In all the tests, each core can access about 2G memory. In test3, >>>> there are 4223139840 non-zeros in the matrix. This will consume about >>>> 1.74M, using double precision. Considering some extra memory used to store >>>> integer index, 2G memory should still be way enough. >>>> >>> >>>> >>> Is there a way to find out which part of KSPSolve uses the most >>>> memory? >>>> >>> Thank you so much. >>>> >>> >>>> >>> BTW, there are 4 options remains unused and I don't understand why >>>> they are omitted: >>>> >>> -mg_coarse_telescope_mg_coarse_ksp_type value: preonly >>>> >>> -mg_coarse_telescope_mg_coarse_pc_type value: bjacobi >>>> >>> -mg_coarse_telescope_mg_levels_ksp_max_it value: 1 >>>> >>> -mg_coarse_telescope_mg_levels_ksp_type value: richardson >>>> >>> >>>> >>> >>>> >>> Regards, >>>> >>> Frank >>>> >>> >>>> >>> On 07/13/2016 05:47 PM, Dave May wrote: >>>> >>>> >>>> >>>> On 14 July 2016 at 01:07, frank <hengj...@uci.edu> wrote: >>>> >>>> Hi Dave, >>>> >>>> >>>> >>>> Sorry for the late reply. >>>> >>>> Thank you so much for your detailed reply. >>>> >>>> >>>> >>>> I have a question about the estimation of the memory usage. There >>>> are 4223139840 allocated non-zeros and 18432 MPI processes. Double >>>> precision is used. So the memory per process is: >>>> >>>> 4223139840 * 8bytes / 18432 / 1024 / 1024 = 1.74M ? >>>> >>>> Did I do sth wrong here? Because this seems too small. >>>> >>>> >>>> >>>> No - I totally f***ed it up. You are correct. That'll teach me for >>>> fumbling around with my iphone calculator and not using my brain. (Note >>>> that to convert to MB just divide by 1e6, not 1024^2 - although I >>>> apparently cannot convert between units correctly....) >>>> >>>> >>>> >>>> From the PETSc objects associated with the solver, It looks like >>>> it _should_ run with 2GB per MPI rank. Sorry for my mistake. Possibilities >>>> are: somewhere in your usage of PETSc you've introduced a memory leak; >>>> PETSc is doing a huge over allocation (e.g. as per our discussion of >>>> MatPtAP); or in your application code there are other objects you have >>>> forgotten to log the memory for. >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> I am running this job on Bluewater >>>> >>>> I am using the 7 points FD stencil in 3D. >>>> >>>> >>>> >>>> I thought so on both counts. >>>> >>>> >>>> >>>> I apologize that I made a stupid mistake in computing the memory >>>> per core. My settings render each core can access only 2G memory on average >>>> instead of 8G which I mentioned in previous email. I re-run the job with 8G >>>> memory per core on average and there is no "Out Of Memory" error. I would >>>> do more test to see if there is still some memory issue. >>>> >>>> >>>> >>>> Ok. I'd still like to know where the memory was being used since >>>> my estimates were off. >>>> >>>> >>>> >>>> >>>> >>>> Thanks, >>>> >>>> Dave >>>> >>>> >>>> >>>> Regards, >>>> >>>> Frank >>>> >>>> >>>> >>>> >>>> >>>> >>>> >>>> On 07/11/2016 01:18 PM, Dave May wrote: >>>> >>>>> Hi Frank, >>>> >>>>> >>>> >>>>> >>>> >>>>> On 11 July 2016 at 19:14, frank <hengj...@uci.edu> wrote: >>>> >>>>> Hi Dave, >>>> >>>>> >>>> >>>>> I re-run the test using bjacobi as the preconditioner on the >>>> coarse mesh of telescope. The Grid is 3072*256*768 and process mesh is >>>> 96*8*24. The petsc option file is attached. >>>> >>>>> I still got the "Out Of Memory" error. The error occurred before >>>> the linear solver finished one step. So I don't have the full info from >>>> ksp_view. The info from ksp_view_pre is attached. >>>> >>>>> >>>> >>>>> Okay - that is essentially useless (sorry) >>>> >>>>> >>>> >>>>> It seems to me that the error occurred when the decomposition was >>>> going to be changed. >>>> >>>>> >>>> >>>>> Based on what information? >>>> >>>>> Running with -info would give us more clues, but will create a >>>> ton of output. >>>> >>>>> Please try running the case which failed with -info >>>> >>>>> I had another test with a grid of 1536*128*384 and the same >>>> process mesh as above. There was no error. The ksp_view info is attached >>>> for comparison. >>>> >>>>> Thank you. >>>> >>>>> >>>> >>>>> >>>> >>>>> [3] Here is my crude estimate of your memory usage. >>>> >>>>> I'll target the biggest memory hogs only to get an order of >>>> magnitude estimate >>>> >>>>> >>>> >>>>> * The Fine grid operator contains 4223139840 non-zeros --> 1.8 GB >>>> per MPI rank assuming double precision. >>>> >>>>> The indices for the AIJ could amount to another 0.3 GB (assuming >>>> 32 bit integers) >>>> >>>>> >>>> >>>>> * You use 5 levels of coarsening, so the other operators should >>>> represent (collectively) >>>> >>>>> 2.1 / 8 + 2.1/8^2 + 2.1/8^3 + 2.1/8^4 ~ 300 MB per MPI rank on >>>> the communicator with 18432 ranks. >>>> >>>>> The coarse grid should consume ~ 0.5 MB per MPI rank on the >>>> communicator with 18432 ranks. >>>> >>>>> >>>> >>>>> * You use a reduction factor of 64, making the new communicator >>>> with 288 MPI ranks. >>>> >>>>> PCTelescope will first gather a temporary matrix associated with >>>> your coarse level operator assuming a comm size of 288 living on the comm >>>> with size 18432. >>>> >>>>> This matrix will require approximately 0.5 * 64 = 32 MB per core >>>> on the 288 ranks. >>>> >>>>> This matrix is then used to form a new MPIAIJ matrix on the >>>> subcomm, thus require another 32 MB per rank. >>>> >>>>> The temporary matrix is now destroyed. >>>> >>>>> >>>> >>>>> * Because a DMDA is detected, a permutation matrix is assembled. >>>> >>>>> This requires 2 doubles per point in the DMDA. >>>> >>>>> Your coarse DMDA contains 92 x 16 x 48 points. >>>> >>>>> Thus the permutation matrix will require < 1 MB per MPI rank on >>>> the sub-comm. >>>> >>>>> >>>> >>>>> * Lastly, the matrix is permuted. This uses MatPtAP(), but the >>>> resulting operator will have the same memory footprint as the unpermuted >>>> matrix (32 MB). At any stage in PCTelescope, only 2 operators of size 32 MB >>>> are held in memory when the DMDA is provided. >>>> >>>>> >>>> >>>>> From my rough estimates, the worst case memory foot print for any >>>> given core, given your options is approximately >>>> >>>>> 2100 MB + 300 MB + 32 MB + 32 MB + 1 MB = 2465 MB >>>> >>>>> This is way below 8 GB. >>>> >>>>> >>>> >>>>> Note this estimate completely ignores: >>>> >>>>> (1) the memory required for the restriction operator, >>>> >>>>> (2) the potential growth in the number of non-zeros per row due >>>> to Galerkin coarsening (I wished -ksp_view_pre reported the output from >>>> MatView so we could see the number of non-zeros required by the coarse >>>> level operators) >>>> >>>>> (3) all temporary vectors required by the CG solver, and those >>>> required by the smoothers. >>>> >>>>> (4) internal memory allocated by MatPtAP >>>> >>>>> (5) memory associated with IS's used within PCTelescope >>>> >>>>> >>>> >>>>> So either I am completely off in my estimates, or you have not >>>> carefully estimated the memory usage of your application code. Hopefully >>>> others might examine/correct my rough estimates >>>> >>>>> >>>> >>>>> Since I don't have your code I cannot access the latter. >>>> >>>>> Since I don't have access to the same machine you are running on, >>>> I think we need to take a step back. >>>> >>>>> >>>> >>>>> [1] What machine are you running on? Send me a URL if its >>>> available >>>> >>>>> >>>> >>>>> [2] What discretization are you using? (I am guessing a scalar 7 >>>> point FD stencil) >>>> >>>>> If it's a 7 point FD stencil, we should be able to examine the >>>> memory usage of your solver configuration using a standard, light weight >>>> existing PETSc example, run on your machine at the same scale. >>>> >>>>> This would hopefully enable us to correctly evaluate the actual >>>> memory usage required by the solver configuration you are using. >>>> >>>>> >>>> >>>>> Thanks, >>>> >>>>> Dave >>>> >>>>> >>>> >>>>> >>>> >>>>> Frank >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> >>>> >>>>> On 07/08/2016 10:38 PM, Dave May wrote: >>>> >>>>>> >>>> >>>>>> On Saturday, 9 July 2016, frank <hengj...@uci.edu> wrote: >>>> >>>>>> Hi Barry and Dave, >>>> >>>>>> >>>> >>>>>> Thank both of you for the advice. >>>> >>>>>> >>>> >>>>>> @Barry >>>> >>>>>> I made a mistake in the file names in last email. I attached the >>>> correct files this time. >>>> >>>>>> For all the three tests, 'Telescope' is used as the coarse >>>> preconditioner. >>>> >>>>>> >>>> >>>>>> == Test1: Grid: 1536*128*384, Process Mesh: 48*4*12 >>>> >>>>>> Part of the memory usage: Vector 125 124 3971904 >>>> 0. >>>> >>>>>> Matrix 101 101 >>>> 9462372 0 >>>> >>>>>> >>>> >>>>>> == Test2: Grid: 1536*128*384, Process Mesh: 96*8*24 >>>> >>>>>> Part of the memory usage: Vector 125 124 681672 >>>> 0. >>>> >>>>>> Matrix 101 101 >>>> 1462180 0. >>>> >>>>>> >>>> >>>>>> In theory, the memory usage in Test1 should be 8 times of Test2. >>>> In my case, it is about 6 times. >>>> >>>>>> >>>> >>>>>> == Test3: Grid: 3072*256*768, Process Mesh: 96*8*24. >>>> Sub-domain per process: 32*32*32 >>>> >>>>>> Here I get the out of memory error. >>>> >>>>>> >>>> >>>>>> I tried to use -mg_coarse jacobi. In this way, I don't need to >>>> set -mg_coarse_ksp_type and -mg_coarse_pc_type explicitly, right? >>>> >>>>>> The linear solver didn't work in this case. Petsc output some >>>> errors. >>>> >>>>>> >>>> >>>>>> @Dave >>>> >>>>>> In test3, I use only one instance of 'Telescope'. On the coarse >>>> mesh of 'Telescope', I used LU as the preconditioner instead of SVD. >>>> >>>>>> If my set the levels correctly, then on the last coarse mesh of >>>> MG where it calls 'Telescope', the sub-domain per process is 2*2*2. >>>> >>>>>> On the last coarse mesh of 'Telescope', there is only one grid >>>> point per process. >>>> >>>>>> I still got the OOM error. The detailed petsc option file is >>>> attached. >>>> >>>>>> >>>> >>>>>> Do you understand the expected memory usage for the particular >>>> parallel LU implementation you are using? I don't (seriously). Replace LU >>>> with bjacobi and re-run this test. My point about solver debugging is still >>>> valid. >>>> >>>>>> >>>> >>>>>> And please send the result of KSPView so we can see what is >>>> actually used in the computations >>>> >>>>>> >>>> >>>>>> Thanks >>>> >>>>>> Dave >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> Thank you so much. >>>> >>>>>> >>>> >>>>>> Frank >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> On 07/06/2016 02:51 PM, Barry Smith wrote: >>>> >>>>>> On Jul 6, 2016, at 4:19 PM, frank <hengj...@uci.edu> wrote: >>>> >>>>>> >>>> >>>>>> Hi Barry, >>>> >>>>>> >>>> >>>>>> Thank you for you advice. >>>> >>>>>> I tried three test. In the 1st test, the grid is 3072*256*768 >>>> and the process mesh is 96*8*24. >>>> >>>>>> The linear solver is 'cg' the preconditioner is 'mg' and >>>> 'telescope' is used as the preconditioner at the coarse mesh. >>>> >>>>>> The system gives me the "Out of Memory" error before the linear >>>> system is completely solved. >>>> >>>>>> The info from '-ksp_view_pre' is attached. I seems to me that >>>> the error occurs when it reaches the coarse mesh. >>>> >>>>>> >>>> >>>>>> The 2nd test uses a grid of 1536*128*384 and process mesh is >>>> 96*8*24. The 3rd test uses the >>>> same grid but a different process mesh 48*4*12. >>>> >>>>>> Are you sure this is right? The total matrix and vector >>>> memory usage goes from 2nd test >>>> >>>>>> Vector 384 383 8,193,712 0. >>>> >>>>>> Matrix 103 103 11,508,688 0. >>>> >>>>>> to 3rd test >>>> >>>>>> Vector 384 383 1,590,520 0. >>>> >>>>>> Matrix 103 103 3,508,664 0. >>>> >>>>>> that is the memory usage got smaller but if you have only 1/8th >>>> the processes and the same grid it should have gotten about 8 times bigger. >>>> Did you maybe cut the grid by a factor of 8 also? If so that still doesn't >>>> explain it because the memory usage changed by a factor of 5 something for >>>> the vectors and 3 something for the matrices. >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> The linear solver and petsc options in 2nd and 3rd tests are the >>>> same in 1st test. The linear solver works fine in both test. >>>> >>>>>> I attached the memory usage of the 2nd and 3rd tests. The memory >>>> info is from the option '-log_summary'. I tried to use '-momery_info' as >>>> you suggested, but in my case petsc treated it as an unused option. It >>>> output nothing about the memory. Do I need to add sth to my code so I can >>>> use '-memory_info'? >>>> >>>>>> Sorry, my mistake the option is -memory_view >>>> >>>>>> >>>> >>>>>> Can you run the one case with -memory_view and -mg_coarse >>>> jacobi -ksp_max_it 1 (just so it doesn't iterate forever) to see how much >>>> memory is used without the telescope? Also run case 2 the same way. >>>> >>>>>> >>>> >>>>>> Barry >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> >>>> >>>>>> In both tests the memory usage is not large. >>>> >>>>>> >>>> >>>>>> It seems to me that it might be the 'telescope' preconditioner >>>> that allocated a lot of memory and caused the error in the 1st test. >>>> >>>>>> Is there is a way to show how much memory it allocated? >>>> >>>>>> >>>> >>>>>> Frank >>>> >>>>>> >>>> >>>>>> On 07/05/2016 03:37 PM, Barry Smith wrote: >>>> >>>>>> Frank, >>>> >>>>>> >>>> >>>>>> You can run with -ksp_view_pre to have it "view" the KSP >>>> before the solve so hopefully it gets that far. >>>> >>>>>> >>>> >>>>>> Please run the problem that does fit with -memory_info >>>> when the problem completes it will show the "high water mark" for PETSc >>>> allocated memory and total memory used. We first want to look at these >>>> numbers to see if it is using more memory than you expect. You could also >>>> run with say half the grid spacing to see how the memory usage scaled with >>>> the increase in grid points. Make the runs also with -log_view and send all >>>> the output from these options. >>>> >>>>>> >>>> >>>>>> Barry >>>> >>>>>> >>>> >>>>>> On Jul 5, 2016, at 5:23 PM, frank <hengj...@uci.edu> wrote: >>>> >>>>>> >>>> >>>>>> Hi, >>>> >>>>>> >>>> >>>>>> I am using the CG ksp solver and Multigrid preconditioner to >>>> solve a linear system in parallel. >>>> >>>>>> I chose to use the 'Telescope' as the preconditioner on the >>>> coarse mesh for its good performance. >>>> >>>>>> The petsc options file is attached. >>>> >>>>>> >>>> >>>>>> The domain is a 3d box. >>>> >>>>>> It works well when the grid is 1536*128*384 and the process >>>> mesh is 96*8*24. When I double the size of grid and >>>> keep the same process mesh and petsc options, I >>>> get an "out of memory" error from the super-cluster I am using. >>>> >>>>>> Each process has access to at least 8G memory, which should be >>>> more than enough for my application. I am sure that all the other parts of >>>> my code( except the linear solver ) do not use much memory. So I doubt if >>>> there is something wrong with the linear solver. >>>> >>>>>> The error occurs before the linear system is completely solved >>>> so I don't have the info from ksp view. I am not able to re-produce the >>>> error with a smaller problem either. >>>> >>>>>> In addition, I tried to use the block jacobi as the >>>> preconditioner with the same grid and same decomposition. The linear solver >>>> runs extremely slow but there is no memory error. >>>> >>>>>> >>>> >>>>>> How can I diagnose what exactly cause the error? >>>> >>>>>> Thank you so much. >>>> >>>>>> >>>> >>>>>> Frank >>>> >>>>>> <petsc_options.txt> >>>> >>>>>> <ksp_view_pre.txt><memory_test2.txt><memory_test3.txt><petsc >>>> _options.txt> >>>> >>>>>> >>>> >>>>> >>>> >>>> >>>> >>> <ksp_view1.txt><ksp_view2.txt><ksp_view3.txt><memory1.txt><m >>>> emory2.txt><petsc_options1.txt><petsc_options2.txt><petsc_options3.txt> >>>> > >>>> >>>> >>> >> > >