I would suggest Fande add this new implementation into petsc. What is the algorithm? I'll try to see if I can further reduce memory consumption of the current implementation of symbolic PtAP when I get time. Hong
On Fri, Apr 12, 2019 at 8:27 AM Mark Adams via petsc-dev < petsc-dev@mcs.anl.gov> wrote: > > > On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. <bsm...@mcs.anl.gov> > wrote: > >> >> >> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev < >> petsc-dev@mcs.anl.gov> wrote: >> > >> > Interesting, nice work. >> > >> > It would be interesting to get the flop counters working. >> > >> > This looks like GMG, I assume 3D. >> > >> > The degree of parallelism is not very realistic. You should probably >> run a 10x smaller problem, at least, or use 10x more processes. >> >> Why do you say that? He's got his machine with a certain amount of >> physical memory per node, are you saying he should ignore/not use 90% of >> that physical memory for his simulation? > > > In my experience 1.5M equations/process about 50x more than applications > run, but this is just anecdotal. Some apps are dominated by the linear > solver in terms of memory but some apps use a lot of memory in the physics > parts of the code. > > The one app that I can think of where the memory usage is dominated by the > solver does like 10 (pseudo) time steps with pretty hard nonlinear solves, > so in the end they are not bound by turnaround time. But they are kind of a > odd (academic) application and not very representative of what I see in the > broader comp sci community. And these guys do have a scalable code so > instead of waiting a week on the queue to run a 10 hour job that uses 10% > of the machine, they wait a day to run a 2 hour job that takes 50% of the > machine because centers scheduling policies work that way. > > He should buy a machine 10x bigger just because it means having less >> degrees of freedom per node (whose footing the bill for this purchase?). At >> INL they run simulations for a purpose, not just for scalability studies >> and there are no dang GPUs or barely used over-sized monstrocities sitting >> around to brag about twice a year at SC. >> > > I guess the are the nuke guys. I've never worked with them or seen this > kind of complexity analysis in their talks, but OK if they fill up memory > with the solver then this is representative of a significant (DOE)app. > > >> >> Barry >> >> >> >> > I guess it does not matter. This basically like a one node run because >> the subdomains are so large. >> > >> > And are you sure the numerics are the same with and without hypre? >> Hypre is 15x slower. Any ideas what is going on? >> > >> > It might be interesting to scale this test down to a node to see if >> this is from communication. >> > >> > Again, nice work, >> > Mark >> > >> > >> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong...@gmail.com> wrote: >> > Hi Developers, >> > >> > I just want to share a good news. It is known PETSc-ptap-scalable is >> taking too much memory for some applications because it needs to build >> intermediate data structures. According to Mark's suggestions, I >> implemented the all-at-once algorithm that does not cache any intermediate >> data. >> > >> > I did some comparison, the new implementation is actually scalable in >> terms of the memory usage and the compute time even though it is still >> slower than "ptap-scalable". There are some memory profiling results (see >> the attachments). The new all-at-once implementation use the similar amount >> of memory as hypre, but it way faster than hypre. >> > >> > For example, for a problem with 14,893,346,880 unknowns using 10,000 >> processor cores, There are timing results: >> > >> > Hypre algorithm: >> > >> > MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 >> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 >> > MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 >> 0.0e+00 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 >> > MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 >> 3.3e+04 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 >> > >> > PETSc scalable PtAP: >> > >> > MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 >> 2.0e+05 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418 >> > MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 >> 1.4e+05 3.5e+02 1 0 3 3 9 1 0 3 3 9 0 >> > MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 >> 3.1e+05 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011 >> > >> > New implementation of the all-at-once algorithm: >> > >> > MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 >> 1.4e+05 6.0e+02 4 0 7 7 17 4 0 7 7 17 0 >> > MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 >> 1.2e+05 2.0e+02 2 0 5 4 6 2 0 5 4 6 0 >> > MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 >> 2.0e+05 4.0e+02 2 0 2 3 11 2 0 2 3 11 0 >> > >> > >> > You can see here the all-at-once is a bit slower than ptap-scalable, >> but it uses only much less memory. >> > >> > >> > Fande >> > >> >>