Interesting, nice work. It would be interesting to get the flop counters working.
This looks like GMG, I assume 3D. The degree of parallelism is not very realistic. You should probably run a 10x smaller problem, at least, or use 10x more processes. I guess it does not matter. This basically like a one node run because the subdomains are so large. And are you sure the numerics are the same with and without hypre? Hypre is 15x slower. Any ideas what is going on? It might be interesting to scale this test down to a node to see if this is from communication. Again, nice work, Mark On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong...@gmail.com> wrote: > Hi Developers, > > I just want to share a good news. It is known PETSc-ptap-scalable is > taking too much memory for some applications because it needs to build > intermediate data structures. According to Mark's suggestions, I > implemented the all-at-once algorithm that does not cache any intermediate > data. > > I did some comparison, the new implementation is actually scalable in > terms of the memory usage and the compute time even though it is still > slower than "ptap-scalable". There are some memory profiling results (see > the attachments). The new all-at-once implementation use the similar amount > of memory as hypre, but it way faster than hypre. > > For example, for a problem with 14,893,346,880 unknowns using 10,000 > processor cores, There are timing results: > > Hypre algorithm: > > MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 > 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 > MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 > 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 > > PETSc scalable PtAP: > > MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 > 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418 > MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 > 3.5e+02 1 0 3 3 9 1 0 3 3 9 0 > MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 > 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011 > > New implementation of the all-at-once algorithm: > > MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 > 6.0e+02 4 0 7 7 17 4 0 7 7 17 0 > MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 > 2.0e+02 2 0 5 4 6 2 0 5 4 6 0 > MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 > 4.0e+02 2 0 2 3 11 2 0 2 3 11 0 > > > You can see here the all-at-once is a bit slower than ptap-scalable, but > it uses only much less memory. > > > Fande > >