> On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <petsc-dev@mcs.anl.gov> 
> wrote:
> 
> Interesting, nice work.
> 
> It would be interesting to get the flop counters working.
> 
> This looks like GMG, I assume 3D.
> 
> The degree of parallelism is not very realistic. You should probably run a 
> 10x smaller problem, at least, or use 10x more processes.

   Why do you say that? He's got his machine with a certain amount of physical 
memory per node, are you saying he should ignore/not use 90% of that physical 
memory for his simulation? He should buy a machine 10x bigger just because it 
means having less degrees of freedom per node (whose footing the bill for this 
purchase?). At INL they run simulations for a purpose, not just for scalability 
studies and there are no dang GPUs or barely used over-sized monstrocities 
sitting around to brag about twice a year at SC.

   Barry



> I guess it does not matter. This basically like a one node run because the 
> subdomains are so large.
> 
> And are you sure the numerics are the same with and without hypre? Hypre is 
> 15x slower. Any ideas what is going on?
> 
> It might be interesting to scale this test down to a node to see if this is 
> from communication.
> 
> Again, nice work,
> Mark
> 
> 
> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong...@gmail.com> wrote:
> Hi Developers,
> 
> I just want to share a good news.  It is known PETSc-ptap-scalable is taking 
> too much memory for some applications because it needs to build intermediate 
> data structures.  According to Mark's suggestions, I implemented the  
> all-at-once algorithm that does not cache any intermediate data. 
> 
> I did some comparison,  the new implementation is actually scalable in terms 
> of the memory usage and the compute time even though it is still  slower than 
> "ptap-scalable".   There are some memory profiling results (see the 
> attachments). The new all-at-once implementation use the similar amount of 
> memory as hypre, but it way faster than hypre.
> 
> For example, for a problem with 14,893,346,880 unknowns using 10,000 
> processor cores,  There are timing results:
> 
> Hypre algorithm:
> 
> MatPtAP               50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
> MatPtAPSymbolic       50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0     0
> MatPtAPNumeric        50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17     0
> 
> PETSc scalable PtAP:
> 
> MatPtAP               50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic       50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 
> 3.5e+02  1  0  3  3  9   1  0  3  3  9     0
> MatPtAPNumeric        50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
> 
> New implementation of the all-at-once algorithm:
> 
> MatPtAP               50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 
> 6.0e+02  4  0  7  7 17   4  0  7  7 17     0
> MatPtAPSymbolic       50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 
> 2.0e+02  2  0  5  4  6   2  0  5  4  6     0
> MatPtAPNumeric        50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 
> 4.0e+02  2  0  2  3 11   2  0  2  3 11     0
> 
> 
> You can see here the all-at-once is a bit slower than ptap-scalable, but it 
> uses only much less memory.   
> 
> 
> Fande
>  

Reply via email to