> On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <petsc-dev@mcs.anl.gov> > wrote: > > Interesting, nice work. > > It would be interesting to get the flop counters working. > > This looks like GMG, I assume 3D. > > The degree of parallelism is not very realistic. You should probably run a > 10x smaller problem, at least, or use 10x more processes. Why do you say that? He's got his machine with a certain amount of physical memory per node, are you saying he should ignore/not use 90% of that physical memory for his simulation? He should buy a machine 10x bigger just because it means having less degrees of freedom per node (whose footing the bill for this purchase?). At INL they run simulations for a purpose, not just for scalability studies and there are no dang GPUs or barely used over-sized monstrocities sitting around to brag about twice a year at SC. Barry > I guess it does not matter. This basically like a one node run because the > subdomains are so large. > > And are you sure the numerics are the same with and without hypre? Hypre is > 15x slower. Any ideas what is going on? > > It might be interesting to scale this test down to a node to see if this is > from communication. > > Again, nice work, > Mark > > > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong <fdkong...@gmail.com> wrote: > Hi Developers, > > I just want to share a good news. It is known PETSc-ptap-scalable is taking > too much memory for some applications because it needs to build intermediate > data structures. According to Mark's suggestions, I implemented the > all-at-once algorithm that does not cache any intermediate data. > > I did some comparison, the new implementation is actually scalable in terms > of the memory usage and the compute time even though it is still slower than > "ptap-scalable". There are some memory profiling results (see the > attachments). The new all-at-once implementation use the similar amount of > memory as hypre, but it way faster than hypre. > > For example, for a problem with 14,893,346,880 unknowns using 10,000 > processor cores, There are timing results: > > Hypre algorithm: > > MatPtAP 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 > 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 > MatPtAPSymbolic 50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 > 0.0e+00 0 0 0 0 0 0 0 0 0 0 0 > MatPtAPNumeric 50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 > 6.0e+02 33 0 1 0 17 33 0 1 0 17 0 > > PETSc scalable PtAP: > > MatPtAP 50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 > 7.5e+02 2 1 4 6 20 2 1 4 6 20 129418 > MatPtAPSymbolic 50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 > 3.5e+02 1 0 3 3 9 1 0 3 3 9 0 > MatPtAPNumeric 50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 > 4.0e+02 1 1 2 4 11 1 1 2 4 11 235011 > > New implementation of the all-at-once algorithm: > > MatPtAP 50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 > 6.0e+02 4 0 7 7 17 4 0 7 7 17 0 > MatPtAPSymbolic 50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 > 2.0e+02 2 0 5 4 6 2 0 5 4 6 0 > MatPtAPNumeric 50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 > 4.0e+02 2 0 2 3 11 2 0 2 3 11 0 > > > You can see here the all-at-once is a bit slower than ptap-scalable, but it > uses only much less memory. > > > Fande >
Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm
Smith, Barry F. via petsc-dev Thu, 11 Apr 2019 20:43:19 -0700
- Re: [petsc-dev] New implementation of PtAP b... Smith, Barry F. via petsc-dev
- Re: [petsc-dev] New implementation of P... Mark Adams via petsc-dev
- Re: [petsc-dev] New implementation ... Smith, Barry F. via petsc-dev
- Re: [petsc-dev] New implementat... Mark Adams via petsc-dev
- Re: [petsc-dev] New impleme... hong--- via petsc-dev
- Re: [petsc-dev] New impleme... Zhang, Hong via petsc-dev
- Re: [petsc-dev] New impleme... Mark Adams via petsc-dev
- Re: [petsc-dev] New im... Mark Adams via petsc-dev
- Re: [petsc-dev] New implementation ... Mark Adams via petsc-dev
- Re: [petsc-dev] New implementat... Mark Adams via petsc-dev