Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev
>
>> I wonder if the their symbolic setup is getting called every time. You do
>> 50 solves it looks like and that should be enough to amortize a one time
>> setup cost.
>>
>
> Hypre does not have concept called symbolic. They do everything from
> scratch, and won't reuse any data.
>

Really, Hypre does not cache the maps and non-zero structure, etc, that is
generated in RAP?

I suspect that that is contributing to Hypres poor performance, but it is
not the whole story as you are only doing 5 solves.


Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev
>
> So you could reorder your equations and see a block diagonal matrix with
>> 576 blocks. right?
>>
>
> I not sure I understand the question correctly. For each mesh vertex, we
> have a 576x576 diagonal matrix.   The unknowns are ordered in this way:
> v0, v2.., v575 for vertex 1, and another 576 variables for mesh vertex 2,
> and so on.
>

My question is,mathematically, or algebraically, is this preconditioner
equivalent to 576 Laplacian PCs? I see that it is not because you coarsen
the number of variables per node. So your interpolation operators couple
your equations. I think that other than the coupling from eigen estimates
and Krylov methods, and the coupling from your variable coursening that you
have independent scalar Laplacian PCs.

10 levels is a lot. I am guessing you do like 5 levels of variable
coarsening and 5 levels of (normal) vertex coarsening with some sort of AMG
method.

This is a very different regime that problems that I am used to.

And it would still be interesting to see the flop counters to get a sense
of the underlying performance differences between the normal and the
all-at-once PtAp.


Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev
On Mon, Apr 15, 2019 at 2:56 PM Fande Kong  wrote:

>
>
> On Mon, Apr 15, 2019 at 6:49 AM Matthew Knepley  wrote:
>
>> On Mon, Apr 15, 2019 at 12:41 AM Fande Kong via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>>
>>> On Fri, Apr 12, 2019 at 7:27 AM Mark Adams  wrote:
>>>


 On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. 
 wrote:

>
>
> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > Interesting, nice work.
> >
> > It would be interesting to get the flop counters working.
> >
> > This looks like GMG, I assume 3D.
> >
> > The degree of parallelism is not very realistic. You should probably
> run a 10x smaller problem, at least, or use 10x more processes.
>
>Why do you say that? He's got his machine with a certain amount of
> physical memory per node, are you saying he should ignore/not use 90% of
> that physical memory for his simulation?


 In my experience 1.5M equations/process about 50x more than
 applications run, but this is just anecdotal. Some apps are dominated by
 the linear solver in terms of memory but some apps use a lot of memory in
 the physics parts of the code.

>>>
>>> The test case is solving the multigroup neutron transport equations
>>> where each mesh vertex could be associated with a hundred or a thousand
>>> variables. The mesh is actually small so that it can be handled efficiently
>>> in the physics part of the code. 90% of the memory is consumed by the
>>> solver (SNES, KSP, PC). This is the reason I was trying to implement a
>>> memory friendly PtAP.
>>>
>>>
 The one app that I can think of where the memory usage is dominated by
 the solver does like 10 (pseudo) time steps with pretty hard nonlinear
 solves, so in the end they are not bound by turnaround time. But they are
 kind of a odd (academic) application and not very representative of what I
 see in the broader comp sci community. And these guys do have a scalable
 code so instead of waiting a week on the queue to run a 10 hour job that
 uses 10% of the machine, they wait a day to run a 2 hour job that takes 50%
 of the machine because centers scheduling policies work that way.

>>>
>>> Our code is scalable but we do not have a huge machine unfortunately.
>>>
>>>

 He should buy a machine 10x bigger just because it means having less
> degrees of freedom per node (whose footing the bill for this purchase?). 
> At
> INL they run simulations for a purpose, not just for scalability studies
> and there are no dang GPUs or barely used over-sized monstrocities sitting
> around to brag about twice a year at SC.
>

 I guess the are the nuke guys. I've never worked with them or seen this
 kind of complexity analysis in their talks, but OK if they fill up memory
 with the solver then this is representative of a significant (DOE)app.

>>>
>>> You do not see the complexity analysis  in the talks because most of the
>>> people at INL live in a different community.  I will convince more people
>>> give talks in our community in the future.
>>>
>>> We focus on the nuclear energy simulations that involve multiphysics
>>> (neutron transport, mechanics contact, computational materials,
>>> compressible/incompressible flows, two-phase flows, etc.). We are
>>> developing a flexible platform (open source) that allows different physics
>>> guys couple their code together efficiently.
>>> https://mooseframework.inl.gov/old
>>>
>>
>> Fande, this is very interesting. Can you tell me:
>>
>>   1) A rough estimate of dofs/vertex (or cell or face) depending on where
>> you put unknowns
>>
>
> The big run (Neutron transport equations) posted earlier has 576 variables
> on each mesh vertex. Physics guys think at the current stage 100-1000
> variables (the number of energy groups times the number of neutron flying
> directions) on each mesh vertex will give us an acceptable simulation
> result.  1000 variables are preferred.
>
>
>
>>
>>   2) Are all unknowns on the same vertex coupled together? If not, where
>> do you specify block sparsity?
>>
>
> Yes, they are physically coupled together through the scattering and the
> fission events. But we are using the matrix-free method, and the variables
> coupling is ignored in the  preconditioning matrix so that the system won't
> take that much memory.
>

So the preconditioner looks like 576 independent Laplacian solves,
mathematically. So you could reorder your equations and see a block
diagonal matrix with 576 blocks. right?


>
>
>>
>>   3) How are the coefficients from the equation discretized on the mesh?
>>
>
> The coefficients (often referred to as cross sections for neutron guys)
> could be different for each variable, and they totally depend on the
> reactor configuration. My current simulation indeed uses heterogeneous
> materials.
>
> I actually

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-15 Thread Mark Adams via petsc-dev
>
>
> I guess you are interested in the performance of the new algorithms on
>  small problems. I will try to test a petsc example such as
> mat/examples/tests/ex96.c.
>

It's not a big deal. And the fact that they are similar on one node tells
us the kernels are similar.


>
>
>>
>> And are you sure the numerics are the same with and without hypre? Hypre
>> is 15x slower. Any ideas what is going on?
>>
>
> Hypre performs pretty good when the number of processor core is small ( a
> couple of hundreds).  I guess the issue is related to how they handle the
> communications.
>
>
>>
>> It might be interesting to scale this test down to a node to see if this
>> is from communication.
>>
>
I wonder if the their symbolic setup is getting called every time. You do
50 solves it looks like and that should be enough to amortize a one time
setup cost.

Does PETSc do any clever scalability tricks? You just pack and send point
to point messages I would think, but maybe Hypre is doing something bad. I
have seen Hypre scale out to large machine but on synthetic problems.

So this is a realistic problem. Can you run with -info and grep on GAMG and
send me the (~20 lines) of output. You will be able to see info about each
level, like number of equations and average nnz/row.


>
> Hypre preforms similarly as petsc on a single compute node.
>
>
> Fande,
>
>
>>
>> Again, nice work,
>> Mark
>>
>>
>> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
>>
>>> Hi Developers,
>>>
>>> I just want to share a good news.  It is known PETSc-ptap-scalable is
>>> taking too much memory for some applications because it needs to build
>>> intermediate data structures.  According to Mark's suggestions, I
>>> implemented the  all-at-once algorithm that does not cache any intermediate
>>> data.
>>>
>>> I did some comparison,  the new implementation is actually scalable in
>>> terms of the memory usage and the compute time even though it is still
>>> slower than "ptap-scalable".   There are some memory profiling results (see
>>> the attachments). The new all-at-once implementation use the similar amount
>>> of memory as hypre, but it way faster than hypre.
>>>
>>> For example, for a problem with 14,893,346,880 unknowns using 10,000
>>> processor cores,  There are timing results:
>>>
>>> Hypre algorithm:
>>>
>>> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>>> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
>>> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>>> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
>>> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>>>
>>> PETSc scalable PtAP:
>>>
>>> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
>>> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>>> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
>>> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
>>> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
>>> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>>>
>>> New implementation of the all-at-once algorithm:
>>>
>>> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
>>> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
>>> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
>>> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
>>> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
>>> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>>>
>>>
>>> You can see here the all-at-once is a bit slower than ptap-scalable, but
>>> it uses only much less memory.
>>>
>>>
>>> Fande
>>>
>>>
>>


Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-12 Thread Zhang, Hong via petsc-dev
I would suggest Fande add this new implementation into petsc. What is the 
algorithm?
I'll try to see if I can further reduce memory consumption of the current 
symbolic PtAP when I get time.
Hong

On Fri, Apr 12, 2019 at 8:27 AM Mark Adams via petsc-dev 
mailto:petsc-dev@mcs.anl.gov>> wrote:


On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. 
mailto:bsm...@mcs.anl.gov>> wrote:


> On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev 
> mailto:petsc-dev@mcs.anl.gov>> wrote:
>
> Interesting, nice work.
>
> It would be interesting to get the flop counters working.
>
> This looks like GMG, I assume 3D.
>
> The degree of parallelism is not very realistic. You should probably run a 
> 10x smaller problem, at least, or use 10x more processes.

   Why do you say that? He's got his machine with a certain amount of physical 
memory per node, are you saying he should ignore/not use 90% of that physical 
memory for his simulation?

In my experience 1.5M equations/process about 50x more than applications run, 
but this is just anecdotal. Some apps are dominated by the linear solver in 
terms of memory but some apps use a lot of memory in the physics parts of the 
code.

The one app that I can think of where the memory usage is dominated by the 
solver does like 10 (pseudo) time steps with pretty hard nonlinear solves, so 
in the end they are not bound by turnaround time. But they are kind of a odd 
(academic) application and not very representative of what I see in the broader 
comp sci community. And these guys do have a scalable code so instead of 
waiting a week on the queue to run a 10 hour job that uses 10% of the machine, 
they wait a day to run a 2 hour job that takes 50% of the machine because 
centers scheduling policies work that way.

He should buy a machine 10x bigger just because it means having less degrees of 
freedom per node (whose footing the bill for this purchase?). At INL they run 
simulations for a purpose, not just for scalability studies and there are no 
dang GPUs or barely used over-sized monstrocities sitting around to brag about 
twice a year at SC.

I guess the are the nuke guys. I've never worked with them or seen this kind of 
complexity analysis in their talks, but OK if they fill up memory with the 
solver then this is representative of a significant (DOE)app.


   Barry



> I guess it does not matter. This basically like a one node run because the 
> subdomains are so large.
>
> And are you sure the numerics are the same with and without hypre? Hypre is 
> 15x slower. Any ideas what is going on?
>
> It might be interesting to scale this test down to a node to see if this is 
> from communication.
>
> Again, nice work,
> Mark
>
>
> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong 
> mailto:fdkong...@gmail.com>> wrote:
> Hi Developers,
>
> I just want to share a good news.  It is known PETSc-ptap-scalable is taking 
> too much memory for some applications because it needs to build intermediate 
> data structures.  According to Mark's suggestions, I implemented the  
> all-at-once algorithm that does not cache any intermediate data.
>
> I did some comparison,  the new implementation is actually scalable in terms 
> of the memory usage and the compute time even though it is still  slower than 
> "ptap-scalable".   There are some memory profiling results (see the 
> attachments). The new all-at-once implementation use the similar amount of 
> memory as hypre, but it way faster than hypre.
>
> For example, for a problem with 14,893,346,880 unknowns using 10,000 
> processor cores,  There are timing results:
>
> Hypre algorithm:
>
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>
> PETSc scalable PtAP:
>
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>
> New implementation of the all-at-once algorithm:
>
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>
>
> You can see here the all-at-once is a bit slower than ptap-scalable, but it 
> uses only much less memory.
>
>
> Fande
>



Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-12 Thread hong--- via petsc-dev
I would suggest Fande add this new implementation into petsc. What is the
algorithm?
I'll try to see if I can further reduce memory consumption of the current
implementation of symbolic PtAP when I get time.
Hong

On Fri, Apr 12, 2019 at 8:27 AM Mark Adams via petsc-dev <
petsc-dev@mcs.anl.gov> wrote:

>
>
> On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F. 
> wrote:
>
>>
>>
>> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
>> petsc-dev@mcs.anl.gov> wrote:
>> >
>> > Interesting, nice work.
>> >
>> > It would be interesting to get the flop counters working.
>> >
>> > This looks like GMG, I assume 3D.
>> >
>> > The degree of parallelism is not very realistic. You should probably
>> run a 10x smaller problem, at least, or use 10x more processes.
>>
>>Why do you say that? He's got his machine with a certain amount of
>> physical memory per node, are you saying he should ignore/not use 90% of
>> that physical memory for his simulation?
>
>
> In my experience 1.5M equations/process about 50x more than applications
> run, but this is just anecdotal. Some apps are dominated by the linear
> solver in terms of memory but some apps use a lot of memory in the physics
> parts of the code.
>
> The one app that I can think of where the memory usage is dominated by the
> solver does like 10 (pseudo) time steps with pretty hard nonlinear solves,
> so in the end they are not bound by turnaround time. But they are kind of a
> odd (academic) application and not very representative of what I see in the
> broader comp sci community. And these guys do have a scalable code so
> instead of waiting a week on the queue to run a 10 hour job that uses 10%
> of the machine, they wait a day to run a 2 hour job that takes 50% of the
> machine because centers scheduling policies work that way.
>
> He should buy a machine 10x bigger just because it means having less
>> degrees of freedom per node (whose footing the bill for this purchase?). At
>> INL they run simulations for a purpose, not just for scalability studies
>> and there are no dang GPUs or barely used over-sized monstrocities sitting
>> around to brag about twice a year at SC.
>>
>
> I guess the are the nuke guys. I've never worked with them or seen this
> kind of complexity analysis in their talks, but OK if they fill up memory
> with the solver then this is representative of a significant (DOE)app.
>
>
>>
>>Barry
>>
>>
>>
>> > I guess it does not matter. This basically like a one node run because
>> the subdomains are so large.
>> >
>> > And are you sure the numerics are the same with and without hypre?
>> Hypre is 15x slower. Any ideas what is going on?
>> >
>> > It might be interesting to scale this test down to a node to see if
>> this is from communication.
>> >
>> > Again, nice work,
>> > Mark
>> >
>> >
>> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
>> > Hi Developers,
>> >
>> > I just want to share a good news.  It is known PETSc-ptap-scalable is
>> taking too much memory for some applications because it needs to build
>> intermediate data structures.  According to Mark's suggestions, I
>> implemented the  all-at-once algorithm that does not cache any intermediate
>> data.
>> >
>> > I did some comparison,  the new implementation is actually scalable in
>> terms of the memory usage and the compute time even though it is still
>> slower than "ptap-scalable".   There are some memory profiling results (see
>> the attachments). The new all-at-once implementation use the similar amount
>> of memory as hypre, but it way faster than hypre.
>> >
>> > For example, for a problem with 14,893,346,880 unknowns using 10,000
>> processor cores,  There are timing results:
>> >
>> > Hypre algorithm:
>> >
>> > MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>> > MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00
>> 0.0e+00 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
>> > MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07
>> 3.3e+04 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>> >
>> > PETSc scalable PtAP:
>> >
>> > MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07
>> 2.0e+05 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
>> > MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07
>> 1.4e+05 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
>> > MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07
>> 3.1e+05 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>> >
>> > New implementation of the all-at-once algorithm:
>> >
>> > MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08
>> 1.4e+05 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
>> > MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07
>> 1.2e+05 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
>> > MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07
>> 2.0e+05 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>> >
>> >
>> > You can see here the all-at-onc

Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-12 Thread Mark Adams via petsc-dev
On Thu, Apr 11, 2019 at 11:42 PM Smith, Barry F.  wrote:

>
>
> > On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev <
> petsc-dev@mcs.anl.gov> wrote:
> >
> > Interesting, nice work.
> >
> > It would be interesting to get the flop counters working.
> >
> > This looks like GMG, I assume 3D.
> >
> > The degree of parallelism is not very realistic. You should probably run
> a 10x smaller problem, at least, or use 10x more processes.
>
>Why do you say that? He's got his machine with a certain amount of
> physical memory per node, are you saying he should ignore/not use 90% of
> that physical memory for his simulation?


In my experience 1.5M equations/process about 50x more than applications
run, but this is just anecdotal. Some apps are dominated by the linear
solver in terms of memory but some apps use a lot of memory in the physics
parts of the code.

The one app that I can think of where the memory usage is dominated by the
solver does like 10 (pseudo) time steps with pretty hard nonlinear solves,
so in the end they are not bound by turnaround time. But they are kind of a
odd (academic) application and not very representative of what I see in the
broader comp sci community. And these guys do have a scalable code so
instead of waiting a week on the queue to run a 10 hour job that uses 10%
of the machine, they wait a day to run a 2 hour job that takes 50% of the
machine because centers scheduling policies work that way.

He should buy a machine 10x bigger just because it means having less
> degrees of freedom per node (whose footing the bill for this purchase?). At
> INL they run simulations for a purpose, not just for scalability studies
> and there are no dang GPUs or barely used over-sized monstrocities sitting
> around to brag about twice a year at SC.
>

I guess the are the nuke guys. I've never worked with them or seen this
kind of complexity analysis in their talks, but OK if they fill up memory
with the solver then this is representative of a significant (DOE)app.


>
>Barry
>
>
>
> > I guess it does not matter. This basically like a one node run because
> the subdomains are so large.
> >
> > And are you sure the numerics are the same with and without hypre? Hypre
> is 15x slower. Any ideas what is going on?
> >
> > It might be interesting to scale this test down to a node to see if this
> is from communication.
> >
> > Again, nice work,
> > Mark
> >
> >
> > On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
> > Hi Developers,
> >
> > I just want to share a good news.  It is known PETSc-ptap-scalable is
> taking too much memory for some applications because it needs to build
> intermediate data structures.  According to Mark's suggestions, I
> implemented the  all-at-once algorithm that does not cache any intermediate
> data.
> >
> > I did some comparison,  the new implementation is actually scalable in
> terms of the memory usage and the compute time even though it is still
> slower than "ptap-scalable".   There are some memory profiling results (see
> the attachments). The new all-at-once implementation use the similar amount
> of memory as hypre, but it way faster than hypre.
> >
> > For example, for a problem with 14,893,346,880 unknowns using 10,000
> processor cores,  There are timing results:
> >
> > Hypre algorithm:
> >
> > MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> > MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> > MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> >
> > PETSc scalable PtAP:
> >
> > MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> > MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> > MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
> >
> > New implementation of the all-at-once algorithm:
> >
> > MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> > MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> > MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
> >
> >
> > You can see here the all-at-once is a bit slower than ptap-scalable, but
> it uses only much less memory.
> >
> >
> > Fande
> >
>
>


Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-11 Thread Smith, Barry F. via petsc-dev



> On Apr 11, 2019, at 9:07 PM, Mark Adams via petsc-dev  
> wrote:
> 
> Interesting, nice work.
> 
> It would be interesting to get the flop counters working.
> 
> This looks like GMG, I assume 3D.
> 
> The degree of parallelism is not very realistic. You should probably run a 
> 10x smaller problem, at least, or use 10x more processes.

   Why do you say that? He's got his machine with a certain amount of physical 
memory per node, are you saying he should ignore/not use 90% of that physical 
memory for his simulation? He should buy a machine 10x bigger just because it 
means having less degrees of freedom per node (whose footing the bill for this 
purchase?). At INL they run simulations for a purpose, not just for scalability 
studies and there are no dang GPUs or barely used over-sized monstrocities 
sitting around to brag about twice a year at SC.

   Barry



> I guess it does not matter. This basically like a one node run because the 
> subdomains are so large.
> 
> And are you sure the numerics are the same with and without hypre? Hypre is 
> 15x slower. Any ideas what is going on?
> 
> It might be interesting to scale this test down to a node to see if this is 
> from communication.
> 
> Again, nice work,
> Mark
> 
> 
> On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:
> Hi Developers,
> 
> I just want to share a good news.  It is known PETSc-ptap-scalable is taking 
> too much memory for some applications because it needs to build intermediate 
> data structures.  According to Mark's suggestions, I implemented the  
> all-at-once algorithm that does not cache any intermediate data. 
> 
> I did some comparison,  the new implementation is actually scalable in terms 
> of the memory usage and the compute time even though it is still  slower than 
> "ptap-scalable".   There are some memory profiling results (see the 
> attachments). The new all-at-once implementation use the similar amount of 
> memory as hypre, but it way faster than hypre.
> 
> For example, for a problem with 14,893,346,880 unknowns using 10,000 
> processor cores,  There are timing results:
> 
> Hypre algorithm:
> 
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> 
> PETSc scalable PtAP:
> 
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
> 
> New implementation of the all-at-once algorithm:
> 
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
> 
> 
> You can see here the all-at-once is a bit slower than ptap-scalable, but it 
> uses only much less memory.   
> 
> 
> Fande
>  



Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-11 Thread Mark Adams via petsc-dev
Interesting, nice work.

It would be interesting to get the flop counters working.

This looks like GMG, I assume 3D.

The degree of parallelism is not very realistic. You should probably run a
10x smaller problem, at least, or use 10x more processes. I guess it does
not matter. This basically like a one node run because the subdomains are
so large.

And are you sure the numerics are the same with and without hypre? Hypre is
15x slower. Any ideas what is going on?

It might be interesting to scale this test down to a node to see if this is
from communication.

Again, nice work,
Mark


On Thu, Apr 11, 2019 at 7:08 PM Fande Kong  wrote:

> Hi Developers,
>
> I just want to share a good news.  It is known PETSc-ptap-scalable is
> taking too much memory for some applications because it needs to build
> intermediate data structures.  According to Mark's suggestions, I
> implemented the  all-at-once algorithm that does not cache any intermediate
> data.
>
> I did some comparison,  the new implementation is actually scalable in
> terms of the memory usage and the compute time even though it is still
> slower than "ptap-scalable".   There are some memory profiling results (see
> the attachments). The new all-at-once implementation use the similar amount
> of memory as hypre, but it way faster than hypre.
>
> For example, for a problem with 14,893,346,880 unknowns using 10,000
> processor cores,  There are timing results:
>
> Hypre algorithm:
>
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
>
> PETSc scalable PtAP:
>
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
>
> New implementation of the all-at-once algorithm:
>
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
>
>
> You can see here the all-at-once is a bit slower than ptap-scalable, but
> it uses only much less memory.
>
>
> Fande
>
>


Re: [petsc-dev] New implementation of PtAP based on all-at-once algorithm

2019-04-11 Thread Smith, Barry F. via petsc-dev


  Excellent! Thanks

   Barry


> On Apr 11, 2019, at 6:08 PM, Fande Kong via petsc-dev  
> wrote:
> 
> Hi Developers,
> 
> I just want to share a good news.  It is known PETSc-ptap-scalable is taking 
> too much memory for some applications because it needs to build intermediate 
> data structures.  According to Mark's suggestions, I implemented the  
> all-at-once algorithm that does not cache any intermediate data. 
> 
> I did some comparison,  the new implementation is actually scalable in terms 
> of the memory usage and the compute time even though it is still  slower than 
> "ptap-scalable".   There are some memory profiling results (see the 
> attachments). The new all-at-once implementation use the similar amount of 
> memory as hypre, but it way faster than hypre.
> 
> For example, for a problem with 14,893,346,880 unknowns using 10,000 
> processor cores,  There are timing results:
> 
> Hypre algorithm:
> 
> MatPtAP   50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> MatPtAPSymbolic   50 1.0 2.3969e-0213.0 0.00e+00 0.0 0.0e+00 0.0e+00 
> 0.0e+00  0  0  0  0  0   0  0  0  0  0 0
> MatPtAPNumeric50 1.0 3.5353e+03 1.0 0.00e+00 0.0 1.9e+07 3.3e+04 
> 6.0e+02 33  0  1  0 17  33  0  1  0 17 0
> 
> PETSc scalable PtAP:
> 
> MatPtAP   50 1.0 1.1453e+02 1.0 2.07e+09 3.8 6.6e+07 2.0e+05 
> 7.5e+02  2  1  4  6 20   2  1  4  6 20 129418
> MatPtAPSymbolic   50 1.0 5.1562e+01 1.0 0.00e+00 0.0 4.1e+07 1.4e+05 
> 3.5e+02  1  0  3  3  9   1  0  3  3  9 0
> MatPtAPNumeric50 1.0 6.3072e+01 1.0 2.07e+09 3.8 2.4e+07 3.1e+05 
> 4.0e+02  1  1  2  4 11   1  1  2  4 11 235011
> 
> New implementation of the all-at-once algorithm:
> 
> MatPtAP   50 1.0 2.2153e+02 1.0 0.00e+00 0.0 1.0e+08 1.4e+05 
> 6.0e+02  4  0  7  7 17   4  0  7  7 17 0
> MatPtAPSymbolic   50 1.0 1.1055e+02 1.0 0.00e+00 0.0 7.9e+07 1.2e+05 
> 2.0e+02  2  0  5  4  6   2  0  5  4  6 0
> MatPtAPNumeric50 1.0 1.1102e+02 1.0 0.00e+00 0.0 2.6e+07 2.0e+05 
> 4.0e+02  2  0  2  3 11   2  0  2  3 11 0
> 
> 
> You can see here the all-at-once is a bit slower than ptap-scalable, but it 
> uses only much less memory.   
> 
> 
> Fande
>  
>