Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Jeff Hammond
On Friday, July 8, 2016, Barry Smith  wrote:

>
> > On Jul 8, 2016, at 12:17 PM, Jeff Hammond  > wrote:
> >
> >
> >
> > On Fri, Jul 8, 2016 at 9:48 AM, Richard Mills  > wrote:
> >
> >
> > On Fri, Jul 8, 2016 at 9:40 AM, Jeff Hammond  > wrote:
> >
> > > 1) How do we run at bandwidth peak on new architectures like Cori or
> Aurora?
> >
> >   Huh, there is a how here, not a why?
> > >
> > > Patrick and Rich have good suggestions here. Karl and Rich showed some
> promising numbers for KNL at the PETSc meeting.
> > >
> > >
> > > Future systems from multiple vendors basically move from 2-tier memory
> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
> a node.
> >
> >   Jeff,
> >
> >Would Intel sell me a system that had essentially no regular memory
> DRAM (which is too slow anyway) and no slow memory (which is absurdly too
> slow)?  What cost savings would I get in $ and power usage compared to say
> what is going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a
> significant savings then get the cut down machine, if it is insignificant
> than realize the cost of not using it (the DRAM you paid so little for) is
> insignificant and not worth worrying about, just like cruise control when
> you don't use the highway. Actually I could use the DRAM to store the
> history needed for the adjoints; so maybe it is ok to keep, but surely not
> useful for data that is continuously involved in the computation.
> >
> > Disclaimer: All of the following data is pulled off of the Internet,
> which in some cases is horribly unreliable.  My comments are strictly for
> academic discussion and not meant to be authoritative or have any influence
> on purchasing or design decisions.  Do not equate quoted TDP to measured
> power during any workload, or assume that different measurements can be
> compared directly.
> >
> > Your thinking is in line with
> http://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/.
> ..
> >
> > Intel sells KNL packages as parts (
> http://ark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server)
> that don't have any DRAM in them, just MCDRAM.  It's the decision of the
> integrator what goes into the system, which of course is correlated to what
> the intended customer wants.  While you might not need a node with DRAM,
> many users do, and the systems that DOE buys are designed to meet the needs
> of their broad user base.
> >
> > I don't know if KNL is bootable without no DRAM at all - this is likely
> more to do with what motherboard, BIOS, etc. expect than the processor
> package itself.  However, the KNL alltoall mode addresses the case where
> DRAM channels are underpopulated (with fully populated channels, one should
> use quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you
> should be able to boot it with only one channel populated.  Of course, if
> you do this, you'll get 1/6 of the DDR4 bandwidth.
> >
> > Just FYI: I have run on KNL systems with no DRAM, only MCDRAM.  This was
> on an internal lab machine and not a commercially available system, but I
> see no reason why one couldn't buy systems this way.
> >
> >
> > It puts quite a bit of pressure on the system software footprint if one
> does not have DDR4.  Blue Gene CNK had a very small memory footprint, but
> commodity Linux is certainly much larger.  The memory footprint of MPI at
> scale depends on the fabric HW/SW.  It's probably cheaper to buy one 32 GB
> stick per node than pay for someone to write the system software that gives
> you at least 15.5 GB of usable MCDRAM.
> >
>Sure.
>
> But this doesn't help with the question of whether the DRAM uses
> significant power. Say it does use significant power then it might make
> sense to be able to turn off most of the DRAM via software when running
> PETSc programs. If it doesn't use significant power then it doesn't matter
> if we don't use it.
>
>
Like I said, it's approximately 0.37 W/GB and in the neighborhood of 15-20%
of total system power in supercomputers.

On the other hand, CPU is approx 50-60% of system power and, unlike DRAM,
amenable to a range of power optimizations, at least in theory. Whether
operators give users C- and P-state control is a policy question.

Jeff


>   Barry
>
> > Jeff
> >
> > --
> > Jeff Hammond
> > jeff.scie...@gmail.com 
> > http://jeffhammond.github.io/
>
>

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Barry Smith

> On Jul 8, 2016, at 12:17 PM, Jeff Hammond  wrote:
> 
> 
> 
> On Fri, Jul 8, 2016 at 9:48 AM, Richard Mills  wrote:
> 
> 
> On Fri, Jul 8, 2016 at 9:40 AM, Jeff Hammond  wrote:
> 
> > 1) How do we run at bandwidth peak on new architectures like Cori or Aurora?
> 
>   Huh, there is a how here, not a why?
> >
> > Patrick and Rich have good suggestions here. Karl and Rich showed some 
> > promising numbers for KNL at the PETSc meeting.
> >
> >
> > Future systems from multiple vendors basically move from 2-tier memory 
> > hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g. 
> > HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on 
> > a node.
> 
>   Jeff,
> 
>Would Intel sell me a system that had essentially no regular memory DRAM 
> (which is too slow anyway) and no slow memory (which is absurdly too slow)?  
> What cost savings would I get in $ and power usage compared to say what is 
> going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a 
> significant savings then get the cut down machine, if it is insignificant 
> than realize the cost of not using it (the DRAM you paid so little for) is 
> insignificant and not worth worrying about, just like cruise control when you 
> don't use the highway. Actually I could use the DRAM to store the history 
> needed for the adjoints; so maybe it is ok to keep, but surely not useful for 
> data that is continuously involved in the computation.
> 
> Disclaimer: All of the following data is pulled off of the Internet, which in 
> some cases is horribly unreliable.  My comments are strictly for academic 
> discussion and not meant to be authoritative or have any influence on 
> purchasing or design decisions.  Do not equate quoted TDP to measured power 
> during any workload, or assume that different measurements can be compared 
> directly.
> 
> Your thinking is in line with 
> http://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/...
> 
> Intel sells KNL packages as parts 
> (http://ark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server)
>  that don't have any DRAM in them, just MCDRAM.  It's the decision of the 
> integrator what goes into the system, which of course is correlated to what 
> the intended customer wants.  While you might not need a node with DRAM, many 
> users do, and the systems that DOE buys are designed to meet the needs of 
> their broad user base.
> 
> I don't know if KNL is bootable without no DRAM at all - this is likely more 
> to do with what motherboard, BIOS, etc. expect than the processor package 
> itself.  However, the KNL alltoall mode addresses the case where DRAM 
> channels are underpopulated (with fully populated channels, one should use 
> quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you should be 
> able to boot it with only one channel populated.  Of course, if you do this, 
> you'll get 1/6 of the DDR4 bandwidth.
> 
> Just FYI: I have run on KNL systems with no DRAM, only MCDRAM.  This was on 
> an internal lab machine and not a commercially available system, but I see no 
> reason why one couldn't buy systems this way.
> 
> 
> It puts quite a bit of pressure on the system software footprint if one does 
> not have DDR4.  Blue Gene CNK had a very small memory footprint, but 
> commodity Linux is certainly much larger.  The memory footprint of MPI at 
> scale depends on the fabric HW/SW.  It's probably cheaper to buy one 32 GB 
> stick per node than pay for someone to write the system software that gives 
> you at least 15.5 GB of usable MCDRAM.
> 
   Sure.

But this doesn't help with the question of whether the DRAM uses 
significant power. Say it does use significant power then it might make sense 
to be able to turn off most of the DRAM via software when running PETSc 
programs. If it doesn't use significant power then it doesn't matter if we 
don't use it.

  Barry

> Jeff
>  
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/



Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Jeff Hammond
On Fri, Jul 8, 2016 at 9:48 AM, Richard Mills 
wrote:

>
>
> On Fri, Jul 8, 2016 at 9:40 AM, Jeff Hammond 
> wrote:
>
>>
>>> > 1) How do we run at bandwidth peak on new architectures like Cori or
>>> Aurora?
>>>
>>>   Huh, there is a how here, not a why?
>>> >
>>> > Patrick and Rich have good suggestions here. Karl and Rich showed some
>>> promising numbers for KNL at the PETSc meeting.
>>> >
>>> >
>>> > Future systems from multiple vendors basically move from 2-tier memory
>>> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
>>> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
>>> a node.
>>>
>>>   Jeff,
>>>
>>>Would Intel sell me a system that had essentially no regular memory
>>> DRAM (which is too slow anyway) and no slow memory (which is absurdly too
>>> slow)?  What cost savings would I get in $ and power usage compared to say
>>> what is going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a
>>> significant savings then get the cut down machine, if it is insignificant
>>> than realize the cost of not using it (the DRAM you paid so little for) is
>>> insignificant and not worth worrying about, just like cruise control when
>>> you don't use the highway. Actually I could use the DRAM to store the
>>> history needed for the adjoints; so maybe it is ok to keep, but surely not
>>> useful for data that is continuously involved in the computation.
>>>
>>
>> *Disclaimer: All of the following data is pulled off of the Internet,
>> which in some cases is horribly unreliable.  My comments are strictly for
>> academic discussion and not meant to be authoritative or have any influence
>> on purchasing or design decisions.  Do not equate quoted TDP to measured
>> power during any workload, or assume that different measurements can be
>> compared directly.*
>>
>> Your thinking is in line with
>> http://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/.
>> ..
>>
>> Intel sells KNL packages as parts (
>> http://ark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server)
>> that don't have any DRAM in them, just MCDRAM.  It's the decision of the
>> integrator what goes into the system, which of course is correlated to what
>> the intended customer wants.  While you might not need a node with DRAM,
>> many users do, and the systems that DOE buys are designed to meet the needs
>> of their broad user base.
>>
>> I don't know if KNL is bootable without no DRAM at all - this is likely
>> more to do with what motherboard, BIOS, etc. expect than the processor
>> package itself.  However, the KNL alltoall mode addresses the case where
>> DRAM channels are underpopulated (with fully populated channels, one should
>> use quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you
>> should be able to boot it with only one channel populated.  Of course, if
>> you do this, you'll get 1/6 of the DDR4 bandwidth.
>>
>
> Just FYI: I have run on KNL systems with no DRAM, only MCDRAM.  This was
> on an internal lab machine and not a commercially available system, but I
> see no reason why one couldn't buy systems this way.
>
>
It puts quite a bit of pressure on the system software footprint if one
does not have DDR4.  Blue Gene CNK had a very small memory footprint, but
commodity Linux is certainly much larger.  The memory footprint of MPI at
scale depends on the fabric HW/SW.  It's probably cheaper to buy one 32 GB
stick per node than pay for someone to write the system software that gives
you at least 15.5 GB of usable MCDRAM.

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Richard Mills
On Fri, Jul 8, 2016 at 9:40 AM, Jeff Hammond  wrote:

>
>> > 1) How do we run at bandwidth peak on new architectures like Cori or
>> Aurora?
>>
>>   Huh, there is a how here, not a why?
>> >
>> > Patrick and Rich have good suggestions here. Karl and Rich showed some
>> promising numbers for KNL at the PETSc meeting.
>> >
>> >
>> > Future systems from multiple vendors basically move from 2-tier memory
>> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
>> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
>> a node.
>>
>>   Jeff,
>>
>>Would Intel sell me a system that had essentially no regular memory
>> DRAM (which is too slow anyway) and no slow memory (which is absurdly too
>> slow)?  What cost savings would I get in $ and power usage compared to say
>> what is going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a
>> significant savings then get the cut down machine, if it is insignificant
>> than realize the cost of not using it (the DRAM you paid so little for) is
>> insignificant and not worth worrying about, just like cruise control when
>> you don't use the highway. Actually I could use the DRAM to store the
>> history needed for the adjoints; so maybe it is ok to keep, but surely not
>> useful for data that is continuously involved in the computation.
>>
>
> *Disclaimer: All of the following data is pulled off of the Internet,
> which in some cases is horribly unreliable.  My comments are strictly for
> academic discussion and not meant to be authoritative or have any influence
> on purchasing or design decisions.  Do not equate quoted TDP to measured
> power during any workload, or assume that different measurements can be
> compared directly.*
>
> Your thinking is in line with
> http://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/.
> ..
>
> Intel sells KNL packages as parts (
> http://ark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server)
> that don't have any DRAM in them, just MCDRAM.  It's the decision of the
> integrator what goes into the system, which of course is correlated to what
> the intended customer wants.  While you might not need a node with DRAM,
> many users do, and the systems that DOE buys are designed to meet the needs
> of their broad user base.
>
> I don't know if KNL is bootable without no DRAM at all - this is likely
> more to do with what motherboard, BIOS, etc. expect than the processor
> package itself.  However, the KNL alltoall mode addresses the case where
> DRAM channels are underpopulated (with fully populated channels, one should
> use quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you
> should be able to boot it with only one channel populated.  Of course, if
> you do this, you'll get 1/6 of the DDR4 bandwidth.
>

Just FYI: I have run on KNL systems with no DRAM, only MCDRAM.  This was on
an internal lab machine and not a commercially available system, but I see
no reason why one couldn't buy systems this way.

--Richard


> As to the question of DRAM power, there is a lot of detailed information
> available (e.g.
> https://www.micron.com/~/media/documents/products/power-calculator/ddr4_power_calc.xlsm,
>
> https://www.micron.com/~/media/Documents/Products/Technical%20Note/DRAM/TN4603.pdf,
> https://lenovopress.com/lp0083.pdf) but since I am lazy, I'll use the
> numbers reported on
> http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html
> for client memory (i.e. not server memory, hence probably not providing
> ECC, but ECC doesn't change power consumption much), which works out to
> 0.37 W/GB for DDR4-2133, hence 71 W for 192 GB [
> http://www.nextplatform.com/2015/11/30/inside-future-knights-landing-xeon-phi-systems/].
> That 71W is ~1/3 of the processor package power (215W).  The network
> adapter draws some power, and the cables and switches (especially optics)
> are a nontrivial power draw.  So DRAM is at most 25% of the node power, and
> perhaps ~17% of system power based upon what I can derive from Shaheen II.
>
> Shaheen II Cray XC40
> 1.96 MW = 6174 * (2 sockets * 135 W/socket + 128 GB * 0.37 W/GB)
> 2.83 MW total
> = 69% from CPU+DRAM
>
> Again, *these are not the exact numbers* but what I can derive from
> https://www.top500.org/system/178515,
> https://www.hpc.kaust.edu.sa/content/shaheen-ii and
> http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz
> .
>
> Back to the higher level analysis, what is unfortunate about DRAM is that
> it needs power to hold data even if the data isn't used, because it is not
> persistent.  I don't know how well it powers down when the physical memory
> isn't mapped but it seems that power is not gated today [
> http://digitalpiglet.org/research/sion2014socc.pdf].  The advantage of
> nonvolatile memory is that it doesn't require power when not being
> accessed, whether or not the data is 

Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Jeff Hammond
>
>
> > 1) How do we run at bandwidth peak on new architectures like Cori or
> Aurora?
>
>   Huh, there is a how here, not a why?
> >
> > Patrick and Rich have good suggestions here. Karl and Rich showed some
> promising numbers for KNL at the PETSc meeting.
> >
> >
> > Future systems from multiple vendors basically move from 2-tier memory
> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
> a node.
>
>   Jeff,
>
>Would Intel sell me a system that had essentially no regular memory
> DRAM (which is too slow anyway) and no slow memory (which is absurdly too
> slow)?  What cost savings would I get in $ and power usage compared to say
> what is going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a
> significant savings then get the cut down machine, if it is insignificant
> than realize the cost of not using it (the DRAM you paid so little for) is
> insignificant and not worth worrying about, just like cruise control when
> you don't use the highway. Actually I could use the DRAM to store the
> history needed for the adjoints; so maybe it is ok to keep, but surely not
> useful for data that is continuously involved in the computation.
>

*Disclaimer: All of the following data is pulled off of the Internet, which
in some cases is horribly unreliable.  My comments are strictly for
academic discussion and not meant to be authoritative or have any influence
on purchasing or design decisions.  Do not equate quoted TDP to measured
power during any workload, or assume that different measurements can be
compared directly.*

Your thinking is in line with
http://www.nextplatform.com/2015/08/03/future-systems-intel-ponders-breaking-up-the-cpu/.
..

Intel sells KNL packages as parts (
http://ark.intel.com/products/family/92650/Intel-Xeon-Phi-Product-Family-x200#@Server)
that don't have any DRAM in them, just MCDRAM.  It's the decision of the
integrator what goes into the system, which of course is correlated to what
the intended customer wants.  While you might not need a node with DRAM,
many users do, and the systems that DOE buys are designed to meet the needs
of their broad user base.

I don't know if KNL is bootable without no DRAM at all - this is likely
more to do with what motherboard, BIOS, etc. expect than the processor
package itself.  However, the KNL alltoall mode addresses the case where
DRAM channels are underpopulated (with fully populated channels, one should
use quadrant, hemisphere, SNC-2 or SNC-4), so if DRAM is necessary, you
should be able to boot it with only one channel populated.  Of course, if
you do this, you'll get 1/6 of the DDR4 bandwidth.

As to the question of DRAM power, there is a lot of detailed information
available (e.g.
https://www.micron.com/~/media/documents/products/power-calculator/ddr4_power_calc.xlsm,
https://www.micron.com/~/media/Documents/Products/Technical%20Note/DRAM/TN4603.pdf,
https://lenovopress.com/lp0083.pdf) but since I am lazy, I'll use the
numbers reported on
http://www.tomshardware.com/reviews/intel-core-i7-5960x-haswell-e-cpu,3918-13.html
for client memory (i.e. not server memory, hence probably not providing
ECC, but ECC doesn't change power consumption much), which works out to
0.37 W/GB for DDR4-2133, hence 71 W for 192 GB [
http://www.nextplatform.com/2015/11/30/inside-future-knights-landing-xeon-phi-systems/].
That 71W is ~1/3 of the processor package power (215W).  The network
adapter draws some power, and the cables and switches (especially optics)
are a nontrivial power draw.  So DRAM is at most 25% of the node power, and
perhaps ~17% of system power based upon what I can derive from Shaheen II.

Shaheen II Cray XC40
1.96 MW = 6174 * (2 sockets * 135 W/socket + 128 GB * 0.37 W/GB)
2.83 MW total
= 69% from CPU+DRAM

Again, *these are not the exact numbers* but what I can derive from
https://www.top500.org/system/178515,
https://www.hpc.kaust.edu.sa/content/shaheen-ii and
http://ark.intel.com/products/81060/Intel-Xeon-Processor-E5-2698-v3-40M-Cache-2_30-GHz
.

Back to the higher level analysis, what is unfortunate about DRAM is that
it needs power to hold data even if the data isn't used, because it is not
persistent.  I don't know how well it powers down when the physical memory
isn't mapped but it seems that power is not gated today [
http://digitalpiglet.org/research/sion2014socc.pdf].  The advantage of
nonvolatile memory is that it doesn't require power when not being
accessed, whether or not the data is preserved.

I suspect that nonvolatile memory (NVM) is the right place to put your
adjoint matrices, provided the NVM bandwidth is sufficient.

*Disclaimer: All of these are academic comments.  Do not use them to try to
influence others or make any decisions.  Do your own research and be
skeptical of everything I derived from the Internet.*

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-08 Thread Mark Adams
>
>
>>
> The why is "We need to run at bandwidth peak on new arches". I do not
> prescribe the How, just ask for it.
>
>
Be careful about specifying an optimization parameter unless it is really
what you want.  eg, maximising arithmetic intensity will lead you
to Cayley–Hamilton inversion and minimizing ("avoiding") network
communication can lead to some funny algorithms if you are not careful.

Now, I suspect that maximizing bandwidth might not be gameable because that
is where Sam Williams ends up -- with large memory -- with HPGMG-FV, which
is matrix-free, 4th order accurate, finite volume, multigrid solver of the
3D Laplacian (non-constant coefficient). (While I hesitate to use one
person's experience as an implied "proof", Sam is very thorough and
honest.)  And keeping the memory bus saturated in the next 10 years may not
be achievable even for Sam.

But, in the space of equation solvers that are emerging architecture
friendly, you are latency constrained in practical -- not large memory --
regimes.  Though, large memory is a good place to start, a baseline, and
easier to think about and achieve. Still, I wince when I see a goal that
only implies good performance.


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Richard Mills
On Thu, Jul 7, 2016 at 5:06 PM, Jeff Hammond  wrote:

>
>
> On Thu, Jul 7, 2016 at 4:34 PM, Richard Mills 
> wrote:
>
>> On Fri, Jul 1, 2016 at 4:13 PM, Jeff Hammond 
>> wrote:
>>
>>> [...]
>>>
>>> Maybe I am just biased because I spend all of my time reading
>>> www.nextplatform.com, but I hear machine learning is becoming an
>>> important HPC workload.  While the most hyped efforts related to running
>>> inaccurate - the technical term is half-precision - dense matrix
>>> multiplication as fast as possible, I suspect that more elegant approaches
>>> will prevail.  Presumably there is something that PETSc can do to enable
>>> machine learning algorithms.  As most of the existing approaches use silly
>>> programming models based on MapReduce, it can't be too hard for PETSc to do
>>> better.
>>>
>>
>> "Machine learning" is definitely the hype du jour, but when that term
>> gets thrown around, everyone is equating it with neural networks with a lot
>> of layers ("deep learning").  That's why everyone is going on about half
>> precision dense matrix multiplication, as low accuracy works fine for some
>> of this stuff.  The thing is, there are a a ton of machine-learning
>> approaches out there that are NOT neural networks, and I worry that
>> everyone is too ready to jump into specialized hardware for neural nets
>> when maybe there are better approaches out there.  Regarding machine
>> learning approaches that use sparse matrix methods, I think that PETSc
>> (plus SLEPc) provide pretty good building blocks right now for these,
>> though there are probably things that could be better supported.  But what
>> machine learning approaches PETSc should target right now, I don't know.
>> Program managers currently like terms like "neuromorphic computing" and
>> half-precision computations seem to be the focus.  (Though why stop there?
>> Why not quarter precision?!!)
>>
>>
> Google TPU does quarter precision i.e. 8-bit fixed-point [
> http://www.nextplatform.com/2016/05/19/google-takes-unconventional-route-homegrown-machine-learning-chips/],
> so the machine learning folks have already gone there.  No need to
> speculate about it :-)
>

How wonderfully retro!  I remember doing stuff like this for 3D graphics,
back in the day when floating point was way too expensive, so we had to do
it with fixed point calculations.  I guess I'm getting pretty old in
computing years...

--Richard


>
> Jeff
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Barry Smith

> On Jul 7, 2016, at 7:05 PM, Jeff Hammond  wrote:
> 
> 
> 
> On Thu, Jul 7, 2016 at 1:04 PM, Matthew Knepley  wrote:
> On Fri, Jul 1, 2016 at 4:32 PM, Barry Smith  wrote:
> 
>The DOE SciDAC institutes have supported PETSc linear solver research/code 
> development for the past fifteen years.
> 
> This email is to solicit ideas for linear solver research/code 
> development work for the next round of SciDAC institutes (which will be a 4 
> year period) in PETSc. Please send me any ideas, no matter how crazy, on 
> things you feel are missing, broken, or incomplete in PETSc with regard to 
> linear solvers that we should propose to work on. In particular, issues 
> coming from particular classes of applications would be good. Generic "multi 
> physics" coupling types of things are too general (and old :-)) while  work 
> for extreme large scale is also out since that is covered under another call 
> (ECP). But particular types of optimizations etc for existing or new codes 
> could be in, just not for the very large scale.
> 
> Rough ideas and pointers to publications are all useful. There is an 
> extremely short fuse so the sooner the better,
> 
> I think the suggestions so far are fine, however they all seem to start at 
> the "how", whereas I would prefer we start at the "why". Maybe something like
> 
> 1) How do we run at bandwidth peak on new architectures like Cori or Aurora?

  Huh, there is a how here, not a why?
> 
> Patrick and Rich have good suggestions here. Karl and Rich showed some 
> promising numbers for KNL at the PETSc meeting.
> 
> 
> Future systems from multiple vendors basically move from 2-tier memory 
> hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g. 
> HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on a 
> node.  

  Jeff,

   Would Intel sell me a system that had essentially no regular memory DRAM 
(which is too slow anyway) and no slow memory (which is absurdly too slow)?  
What cost savings would I get in $ and power usage compared to say what is 
going in the theta? 10% and 20%, 5% and 30%, 5% and 5 %? If it is a significant 
savings then get the cut down machine, if it is insignificant than realize the 
cost of not using it (the DRAM you paid so little for) is insignificant and not 
worth worrying about, just like cruise control when you don't use the highway. 
Actually I could use the DRAM to store the history needed for the adjoints; so 
maybe it is ok to keep, but surely not useful for data that is continuously 
involved in the computation.

   Barry



   
> Xeon Phi and some GPUs have caches, but it is unclear to me if it actually 
> benefits software like PETSc to consider them.  Figuring out how to run PETSc 
> effectively on KNL should be generally useful...
> 
> Jeff
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/



Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Barry Smith

> On Jul 7, 2016, at 7:06 PM, Jeff Hammond  wrote:
> 
> 
> 
> On Thu, Jul 7, 2016 at 4:34 PM, Richard Mills  wrote:
> On Fri, Jul 1, 2016 at 4:13 PM, Jeff Hammond  wrote:
> [...]
> 
> Maybe I am just biased because I spend all of my time reading 
> www.nextplatform.com, but I hear machine learning is becoming an important 
> HPC workload.  While the most hyped efforts related to running inaccurate - 
> the technical term is half-precision - dense matrix multiplication as fast as 
> possible, I suspect that more elegant approaches will prevail.  Presumably 
> there is something that PETSc can do to enable machine learning algorithms.  
> As most of the existing approaches use silly programming models based on 
> MapReduce, it can't be too hard for PETSc to do better.
> 
> "Machine learning" is definitely the hype du jour, but when that term gets 
> thrown around, everyone is equating it with neural networks with a lot of 
> layers ("deep learning").  That's why everyone is going on about half 
> precision dense matrix multiplication, as low accuracy works fine for some of 
> this stuff.  The thing is, there are a a ton of machine-learning approaches 
> out there that are NOT neural networks, and I worry that everyone is too 
> ready to jump into specialized hardware for neural nets when maybe there are 
> better approaches out there.  Regarding machine learning approaches that use 
> sparse matrix methods, I think that PETSc (plus SLEPc) provide pretty good 
> building blocks right now for these, though there are probably things that 
> could be better supported.  But what machine learning approaches PETSc should 
> target right now, I don't know.  Program managers currently like terms like 
> "neuromorphic computing"

  It may be as much or even more idiots who talk to program managers that like 
"neuromorphic computing".


> and half-precision computations seem to be the focus.  (Though why stop 
> there?  Why not quarter precision?!!)
> 
> 
> Google TPU does quarter precision i.e. 8-bit fixed-point 
> [http://www.nextplatform.com/2016/05/19/google-takes-unconventional-route-homegrown-machine-learning-chips/],
>  so the machine learning folks have already gone there.  No need to speculate 
> about it :-)
> 
> Jeff
> 
> -- 
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/



Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Jeff Hammond
On Thu, Jul 7, 2016 at 4:34 PM, Richard Mills 
wrote:

> On Fri, Jul 1, 2016 at 4:13 PM, Jeff Hammond 
> wrote:
>
>> [...]
>>
>> Maybe I am just biased because I spend all of my time reading
>> www.nextplatform.com, but I hear machine learning is becoming an
>> important HPC workload.  While the most hyped efforts related to running
>> inaccurate - the technical term is half-precision - dense matrix
>> multiplication as fast as possible, I suspect that more elegant approaches
>> will prevail.  Presumably there is something that PETSc can do to enable
>> machine learning algorithms.  As most of the existing approaches use silly
>> programming models based on MapReduce, it can't be too hard for PETSc to do
>> better.
>>
>
> "Machine learning" is definitely the hype du jour, but when that term gets
> thrown around, everyone is equating it with neural networks with a lot of
> layers ("deep learning").  That's why everyone is going on about half
> precision dense matrix multiplication, as low accuracy works fine for some
> of this stuff.  The thing is, there are a a ton of machine-learning
> approaches out there that are NOT neural networks, and I worry that
> everyone is too ready to jump into specialized hardware for neural nets
> when maybe there are better approaches out there.  Regarding machine
> learning approaches that use sparse matrix methods, I think that PETSc
> (plus SLEPc) provide pretty good building blocks right now for these,
> though there are probably things that could be better supported.  But what
> machine learning approaches PETSc should target right now, I don't know.
> Program managers currently like terms like "neuromorphic computing" and
> half-precision computations seem to be the focus.  (Though why stop there?
> Why not quarter precision?!!)
>
>
Google TPU does quarter precision i.e. 8-bit fixed-point [
http://www.nextplatform.com/2016/05/19/google-takes-unconventional-route-homegrown-machine-learning-chips/],
so the machine learning folks have already gone there.  No need to
speculate about it :-)

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Jeff Hammond
On Thu, Jul 7, 2016 at 1:04 PM, Matthew Knepley  wrote:

> On Fri, Jul 1, 2016 at 4:32 PM, Barry Smith  wrote:
>
>>
>>The DOE SciDAC institutes have supported PETSc linear solver
>> research/code development for the past fifteen years.
>>
>> This email is to solicit ideas for linear solver research/code
>> development work for the next round of SciDAC institutes (which will be a 4
>> year period) in PETSc. Please send me any ideas, no matter how crazy, on
>> things you feel are missing, broken, or incomplete in PETSc with regard to
>> linear solvers that we should propose to work on. In particular, issues
>> coming from particular classes of applications would be good. Generic
>> "multi physics" coupling types of things are too general (and old :-))
>> while  work for extreme large scale is also out since that is covered under
>> another call (ECP). But particular types of optimizations etc for existing
>> or new codes could be in, just not for the very large scale.
>>
>> Rough ideas and pointers to publications are all useful. There is an
>> extremely short fuse so the sooner the better,
>>
>
> I think the suggestions so far are fine, however they all seem to start at
> the "how", whereas I would prefer we start at the "why". Maybe something
> like
>
> 1) How do we run at bandwidth peak on new architectures like Cori or
> Aurora?
>
> Patrick and Rich have good suggestions here. Karl and Rich showed some
> promising numbers for KNL at the PETSc meeting.
>
>
Future systems from multiple vendors basically move from 2-tier memory
hierarchy of shared LLC and DRAM to a 3-tier hierarchy of fast memory (e.g.
HBM), regular memory (e.g. DRAM), and slow (likely nonvolatile) memory  on
a node.  Xeon Phi and some GPUs have caches, but it is unclear to me if it
actually benefits software like PETSc to consider them.  Figuring out how
to run PETSc effectively on KNL should be generally useful...

Jeff

-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-07 Thread Richard Mills
On Fri, Jul 1, 2016 at 4:13 PM, Jeff Hammond  wrote:

> [...]
>
> Maybe I am just biased because I spend all of my time reading
> www.nextplatform.com, but I hear machine learning is becoming an
> important HPC workload.  While the most hyped efforts related to running
> inaccurate - the technical term is half-precision - dense matrix
> multiplication as fast as possible, I suspect that more elegant approaches
> will prevail.  Presumably there is something that PETSc can do to enable
> machine learning algorithms.  As most of the existing approaches use silly
> programming models based on MapReduce, it can't be too hard for PETSc to do
> better.
>

"Machine learning" is definitely the hype du jour, but when that term gets
thrown around, everyone is equating it with neural networks with a lot of
layers ("deep learning").  That's why everyone is going on about half
precision dense matrix multiplication, as low accuracy works fine for some
of this stuff.  The thing is, there are a a ton of machine-learning
approaches out there that are NOT neural networks, and I worry that
everyone is too ready to jump into specialized hardware for neural nets
when maybe there are better approaches out there.  Regarding machine
learning approaches that use sparse matrix methods, I think that PETSc
(plus SLEPc) provide pretty good building blocks right now for these,
though there are probably things that could be better supported.  But what
machine learning approaches PETSc should target right now, I don't know.
Program managers currently like terms like "neuromorphic computing" and
half-precision computations seem to be the focus.  (Though why stop there?
Why not quarter precision?!!)

--Richard


> Jeff
>
> On Fri, Jul 1, 2016 at 2:32 PM, Barry Smith  wrote:
>
>>
>>The DOE SciDAC institutes have supported PETSc linear solver
>> research/code development for the past fifteen years.
>>
>> This email is to solicit ideas for linear solver research/code
>> development work for the next round of SciDAC institutes (which will be a 4
>> year period) in PETSc. Please send me any ideas, no matter how crazy, on
>> things you feel are missing, broken, or incomplete in PETSc with regard to
>> linear solvers that we should propose to work on. In particular, issues
>> coming from particular classes of applications would be good. Generic
>> "multi physics" coupling types of things are too general (and old :-))
>> while  work for extreme large scale is also out since that is covered under
>> another call (ECP). But particular types of optimizations etc for existing
>> or new codes could be in, just not for the very large scale.
>>
>> Rough ideas and pointers to publications are all useful. There is an
>> extremely short fuse so the sooner the better,
>>
>> Thanks
>>
>>   Barry
>>
>>
>>
>>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/
>


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-03 Thread Stefano Zampini
What about recycling of Krylov subspaces and improved support for initial 
guesses (already there, maybe we can add other low-order methods)?

On Jul 2, 2016, at 10:44 PM, Hong  wrote:

> Efficient and scalable MatGetSubmatrix() and assemble submatrices into a 
> matrix -- used for multiphysic simulation, e.g., wash project we are doing 
> now.
> 
> Hong
> 
> On Sat, Jul 2, 2016 at 2:46 AM, Patrick Sanan  wrote:
> Maybe a general class of opportunities for PETSc could be wrapped up
> under "good abstractions for memory space awareness". That is, all
> these topics have been discussed recently:
> 
> 1.  Re Jeff's excellent suggestion about OpenMP, it would be nice if
> more were done to promote the alternative, MPI-based shared memory
> capability. It's quite clear that this is a good way to go, but this
> doesn't mean that people (who often equate shared memory parallelism
> with threads) will use it, so the more that can be done to make the
> internals of the library use this approach and the more that the
> classes (DM in particular) can make it easy to do things "the right
> way", the better.
> 
> 2. As we saw from Richard's talk on KNL, that device will feature
> (when used a certain way) one NUMA domain which can provide much
> greater memory bandwidth. As in the discussions here before on the
> topic, it's a real challenge to figure out how to to make something
> like this usable via PETSc, and an even greater one to pick defaults
> in a way that won't sometimes produce bewildering performance results.
> Nevertheless, there's a good chance this is the kind of hardware that
> people will be running PETSc on, and given that this is something
> which attacks memory bandwidth bottlenecks, something that could speed
> up a lot of PETSc code.
> 
> 3. People will likely also be running on machines with
> coprocessors/accelerators/etc for a while, and there needs to be
> plumbing to deal with the fact that each device may provide an extra
> memory space which needs to be managed properly in the flat MPI
> environment. This is of course related to point 1, as good use of
> MPI-3 shared memory seems like a reasonable way forward.
> 
> 4. Related topics re MPI-4/endpoints and how PETSc really could be
> used properly from an existing multi-threaded environment. Karl has
> talked about the challenges of doing this the right way a lot
> recently.
> 
> With all of these, introducing some clever abstractions to allow these
> sorts of things to be as transparent/automatic/encapsulated as
> possible might be very valuable and well-used additions to the
> library.
> 
> On Sat, Jul 2, 2016 at 1:13 AM, Jeff Hammond  wrote:
> > Obviously, holistic support for OpenMP is critical to the future of PETSc
> > :-D
> >
> > On a more serious note, Matt and I have discussed the use of PETSc for
> > sparse multidimensional array computations for dimensions greater than 2,
> > also known as tensor computations. The associated paper describing previous
> > work with dense arrays is
> > http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-210.html.  There
> > was even an unsuccessful SciDAC application proposal that described how
> > PETSc could be used for that domain when sparsity is important.  To start,
> > all we'd need is sparse matrix x sparse matrix multiplication, which I hear
> > the multigrid folks also need.  Sparse times dense is also important.
> > Sparse tensor factorization would also help, but I get that there are enough
> > open math questions there that it might be impractical to try to implement
> > something in PETSc in the near future.
> >
> > Maybe I am just biased because I spend all of my time reading
> > www.nextplatform.com, but I hear machine learning is becoming an important
> > HPC workload.  While the most hyped efforts related to running inaccurate -
> > the technical term is half-precision - dense matrix multiplication as fast
> > as possible, I suspect that more elegant approaches will prevail.
> > Presumably there is something that PETSc can do to enable machine learning
> > algorithms.  As most of the existing approaches use silly programming models
> > based on MapReduce, it can't be too hard for PETSc to do better.
> >
> > Jeff
> >
> > On Fri, Jul 1, 2016 at 2:32 PM, Barry Smith  wrote:
> >>
> >>
> >>The DOE SciDAC institutes have supported PETSc linear solver
> >> research/code development for the past fifteen years.
> >>
> >> This email is to solicit ideas for linear solver research/code
> >> development work for the next round of SciDAC institutes (which will be a 4
> >> year period) in PETSc. Please send me any ideas, no matter how crazy, on
> >> things you feel are missing, broken, or incomplete in PETSc with regard to
> >> linear solvers that we should propose to work on. In particular, issues
> >> coming from particular classes of applications would be good. Generic 
> >> "multi
> >> 

Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-02 Thread Hong
Efficient and scalable MatGetSubmatrix() and assemble submatrices into a
matrix -- used for multiphysic simulation, e.g., wash project we are doing
now.

Hong

On Sat, Jul 2, 2016 at 2:46 AM, Patrick Sanan 
wrote:

> Maybe a general class of opportunities for PETSc could be wrapped up
> under "good abstractions for memory space awareness". That is, all
> these topics have been discussed recently:
>
> 1.  Re Jeff's excellent suggestion about OpenMP, it would be nice if
> more were done to promote the alternative, MPI-based shared memory
> capability. It's quite clear that this is a good way to go, but this
> doesn't mean that people (who often equate shared memory parallelism
> with threads) will use it, so the more that can be done to make the
> internals of the library use this approach and the more that the
> classes (DM in particular) can make it easy to do things "the right
> way", the better.
>
> 2. As we saw from Richard's talk on KNL, that device will feature
> (when used a certain way) one NUMA domain which can provide much
> greater memory bandwidth. As in the discussions here before on the
> topic, it's a real challenge to figure out how to to make something
> like this usable via PETSc, and an even greater one to pick defaults
> in a way that won't sometimes produce bewildering performance results.
> Nevertheless, there's a good chance this is the kind of hardware that
> people will be running PETSc on, and given that this is something
> which attacks memory bandwidth bottlenecks, something that could speed
> up a lot of PETSc code.
>
> 3. People will likely also be running on machines with
> coprocessors/accelerators/etc for a while, and there needs to be
> plumbing to deal with the fact that each device may provide an extra
> memory space which needs to be managed properly in the flat MPI
> environment. This is of course related to point 1, as good use of
> MPI-3 shared memory seems like a reasonable way forward.
>
> 4. Related topics re MPI-4/endpoints and how PETSc really could be
> used properly from an existing multi-threaded environment. Karl has
> talked about the challenges of doing this the right way a lot
> recently.
>
> With all of these, introducing some clever abstractions to allow these
> sorts of things to be as transparent/automatic/encapsulated as
> possible might be very valuable and well-used additions to the
> library.
>
> On Sat, Jul 2, 2016 at 1:13 AM, Jeff Hammond 
> wrote:
> > Obviously, holistic support for OpenMP is critical to the future of PETSc
> > :-D
> >
> > On a more serious note, Matt and I have discussed the use of PETSc for
> > sparse multidimensional array computations for dimensions greater than 2,
> > also known as tensor computations. The associated paper describing
> previous
> > work with dense arrays is
> > http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-210.html.
> There
> > was even an unsuccessful SciDAC application proposal that described how
> > PETSc could be used for that domain when sparsity is important.  To
> start,
> > all we'd need is sparse matrix x sparse matrix multiplication, which I
> hear
> > the multigrid folks also need.  Sparse times dense is also important.
> > Sparse tensor factorization would also help, but I get that there are
> enough
> > open math questions there that it might be impractical to try to
> implement
> > something in PETSc in the near future.
> >
> > Maybe I am just biased because I spend all of my time reading
> > www.nextplatform.com, but I hear machine learning is becoming an
> important
> > HPC workload.  While the most hyped efforts related to running
> inaccurate -
> > the technical term is half-precision - dense matrix multiplication as
> fast
> > as possible, I suspect that more elegant approaches will prevail.
> > Presumably there is something that PETSc can do to enable machine
> learning
> > algorithms.  As most of the existing approaches use silly programming
> models
> > based on MapReduce, it can't be too hard for PETSc to do better.
> >
> > Jeff
> >
> > On Fri, Jul 1, 2016 at 2:32 PM, Barry Smith  wrote:
> >>
> >>
> >>The DOE SciDAC institutes have supported PETSc linear solver
> >> research/code development for the past fifteen years.
> >>
> >> This email is to solicit ideas for linear solver research/code
> >> development work for the next round of SciDAC institutes (which will be
> a 4
> >> year period) in PETSc. Please send me any ideas, no matter how crazy, on
> >> things you feel are missing, broken, or incomplete in PETSc with regard
> to
> >> linear solvers that we should propose to work on. In particular, issues
> >> coming from particular classes of applications would be good. Generic
> "multi
> >> physics" coupling types of things are too general (and old :-)) while
> work
> >> for extreme large scale is also out since that is covered under another
> call
> >> (ECP). But particular types of 

Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-02 Thread Patrick Sanan
Maybe a general class of opportunities for PETSc could be wrapped up
under "good abstractions for memory space awareness". That is, all
these topics have been discussed recently:

1.  Re Jeff's excellent suggestion about OpenMP, it would be nice if
more were done to promote the alternative, MPI-based shared memory
capability. It's quite clear that this is a good way to go, but this
doesn't mean that people (who often equate shared memory parallelism
with threads) will use it, so the more that can be done to make the
internals of the library use this approach and the more that the
classes (DM in particular) can make it easy to do things "the right
way", the better.

2. As we saw from Richard's talk on KNL, that device will feature
(when used a certain way) one NUMA domain which can provide much
greater memory bandwidth. As in the discussions here before on the
topic, it's a real challenge to figure out how to to make something
like this usable via PETSc, and an even greater one to pick defaults
in a way that won't sometimes produce bewildering performance results.
Nevertheless, there's a good chance this is the kind of hardware that
people will be running PETSc on, and given that this is something
which attacks memory bandwidth bottlenecks, something that could speed
up a lot of PETSc code.

3. People will likely also be running on machines with
coprocessors/accelerators/etc for a while, and there needs to be
plumbing to deal with the fact that each device may provide an extra
memory space which needs to be managed properly in the flat MPI
environment. This is of course related to point 1, as good use of
MPI-3 shared memory seems like a reasonable way forward.

4. Related topics re MPI-4/endpoints and how PETSc really could be
used properly from an existing multi-threaded environment. Karl has
talked about the challenges of doing this the right way a lot
recently.

With all of these, introducing some clever abstractions to allow these
sorts of things to be as transparent/automatic/encapsulated as
possible might be very valuable and well-used additions to the
library.

On Sat, Jul 2, 2016 at 1:13 AM, Jeff Hammond  wrote:
> Obviously, holistic support for OpenMP is critical to the future of PETSc
> :-D
>
> On a more serious note, Matt and I have discussed the use of PETSc for
> sparse multidimensional array computations for dimensions greater than 2,
> also known as tensor computations. The associated paper describing previous
> work with dense arrays is
> http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-210.html.  There
> was even an unsuccessful SciDAC application proposal that described how
> PETSc could be used for that domain when sparsity is important.  To start,
> all we'd need is sparse matrix x sparse matrix multiplication, which I hear
> the multigrid folks also need.  Sparse times dense is also important.
> Sparse tensor factorization would also help, but I get that there are enough
> open math questions there that it might be impractical to try to implement
> something in PETSc in the near future.
>
> Maybe I am just biased because I spend all of my time reading
> www.nextplatform.com, but I hear machine learning is becoming an important
> HPC workload.  While the most hyped efforts related to running inaccurate -
> the technical term is half-precision - dense matrix multiplication as fast
> as possible, I suspect that more elegant approaches will prevail.
> Presumably there is something that PETSc can do to enable machine learning
> algorithms.  As most of the existing approaches use silly programming models
> based on MapReduce, it can't be too hard for PETSc to do better.
>
> Jeff
>
> On Fri, Jul 1, 2016 at 2:32 PM, Barry Smith  wrote:
>>
>>
>>The DOE SciDAC institutes have supported PETSc linear solver
>> research/code development for the past fifteen years.
>>
>> This email is to solicit ideas for linear solver research/code
>> development work for the next round of SciDAC institutes (which will be a 4
>> year period) in PETSc. Please send me any ideas, no matter how crazy, on
>> things you feel are missing, broken, or incomplete in PETSc with regard to
>> linear solvers that we should propose to work on. In particular, issues
>> coming from particular classes of applications would be good. Generic "multi
>> physics" coupling types of things are too general (and old :-)) while  work
>> for extreme large scale is also out since that is covered under another call
>> (ECP). But particular types of optimizations etc for existing or new codes
>> could be in, just not for the very large scale.
>>
>> Rough ideas and pointers to publications are all useful. There is an
>> extremely short fuse so the sooner the better,
>>
>> Thanks
>>
>>   Barry
>>
>>
>>
>
>
>
> --
> Jeff Hammond
> jeff.scie...@gmail.com
> http://jeffhammond.github.io/


Re: [petsc-dev] Soliciting suggestions for linear solver work under SciDAC 4 Institutes

2016-07-01 Thread Jeff Hammond
Obviously, holistic support for OpenMP is critical to the future of PETSc
:-D

On a more serious note, Matt and I have discussed the use of PETSc for
sparse multidimensional array computations for dimensions greater than 2,
also known as tensor computations. The associated paper describing previous
work with dense arrays is
http://www.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-210.html.  There
was even an unsuccessful SciDAC application proposal that described how
PETSc could be used for that domain when sparsity is important.  To start,
all we'd need is sparse matrix x sparse matrix multiplication, which I hear
the multigrid folks also need.  Sparse times dense is also important.
Sparse tensor factorization would also help, but I get that there are
enough open math questions there that it might be impractical to try to
implement something in PETSc in the near future.

Maybe I am just biased because I spend all of my time reading
www.nextplatform.com, but I hear machine learning is becoming an important
HPC workload.  While the most hyped efforts related to running inaccurate -
the technical term is half-precision - dense matrix multiplication as fast
as possible, I suspect that more elegant approaches will prevail.
Presumably there is something that PETSc can do to enable machine learning
algorithms.  As most of the existing approaches use silly programming
models based on MapReduce, it can't be too hard for PETSc to do better.

Jeff

On Fri, Jul 1, 2016 at 2:32 PM, Barry Smith  wrote:

>
>The DOE SciDAC institutes have supported PETSc linear solver
> research/code development for the past fifteen years.
>
> This email is to solicit ideas for linear solver research/code
> development work for the next round of SciDAC institutes (which will be a 4
> year period) in PETSc. Please send me any ideas, no matter how crazy, on
> things you feel are missing, broken, or incomplete in PETSc with regard to
> linear solvers that we should propose to work on. In particular, issues
> coming from particular classes of applications would be good. Generic
> "multi physics" coupling types of things are too general (and old :-))
> while  work for extreme large scale is also out since that is covered under
> another call (ECP). But particular types of optimizations etc for existing
> or new codes could be in, just not for the very large scale.
>
> Rough ideas and pointers to publications are all useful. There is an
> extremely short fuse so the sooner the better,
>
> Thanks
>
>   Barry
>
>
>
>


-- 
Jeff Hammond
jeff.scie...@gmail.com
http://jeffhammond.github.io/