Jed,

    Thanks, this is very useful.

  Barry


    

> On Oct 31, 2019, at 11:47 AM, Jed Brown <j...@jedbrown.org> wrote:
> 
> "Smith, Barry F." <bsm...@mcs.anl.gov> writes:
> 
>>> On Oct 23, 2019, at 7:15 PM, Jed Brown <j...@jedbrown.org> wrote:
>>> 
>>> IMO, Figures 2 and 7+ are more interesting when the x axis (vector size)
>>> is replaced by execution time.  
>> 
>> 
>>> We don't scale by fixing the resource
>>> and increasing the problem size, we choose the global problem size based
>>> on accuracy/model complexity and choose a Pareto tradeoff of execution
>>> time with efficiency (1/cost) to decide how many nodes to use.  Most of
>>> those sloping tails on the left become vertical lines under that
>>> transformation.
>> 
>>   I don't see the connection between your first sentence and the other 
>> sentences.
>> 
>>   How does the plot with time instead of size tell you what number of 
>> processors to use?
> 
> The point is that in the planning stage, you don't care how many
> processors are used, you care whether the machine is capable of solving
> problem P in time T.  After determining that, you want to know how much
> it will cost so you can apply for an allocation.  Only once you have an
> allocation and need to configure input parameters for a particular model
> do you care how many elements per process and how many processes in total.
> 
>>   I don't understand the plots with x as a time axis, so I suspect most 
>> potential readers won't. The only point of the plots is really to give an 
>> idea of the scale of the performance and that performance is low except for 
>> large sizes so will keep the plot axis as is.
> 
> Compare these two figures.  When plotting versus size, you see a long
> tail to the left, but can't tell if it's getting faster.  It makes the
> claim that lower latency is a specific capability a squishy and
> imprecise concept, while plotting versus time is directly relevant.  You
> can say we have n microseconds to do a complex task (time step of a
> model) and here each VecDot takes at least k microseconds on
> architecture A no matter how we scale it.
> 
> In these figures, we can read off that VecDot completes 8x faster on CPU
> than GPU.  That the GPU is useless if your time budget is less than 100
> microseconds and clearly preferable if you have at least 200
> microseconds.  We know that intrinsic MPI_Allreduce latency is about 15
> microseconds on a nice machine at any scale (BG, etc.), so if we had an
> application where MPI_Allreduce was limiting performance on a previous
> problem configuration/architecture, then it'll hurt 6x as much here.
> 
> <VecDot_CPU_vs_GPU_time.png><VecDot_CPU_vs_GPU_size.png>
> 
> Hannah, could you please give me access to push?  I modified the script
> to make both kinds of plots.

Reply via email to