Jed,
Thanks, this is very useful. Barry > On Oct 31, 2019, at 11:47 AM, Jed Brown <j...@jedbrown.org> wrote: > > "Smith, Barry F." <bsm...@mcs.anl.gov> writes: > >>> On Oct 23, 2019, at 7:15 PM, Jed Brown <j...@jedbrown.org> wrote: >>> >>> IMO, Figures 2 and 7+ are more interesting when the x axis (vector size) >>> is replaced by execution time. >> >> >>> We don't scale by fixing the resource >>> and increasing the problem size, we choose the global problem size based >>> on accuracy/model complexity and choose a Pareto tradeoff of execution >>> time with efficiency (1/cost) to decide how many nodes to use. Most of >>> those sloping tails on the left become vertical lines under that >>> transformation. >> >> I don't see the connection between your first sentence and the other >> sentences. >> >> How does the plot with time instead of size tell you what number of >> processors to use? > > The point is that in the planning stage, you don't care how many > processors are used, you care whether the machine is capable of solving > problem P in time T. After determining that, you want to know how much > it will cost so you can apply for an allocation. Only once you have an > allocation and need to configure input parameters for a particular model > do you care how many elements per process and how many processes in total. > >> I don't understand the plots with x as a time axis, so I suspect most >> potential readers won't. The only point of the plots is really to give an >> idea of the scale of the performance and that performance is low except for >> large sizes so will keep the plot axis as is. > > Compare these two figures. When plotting versus size, you see a long > tail to the left, but can't tell if it's getting faster. It makes the > claim that lower latency is a specific capability a squishy and > imprecise concept, while plotting versus time is directly relevant. You > can say we have n microseconds to do a complex task (time step of a > model) and here each VecDot takes at least k microseconds on > architecture A no matter how we scale it. > > In these figures, we can read off that VecDot completes 8x faster on CPU > than GPU. That the GPU is useless if your time budget is less than 100 > microseconds and clearly preferable if you have at least 200 > microseconds. We know that intrinsic MPI_Allreduce latency is about 15 > microseconds on a nice machine at any scale (BG, etc.), so if we had an > application where MPI_Allreduce was limiting performance on a previous > problem configuration/architecture, then it'll hurt 6x as much here. > > <VecDot_CPU_vs_GPU_time.png><VecDot_CPU_vs_GPU_size.png> > > Hannah, could you please give me access to push? I modified the script > to make both kinds of plots.