Karl,
Thanks for your comments. > On Oct 10, 2019, at 12:34 AM, Karl Rupp via petsc-dev <petsc-dev@mcs.anl.gov> > wrote: > > Hi, > > Table 2 reports negative latencies. This doesn't look right to me ;-) > If it's the outcome of a parameter fit to the performance model, then use a > parameter name (e.g. alpha) instead of the term 'latency'. Per Jed's suggestion we will include some plots and additional information to make clearer what happens on the CPU. > > Figure 11 has a very narrow range in the y-coordinate and thus exaggerates > the variation greatly. "GPU performance" should be adjusted to something like > "execution time" to explain the meaning of the y-axis. Thanks. Fixed by adding the next size of 10^7 also > > Page 12: The latency for VecDot is higher than for VecAXPY because VecDot > requires the result to be copied back to the host. This is an additional > operation. Good point. We will include this. > > Regarding performance measurements: Did you synchronize after each kernel > launch? I.e. did you run (approach A) For all our runs, as stated near the beginning of the text many times == 1. This seems to work fine and are reproducible so I don't see a need to run multiple times. Barry > for (many times) { > synchronize(); > start_timer(); > kernel_launch(); > synchronize(); > stop_timer(); > } > and then take averages over the timings obtained, or did you (approach B) > synchronize(); > start_timer(); > for (many times) { > kernel_launch(); > } > synchronize(); > stop_timer(); > and then divide the obtained time by the number of runs? > > Approach A will report a much higher latency than the latter, because > synchronizations are expensive (i.e. your latency consists of kernel launch > latency plus device synchronization latency). Approach B is slightly > over-optimistic, but I've found it to better match what one observes for an > algorithm involving several kernel launches. > > Best regards, > Karli > > > > On 10/10/19 12:34 AM, Smith, Barry F. via petsc-dev wrote: >> We've prepared a short report on the performance of vector operations on >> Summit and would appreciate any feed back including: inconsistencies, lack >> of clarity, incorrect notation or terminology, etc. >> Thanks >> Barry, Hannah, and Richard