Forgot to mention long VecScatter time might also due to local memory copies. 
If the communication pattern has large local to local (self to self)  scatter, 
which often happens thanks to locality, then the memory copy time is counted in 
VecScatter. You can analyze your code's communication pattern to see if it is 
the case.

--Junchao Zhang


On Wed, Mar 20, 2019 at 4:44 PM Zhang, Junchao via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:


On Wed, Mar 20, 2019 at 4:18 PM Manuel Valera 
<mvaler...@sdsu.edu<mailto:mvaler...@sdsu.edu>> wrote:
Thanks for your answer, so for example i have a log for 200 cores across 10 
nodes that reads:

------------------------------------------------------------------------------------------------------------------------
Event                           Count      Time (sec)     Flop                  
           --- Global ---  --- Stage ---   Total
                                Max Ratio  Max     Ratio   Max  Ratio  Mess   
Avg len Reduct  %T %F %M %L %R  %T %F %M %L %R Mflop/s
----------------------------------------------------------------------------------------------------------------------
VecScatterBegin     3014 1.0 4.5550e+01 2.6 0.00e+00 0.0 4.2e+06 1.1e+06 
2.8e+01  4  0 63 56  0   4  0 63 56  0     0
VecScatterEnd       2976 1.0 1.2143e+02 1.7 0.00e+00 0.0 0.0e+00 0.0e+00 
0.0e+00 14  0  0  0  0  14  0  0  0  0     0

While for 20 nodes at one node i have:
 What does that mean?
VecScatterBegin     2596 1.0 2.9142e+01 2.1 0.00e+00 0.0 1.2e+05 4.0e+06 
3.0e+01  2  0 81 61  0   2  0 81 61  0     0
VecScatterEnd       2558 1.0 8.0344e+01 7.9 0.00e+00 0.0  0.0e+00 0.0e+00 
0.0e+00  3  0  0  0  0   3  0  0  0  0     0

Where do i see the max/min ratio in here? and why End step is all 0.0e00 in 
both but still grows from 3% to 14% of total time? It seems i would need to run 
again with the -log_sync option, is this correct?

e.g., 2.1, 7.9. MPI send/recv are in VecScatterBegin(). VecScatterEnd() only 
does MPI_Wait. That is why it has zero messages. Yes, run with -log_sync and 
see what happens.

Different question, can't i estimate the total communication time if i had a 
typical communication time per MPI message times the number of MPI messages 
reported in the log? or it doesn't work like that?

Probably not work because you have multiple processes doing send/recv at the 
same time. They might saturate the bandwidth. Petsc also does 
computation/communication overlapping.

Thanks.





On Wed, Mar 20, 2019 at 2:02 PM Zhang, Junchao 
<jczh...@mcs.anl.gov<mailto:jczh...@mcs.anl.gov>> wrote:
See the "Mess   AvgLen  Reduct" number in each log stage.  Mess is the total 
number of messages sent in an event over all processes.  AvgLen is average 
message len. Reduct is the number of global reduction.
Each event like VecScatterBegin/End has a maximal execution time over all 
processes, and a max/min ratio.  %T is sum(execution time of the event on each 
process)/sum(execution time of the stage on each process). %T indicates how 
expensive the event is. It is a number you should pay attention to.
If your code is imbalanced (i.e., with a big max/min ratio), then the 
performance number is skewed and becomes misleading because some processes are 
just waiting for others. Then, besides -log_view, you can add -log_sync, which 
adds an extra MPI_Barrier for each event to let them start at the same time. 
With that, it is easier to interpret the number.
src/vec/vscat/examples/ex4.c is a tiny example for VecScatter logging.

--Junchao Zhang


On Wed, Mar 20, 2019 at 2:58 PM Manuel Valera via petsc-users 
<petsc-users@mcs.anl.gov<mailto:petsc-users@mcs.anl.gov>> wrote:
Hello,

I am working on timing my model, which we made MPI scalable using petsc DMDAs, 
i want to know more about the output log and how to calculate a total 
communication times for my runs, so far i see we have "MPI Messages" and "MPI 
Messages Lengths" in the log, along VecScatterEnd and VecScatterBegin reports.

My question is, how do i interpret these number to get a rough estimate on how 
much overhead we have just from MPI communications times in my model runs?

Thanks,


Reply via email to