On Wed, Feb 10, 2016 at 8:12 AM, Xiangdong <[email protected]> wrote:
> On Mon, Feb 8, 2016 at 6:45 PM, Jed Brown <[email protected]> wrote: > >> Xiangdong <[email protected]> writes: >> >> > iii) since the time ratios of VecDot (2.5) and MatMult (1.5) are still >> > high, I rerun the program with ipm module. The IPM summary is here: >> > >> https://drive.google.com/file/d/0BxEfb1tasJxhYXI0VkV0cjlLWUU/view?usp=sharing >> . >> > From this IPM reuslts, MPI_Allreduce takes 74% of MPI time. The >> > communication by task figure (1st figure in p4) in above link showed >> that >> > it is not well-balanced. Is this related to the hardware and network >> (which >> > the users cannot control) or can I do something on my codes to improve? >> >> Here are a few functions that don't have any communication, but still >> have significant load imbalance. >> >> VecAXPY 1021815 1.0 2.2148e+01 2.1 1.89e+10 1.1 0.0e+00 >> 0.0e+00 0.0e+00 2 4 0 0 0 2 4 0 0 0 207057 >> VecMAXPY 613089 1.0 1.3276e+01 2.2 2.27e+10 1.1 0.0e+00 >> 0.0e+00 0.0e+00 1 4 0 0 0 1 4 0 0 0 414499 >> MatSOR 818390 1.0 1.9608e+02 1.5 2.00e+11 1.1 0.0e+00 >> 0.0e+00 0.0e+00 22 40 0 0 0 22 40 0 0 0 247472 >> >> > The result above is from a run with 256 cores (16 nodes * 16 cores/node). > I did another run with 64 nodes * 4 cores/node. Now these functions are > much better balanced ( a factor of 1.2-1.3, instead of 1.5-2.1). > > VecAXPY 987215 1.0 6.8469e+00 1.3 1.82e+10 1.1 0.0e+00 0.0e+00 > 0.0e+00 1 4 0 0 0 1 4 0 0 0 647096 > VecMAXPY 592329 1.0 6.0866e+00 1.3 2.19e+10 1.1 0.0e+00 0.0e+00 > 0.0e+00 1 4 0 0 0 1 4 0 0 0 873511 > MatSOR 790717 1.0 1.2933e+02 1.2 1.93e+11 1.1 0.0e+00 0.0e+00 > 0.0e+00 24 40 0 0 0 24 40 0 0 0 362525 > > For the functions requires communication, the time ratio is about (1.4-1.6) > VecDot 789772 1.0 8.4804e+01 1.4 1.46e+10 1.1 0.0e+00 0.0e+00 > 7.9e+05 14 3 0 0 40 14 3 0 0 40 41794 > VecNorm 597914 1.0 7.6259e+01 1.6 1.10e+10 1.1 0.0e+00 0.0e+00 > 6.0e+05 12 2 0 0 30 12 2 0 0 30 34996 > > The full logsummary for this new run is here: > https://googledrive.com/host/0BxEfb1tasJxhVkZ2NHJkSmF4LUU > > Can we say now the load imbalance is from the network communication, > instead of memory bandwidth? > Actually now it looks even more like what Jed was saying. The 4 cores have much more available bandwidth. Matt > Thanks. > > Xiangdong > > You can and should improve load balance before stressing about network >> costs. This could be that the nodes aren't clean (running at different >> speeds) or that the partition is not balancing data. >> > > -- What most experimenters take for granted before they begin their experiments is infinitely more interesting than any results to which their experiments lead. -- Norbert Wiener
