You can also use -log_view -log_sync to sync before timing so that you can clearly see which operations are really imbalanced.
--Junchao Zhang On Fri, May 20, 2022 at 12:37 PM Ernesto Prudencio via petsc-users < [email protected]> wrote: > Thank you, Barry. I will dig more on the issue with your suggestions. > > > > > > Schlumberger-Private > > *From:* Barry Smith <[email protected]> > *Sent:* Friday, May 20, 2022 12:33 PM > *To:* Ernesto Prudencio <[email protected]> > *Cc:* PETSc users list <[email protected]> > *Subject:* [Ext] Re: [petsc-users] Very slow VecDot operations > > > > > > Ernesto, > > > > If you ran (or can run) with -log_view you could see the time "ratio" > in the output that tells how much time the "fastest" rank spent on the dot > product versus the "slowest". Based on the different counts per rank you > report that ratio might be around 3. But based on the times you report is > around 200! > > > > My guess is that for the VecDotRhs() some ranks are arriving at the > vec dot long before other ranks and have to wait there an extremely long > amount of time making it appear that the dot product is very slow. While, > in reality, the large time credited to the vecdot is due to a misbalance in > time for the operation before the VecDot. > > > > Barry > > > > > > On May 20, 2022, at 1:23 PM, Ernesto Prudencio via petsc-users < > [email protected]> wrote: > > > > I am using LSQR to minimize || L x – b ||_2, where L is a sparse > rectangular matrix with 145,253,395 rows, 209,423,775 columns, and around > 54 billion non zeros. > > > > The numbers reported below are for a run with 27 compute nodes, each > compute node with 4 MPI ranks, so a total of 108 ranks. > > > > Throughout the run, I assess the runtime taken by all dot products during > the LSQR iterations, and I differentiate between dot products involving > vectors of the size of the solution vector “x”, and dot products involving > vectors of the size of the rhs “b”. Here are the numbers I get (we have an > implementation of LSQR that performs some extra vector dot products for our > needs): > > > > 236 VecDotSol take 1.523 seconds > > 226 VecDotRhs take 326.008 seconds > > > > Regarding the partition of rows and columns among the 108 MPI ranks: > > > > Rows: min = 838,529 ; avg = 1.34494e+06 ; max = 2,437,206 > > Columns: min = 1,903,500 ; avg = 1.93911e+06 ; max = 1,946,270 > > > > Regarding the partition of rows and columns among the 27 compute nodes: > > > > Rows: min = 3,575,584 ; avg = 5.37976e+06 ; max = 8,788,062 > > Columns: min = 7,637,500 ; avg = 7.75644e+06 ; max = 7,785,080 > > > > Questions: > > 1. Why the average run times are so different between VecDotSol and > VecDotRhs? > 2. Could the much bigger unbalancing among the number of rows per rank > (compared to the very well balanced distribution of columns per rank) be > the cause? > 3. Have you ever observed such situation? > 4. Could it be because of a bad MPI configuration / parametrization > with respect to the underlying network? > 5. But, if yes, why the VecDotSol dot products are so much faster than > VecDotRhs? > > > > Thank you in advance, > > > > Ernesto. > > > > > > Schlumberger-Private > > >
