I am using LSQR to minimize || L x - b ||_2, where L is a sparse rectangular 
matrix with 145,253,395 rows, 209,423,775 columns, and around 54 billion non 
zeros.

The numbers reported below are for a run with 27 compute nodes, each compute 
node with 4 MPI ranks, so a total of 108 ranks.

Throughout the run, I assess the runtime taken by all dot products during the 
LSQR iterations, and I differentiate between dot products involving vectors of 
the size of the solution vector "x", and dot products involving vectors of the 
size of the rhs "b". Here are the numbers I get (we have an implementation of 
LSQR that performs some extra vector dot products for our needs):

236 VecDotSol take 1.523 seconds
226 VecDotRhs take 326.008 seconds

Regarding the partition of rows and columns among the 108 MPI ranks:

Rows: min = 838,529 ; avg = 1.34494e+06 ; max = 2,437,206
Columns: min = 1,903,500 ; avg = 1.93911e+06 ; max =  1,946,270

Regarding the partition of rows and columns among the 27 compute nodes:

Rows: min = 3,575,584 ; avg = 5.37976e+06 ; max = 8,788,062
Columns: min = 7,637,500 ; avg = 7.75644e+06 ; max = 7,785,080

Questions:

  1.  Why the average run times are so different between VecDotSol and 
VecDotRhs?
  2.  Could the much bigger unbalancing among the number of rows per rank 
(compared to the very well balanced distribution of columns per rank) be the 
cause?
  3.  Have you ever observed such situation?
  4.  Could it be because of a bad MPI configuration / parametrization with 
respect to the underlying network?
  5.  But, if yes, why the VecDotSol dot products are so much faster than 
VecDotRhs?

Thank you in advance,

Ernesto.



Schlumberger-Private

Reply via email to