I wrote a test program to compare the performance of the Matrix Template
Library on packed array multiplication and although petsc outperformed mtl,
I was surprised that its performance was not strictly linear in the number
of entries:

1) I created a packed column major matrix of doubles in mtl with the
corresponding row major version in petsc; all matrices were dense with 10
entries per column (row), where the indices are chosen randomly.  

2) For each column size, I created a fixed (dense) vector of doubles and
performed 1000 multiplications (transposed multiplications) and recored the
timings, with the results as follows:

Iterations       rows    columns         entries         mtl     petc   
1000    1000    1000    10000   5.35    0.1
1000    1000    2000    10000   5.69    0.1
1000    1000    4000    10000   5.91    0.11
1000    1000    8000    10000   7.85    0.13
1000    1000    16000   10000   9.82    0.21
1000    1000    32000   10000   12.18   0.21
1000    2000    2000    20000   10.94   0.21
1000    2000    4000    20000   11.11   0.21
1000    2000    8000    20000   12.12   0.26
1000    2000    16000   20000   13.47   0.31
1000    2000    32000   20000   16.69   0.4
1000    2000    64000   20000   23.66   0.68
1000    4000    4000    40000   24.7    0.44
1000    4000    8000    40000   26.24   0.52
1000    4000    16000   40000   27.73   0.64
1000    4000    32000   40000   31.23   0.76
1000    4000    64000   40000   37.91   1.26
1000    4000    128000  40000   50      2.02
1000    8000    8000    80000   49.79   1.08
1000    8000    16000   80000   53.24   1.29
1000    8000    32000   80000   56.63   1.61
1000    8000    64000   80000   59.54   2.44
1000    8000    128000  80000   69.85   3.55
1000    8000    256000  80000   95.39   4.61
1000    16000   16000   160000  103.58  2.52
1000    16000   32000   160000  100.83  2.92
1000    16000   64000   160000  104.57  4.68
1000    16000   128000  160000  127.82  6.67
1000    16000   256000  160000  151.82  8.18
1000    16000   512000  160000  200.33  9.97
1000    32000   32000   320000  206.65  6
1000    32000   64000   320000  218.44  9.41
1000    32000   128000  320000  234.82  12.95
1000    32000   256000  320000  260.71  15.22
1000    32000   512000  320000  304.17  18.38
1000    32000   1024000 320000  405.6   21.76

As you can see, while petsc was 30-60 times faster than mtl, the performance
of both packages also depended significantly on the number of rows.  Maybe
this is not surprising, since I believe they both utilize the BLAS libraries
at some level, but since I can write down a simple algorithm for performing
the multiplication linearly in the number of entries, and I would expect
such a standard library to use the most efficient algorithm available, I am
wondering why this is happening?  Any ideas?


Reply via email to