This is a direct consequence of the so-called memory mountain, i.e. the increasing costs of accessing memory further and further away from the processor on a standard Von Neumann architecture: cache -> RAM -> hard drive. Larger matrices don't completely fit in the cache, and you're paying a cache miss penalty. See Bryant and Hallaron, "Computer Systems: A Programmer's Perspective" for more on this, or for excruciating details, check out ATLAS (Automatically Tuned Linear Algebra System).
~A On 8/6/07, John R. Wicks <jwicks at cs.brown.edu> wrote: > I wrote a test program to compare the performance of the Matrix Template > Library on packed array multiplication and although petsc outperformed mtl, > I was surprised that its performance was not strictly linear in the number > of entries: > > 1) I created a packed column major matrix of doubles in mtl with the > corresponding row major version in petsc; all matrices were dense with 10 > entries per column (row), where the indices are chosen randomly. > > 2) For each column size, I created a fixed (dense) vector of doubles and > performed 1000 multiplications (transposed multiplications) and recored the > timings, with the results as follows: > > Iterations rows columns entries mtl petc > 1000 1000 1000 10000 5.35 0.1 > 1000 1000 2000 10000 5.69 0.1 > 1000 1000 4000 10000 5.91 0.11 > 1000 1000 8000 10000 7.85 0.13 > 1000 1000 16000 10000 9.82 0.21 > 1000 1000 32000 10000 12.18 0.21 > 1000 2000 2000 20000 10.94 0.21 > 1000 2000 4000 20000 11.11 0.21 > 1000 2000 8000 20000 12.12 0.26 > 1000 2000 16000 20000 13.47 0.31 > 1000 2000 32000 20000 16.69 0.4 > 1000 2000 64000 20000 23.66 0.68 > 1000 4000 4000 40000 24.7 0.44 > 1000 4000 8000 40000 26.24 0.52 > 1000 4000 16000 40000 27.73 0.64 > 1000 4000 32000 40000 31.23 0.76 > 1000 4000 64000 40000 37.91 1.26 > 1000 4000 128000 40000 50 2.02 > 1000 8000 8000 80000 49.79 1.08 > 1000 8000 16000 80000 53.24 1.29 > 1000 8000 32000 80000 56.63 1.61 > 1000 8000 64000 80000 59.54 2.44 > 1000 8000 128000 80000 69.85 3.55 > 1000 8000 256000 80000 95.39 4.61 > 1000 16000 16000 160000 103.58 2.52 > 1000 16000 32000 160000 100.83 2.92 > 1000 16000 64000 160000 104.57 4.68 > 1000 16000 128000 160000 127.82 6.67 > 1000 16000 256000 160000 151.82 8.18 > 1000 16000 512000 160000 200.33 9.97 > 1000 32000 32000 320000 206.65 6 > 1000 32000 64000 320000 218.44 9.41 > 1000 32000 128000 320000 234.82 12.95 > 1000 32000 256000 320000 260.71 15.22 > 1000 32000 512000 320000 304.17 18.38 > 1000 32000 1024000 320000 405.6 21.76 > > As you can see, while petsc was 30-60 times faster than mtl, the performance > of both packages also depended significantly on the number of rows. Maybe > this is not surprising, since I believe they both utilize the BLAS libraries > at some level, but since I can write down a simple algorithm for performing > the multiplication linearly in the number of entries, and I would expect > such a standard library to use the most efficient algorithm available, I am > wondering why this is happening? Any ideas? > >
