Re: Structure-based a %*% b optimization results.

Dmitriy Lyubimov Mon, 20 Apr 2015 10:44:16 -0700

Two specific papers on my part to rely on: Katzushige Goto's [1] speaks of
problem decomposition into 6 kernels for dense multiplication, and tuning.
My understanding this is basically how GOTO Blas (presently i beleive
OpenBlas) is working. In particular, in p. 4.2.4 he speaks of importance of
re-packaging the data into contiguous memory format, even if it means
memory copying. This is sort of how my hack works, and my experience is
roughly equivalent to small gepb kernel. Of course for bigger matrices this
needs to be carefully tuned and decomposed into kernels per  paper.


[2] basically says the hell with kernel tuning, claiming beating up [1] and
mkl optimalities, let's use space filling curves for cache optimization. I
like this but that particular work is also just for dense matrices.
Locality sensitvie sparse multiplication is yet to be scavenged for. Like i
said, this is much more appealing since it seems to forgo architecture
specific tuning necessary for other blas libraries.

lmk if there's any interest working on any of the two.

[1] Goto, Geijn Anatomy of high-performance matrix multiplication
[2] Bader, Henecke Cache oblivious dense and sparse matrix multiplication
based on peano curves

On Fri, Apr 17, 2015 at 5:26 PM, Dmitriy Lyubimov <[email protected]> wrote:

> Spent an hour on this today.
>
> What i am doing: simply reimplementing pairwise dot-product algorithm in
> stock dense matrix times().
>
> However, equipping every matrix with structure "flavor" (i.e. dense(...)
> reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> reports row-wise again, etc.)
>
> Next, wrote a binary operator that switches on combination of operand
> orientation and flips misaligned operand(s) (if any) to match most "speedy"
> orientation RW-CW. here are result for 300x300 dense matrix pairs:
>
> Ad %*% Bd: (107.125,46.375)
> Ad' %*% Bd: (206.475,39.325)
> Ad %*% Bd': (37.2,42.65)
> Ad' %*% Bd': (100.95,38.025)
> Ad'' %*% Bd'': (120.125,43.3)
>
> these results are for transpose combinations of original 300x300 dense
> random matrices, averaged over 40 runs (so standard error should be well
> controlled), in ms. First number is stock times() application (i.e. what
> we'd do with %*% operator now), and second number is ms with rewriting
> matrices into RW-CW orientation.
>
> For example, AB reorients B only, just like A''B'', AB' reorients nothing,
> and worst case A'B re-orients both (I also tried to run sum of outer
> products for A'B case without re-orientation -- apparently L1 misses far
> outweigh costs of reorientation there, i got very bad results there for
> outer product sum).
>
> as we can see, stock times() version does pretty bad for even dense
> operands for any orientation except for the optimal.
>
> Given that, i am inclined just to add orientation-driven structure
> optimization here and replace all stock calls with just orientation
> adjustment.
>
> Of course i will need to append this matrix with sparse and sparse row
> matrix combination (quite a bit of those i guess) and see what happens
> compared to stock sparse multiplications.
>
> But even that seems like a big win to me (basically, just doing
> reorientation optimization seems to give 3x speed up on average in
> matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>

Re: Structure-based a %*% b optimization results.

Reply via email to