Re: Structure-based a %*% b optimization results.

Dmitriy Lyubimov Fri, 17 Apr 2015 22:04:27 -0700

yes, that's the idea -- make LHS or RHS reads mostly hitting cache with
minimal effort. This is of course will not work optimally, but it this
seems a low cost improvement for small matrices.


I would not say this looks good. This still looks bad, it's just the stock
looks awful.

There's a paper from circa 2007 that claims outperforming even then-MKL and
GOTO blas with space-filling curve algorithms. Incidentally, i dealt with
space-filling curve algorithms quite a bit, so hopefully it is not
prohibitively complex for me. If matrix rewriting costs are fairly tame as
they seem to be, then it probably makes sense to explore this for jvm-only
speedup of the sparse stuff. Nice thing about it is that it is oblivious of
the cpu cache architecture and sizes, or so is one of claims. (although i
would think MKL would catch up with anything published by now).

Although generally i would probably not go into in-core algebraic stuff but
rather take something off-the-shelf as an additional option. netlib or
bidmat.


On Fri, Apr 17, 2015 at 7:17 PM, Ted Dunning <[email protected]> wrote:

>
> This does look good.
>
> One additional thought would be to do a standard multi-level blocking
> implementation of matrix times. In my experience this often makes
> orientation much less important.
>
> The basic reason is that dense times requires n^3 ops but only n^2 memory
> operations.  By rearranging the loops you get reuse in registers and then
> reuse in L1 and L2.
>
> The win that you are getting now is due to cache lines being fully used
> rather than partially used and then lost before they are touched again.
>
> The last time I did this, there were only three important caching layers.
> Registers. Cache. Memory. There might be more now.  Done well, this used to
> buy >10x speed.  Might even buy more, especially with matrices that blow L2
> or even L3.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:26, Dmitriy Lyubimov <[email protected]> wrote:
> >
> > Spent an hour on this today.
> >
> > What i am doing: simply reimplementing pairwise dot-product algorithm in
> > stock dense matrix times().
> >
> > However, equipping every matrix with structure "flavor" (i.e. dense(...)
> > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > reports row-wise again, etc.)
> >
> > Next, wrote a binary operator that switches on combination of operand
> > orientation and flips misaligned operand(s) (if any) to match most
> "speedy"
> > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> >
> > Ad %*% Bd: (107.125,46.375)
> > Ad' %*% Bd: (206.475,39.325)
> > Ad %*% Bd': (37.2,42.65)
> > Ad' %*% Bd': (100.95,38.025)
> > Ad'' %*% Bd'': (120.125,43.3)
> >
> > these results are for transpose combinations of original 300x300 dense
> > random matrices, averaged over 40 runs (so standard error should be well
> > controlled), in ms. First number is stock times() application (i.e. what
> > we'd do with %*% operator now), and second number is ms with rewriting
> > matrices into RW-CW orientation.
> >
> > For example, AB reorients B only, just like A''B'', AB' reorients
> nothing,
> > and worst case A'B re-orients both (I also tried to run sum of outer
> > products for A'B case without re-orientation -- apparently L1 misses far
> > outweigh costs of reorientation there, i got very bad results there for
> > outer product sum).
> >
> > as we can see, stock times() version does pretty bad for even dense
> > operands for any orientation except for the optimal.
> >
> > Given that, i am inclined just to add orientation-driven structure
> > optimization here and replace all stock calls with just orientation
> > adjustment.
> >
> > Of course i will need to append this matrix with sparse and sparse row
> > matrix combination (quite a bit of those i guess) and see what happens
> > compared to stock sparse multiplications.
> >
> > But even that seems like a big win to me (basically, just doing
> > reorientation optimization seems to give 3x speed up on average in
> > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>

Re: Structure-based a %*% b optimization results.

Reply via email to