Sadly, no, since that was from a different job.

But here are some references with snippets:

This one indicates that things have changed dramatically even just from
2009:
http://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/lec02.pdf

This next is a web aside from a pretty good looking book [1]
http://csapp.cs.cmu.edu/2e/waside/waside-blocking.pdf

I would guess that Samsara's optimizer could well do blocking as well as
the transpose transformations that Dmitriy is talking about.


[1] http://csapp.cs.cmu.edu/



On Fri, Apr 17, 2015 at 10:24 PM, Andrew Musselman <
[email protected]> wrote:

> Ted you have any sample code snippets?
>
> On Friday, April 17, 2015, Ted Dunning <[email protected]> wrote:
>
> >
> > This does look good.
> >
> > One additional thought would be to do a standard multi-level blocking
> > implementation of matrix times. In my experience this often makes
> > orientation much less important.
> >
> > The basic reason is that dense times requires n^3 ops but only n^2 memory
> > operations.  By rearranging the loops you get reuse in registers and then
> > reuse in L1 and L2.
> >
> > The win that you are getting now is due to cache lines being fully used
> > rather than partially used and then lost before they are touched again.
> >
> > The last time I did this, there were only three important caching layers.
> > Registers. Cache. Memory. There might be more now.  Done well, this used
> to
> > buy >10x speed.  Might even buy more, especially with matrices that blow
> L2
> > or even L3.
> >
> > Sent from my iPhone
> >
> > > On Apr 17, 2015, at 17:26, Dmitriy Lyubimov <[email protected]
> > <javascript:;>> wrote:
> > >
> > > Spent an hour on this today.
> > >
> > > What i am doing: simply reimplementing pairwise dot-product algorithm
> in
> > > stock dense matrix times().
> > >
> > > However, equipping every matrix with structure "flavor" (i.e.
> dense(...)
> > > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > > reports row-wise again, etc.)
> > >
> > > Next, wrote a binary operator that switches on combination of operand
> > > orientation and flips misaligned operand(s) (if any) to match most
> > "speedy"
> > > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> > >
> > > Ad %*% Bd: (107.125,46.375)
> > > Ad' %*% Bd: (206.475,39.325)
> > > Ad %*% Bd': (37.2,42.65)
> > > Ad' %*% Bd': (100.95,38.025)
> > > Ad'' %*% Bd'': (120.125,43.3)
> > >
> > > these results are for transpose combinations of original 300x300 dense
> > > random matrices, averaged over 40 runs (so standard error should be
> well
> > > controlled), in ms. First number is stock times() application (i.e.
> what
> > > we'd do with %*% operator now), and second number is ms with rewriting
> > > matrices into RW-CW orientation.
> > >
> > > For example, AB reorients B only, just like A''B'', AB' reorients
> > nothing,
> > > and worst case A'B re-orients both (I also tried to run sum of outer
> > > products for A'B case without re-orientation -- apparently L1 misses
> far
> > > outweigh costs of reorientation there, i got very bad results there for
> > > outer product sum).
> > >
> > > as we can see, stock times() version does pretty bad for even dense
> > > operands for any orientation except for the optimal.
> > >
> > > Given that, i am inclined just to add orientation-driven structure
> > > optimization here and replace all stock calls with just orientation
> > > adjustment.
> > >
> > > Of course i will need to append this matrix with sparse and sparse row
> > > matrix combination (quite a bit of those i guess) and see what happens
> > > compared to stock sparse multiplications.
> > >
> > > But even that seems like a big win to me (basically, just doing
> > > reorientation optimization seems to give 3x speed up on average in
> > > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
> >
>

Reply via email to