Re: Structure-based a %*% b optimization results.

2015-04-20 Thread Dmitriy Lyubimov
Two specific papers on my part to rely on: Katzushige Goto's [1] speaks of
problem decomposition into 6 kernels for dense multiplication, and tuning.
My understanding this is basically how GOTO Blas (presently i beleive
OpenBlas) is working. In particular, in p. 4.2.4 he speaks of importance of
re-packaging the data into contiguous memory format, even if it means
memory copying. This is sort of how my hack works, and my experience is
roughly equivalent to small gepb kernel. Of course for bigger matrices this
needs to be carefully tuned and decomposed into kernels per  paper.

[2] basically says the hell with kernel tuning, claiming beating up [1] and
mkl optimalities, let's use space filling curves for cache optimization. I
like this but that particular work is also just for dense matrices.
Locality sensitvie sparse multiplication is yet to be scavenged for. Like i
said, this is much more appealing since it seems to forgo architecture
specific tuning necessary for other blas libraries.

lmk if there's any interest working on any of the two.

[1] Goto, Geijn Anatomy of high-performance matrix multiplication
[2] Bader, Henecke Cache oblivious dense and sparse matrix multiplication
based on peano curves

On Fri, Apr 17, 2015 at 5:26 PM, Dmitriy Lyubimov  wrote:

> Spent an hour on this today.
>
> What i am doing: simply reimplementing pairwise dot-product algorithm in
> stock dense matrix times().
>
> However, equipping every matrix with structure "flavor" (i.e. dense(...)
> reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> reports row-wise again, etc.)
>
> Next, wrote a binary operator that switches on combination of operand
> orientation and flips misaligned operand(s) (if any) to match most "speedy"
> orientation RW-CW. here are result for 300x300 dense matrix pairs:
>
> Ad %*% Bd: (107.125,46.375)
> Ad' %*% Bd: (206.475,39.325)
> Ad %*% Bd': (37.2,42.65)
> Ad' %*% Bd': (100.95,38.025)
> Ad'' %*% Bd'': (120.125,43.3)
>
> these results are for transpose combinations of original 300x300 dense
> random matrices, averaged over 40 runs (so standard error should be well
> controlled), in ms. First number is stock times() application (i.e. what
> we'd do with %*% operator now), and second number is ms with rewriting
> matrices into RW-CW orientation.
>
> For example, AB reorients B only, just like A''B'', AB' reorients nothing,
> and worst case A'B re-orients both (I also tried to run sum of outer
> products for A'B case without re-orientation -- apparently L1 misses far
> outweigh costs of reorientation there, i got very bad results there for
> outer product sum).
>
> as we can see, stock times() version does pretty bad for even dense
> operands for any orientation except for the optimal.
>
> Given that, i am inclined just to add orientation-driven structure
> optimization here and replace all stock calls with just orientation
> adjustment.
>
> Of course i will need to append this matrix with sparse and sparse row
> matrix combination (quite a bit of those i guess) and see what happens
> compared to stock sparse multiplications.
>
> But even that seems like a big win to me (basically, just doing
> reorientation optimization seems to give 3x speed up on average in
> matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>


Re: Structure-based a %*% b optimization results.

2015-04-18 Thread Ted Dunning
Sadly, no, since that was from a different job.

But here are some references with snippets:

This one indicates that things have changed dramatically even just from
2009:
http://www.cs.cornell.edu/~bindel/class/cs6210-f12/notes/lec02.pdf

This next is a web aside from a pretty good looking book [1]
http://csapp.cs.cmu.edu/2e/waside/waside-blocking.pdf

I would guess that Samsara's optimizer could well do blocking as well as
the transpose transformations that Dmitriy is talking about.


[1] http://csapp.cs.cmu.edu/



On Fri, Apr 17, 2015 at 10:24 PM, Andrew Musselman <
andrew.mussel...@gmail.com> wrote:

> Ted you have any sample code snippets?
>
> On Friday, April 17, 2015, Ted Dunning  wrote:
>
> >
> > This does look good.
> >
> > One additional thought would be to do a standard multi-level blocking
> > implementation of matrix times. In my experience this often makes
> > orientation much less important.
> >
> > The basic reason is that dense times requires n^3 ops but only n^2 memory
> > operations.  By rearranging the loops you get reuse in registers and then
> > reuse in L1 and L2.
> >
> > The win that you are getting now is due to cache lines being fully used
> > rather than partially used and then lost before they are touched again.
> >
> > The last time I did this, there were only three important caching layers.
> > Registers. Cache. Memory. There might be more now.  Done well, this used
> to
> > buy >10x speed.  Might even buy more, especially with matrices that blow
> L2
> > or even L3.
> >
> > Sent from my iPhone
> >
> > > On Apr 17, 2015, at 17:26, Dmitriy Lyubimov  > > wrote:
> > >
> > > Spent an hour on this today.
> > >
> > > What i am doing: simply reimplementing pairwise dot-product algorithm
> in
> > > stock dense matrix times().
> > >
> > > However, equipping every matrix with structure "flavor" (i.e.
> dense(...)
> > > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > > reports row-wise again, etc.)
> > >
> > > Next, wrote a binary operator that switches on combination of operand
> > > orientation and flips misaligned operand(s) (if any) to match most
> > "speedy"
> > > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> > >
> > > Ad %*% Bd: (107.125,46.375)
> > > Ad' %*% Bd: (206.475,39.325)
> > > Ad %*% Bd': (37.2,42.65)
> > > Ad' %*% Bd': (100.95,38.025)
> > > Ad'' %*% Bd'': (120.125,43.3)
> > >
> > > these results are for transpose combinations of original 300x300 dense
> > > random matrices, averaged over 40 runs (so standard error should be
> well
> > > controlled), in ms. First number is stock times() application (i.e.
> what
> > > we'd do with %*% operator now), and second number is ms with rewriting
> > > matrices into RW-CW orientation.
> > >
> > > For example, AB reorients B only, just like A''B'', AB' reorients
> > nothing,
> > > and worst case A'B re-orients both (I also tried to run sum of outer
> > > products for A'B case without re-orientation -- apparently L1 misses
> far
> > > outweigh costs of reorientation there, i got very bad results there for
> > > outer product sum).
> > >
> > > as we can see, stock times() version does pretty bad for even dense
> > > operands for any orientation except for the optimal.
> > >
> > > Given that, i am inclined just to add orientation-driven structure
> > > optimization here and replace all stock calls with just orientation
> > > adjustment.
> > >
> > > Of course i will need to append this matrix with sparse and sparse row
> > > matrix combination (quite a bit of those i guess) and see what happens
> > > compared to stock sparse multiplications.
> > >
> > > But even that seems like a big win to me (basically, just doing
> > > reorientation optimization seems to give 3x speed up on average in
> > > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
> >
>


Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Dmitriy Lyubimov
yes, that's the idea -- make LHS or RHS reads mostly hitting cache with
minimal effort. This is of course will not work optimally, but it this
seems a low cost improvement for small matrices.

I would not say this looks good. This still looks bad, it's just the stock
looks awful.

There's a paper from circa 2007 that claims outperforming even then-MKL and
GOTO blas with space-filling curve algorithms. Incidentally, i dealt with
space-filling curve algorithms quite a bit, so hopefully it is not
prohibitively complex for me. If matrix rewriting costs are fairly tame as
they seem to be, then it probably makes sense to explore this for jvm-only
speedup of the sparse stuff. Nice thing about it is that it is oblivious of
the cpu cache architecture and sizes, or so is one of claims. (although i
would think MKL would catch up with anything published by now).

Although generally i would probably not go into in-core algebraic stuff but
rather take something off-the-shelf as an additional option. netlib or
bidmat.


On Fri, Apr 17, 2015 at 7:17 PM, Ted Dunning  wrote:

>
> This does look good.
>
> One additional thought would be to do a standard multi-level blocking
> implementation of matrix times. In my experience this often makes
> orientation much less important.
>
> The basic reason is that dense times requires n^3 ops but only n^2 memory
> operations.  By rearranging the loops you get reuse in registers and then
> reuse in L1 and L2.
>
> The win that you are getting now is due to cache lines being fully used
> rather than partially used and then lost before they are touched again.
>
> The last time I did this, there were only three important caching layers.
> Registers. Cache. Memory. There might be more now.  Done well, this used to
> buy >10x speed.  Might even buy more, especially with matrices that blow L2
> or even L3.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:26, Dmitriy Lyubimov  wrote:
> >
> > Spent an hour on this today.
> >
> > What i am doing: simply reimplementing pairwise dot-product algorithm in
> > stock dense matrix times().
> >
> > However, equipping every matrix with structure "flavor" (i.e. dense(...)
> > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > reports row-wise again, etc.)
> >
> > Next, wrote a binary operator that switches on combination of operand
> > orientation and flips misaligned operand(s) (if any) to match most
> "speedy"
> > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> >
> > Ad %*% Bd: (107.125,46.375)
> > Ad' %*% Bd: (206.475,39.325)
> > Ad %*% Bd': (37.2,42.65)
> > Ad' %*% Bd': (100.95,38.025)
> > Ad'' %*% Bd'': (120.125,43.3)
> >
> > these results are for transpose combinations of original 300x300 dense
> > random matrices, averaged over 40 runs (so standard error should be well
> > controlled), in ms. First number is stock times() application (i.e. what
> > we'd do with %*% operator now), and second number is ms with rewriting
> > matrices into RW-CW orientation.
> >
> > For example, AB reorients B only, just like A''B'', AB' reorients
> nothing,
> > and worst case A'B re-orients both (I also tried to run sum of outer
> > products for A'B case without re-orientation -- apparently L1 misses far
> > outweigh costs of reorientation there, i got very bad results there for
> > outer product sum).
> >
> > as we can see, stock times() version does pretty bad for even dense
> > operands for any orientation except for the optimal.
> >
> > Given that, i am inclined just to add orientation-driven structure
> > optimization here and replace all stock calls with just orientation
> > adjustment.
> >
> > Of course i will need to append this matrix with sparse and sparse row
> > matrix combination (quite a bit of those i guess) and see what happens
> > compared to stock sparse multiplications.
> >
> > But even that seems like a big win to me (basically, just doing
> > reorientation optimization seems to give 3x speed up on average in
> > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>


Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Andrew Musselman
Ted you have any sample code snippets?

On Friday, April 17, 2015, Ted Dunning  wrote:

>
> This does look good.
>
> One additional thought would be to do a standard multi-level blocking
> implementation of matrix times. In my experience this often makes
> orientation much less important.
>
> The basic reason is that dense times requires n^3 ops but only n^2 memory
> operations.  By rearranging the loops you get reuse in registers and then
> reuse in L1 and L2.
>
> The win that you are getting now is due to cache lines being fully used
> rather than partially used and then lost before they are touched again.
>
> The last time I did this, there were only three important caching layers.
> Registers. Cache. Memory. There might be more now.  Done well, this used to
> buy >10x speed.  Might even buy more, especially with matrices that blow L2
> or even L3.
>
> Sent from my iPhone
>
> > On Apr 17, 2015, at 17:26, Dmitriy Lyubimov  > wrote:
> >
> > Spent an hour on this today.
> >
> > What i am doing: simply reimplementing pairwise dot-product algorithm in
> > stock dense matrix times().
> >
> > However, equipping every matrix with structure "flavor" (i.e. dense(...)
> > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > reports row-wise again, etc.)
> >
> > Next, wrote a binary operator that switches on combination of operand
> > orientation and flips misaligned operand(s) (if any) to match most
> "speedy"
> > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> >
> > Ad %*% Bd: (107.125,46.375)
> > Ad' %*% Bd: (206.475,39.325)
> > Ad %*% Bd': (37.2,42.65)
> > Ad' %*% Bd': (100.95,38.025)
> > Ad'' %*% Bd'': (120.125,43.3)
> >
> > these results are for transpose combinations of original 300x300 dense
> > random matrices, averaged over 40 runs (so standard error should be well
> > controlled), in ms. First number is stock times() application (i.e. what
> > we'd do with %*% operator now), and second number is ms with rewriting
> > matrices into RW-CW orientation.
> >
> > For example, AB reorients B only, just like A''B'', AB' reorients
> nothing,
> > and worst case A'B re-orients both (I also tried to run sum of outer
> > products for A'B case without re-orientation -- apparently L1 misses far
> > outweigh costs of reorientation there, i got very bad results there for
> > outer product sum).
> >
> > as we can see, stock times() version does pretty bad for even dense
> > operands for any orientation except for the optimal.
> >
> > Given that, i am inclined just to add orientation-driven structure
> > optimization here and replace all stock calls with just orientation
> > adjustment.
> >
> > Of course i will need to append this matrix with sparse and sparse row
> > matrix combination (quite a bit of those i guess) and see what happens
> > compared to stock sparse multiplications.
> >
> > But even that seems like a big win to me (basically, just doing
> > reorientation optimization seems to give 3x speed up on average in
> > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>


Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Ted Dunning

This does look good. 

One additional thought would be to do a standard multi-level blocking 
implementation of matrix times. In my experience this often makes orientation 
much less important. 

The basic reason is that dense times requires n^3 ops but only n^2 memory 
operations.  By rearranging the loops you get reuse in registers and then reuse 
in L1 and L2.  

The win that you are getting now is due to cache lines being fully used rather 
than partially used and then lost before they are touched again. 

The last time I did this, there were only three important caching layers.  
Registers. Cache. Memory. There might be more now.  Done well, this used to buy 
>10x speed.  Might even buy more, especially with matrices that blow L2 or even 
L3. 

Sent from my iPhone

> On Apr 17, 2015, at 17:26, Dmitriy Lyubimov  wrote:
> 
> Spent an hour on this today.
> 
> What i am doing: simply reimplementing pairwise dot-product algorithm in
> stock dense matrix times().
> 
> However, equipping every matrix with structure "flavor" (i.e. dense(...)
> reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> reports row-wise again, etc.)
> 
> Next, wrote a binary operator that switches on combination of operand
> orientation and flips misaligned operand(s) (if any) to match most "speedy"
> orientation RW-CW. here are result for 300x300 dense matrix pairs:
> 
> Ad %*% Bd: (107.125,46.375)
> Ad' %*% Bd: (206.475,39.325)
> Ad %*% Bd': (37.2,42.65)
> Ad' %*% Bd': (100.95,38.025)
> Ad'' %*% Bd'': (120.125,43.3)
> 
> these results are for transpose combinations of original 300x300 dense
> random matrices, averaged over 40 runs (so standard error should be well
> controlled), in ms. First number is stock times() application (i.e. what
> we'd do with %*% operator now), and second number is ms with rewriting
> matrices into RW-CW orientation.
> 
> For example, AB reorients B only, just like A''B'', AB' reorients nothing,
> and worst case A'B re-orients both (I also tried to run sum of outer
> products for A'B case without re-orientation -- apparently L1 misses far
> outweigh costs of reorientation there, i got very bad results there for
> outer product sum).
> 
> as we can see, stock times() version does pretty bad for even dense
> operands for any orientation except for the optimal.
> 
> Given that, i am inclined just to add orientation-driven structure
> optimization here and replace all stock calls with just orientation
> adjustment.
> 
> Of course i will need to append this matrix with sparse and sparse row
> matrix combination (quite a bit of those i guess) and see what happens
> compared to stock sparse multiplications.
> 
> But even that seems like a big win to me (basically, just doing
> reorientation optimization seems to give 3x speed up on average in
> matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).


Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Andrew Musselman
Looks like a great improvement.

On Friday, April 17, 2015, Dmitriy Lyubimov  wrote:

> i mean, before we consider hardware based implementations for bigger
> matrices, this change seems like a very easy win
>
> On Fri, Apr 17, 2015 at 5:26 PM, Dmitriy Lyubimov  > wrote:
>
> > Spent an hour on this today.
> >
> > What i am doing: simply reimplementing pairwise dot-product algorithm in
> > stock dense matrix times().
> >
> > However, equipping every matrix with structure "flavor" (i.e. dense(...)
> > reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> > reports row-wise again, etc.)
> >
> > Next, wrote a binary operator that switches on combination of operand
> > orientation and flips misaligned operand(s) (if any) to match most
> "speedy"
> > orientation RW-CW. here are result for 300x300 dense matrix pairs:
> >
> > Ad %*% Bd: (107.125,46.375)
> > Ad' %*% Bd: (206.475,39.325)
> > Ad %*% Bd': (37.2,42.65)
> > Ad' %*% Bd': (100.95,38.025)
> > Ad'' %*% Bd'': (120.125,43.3)
> >
> > these results are for transpose combinations of original 300x300 dense
> > random matrices, averaged over 40 runs (so standard error should be well
> > controlled), in ms. First number is stock times() application (i.e. what
> > we'd do with %*% operator now), and second number is ms with rewriting
> > matrices into RW-CW orientation.
> >
> > For example, AB reorients B only, just like A''B'', AB' reorients
> nothing,
> > and worst case A'B re-orients both (I also tried to run sum of outer
> > products for A'B case without re-orientation -- apparently L1 misses far
> > outweigh costs of reorientation there, i got very bad results there for
> > outer product sum).
> >
> > as we can see, stock times() version does pretty bad for even dense
> > operands for any orientation except for the optimal.
> >
> > Given that, i am inclined just to add orientation-driven structure
> > optimization here and replace all stock calls with just orientation
> > adjustment.
> >
> > Of course i will need to append this matrix with sparse and sparse row
> > matrix combination (quite a bit of those i guess) and see what happens
> > compared to stock sparse multiplications.
> >
> > But even that seems like a big win to me (basically, just doing
> > reorientation optimization seems to give 3x speed up on average in
> > matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
> >
>


Re: Structure-based a %*% b optimization results.

2015-04-17 Thread Dmitriy Lyubimov
i mean, before we consider hardware based implementations for bigger
matrices, this change seems like a very easy win

On Fri, Apr 17, 2015 at 5:26 PM, Dmitriy Lyubimov  wrote:

> Spent an hour on this today.
>
> What i am doing: simply reimplementing pairwise dot-product algorithm in
> stock dense matrix times().
>
> However, equipping every matrix with structure "flavor" (i.e. dense(...)
> reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> reports row-wise again, etc.)
>
> Next, wrote a binary operator that switches on combination of operand
> orientation and flips misaligned operand(s) (if any) to match most "speedy"
> orientation RW-CW. here are result for 300x300 dense matrix pairs:
>
> Ad %*% Bd: (107.125,46.375)
> Ad' %*% Bd: (206.475,39.325)
> Ad %*% Bd': (37.2,42.65)
> Ad' %*% Bd': (100.95,38.025)
> Ad'' %*% Bd'': (120.125,43.3)
>
> these results are for transpose combinations of original 300x300 dense
> random matrices, averaged over 40 runs (so standard error should be well
> controlled), in ms. First number is stock times() application (i.e. what
> we'd do with %*% operator now), and second number is ms with rewriting
> matrices into RW-CW orientation.
>
> For example, AB reorients B only, just like A''B'', AB' reorients nothing,
> and worst case A'B re-orients both (I also tried to run sum of outer
> products for A'B case without re-orientation -- apparently L1 misses far
> outweigh costs of reorientation there, i got very bad results there for
> outer product sum).
>
> as we can see, stock times() version does pretty bad for even dense
> operands for any orientation except for the optimal.
>
> Given that, i am inclined just to add orientation-driven structure
> optimization here and replace all stock calls with just orientation
> adjustment.
>
> Of course i will need to append this matrix with sparse and sparse row
> matrix combination (quite a bit of those i guess) and see what happens
> compared to stock sparse multiplications.
>
> But even that seems like a big win to me (basically, just doing
> reorientation optimization seems to give 3x speed up on average in
> matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).
>


Structure-based a %*% b optimization results.

2015-04-17 Thread Dmitriy Lyubimov
Spent an hour on this today.

What i am doing: simply reimplementing pairwise dot-product algorithm in
stock dense matrix times().

However, equipping every matrix with structure "flavor" (i.e. dense(...)
reports row-wise ,  and dense(...).t reports column wise, dense().t.t
reports row-wise again, etc.)

Next, wrote a binary operator that switches on combination of operand
orientation and flips misaligned operand(s) (if any) to match most "speedy"
orientation RW-CW. here are result for 300x300 dense matrix pairs:

Ad %*% Bd: (107.125,46.375)
Ad' %*% Bd: (206.475,39.325)
Ad %*% Bd': (37.2,42.65)
Ad' %*% Bd': (100.95,38.025)
Ad'' %*% Bd'': (120.125,43.3)

these results are for transpose combinations of original 300x300 dense
random matrices, averaged over 40 runs (so standard error should be well
controlled), in ms. First number is stock times() application (i.e. what
we'd do with %*% operator now), and second number is ms with rewriting
matrices into RW-CW orientation.

For example, AB reorients B only, just like A''B'', AB' reorients nothing,
and worst case A'B re-orients both (I also tried to run sum of outer
products for A'B case without re-orientation -- apparently L1 misses far
outweigh costs of reorientation there, i got very bad results there for
outer product sum).

as we can see, stock times() version does pretty bad for even dense
operands for any orientation except for the optimal.

Given that, i am inclined just to add orientation-driven structure
optimization here and replace all stock calls with just orientation
adjustment.

Of course i will need to append this matrix with sparse and sparse row
matrix combination (quite a bit of those i guess) and see what happens
compared to stock sparse multiplications.

But even that seems like a big win to me (basically, just doing
reorientation optimization seems to give 3x speed up on average in
matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).