This does look good. 

One additional thought would be to do a standard multi-level blocking 
implementation of matrix times. In my experience this often makes orientation 
much less important. 

The basic reason is that dense times requires n^3 ops but only n^2 memory 
operations.  By rearranging the loops you get reuse in registers and then reuse 
in L1 and L2.  

The win that you are getting now is due to cache lines being fully used rather 
than partially used and then lost before they are touched again. 

The last time I did this, there were only three important caching layers.  
Registers. Cache. Memory. There might be more now.  Done well, this used to buy 
>10x speed.  Might even buy more, especially with matrices that blow L2 or even 
L3. 

Sent from my iPhone

> On Apr 17, 2015, at 17:26, Dmitriy Lyubimov <[email protected]> wrote:
> 
> Spent an hour on this today.
> 
> What i am doing: simply reimplementing pairwise dot-product algorithm in
> stock dense matrix times().
> 
> However, equipping every matrix with structure "flavor" (i.e. dense(...)
> reports row-wise ,  and dense(...).t reports column wise, dense().t.t
> reports row-wise again, etc.)
> 
> Next, wrote a binary operator that switches on combination of operand
> orientation and flips misaligned operand(s) (if any) to match most "speedy"
> orientation RW-CW. here are result for 300x300 dense matrix pairs:
> 
> Ad %*% Bd: (107.125,46.375)
> Ad' %*% Bd: (206.475,39.325)
> Ad %*% Bd': (37.2,42.65)
> Ad' %*% Bd': (100.95,38.025)
> Ad'' %*% Bd'': (120.125,43.3)
> 
> these results are for transpose combinations of original 300x300 dense
> random matrices, averaged over 40 runs (so standard error should be well
> controlled), in ms. First number is stock times() application (i.e. what
> we'd do with %*% operator now), and second number is ms with rewriting
> matrices into RW-CW orientation.
> 
> For example, AB reorients B only, just like A''B'', AB' reorients nothing,
> and worst case A'B re-orients both (I also tried to run sum of outer
> products for A'B case without re-orientation -- apparently L1 misses far
> outweigh costs of reorientation there, i got very bad results there for
> outer product sum).
> 
> as we can see, stock times() version does pretty bad for even dense
> operands for any orientation except for the optimal.
> 
> Given that, i am inclined just to add orientation-driven structure
> optimization here and replace all stock calls with just orientation
> adjustment.
> 
> Of course i will need to append this matrix with sparse and sparse row
> matrix combination (quite a bit of those i guess) and see what happens
> compared to stock sparse multiplications.
> 
> But even that seems like a big win to me (basically, just doing
> reorientation optimization seems to give 3x speed up on average in
> matrix-matrix multiplication in 3 cases out of 4, and ties in 1 case).

Reply via email to