On my macbook air m1. time taken is about 0.3 sec.
Applying your patch and then linked with the framework didn't show any
improvement.

If you want optimized gemm performance, you need to compile with
USE_OPENMP=1



On Wed, Mar 30, 2022 at 4:09 PM Elijah Stone <[email protected]> wrote:

> Recent apple arm CPUs include a hardware coprocessor for matrix
> multiplication.  This is nominally not directly accessible to user code
> (though it has been partly reverse engineered), but must be accessed
> through apple's blas implementation.  Attached trivial patch makes j use
> this rather than its own routines for large matrix multiplication on
> darwin/arm.  Performance delta is quite good.  Before:
>
>     a=. ?1e3 2e3$0
>     b=. ?2e3 3e3$0
>     100 timex 'a +/ . * b'
> 0.103497
>
> after:
>
>     100 timex 'a +/ . * b'
> 0.0274741
>     0.103497%0.0274741
> 3.76708
>
> Nearly 4x faster!
>
> There seems to be a warmup period (big buffers go brrr...), so the gemm
> threshold should perhaps be tuned.  I did not take detailed measurements.
>
> (Fine print: benchmarks taken on a 14in macbook w/m1pro.)
>
> Also of note: on desktop (zen2), numpy is 3x faster than j.  I tried
> swapping out j's mm microkernel for the newest from blis, and got only a
> modest boost, so the problem is not there.  I think numpy is using
> openblas.  (On arm, j and numpy are reasonably close, and the hardware
> accelerator smokes both.)
>
>   -E
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to