How many cores does this use?

Henry Rich

On Wed, Mar 30, 2022, 4:09 AM Elijah Stone <[email protected]> wrote:

> Recent apple arm CPUs include a hardware coprocessor for matrix
> multiplication.  This is nominally not directly accessible to user code
> (though it has been partly reverse engineered), but must be accessed
> through apple's blas implementation.  Attached trivial patch makes j use
> this rather than its own routines for large matrix multiplication on
> darwin/arm.  Performance delta is quite good.  Before:
>
>     a=. ?1e3 2e3$0
>     b=. ?2e3 3e3$0
>     100 timex 'a +/ . * b'
> 0.103497
>
> after:
>
>     100 timex 'a +/ . * b'
> 0.0274741
>     0.103497%0.0274741
> 3.76708
>
> Nearly 4x faster!
>
> There seems to be a warmup period (big buffers go brrr...), so the gemm
> threshold should perhaps be tuned.  I did not take detailed measurements.
>
> (Fine print: benchmarks taken on a 14in macbook w/m1pro.)
>
> Also of note: on desktop (zen2), numpy is 3x faster than j.  I tried
> swapping out j's mm microkernel for the newest from blis, and got only a
> modest boost, so the problem is not there.  I think numpy is using
> openblas.  (On arm, j and numpy are reasonably close, and the hardware
> accelerator smokes both.)
>
>   -E
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to