Before: used all available cores; 10 on my system. After: I think the coprocessor is shared; either one for the whole system or one for every n cores.

On Wed, 30 Mar 2022, Henry Rich wrote:

How many cores does this use?

Henry Rich

On Wed, Mar 30, 2022, 4:09 AM Elijah Stone <[email protected]> wrote:

Recent apple arm CPUs include a hardware coprocessor for matrix
multiplication.  This is nominally not directly accessible to user code
(though it has been partly reverse engineered), but must be accessed
through apple's blas implementation.  Attached trivial patch makes j use
this rather than its own routines for large matrix multiplication on
darwin/arm.  Performance delta is quite good.  Before:

    a=. ?1e3 2e3$0
    b=. ?2e3 3e3$0
    100 timex 'a +/ . * b'
0.103497

after:

    100 timex 'a +/ . * b'
0.0274741
    0.103497%0.0274741
3.76708

Nearly 4x faster!

There seems to be a warmup period (big buffers go brrr...), so the gemm
threshold should perhaps be tuned.  I did not take detailed measurements.

(Fine print: benchmarks taken on a 14in macbook w/m1pro.)

Also of note: on desktop (zen2), numpy is 3x faster than j.  I tried
swapping out j's mm microkernel for the newest from blis, and got only a
modest boost, so the problem is not there.  I think numpy is using
openblas.  (On arm, j and numpy are reasonably close, and the hardware
accelerator smokes both.)

  -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to