Recent apple arm CPUs include a hardware coprocessor for matrix multiplication. This is nominally not directly accessible to user code (though it has been partly reverse engineered), but must be accessed through apple's blas implementation. Attached trivial patch makes j use this rather than its own routines for large matrix multiplication on darwin/arm. Performance delta is quite good. Before:

   a=. ?1e3 2e3$0
   b=. ?2e3 3e3$0
   100 timex 'a +/ . * b'
0.103497

after:

   100 timex 'a +/ . * b'
0.0274741
   0.103497%0.0274741
3.76708

Nearly 4x faster!

There seems to be a warmup period (big buffers go brrr...), so the gemm threshold should perhaps be tuned. I did not take detailed measurements.

(Fine print: benchmarks taken on a 14in macbook w/m1pro.)

Also of note: on desktop (zen2), numpy is 3x faster than j. I tried swapping out j's mm microkernel for the newest from blis, and got only a modest boost, so the problem is not there. I think numpy is using openblas. (On arm, j and numpy are reasonably close, and the hardware accelerator smokes both.)

 -E
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to