How many cores does this use? Henry Rich
On Wed, Mar 30, 2022, 4:09 AM Elijah Stone <[email protected]> wrote: > Recent apple arm CPUs include a hardware coprocessor for matrix > multiplication. This is nominally not directly accessible to user code > (though it has been partly reverse engineered), but must be accessed > through apple's blas implementation. Attached trivial patch makes j use > this rather than its own routines for large matrix multiplication on > darwin/arm. Performance delta is quite good. Before: > > a=. ?1e3 2e3$0 > b=. ?2e3 3e3$0 > 100 timex 'a +/ . * b' > 0.103497 > > after: > > 100 timex 'a +/ . * b' > 0.0274741 > 0.103497%0.0274741 > 3.76708 > > Nearly 4x faster! > > There seems to be a warmup period (big buffers go brrr...), so the gemm > threshold should perhaps be tuned. I did not take detailed measurements. > > (Fine print: benchmarks taken on a 14in macbook w/m1pro.) > > Also of note: on desktop (zen2), numpy is 3x faster than j. I tried > swapping out j's mm microkernel for the newest from blis, and got only a > modest boost, so the problem is not there. I think numpy is using > openblas. (On arm, j and numpy are reasonably close, and the hardware > accelerator smokes both.) > > -E > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
