Thanks for the feedback.

af_matmul just starts dthe op on the gpu (or other cpu theads) and returns
almost immediately. af_sync waits until the op is done. This is clear if
you play with them in the repl.

Retail nvidia gpu support for 64bit float is slow. For example, on my GTX
1050 Ti  matmul with f32 (short float) is 25 times faster than with f64.

The next release will have changes to the benchmark script that makes this
clear and will allow easier experiments. With f32, the mpbench times are
more interesting.

The higher end nvidia cards have much better f64 performance and you would
see a big gain in using arrayfire for matmul for larger arguments.


On Thu, Jan 27, 2022 at 8:41 PM Mike Powell <mdpow...@gmail.com> wrote:

> This looks really interesting.
>
> My first look at this was mp_bench.ijs example. This produces (on my iMac):
>
> ┌────────┬────────┐
> │ step   │ millis │
> ├────────┼────────┤
> │mp      │3260    │
> ├────────┼────────┤
> │acreate │708     │
> ├────────┼────────┤
> │bcreate │474     │
> ├────────┼────────┤
> │matmul  │0       │
> ├────────┼────────┤
> │sync    │3252    │
> ├────────┼────────┤
> │get     │432     │
> ├────────┼────────┤
> │aftot   │4867    │
> ├────────┼────────┤
> │mp%aftot│0.669918│
> └────────┴────────┘
>
> That’s certainly not flattering for Arrayfire.
>
> My take on this goes like this. acreate and bcreate take significant time
> getting their arguments transposed. But once this is done, the matrix
> multiplication just zips by.
>
> Then comes the synchronization step. A full 3 seconds. Is this elapsed
> time? Surely not actual resources used time? (Maybe my Mac was doing a
> backup when I set this in motion.)
>
> And then get to undo the transposed matrix. Some significant time there.
>
> What if we sent regular J data (row major) to Arrayfire and used the
> af_transpose function within Arrayfire to change columns to rows. Do the
> af_matmul and then finish up with another af_transpose on the result. This
> might be a lot quicker.
>
> Can someone expand on the timing of af_sync.
>
> If I were planning a machine learning (“ML”)  example from J, I think it
> would end up as an initial passing of data in, followed by a good deal of
> af_ processing, finishing with an af_sync and a return of results to J.
> Simply trying to maximize the time spent with the fastest tool available.
>
> And that means expressing your ML logic in array terms. J should be good
> at that.
>
> Mike Powell
>
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to