Re: [Jprogramming] convolutional neural network [was simplifying im2col]

Henry Rich Thu, 16 May 2019 17:59:19 -0700

No, the 1-core code has a non-AVX version. The cache managementprovides a big gain whether you use the wide ALU or not.


Henry Rich


On 5/16/2019 8:55 PM, bill lam wrote:

I think fast 1-core code is non-existent for non-avx cpu, so it should only
switchover for avx cpu.

On Fri, May 17, 2019, 8:20 AM Henry Rich <[email protected]> wrote:

The code sizes the problem by calculating m*p*n, and

m*p*n<=1000, it uses +/@:(*"1 _)
1000<m*p*n<=5000000, it uses fast 1-core code
5000000<m*p*n, it uses BLAS

Maybe some ARM user will write a Neon version.

Henry Rich


On 5/16/2019 8:12 PM, bill lam wrote:

switchover at 11x11 or 100x100 I'm not so sure. Anyway there is some
arbitrary threshold.

IIUC 256-bit arithmetic only applies to x86_64 avx. arm64 has neon but
needs another implementation.

On Fri, May 17, 2019, 8:01 AM Henry Rich <[email protected]> wrote:

The switchover to BLAS is at more like 100x100 IIRC; for smaller than
that it uses a fast 1-core routine that I wrote.

+/ . * is highly optimized, but for two different cases.  Say you are
multiplying mxp times pxn to produce an mxn product.  If m, p, and n are
big enough to allow the result to be calculated in 4x4 blocks, by
careful management of caches the I/O can be reduced relative to the
arithmetic, and the product can be produced as fast as the ALU can do
the arithmetic.

If n is 1, what you have is a series of dot-products.  These are
produced with special code that uses multiple 256-bit accumulators (in
the next beta; now there are multiple 64-bit accumulators) to produce
each scalar result.  This code is directly invoked via +/@:*"1, but +/ .
* switches over to it when it feels like it.

Other values of m, n, and p are not as efficient because working in 4x4
blocks has edge wastage.  If the matrices are less than 11x11 or so the
system just evaluates as +/@:(*"1 _) like Ken defined it.

If the matrices are really big the system calls BLAS, which uses similar
techniques but can use multiple cores.

Henry Rich

On 5/16/2019 7:31 PM, bill lam wrote:

Ah I forgot, if size of matrix is small , eg less than 11x11 then it

won't

call blas routine.

On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote:

If you use j807 with an avx capable cpu, it should call optimized blas

for

that pattern, you can compare with previous versions of J. If you want

even

faster, you can build j.dll/libj.so from source to enable multiple

core

support and performance will scale up with the number of core used.

On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming <
[email protected]> wrote:

(Also answers Bill’s post, just in)

think I misled you. Brian’s “dot” is more correctly the matrix

product,

such as
        2 3 (+/ . *)&i. 3 4
20 23 26 29
56 68 80 92
so we’re talking about dot =: +/ . *

In some cases, Brian needs to multiply an mxn matrix A, by a kxn

matrix B

for a mxk result,
A dot |: B
In others, he needs C, shape mxn, by D, shape mxk,  for an nxk

result,

(|: C) dot D
and of course, some are straight matrix multiplications.

I defined    Tdot =: |:@:[ +/ .* ] and dotT =: dot |:

Are matrix multiplications going to be enhanced?  And what about

such

variants as these?

Thanks,

Mik

Sent from my iPad

On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote:

In the next beta +/@:*"1 uses 256-bit instructions, which should

help

with dot-products.

Henry rich

On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote:
I've tried various timings and tweaks - the dot products seem to

consume the most time;

it's marginally worth dividing by "num_examples" after summing

"correct_logprobs" rather

than summing the quotient,  " correct_logprobs%num_examples "

I added a couple of dot fns,      Tdot =: |:@[ dot ]     and dotT

=:

dot |:

to neaten up the code a bit.  Those transposes seem unavoidable.

In a practical application, you'd probably run cycles until either

suitable level of convergence

is achieved - or until it's obvious that the process is divergent.

Cheers,

Mike

On 16/05/2019 15:20, Brian Schott wrote:
Mike,

Yes, I new the reason that the calculation was done, but was

surprised by

the manner in which these authors applied the calculation (without

the

multiplication) and I applied the Amend incorrectly, by not

remembering

that it was being applied to an array.

And you are correct that the Amend approach is slower and more

space

consuming than the Product approach. I re-applied -- correctly,

this

time,

finally🤞  -- the Amend approach on a 'dbstopped' version of

`train`

and

got the following timings. In retrospect both methods require the

condition

check and then multiplying by 0 and 1 may be very fast relative to

Amend's

needs.

         mnd =: 0:`(I.@(0&>:)@[)`]}"1
         ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd

dscores

dot|:W2

1
         10 timespacex'(hidden_layer>0)*dscores dot|:W2'
0.0004102 301568
         10 timespacex'hidden_layer mnd dscores dot|:W2'
0.0006501 535360

And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1  using a fork is very

slightly

faster

than mnd.


Thanks, again,

On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
[email protected]> wrote:

The Python authors' comments here explain (well, they assert) why

we're

doing that filtering for hidden_layer > 0:

" Now we have the gradient on the outputs of the hidden layer.

Next,

we

have to backpropagate the ReLU non-linearity. This turns out to

be

easy

because ReLU during the backward pass is effectively a switch.

Since

r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the

chain

rule, we see that the ReLU unit lets the gradient pass through

unchanged

if its input was greater than 0, but kills it if its input was

less

than

zero [or equal to zero - Mike's edit] during the forward pass."

Isn't it curious that the J-way of doing it,

        if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do.

NB.

find

indices of elements <: 0
           dhidden =. 0 ilow } dhidden
        end.

is much slower than the naive

        dhidden =. (hidden_layer >0) * dscores dotT  W2
?

Mike


--

(B=)

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

---
This email has been checked for viruses by Avast antivirus

software.

https://www.avast.com/antivirus

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

---
This email has been checked for viruses by AVG.
https://www.avg.com

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------

For information about J forums see

http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm


----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] convolutional neural network [was simplifying im2col]

Reply via email to