Re: [Jprogramming] convolutional neural network [was simplifying im2col]

Henry Rich Thu, 16 May 2019 10:43:38 -0700

In the next beta +/@:*"1 uses 256-bit instructions, which should helpwith dot-products.


Henry rich


On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote:

I've tried various timings and tweaks - the dot products seem toconsume the most time;
it's marginally worth dividing by "num_examples" after summing"correct_logprobs" rather
than summing the quotient,  " correct_logprobs%num_examples "
I added a couple of dot fns, Tdot =: |:@[ dot ] and dotT =:dot |:
to neaten up the code a bit.  Those transposes seem unavoidable.
In a practical application, you'd probably run cycles until either asuitable level of convergence
is achieved - or until it's obvious that the process is divergent.

Cheers,

Mike


On 16/05/2019 15:20, Brian Schott wrote:
Mike,
Yes, I new the reason that the calculation was done, but wassurprised by
the manner in which these authors applied the calculation (without the
multiplication) and I applied the Amend incorrectly, by not remembering
that it was being applied to an array.

And you are correct that the Amend approach is slower and more space
consuming than the Product approach. I re-applied -- correctly, thistime,
finally🤞  -- the Amend approach on a 'dbstopped' version of `train` and
got the following timings. In retrospect both methods require theconditioncheck and then multiplying by 0 and 1 may be very fast relative toAmend's
needs.

       mnd =: 0:`(I.@(0&>:)@[)`]}"1
((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscoresdot|:W2
1
       10 timespacex'(hidden_layer>0)*dscores dot|:W2'
0.0004102 301568
       10 timespacex'hidden_layer mnd dscores dot|:W2'
0.0006501 535360
And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1 using a fork is very slightlyfaster
than mnd.


Thanks, again,

On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
[email protected]> wrote:
The Python authors' comments here explain (well, they assert) why we're
doing that filtering for hidden_layer > 0:

" Now we have the gradient on the outputs of the hidden layer. Next, we
have to backpropagate the ReLU non-linearity. This turns out to be easy
because ReLU during the backward pass is effectively a switch. Since
r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain
rule, we see that the ReLU unit lets the gradient pass throughunchangedif its input was greater than 0, but kills it if its input was lessthan
zero [or equal to zero - Mike's edit] during the forward pass."

Isn't it curious that the J-way of doing it,
if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do. NB.find
indices of elements <: 0
         dhidden =. 0 ilow } dhidden
      end.

is much slower than the naive

      dhidden =. (hidden_layer >0) * dscores dotT  W2
?

Mike


--
(B=)
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm
---
This email has been checked for viruses by Avast antivirus software.
https://www.avast.com/antivirus

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm



---
This email has been checked for viruses by AVG.
https://www.avg.com

----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] convolutional neural network [was simplifying im2col]

Reply via email to