Did your dot product use (+/ . *) ? This form should already call optimized
blas route and 10 times faster. YMMV.

On Fri, May 17, 2019, 1:27 AM 'Mike Day' via Programming <
[email protected]> wrote:

> I've tried various timings and tweaks - the dot products seem to consume
> the most time;
>
> it's marginally worth dividing by "num_examples" after summing
> "correct_logprobs" rather
>
> than summing the quotient,  " correct_logprobs%num_examples "
>
> I added a couple of dot fns,      Tdot =: |:@[ dot ]     and dotT =: dot |:
> to neaten up the code a bit.  Those transposes seem unavoidable.
>
> In a practical application, you'd probably run cycles until either a
> suitable level of convergence
>
> is achieved - or until it's obvious that the process is divergent.
>
> Cheers,
>
> Mike
>
>
> On 16/05/2019 15:20, Brian Schott wrote:
> > Mike,
> >
> > Yes, I new the reason that the calculation was done, but was surprised by
> > the manner in which these authors applied the calculation (without the
> > multiplication) and I applied the Amend incorrectly, by not remembering
> > that it was being applied to an array.
> >
> > And you are correct that the Amend approach is slower and more space
> > consuming than the Product approach. I re-applied -- correctly, this
> time,
> > finally🤞  -- the Amend approach on a 'dbstopped' version of `train` and
> > got the following timings. In retrospect both methods require the
> condition
> > check and then multiplying by 0 and 1 may be very fast relative to
> Amend's
> > needs.
> >
> >        mnd =: 0:`(I.@(0&>:)@[)`]}"1
> >        ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores
> dot|:W2
> > 1
> >        10 timespacex'(hidden_layer>0)*dscores dot|:W2'
> > 0.0004102 301568
> >        10 timespacex'hidden_layer mnd dscores dot|:W2'
> > 0.0006501 535360
> >
> > And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1  using a fork is very slightly
> faster
> > than mnd.
> >
> >
> > Thanks, again,
> >
> > On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
> > [email protected]> wrote:
> >
> >> The Python authors' comments here explain (well, they assert) why we're
> >> doing that filtering for hidden_layer > 0:
> >>
> >> " Now we have the gradient on the outputs of the hidden layer. Next, we
> >> have to backpropagate the ReLU non-linearity. This turns out to be easy
> >> because ReLU during the backward pass is effectively a switch. Since
> >> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain
> >> rule, we see that the ReLU unit lets the gradient pass through unchanged
> >> if its input was greater than 0, but kills it if its input was less than
> >> zero [or equal to zero - Mike's edit] during the forward pass."
> >>
> >> Isn't it curious that the J-way of doing it,
> >>
> >>       if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do.  NB.
> find
> >> indices of elements <: 0
> >>          dhidden =. 0 ilow } dhidden
> >>       end.
> >>
> >> is much slower than the naive
> >>
> >>       dhidden =. (hidden_layer >0) * dscores dotT  W2
> >> ?
> >>
> >> Mike
> >>
> >>
> >> --
> > (B=)
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ---
> This email has been checked for viruses by Avast antivirus software.
> https://www.avast.com/antivirus
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to