Did your dot product use (+/ . *) ? This form should already call optimized blas route and 10 times faster. YMMV.
On Fri, May 17, 2019, 1:27 AM 'Mike Day' via Programming < [email protected]> wrote: > I've tried various timings and tweaks - the dot products seem to consume > the most time; > > it's marginally worth dividing by "num_examples" after summing > "correct_logprobs" rather > > than summing the quotient, " correct_logprobs%num_examples " > > I added a couple of dot fns, Tdot =: |:@[ dot ] and dotT =: dot |: > to neaten up the code a bit. Those transposes seem unavoidable. > > In a practical application, you'd probably run cycles until either a > suitable level of convergence > > is achieved - or until it's obvious that the process is divergent. > > Cheers, > > Mike > > > On 16/05/2019 15:20, Brian Schott wrote: > > Mike, > > > > Yes, I new the reason that the calculation was done, but was surprised by > > the manner in which these authors applied the calculation (without the > > multiplication) and I applied the Amend incorrectly, by not remembering > > that it was being applied to an array. > > > > And you are correct that the Amend approach is slower and more space > > consuming than the Product approach. I re-applied -- correctly, this > time, > > finally🤞 -- the Amend approach on a 'dbstopped' version of `train` and > > got the following timings. In retrospect both methods require the > condition > > check and then multiplying by 0 and 1 may be very fast relative to > Amend's > > needs. > > > > mnd =: 0:`(I.@(0&>:)@[)`]}"1 > > ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores > dot|:W2 > > 1 > > 10 timespacex'(hidden_layer>0)*dscores dot|:W2' > > 0.0004102 301568 > > 10 timespacex'hidden_layer mnd dscores dot|:W2' > > 0.0006501 535360 > > > > And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1 using a fork is very slightly > faster > > than mnd. > > > > > > Thanks, again, > > > > On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming < > > [email protected]> wrote: > > > >> The Python authors' comments here explain (well, they assert) why we're > >> doing that filtering for hidden_layer > 0: > >> > >> " Now we have the gradient on the outputs of the hidden layer. Next, we > >> have to backpropagate the ReLU non-linearity. This turns out to be easy > >> because ReLU during the backward pass is effectively a switch. Since > >> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain > >> rule, we see that the ReLU unit lets the gradient pass through unchanged > >> if its input was greater than 0, but kills it if its input was less than > >> zero [or equal to zero - Mike's edit] during the forward pass." > >> > >> Isn't it curious that the J-way of doing it, > >> > >> if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do. NB. > find > >> indices of elements <: 0 > >> dhidden =. 0 ilow } dhidden > >> end. > >> > >> is much slower than the naive > >> > >> dhidden =. (hidden_layer >0) * dscores dotT W2 > >> ? > >> > >> Mike > >> > >> > >> -- > > (B=) > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > --- > This email has been checked for viruses by Avast antivirus software. > https://www.avast.com/antivirus > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
