Ah I forgot, if size of matrix is small , eg less than 11x11 then it won't call blas routine.
On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote: > If you use j807 with an avx capable cpu, it should call optimized blas for > that pattern, you can compare with previous versions of J. If you want even > faster, you can build j.dll/libj.so from source to enable multiple core > support and performance will scale up with the number of core used. > > On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming < > [email protected]> wrote: > >> (Also answers Bill’s post, just in) >> >> think I misled you. Brian’s “dot” is more correctly the matrix product, >> such as >> 2 3 (+/ . *)&i. 3 4 >> 20 23 26 29 >> 56 68 80 92 >> so we’re talking about dot =: +/ . * >> >> In some cases, Brian needs to multiply an mxn matrix A, by a kxn matrix B >> for a mxk result, >> A dot |: B >> In others, he needs C, shape mxn, by D, shape mxk, for an nxk result, >> (|: C) dot D >> and of course, some are straight matrix multiplications. >> >> I defined Tdot =: |:@:[ +/ .* ] and dotT =: dot |: >> >> Are matrix multiplications going to be enhanced? And what about such >> variants as these? >> >> Thanks, >> >> Mik >> >> Sent from my iPad >> >> > On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote: >> > >> > In the next beta +/@:*"1 uses 256-bit instructions, which should help >> with dot-products. >> > >> > Henry rich >> > >> >> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote: >> >> I've tried various timings and tweaks - the dot products seem to >> consume the most time; >> >> >> >> it's marginally worth dividing by "num_examples" after summing >> "correct_logprobs" rather >> >> >> >> than summing the quotient, " correct_logprobs%num_examples " >> >> >> >> I added a couple of dot fns, Tdot =: |:@[ dot ] and dotT =: >> dot |: >> >> to neaten up the code a bit. Those transposes seem unavoidable. >> >> >> >> In a practical application, you'd probably run cycles until either a >> suitable level of convergence >> >> >> >> is achieved - or until it's obvious that the process is divergent. >> >> >> >> Cheers, >> >> >> >> Mike >> >> >> >> >> >>> On 16/05/2019 15:20, Brian Schott wrote: >> >>> Mike, >> >>> >> >>> Yes, I new the reason that the calculation was done, but was >> surprised by >> >>> the manner in which these authors applied the calculation (without the >> >>> multiplication) and I applied the Amend incorrectly, by not >> remembering >> >>> that it was being applied to an array. >> >>> >> >>> And you are correct that the Amend approach is slower and more space >> >>> consuming than the Product approach. I re-applied -- correctly, this >> time, >> >>> finally🤞 -- the Amend approach on a 'dbstopped' version of `train` >> and >> >>> got the following timings. In retrospect both methods require the >> condition >> >>> check and then multiplying by 0 and 1 may be very fast relative to >> Amend's >> >>> needs. >> >>> >> >>> mnd =: 0:`(I.@(0&>:)@[)`]}"1 >> >>> ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores >> dot|:W2 >> >>> 1 >> >>> 10 timespacex'(hidden_layer>0)*dscores dot|:W2' >> >>> 0.0004102 301568 >> >>> 10 timespacex'hidden_layer mnd dscores dot|:W2' >> >>> 0.0006501 535360 >> >>> >> >>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1 using a fork is very slightly >> faster >> >>> than mnd. >> >>> >> >>> >> >>> Thanks, again, >> >>> >> >>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming < >> >>> [email protected]> wrote: >> >>> >> >>>> The Python authors' comments here explain (well, they assert) why >> we're >> >>>> doing that filtering for hidden_layer > 0: >> >>>> >> >>>> " Now we have the gradient on the outputs of the hidden layer. Next, >> we >> >>>> have to backpropagate the ReLU non-linearity. This turns out to be >> easy >> >>>> because ReLU during the backward pass is effectively a switch. Since >> >>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain >> >>>> rule, we see that the ReLU unit lets the gradient pass through >> unchanged >> >>>> if its input was greater than 0, but kills it if its input was less >> than >> >>>> zero [or equal to zero - Mike's edit] during the forward pass." >> >>>> >> >>>> Isn't it curious that the J-way of doing it, >> >>>> >> >>>> if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do. NB. >> find >> >>>> indices of elements <: 0 >> >>>> dhidden =. 0 ilow } dhidden >> >>>> end. >> >>>> >> >>>> is much slower than the naive >> >>>> >> >>>> dhidden =. (hidden_layer >0) * dscores dotT W2 >> >>>> ? >> >>>> >> >>>> Mike >> >>>> >> >>>> >> >>>> -- >> >>> (B=) >> >>> ---------------------------------------------------------------------- >> >>> For information about J forums see >> http://www.jsoftware.com/forums.htm >> >> >> >> --- >> >> This email has been checked for viruses by Avast antivirus software. >> >> https://www.avast.com/antivirus >> >> >> >> ---------------------------------------------------------------------- >> >> For information about J forums see http://www.jsoftware.com/forums.htm >> > >> > >> > --- >> > This email has been checked for viruses by AVG. >> > https://www.avg.com >> > >> > ---------------------------------------------------------------------- >> > For information about J forums see http://www.jsoftware.com/forums.htm >> ---------------------------------------------------------------------- >> For information about J forums see http://www.jsoftware.com/forums.htm >> > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
