switchover at 11x11 or 100x100 I'm not so sure. Anyway there is some arbitrary threshold.
IIUC 256-bit arithmetic only applies to x86_64 avx. arm64 has neon but needs another implementation. On Fri, May 17, 2019, 8:01 AM Henry Rich <[email protected]> wrote: > The switchover to BLAS is at more like 100x100 IIRC; for smaller than > that it uses a fast 1-core routine that I wrote. > > +/ . * is highly optimized, but for two different cases. Say you are > multiplying mxp times pxn to produce an mxn product. If m, p, and n are > big enough to allow the result to be calculated in 4x4 blocks, by > careful management of caches the I/O can be reduced relative to the > arithmetic, and the product can be produced as fast as the ALU can do > the arithmetic. > > If n is 1, what you have is a series of dot-products. These are > produced with special code that uses multiple 256-bit accumulators (in > the next beta; now there are multiple 64-bit accumulators) to produce > each scalar result. This code is directly invoked via +/@:*"1, but +/ . > * switches over to it when it feels like it. > > Other values of m, n, and p are not as efficient because working in 4x4 > blocks has edge wastage. If the matrices are less than 11x11 or so the > system just evaluates as +/@:(*"1 _) like Ken defined it. > > If the matrices are really big the system calls BLAS, which uses similar > techniques but can use multiple cores. > > Henry Rich > > On 5/16/2019 7:31 PM, bill lam wrote: > > Ah I forgot, if size of matrix is small , eg less than 11x11 then it > won't > > call blas routine. > > > > On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote: > > > >> If you use j807 with an avx capable cpu, it should call optimized blas > for > >> that pattern, you can compare with previous versions of J. If you want > even > >> faster, you can build j.dll/libj.so from source to enable multiple core > >> support and performance will scale up with the number of core used. > >> > >> On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming < > >> [email protected]> wrote: > >> > >>> (Also answers Bill’s post, just in) > >>> > >>> think I misled you. Brian’s “dot” is more correctly the matrix product, > >>> such as > >>> 2 3 (+/ . *)&i. 3 4 > >>> 20 23 26 29 > >>> 56 68 80 92 > >>> so we’re talking about dot =: +/ . * > >>> > >>> In some cases, Brian needs to multiply an mxn matrix A, by a kxn > matrix B > >>> for a mxk result, > >>> A dot |: B > >>> In others, he needs C, shape mxn, by D, shape mxk, for an nxk result, > >>> (|: C) dot D > >>> and of course, some are straight matrix multiplications. > >>> > >>> I defined Tdot =: |:@:[ +/ .* ] and dotT =: dot |: > >>> > >>> Are matrix multiplications going to be enhanced? And what about such > >>> variants as these? > >>> > >>> Thanks, > >>> > >>> Mik > >>> > >>> Sent from my iPad > >>> > >>>> On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote: > >>>> > >>>> In the next beta +/@:*"1 uses 256-bit instructions, which should help > >>> with dot-products. > >>>> Henry rich > >>>> > >>>>> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote: > >>>>> I've tried various timings and tweaks - the dot products seem to > >>> consume the most time; > >>>>> it's marginally worth dividing by "num_examples" after summing > >>> "correct_logprobs" rather > >>>>> than summing the quotient, " correct_logprobs%num_examples " > >>>>> > >>>>> I added a couple of dot fns, Tdot =: |:@[ dot ] and dotT =: > >>> dot |: > >>>>> to neaten up the code a bit. Those transposes seem unavoidable. > >>>>> > >>>>> In a practical application, you'd probably run cycles until either a > >>> suitable level of convergence > >>>>> is achieved - or until it's obvious that the process is divergent. > >>>>> > >>>>> Cheers, > >>>>> > >>>>> Mike > >>>>> > >>>>> > >>>>>> On 16/05/2019 15:20, Brian Schott wrote: > >>>>>> Mike, > >>>>>> > >>>>>> Yes, I new the reason that the calculation was done, but was > >>> surprised by > >>>>>> the manner in which these authors applied the calculation (without > the > >>>>>> multiplication) and I applied the Amend incorrectly, by not > >>> remembering > >>>>>> that it was being applied to an array. > >>>>>> > >>>>>> And you are correct that the Amend approach is slower and more space > >>>>>> consuming than the Product approach. I re-applied -- correctly, this > >>> time, > >>>>>> finally🤞 -- the Amend approach on a 'dbstopped' version of `train` > >>> and > >>>>>> got the following timings. In retrospect both methods require the > >>> condition > >>>>>> check and then multiplying by 0 and 1 may be very fast relative to > >>> Amend's > >>>>>> needs. > >>>>>> > >>>>>> mnd =: 0:`(I.@(0&>:)@[)`]}"1 > >>>>>> ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores > >>> dot|:W2 > >>>>>> 1 > >>>>>> 10 timespacex'(hidden_layer>0)*dscores dot|:W2' > >>>>>> 0.0004102 301568 > >>>>>> 10 timespacex'hidden_layer mnd dscores dot|:W2' > >>>>>> 0.0006501 535360 > >>>>>> > >>>>>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1 using a fork is very slightly > >>> faster > >>>>>> than mnd. > >>>>>> > >>>>>> > >>>>>> Thanks, again, > >>>>>> > >>>>>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming < > >>>>>> [email protected]> wrote: > >>>>>> > >>>>>>> The Python authors' comments here explain (well, they assert) why > >>> we're > >>>>>>> doing that filtering for hidden_layer > 0: > >>>>>>> > >>>>>>> " Now we have the gradient on the outputs of the hidden layer. > Next, > >>> we > >>>>>>> have to backpropagate the ReLU non-linearity. This turns out to be > >>> easy > >>>>>>> because ReLU during the backward pass is effectively a switch. > Since > >>>>>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain > >>>>>>> rule, we see that the ReLU unit lets the gradient pass through > >>> unchanged > >>>>>>> if its input was greater than 0, but kills it if its input was less > >>> than > >>>>>>> zero [or equal to zero - Mike's edit] during the forward pass." > >>>>>>> > >>>>>>> Isn't it curious that the J-way of doing it, > >>>>>>> > >>>>>>> if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do. > NB. > >>> find > >>>>>>> indices of elements <: 0 > >>>>>>> dhidden =. 0 ilow } dhidden > >>>>>>> end. > >>>>>>> > >>>>>>> is much slower than the naive > >>>>>>> > >>>>>>> dhidden =. (hidden_layer >0) * dscores dotT W2 > >>>>>>> ? > >>>>>>> > >>>>>>> Mike > >>>>>>> > >>>>>>> > >>>>>>> -- > >>>>>> (B=) > >>>>>> > ---------------------------------------------------------------------- > >>>>>> For information about J forums see > >>> http://www.jsoftware.com/forums.htm > >>>>> --- > >>>>> This email has been checked for viruses by Avast antivirus software. > >>>>> https://www.avast.com/antivirus > >>>>> > >>>>> > ---------------------------------------------------------------------- > >>>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>> > >>>> --- > >>>> This email has been checked for viruses by AVG. > >>>> https://www.avg.com > >>>> > >>>> ---------------------------------------------------------------------- > >>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >>> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
