I think fast 1-core code is non-existent for non-avx cpu, so it should only switchover for avx cpu.
On Fri, May 17, 2019, 8:20 AM Henry Rich <[email protected]> wrote: > The code sizes the problem by calculating m*p*n, and > > m*p*n<=1000, it uses +/@:(*"1 _) > 1000<m*p*n<=5000000, it uses fast 1-core code > 5000000<m*p*n, it uses BLAS > > Maybe some ARM user will write a Neon version. > > Henry Rich > > > On 5/16/2019 8:12 PM, bill lam wrote: > > switchover at 11x11 or 100x100 I'm not so sure. Anyway there is some > > arbitrary threshold. > > > > IIUC 256-bit arithmetic only applies to x86_64 avx. arm64 has neon but > > needs another implementation. > > > > On Fri, May 17, 2019, 8:01 AM Henry Rich <[email protected]> wrote: > > > >> The switchover to BLAS is at more like 100x100 IIRC; for smaller than > >> that it uses a fast 1-core routine that I wrote. > >> > >> +/ . * is highly optimized, but for two different cases. Say you are > >> multiplying mxp times pxn to produce an mxn product. If m, p, and n are > >> big enough to allow the result to be calculated in 4x4 blocks, by > >> careful management of caches the I/O can be reduced relative to the > >> arithmetic, and the product can be produced as fast as the ALU can do > >> the arithmetic. > >> > >> If n is 1, what you have is a series of dot-products. These are > >> produced with special code that uses multiple 256-bit accumulators (in > >> the next beta; now there are multiple 64-bit accumulators) to produce > >> each scalar result. This code is directly invoked via +/@:*"1, but +/ . > >> * switches over to it when it feels like it. > >> > >> Other values of m, n, and p are not as efficient because working in 4x4 > >> blocks has edge wastage. If the matrices are less than 11x11 or so the > >> system just evaluates as +/@:(*"1 _) like Ken defined it. > >> > >> If the matrices are really big the system calls BLAS, which uses similar > >> techniques but can use multiple cores. > >> > >> Henry Rich > >> > >> On 5/16/2019 7:31 PM, bill lam wrote: > >>> Ah I forgot, if size of matrix is small , eg less than 11x11 then it > >> won't > >>> call blas routine. > >>> > >>> On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote: > >>> > >>>> If you use j807 with an avx capable cpu, it should call optimized blas > >> for > >>>> that pattern, you can compare with previous versions of J. If you want > >> even > >>>> faster, you can build j.dll/libj.so from source to enable multiple > core > >>>> support and performance will scale up with the number of core used. > >>>> > >>>> On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming < > >>>> [email protected]> wrote: > >>>> > >>>>> (Also answers Bill’s post, just in) > >>>>> > >>>>> think I misled you. Brian’s “dot” is more correctly the matrix > product, > >>>>> such as > >>>>> 2 3 (+/ . *)&i. 3 4 > >>>>> 20 23 26 29 > >>>>> 56 68 80 92 > >>>>> so we’re talking about dot =: +/ . * > >>>>> > >>>>> In some cases, Brian needs to multiply an mxn matrix A, by a kxn > >> matrix B > >>>>> for a mxk result, > >>>>> A dot |: B > >>>>> In others, he needs C, shape mxn, by D, shape mxk, for an nxk > result, > >>>>> (|: C) dot D > >>>>> and of course, some are straight matrix multiplications. > >>>>> > >>>>> I defined Tdot =: |:@:[ +/ .* ] and dotT =: dot |: > >>>>> > >>>>> Are matrix multiplications going to be enhanced? And what about > such > >>>>> variants as these? > >>>>> > >>>>> Thanks, > >>>>> > >>>>> Mik > >>>>> > >>>>> Sent from my iPad > >>>>> > >>>>>> On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote: > >>>>>> > >>>>>> In the next beta +/@:*"1 uses 256-bit instructions, which should > help > >>>>> with dot-products. > >>>>>> Henry rich > >>>>>> > >>>>>>> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote: > >>>>>>> I've tried various timings and tweaks - the dot products seem to > >>>>> consume the most time; > >>>>>>> it's marginally worth dividing by "num_examples" after summing > >>>>> "correct_logprobs" rather > >>>>>>> than summing the quotient, " correct_logprobs%num_examples " > >>>>>>> > >>>>>>> I added a couple of dot fns, Tdot =: |:@[ dot ] and dotT > =: > >>>>> dot |: > >>>>>>> to neaten up the code a bit. Those transposes seem unavoidable. > >>>>>>> > >>>>>>> In a practical application, you'd probably run cycles until either > a > >>>>> suitable level of convergence > >>>>>>> is achieved - or until it's obvious that the process is divergent. > >>>>>>> > >>>>>>> Cheers, > >>>>>>> > >>>>>>> Mike > >>>>>>> > >>>>>>> > >>>>>>>> On 16/05/2019 15:20, Brian Schott wrote: > >>>>>>>> Mike, > >>>>>>>> > >>>>>>>> Yes, I new the reason that the calculation was done, but was > >>>>> surprised by > >>>>>>>> the manner in which these authors applied the calculation (without > >> the > >>>>>>>> multiplication) and I applied the Amend incorrectly, by not > >>>>> remembering > >>>>>>>> that it was being applied to an array. > >>>>>>>> > >>>>>>>> And you are correct that the Amend approach is slower and more > space > >>>>>>>> consuming than the Product approach. I re-applied -- correctly, > this > >>>>> time, > >>>>>>>> finally🤞 -- the Amend approach on a 'dbstopped' version of > `train` > >>>>> and > >>>>>>>> got the following timings. In retrospect both methods require the > >>>>> condition > >>>>>>>> check and then multiplying by 0 and 1 may be very fast relative to > >>>>> Amend's > >>>>>>>> needs. > >>>>>>>> > >>>>>>>> mnd =: 0:`(I.@(0&>:)@[)`]}"1 > >>>>>>>> ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd > dscores > >>>>> dot|:W2 > >>>>>>>> 1 > >>>>>>>> 10 timespacex'(hidden_layer>0)*dscores dot|:W2' > >>>>>>>> 0.0004102 301568 > >>>>>>>> 10 timespacex'hidden_layer mnd dscores dot|:W2' > >>>>>>>> 0.0006501 535360 > >>>>>>>> > >>>>>>>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1 using a fork is very > slightly > >>>>> faster > >>>>>>>> than mnd. > >>>>>>>> > >>>>>>>> > >>>>>>>> Thanks, again, > >>>>>>>> > >>>>>>>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming < > >>>>>>>> [email protected]> wrote: > >>>>>>>> > >>>>>>>>> The Python authors' comments here explain (well, they assert) why > >>>>> we're > >>>>>>>>> doing that filtering for hidden_layer > 0: > >>>>>>>>> > >>>>>>>>> " Now we have the gradient on the outputs of the hidden layer. > >> Next, > >>>>> we > >>>>>>>>> have to backpropagate the ReLU non-linearity. This turns out to > be > >>>>> easy > >>>>>>>>> because ReLU during the backward pass is effectively a switch. > >> Since > >>>>>>>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the > chain > >>>>>>>>> rule, we see that the ReLU unit lets the gradient pass through > >>>>> unchanged > >>>>>>>>> if its input was greater than 0, but kills it if its input was > less > >>>>> than > >>>>>>>>> zero [or equal to zero - Mike's edit] during the forward pass." > >>>>>>>>> > >>>>>>>>> Isn't it curious that the J-way of doing it, > >>>>>>>>> > >>>>>>>>> if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do. > >> NB. > >>>>> find > >>>>>>>>> indices of elements <: 0 > >>>>>>>>> dhidden =. 0 ilow } dhidden > >>>>>>>>> end. > >>>>>>>>> > >>>>>>>>> is much slower than the naive > >>>>>>>>> > >>>>>>>>> dhidden =. (hidden_layer >0) * dscores dotT W2 > >>>>>>>>> ? > >>>>>>>>> > >>>>>>>>> Mike > >>>>>>>>> > >>>>>>>>> > >>>>>>>>> -- > >>>>>>>> (B=) > >>>>>>>> > >> ---------------------------------------------------------------------- > >>>>>>>> For information about J forums see > >>>>> http://www.jsoftware.com/forums.htm > >>>>>>> --- > >>>>>>> This email has been checked for viruses by Avast antivirus > software. > >>>>>>> https://www.avast.com/antivirus > >>>>>>> > >>>>>>> > >> ---------------------------------------------------------------------- > >>>>>>> For information about J forums see > >> http://www.jsoftware.com/forums.htm > >>>>>> --- > >>>>>> This email has been checked for viruses by AVG. > >>>>>> https://www.avg.com > >>>>>> > >>>>>> > ---------------------------------------------------------------------- > >>>>>> For information about J forums see > >> http://www.jsoftware.com/forums.htm > >>>>> > ---------------------------------------------------------------------- > >>>>> For information about J forums see > http://www.jsoftware.com/forums.htm > >>>>> > >>> ---------------------------------------------------------------------- > >>> For information about J forums see http://www.jsoftware.com/forums.htm > >> ---------------------------------------------------------------------- > >> For information about J forums see http://www.jsoftware.com/forums.htm > >> > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm
