Re: [Jprogramming] convolutional neural network [was simplifying im2col]

bill lam Thu, 16 May 2019 16:33:06 -0700

Ah I forgot, if size of matrix is small , eg less than 11x11 then it won't
call blas routine.


On Fri, May 17, 2019, 7:22 AM bill lam <[email protected]> wrote:

> If you use j807 with an avx capable cpu, it should call optimized blas for
> that pattern, you can compare with previous versions of J. If you want even
> faster, you can build j.dll/libj.so from source to enable multiple core
> support and performance will scale up with the number of core used.
>
> On Fri, May 17, 2019, 7:13 AM 'Mike Day' via Programming <
> [email protected]> wrote:
>
>> (Also answers Bill’s post, just in)
>>
>> think I misled you. Brian’s “dot” is more correctly the matrix product,
>> such as
>>      2 3 (+/ . *)&i. 3 4
>> 20 23 26 29
>> 56 68 80 92
>> so we’re talking about dot =: +/ . *
>>
>> In some cases, Brian needs to multiply an mxn matrix A, by a kxn matrix B
>> for a mxk result,
>> A dot |: B
>> In others, he needs C, shape mxn, by D, shape mxk,  for an nxk result,
>> (|: C) dot D
>> and of course, some are straight matrix multiplications.
>>
>> I defined    Tdot =: |:@:[ +/ .* ] and dotT =: dot |:
>>
>> Are matrix multiplications going to be enhanced?  And what about  such
>> variants as these?
>>
>> Thanks,
>>
>> Mik
>>
>> Sent from my iPad
>>
>> > On 16 May 2019, at 18:43, Henry Rich <[email protected]> wrote:
>> >
>> > In the next beta +/@:*"1 uses 256-bit instructions, which should help
>> with dot-products.
>> >
>> > Henry rich
>> >
>> >> On 5/16/2019 1:27 PM, 'Mike Day' via Programming wrote:
>> >> I've tried various timings and tweaks - the dot products seem to
>> consume the most time;
>> >>
>> >> it's marginally worth dividing by "num_examples" after summing
>> "correct_logprobs" rather
>> >>
>> >> than summing the quotient,  " correct_logprobs%num_examples "
>> >>
>> >> I added a couple of dot fns,      Tdot =: |:@[ dot ]     and dotT =:
>> dot |:
>> >> to neaten up the code a bit.  Those transposes seem unavoidable.
>> >>
>> >> In a practical application, you'd probably run cycles until either a
>> suitable level of convergence
>> >>
>> >> is achieved - or until it's obvious that the process is divergent.
>> >>
>> >> Cheers,
>> >>
>> >> Mike
>> >>
>> >>
>> >>> On 16/05/2019 15:20, Brian Schott wrote:
>> >>> Mike,
>> >>>
>> >>> Yes, I new the reason that the calculation was done, but was
>> surprised by
>> >>> the manner in which these authors applied the calculation (without the
>> >>> multiplication) and I applied the Amend incorrectly, by not
>> remembering
>> >>> that it was being applied to an array.
>> >>>
>> >>> And you are correct that the Amend approach is slower and more space
>> >>> consuming than the Product approach. I re-applied -- correctly, this
>> time,
>> >>> finally🤞  -- the Amend approach on a 'dbstopped' version of `train`
>> and
>> >>> got the following timings. In retrospect both methods require the
>> condition
>> >>> check and then multiplying by 0 and 1 may be very fast relative to
>> Amend's
>> >>> needs.
>> >>>
>> >>>       mnd =: 0:`(I.@(0&>:)@[)`]}"1
>> >>>       ((hidden_layer>0)*dscores dot|:W2)-:hidden_layer mnd dscores
>> dot|:W2
>> >>> 1
>> >>>       10 timespacex'(hidden_layer>0)*dscores dot|:W2'
>> >>> 0.0004102 301568
>> >>>       10 timespacex'hidden_layer mnd dscores dot|:W2'
>> >>> 0.0006501 535360
>> >>>
>> >>> And btw, mnd1 =: 0:`(I.@(0>:[))`]}"1  using a fork is very slightly
>> faster
>> >>> than mnd.
>> >>>
>> >>>
>> >>> Thanks, again,
>> >>>
>> >>> On Thu, May 16, 2019 at 5:32 AM 'Mike Day' via Programming <
>> >>> [email protected]> wrote:
>> >>>
>> >>>> The Python authors' comments here explain (well, they assert) why
>> we're
>> >>>> doing that filtering for hidden_layer > 0:
>> >>>>
>> >>>> " Now we have the gradient on the outputs of the hidden layer. Next,
>> we
>> >>>> have to backpropagate the ReLU non-linearity. This turns out to be
>> easy
>> >>>> because ReLU during the backward pass is effectively a switch. Since
>> >>>> r=max(0,x) , we have that dr/dx = 1 (x>0) . Combined with the chain
>> >>>> rule, we see that the ReLU unit lets the gradient pass through
>> unchanged
>> >>>> if its input was greater than 0, but kills it if its input was less
>> than
>> >>>> zero [or equal to zero - Mike's edit] during the forward pass."
>> >>>>
>> >>>> Isn't it curious that the J-way of doing it,
>> >>>>
>> >>>>      if. # ilow=. (<"1@:($ #: I.@:(0 >: ,))) hidden_layer do.  NB.
>> find
>> >>>> indices of elements <: 0
>> >>>>         dhidden =. 0 ilow } dhidden
>> >>>>      end.
>> >>>>
>> >>>> is much slower than the naive
>> >>>>
>> >>>>      dhidden =. (hidden_layer >0) * dscores dotT  W2
>> >>>> ?
>> >>>>
>> >>>> Mike
>> >>>>
>> >>>>
>> >>>> --
>> >>> (B=)
>> >>> ----------------------------------------------------------------------
>> >>> For information about J forums see
>> http://www.jsoftware.com/forums.htm
>> >>
>> >> ---
>> >> This email has been checked for viruses by Avast antivirus software.
>> >> https://www.avast.com/antivirus
>> >>
>> >> ----------------------------------------------------------------------
>> >> For information about J forums see http://www.jsoftware.com/forums.htm
>> >
>> >
>> > ---
>> > This email has been checked for viruses by AVG.
>> > https://www.avg.com
>> >
>> > ----------------------------------------------------------------------
>> > For information about J forums see http://www.jsoftware.com/forums.htm
>> ----------------------------------------------------------------------
>> For information about J forums see http://www.jsoftware.com/forums.htm
>>
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Re: [Jprogramming] convolutional neural network [was simplifying im2col]

Reply via email to