a. I'm lazy, sorry about that. I'll update when I have time.
b. only arm64 has decent support of simd.

On Tue, May 25, 2021, 5:15 AM Joey K Tuttle <j...@qued.com> wrote:

> Good segue into a couple of questions that have been on my mind.
>
> a) Any thoughts comments on a current Beta release for Raspbian?
>
> b) Are there strong reasons to move to a 64 bit OS on the Raspberry Pi?
>
> - joey
>
>
> > On 2021May 24, at 13:54, bill lam <bbill....@gmail.com> wrote:
> >
> > Matrix multiplication on arm64 android should already be fully optimized,
> > including
> > Blas routine with arm64 asimd kernel
> > Openmp multithreading
> >
> > Optimized on desktop too, J runs as fast as other multithreaded optimized
> > blas lapack such as openblas.
> >
> >
> >
> >
> >
> > On Mon, May 24, 2021, 3:53 PM Ric Sherlock <tikk...@gmail.com> wrote:
> >
> >> Just to provide some context to Henry's statement that things have
> changed
> >> a bit since J8.05, below are the timings I get on my phone (Pixel 4a)
> using
> >> J902.
> >>
> >> ,.f"0]2^>:i.13
> >> 0.024127
> >>    1e_5
> >>    2e_6
> >>    3e_6
> >>  3.4e_5
> >> 0.000909
> >> 0.000425
> >> 0.012697
> >> 0.020461
> >> 0.139175
> >> 1.00075
> >>  6.6658
> >> 56.7179
> >>
> >>
> >>
> >> On Mon, 24 May 2021, 15:00 Henry Rich, <henryhr...@gmail.com> wrote:
> >>
> >>> J8.05 is very out-of-date for +/ . * .  Since then I have rewritten the
> >>> JE code a couple of times: the current version is pretty fast and has
> >>> special code depending on matrix sizes.
> >>>
> >>> If you are doing performance measurement you need to get an up-to-date
> >>> J.  Many primitives and combinations run 5-10x faster than they did in
> >>> 8.05.
> >>>
> >>> Henry Rich
> >>>
> >>> On 5/23/2021 10:32 PM, Imre Patyi wrote:
> >>>> Dear Programming in J,
> >>>>
> >>>> I made another test of numerical calculation in J,
> >>>> this time looking at multiplying two matrices using
> >>>> (+/ .*)  and here is what I have found.  It seems to
> >>>> me that J with (+/ .*) has acceptable speed only for
> >>>> matrices of order about 128 or below, after which order it
> >>>> quickly falls behind other standard numerical software such
> >>>> as python with numpy, and Octave.  I also wrote a naive C
> >>>> program for matrix multiplication; for orders 256, 1024,
> >>>> ..., 8192 J tracks as 2 to 4 faster than the naive C program
> >>>> (which does not do SIMD or mind caching much).
> >>>>
> >>>> Numpy and Octave are able to use multiple threads and/or cores
> >>>> just by calling ordinary 'matmul', and they are about 7 to
> >>>> 25 times as fast as J in my experiment.  As a primitive in J
> >>>> the command (+/ .*) could be just as fast as in any competent
> >>>> numerical program available in C for matrix multiplication.
> >>>> Even if you do not want multithreading in J, it seems to
> >>>> me that (+/ .*) has roughly 1/4 or 1/8 the speed of what should
> >>>> be possible for a single threaded program.  It seems especially
> >>>> troubling that it becomes just as slow as a plain vanilla
> >>>> naive C program for larger sizes of the matrices.  I am not sure
> >>>> why J does not seem to use BLAS or LAPACK for matrix multiplication.
> >>>>
> >>>> Yours sincerely,
> >>>> Imre Patyi
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> Here is the summary of timings.
> >>>>
> >>>> n time, C time, J time, python time, Octave (time, J)/(time, C) (time,
> >>>> J)/(time, python) (time, J)/(time, Octave)
> >>>> 256 0.0780 0.0073 0.0010 0.0007 0.0936 7.3047 9.8987
> >>>> 512 0.2680 0.0671 0.0100 0.0050 0.2505 6.7137 13.4195
> >>>> 1024 1.8400 0.7293 0.0479 0.0380 0.3964 15.2255 19.1919
> >>>> 2048 14.0430 6.0432 0.2663 0.2851 0.4303 22.6938 21.1960
> >>>> 4096 109.8290 54.4634 2.2739 2.1620 0.4959 23.9513 25.1917
> >>>> 8192 874.8430 435.2600 17.1282 17.2197 0.4975 25.4120 25.2769
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> File: example-of-matmul.ijs
> >>>>
> >>>> f=: 3 : 0
> >>>> N=.y
> >>>> a=.2 o. ((4 : '(1234*x)+(5678*y)')"0 0)/~ (i.N)
> >>>> NB.smoutput(i.5){(i.5){a
> >>>> NB.smoutput''
> >>>> t=.timex'b=:a(+/ .*)a'
> >>>> NB.smoutput(i.5){(i.5){b
> >>>> NB.t;(60 60#:t)
> >>>> t
> >>>> )
> >>>>
> >>>> NB. Sample run.
> >>>> NB.   ,.f"0]2^>:i.13
> >>>> NB. 0.0135541
> >>>> NB.   3.5e_6
> >>>> NB.   2.9e_6
> >>>> NB.     4e_6
> >>>> NB.  1.77e_5
> >>>> NB. 0.0001052
> >>>> NB. 0.0008633
> >>>> NB. 0.0072972
> >>>> NB. 0.0671373
> >>>> NB. 0.729313
> >>>> NB.  6.04315
> >>>> NB.  54.4634
> >>>> NB.   435.26
> >>>>
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> File: example-with-numpy.py
> >>>>
> >>>> import numpy, time
> >>>> def f(n):
> >>>>     i=numpy.array(numpy.arange(n).reshape((1,n)))
> >>>>     a=numpy.cos(numpy.array(1234*i+5678*i.T))
> >>>>     #print(a.shape)
> >>>>     t0=time.time()
> >>>>     b=numpy.matmul(a,a)
> >>>>     return time.time()-t0
> >>>>
> >>>> for i in range(1,1+13):
> >>>>     print(f(2**i))
> >>>>
> >>>>
> >>>> r"""     Sample run.
> >>>> C:>py "example-with-numpy.py"
> >>>> 0.0020143985748291016
> >>>> 0.0
> >>>> 0.0
> >>>> 0.0
> >>>> 0.0
> >>>> 0.0009746551513671875
> >>>> 0.0
> >>>> 0.0009989738464355469
> >>>> 0.009999990463256836
> >>>> 0.04790067672729492
> >>>> 0.26629042625427246
> >>>> 2.273921251296997
> >>>> 17.128154277801514
> >>>> """
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> File:  The command I used in Octave.
> >>>>
> >>>>>> for n=2.^(1:13) ; i=(0:n-1) ; a=cos(1234*i'+5678*i) ; tic,b=a*a;toc,
> >>> end
> >>>> Elapsed time is 1.3113e-05 seconds.
> >>>> Elapsed time is 1.90735e-05 seconds.
> >>>> Elapsed time is 1.38283e-05 seconds.
> >>>> Elapsed time is 1.3113e-05 seconds.
> >>>> Elapsed time is 2.09808e-05 seconds.
> >>>> Elapsed time is 4.88758e-05 seconds.
> >>>> Elapsed time is 0.000244141 seconds.
> >>>> Elapsed time is 0.00073719 seconds.
> >>>> Elapsed time is 0.00500298 seconds.
> >>>> Elapsed time is 0.0380011 seconds.
> >>>> Elapsed time is 0.285108 seconds.
> >>>> Elapsed time is 2.16196 seconds.
> >>>> Elapsed time is 17.2197 seconds.
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> File: example-of-naive-matmul.c
> >>>>
> >>>> #include <stdlib.h>
> >>>> #include <stdio.h>
> >>>> #include <math.h>
> >>>>
> >>>> int
> >>>> main(int argc, char **argv){
> >>>>
> >>>> int N ;
> >>>> if(argc==0){
> >>>> N=8192 ;
> >>>> } else {
> >>>> N=atoi(argv[1]) ;
> >>>> }
> >>>>
> >>>> double *a=(double*)calloc(N*N,sizeof(double));
> >>>> double *aT=(double*)calloc(N*N,sizeof(double));
> >>>> for(int i=0 ; i<N ; i++){
> >>>> for(int j =0 ; j<N ; j++){
> >>>> a[i+N*j]=aT[j+N*i]=cos(1234*i+5678*j) ;
> >>>> }
> >>>> }
> >>>>
> >>>> double *b=(double*)calloc(N*N,sizeof(double));
> >>>> for(int i=0 ; i<N ; i++){
> >>>> for(int j=0 ; j<N ; j++){
> >>>> double bij=0.0 ;
> >>>> for(int k=0 ; k<N ; k++){
> >>>> bij += aT[k+N*i]*a[k+N*j] ;
> >>>> }
> >>>> b[i+N*j]=bij ;
> >>>> }
> >>>> }
> >>>> printf("\n") ;
> >>>> /*
> >>>> for(int i=0 ; i<5 ; i++){
> >>>> for(int j=0 ; j<5 ; j++){
> >>>> printf("%f\t",a[i+N*j]) ;
> >>>> }
> >>>> printf("\n") ;
> >>>> }
> >>>> printf("\n") ;
> >>>> for(int i=0 ; i<5 ; i++){
> >>>> for(int j=0 ; j<5 ; j++){
> >>>> printf("%f\t",b[i+N*j]) ;
> >>>> }
> >>>> printf("\n") ;
> >>>> }
> >>>> */
> >>>> }
> >>>>
> >>>> /* Sample run.
> >>>> $ cc -o example-of-naive-matmul{,.c} -O3
> >>>> $ for i in {1..13}; do n=`echo 2^$i|bc`; echo $n ; time
> >>>> ./example-of-naive-matmul $n ; done
> >>>> 2
> >>>>
> >>>>
> >>>> real    0m0.038s
> >>>> user    0m0.015s
> >>>> sys     0m0.000s
> >>>> 4
> >>>>
> >>>>
> >>>> real    0m0.045s
> >>>> user    0m0.000s
> >>>> sys     0m0.030s
> >>>> 8
> >>>>
> >>>>
> >>>> real    0m0.047s
> >>>> user    0m0.030s
> >>>> sys     0m0.000s
> >>>> 16
> >>>>
> >>>>
> >>>> real    0m0.046s
> >>>> user    0m0.046s
> >>>> sys     0m0.015s
> >>>> 32
> >>>>
> >>>>
> >>>> real    0m0.051s
> >>>> user    0m0.015s
> >>>> sys     0m0.000s
> >>>> 64
> >>>>
> >>>>
> >>>> real    0m0.046s
> >>>> user    0m0.000s
> >>>> sys     0m0.030s
> >>>> 128
> >>>>
> >>>>
> >>>> real    0m0.045s
> >>>> user    0m0.000s
> >>>> sys     0m0.046s
> >>>> 256
> >>>>
> >>>>
> >>>> real    0m0.078s
> >>>> user    0m0.015s
> >>>> sys     0m0.030s
> >>>> 512
> >>>>
> >>>>
> >>>> real    0m0.268s
> >>>> user    0m0.218s
> >>>> sys     0m0.030s
> >>>> 1024
> >>>>
> >>>>
> >>>> real    0m1.840s
> >>>> user    0m1.811s
> >>>> sys     0m0.030s
> >>>> 2048
> >>>>
> >>>>
> >>>> real    0m14.043s
> >>>> user    0m13.937s
> >>>> sys     0m0.062s
> >>>> 4096
> >>>>
> >>>>
> >>>> real    1m49.829s
> >>>> user    1m49.578s
> >>>> sys     0m0.125s
> >>>> 8192
> >>>>
> >>>>
> >>>> real    14m34.843s
> >>>> user    14m33.046s
> >>>> sys     0m0.874s
> >>>>
> >>>> */
> >>>>
> >>>> ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~
> >>>> I ran all of the above on a lower midrange laptop with Windows 10,
> >>>> i5, 8GB RAM, 2 cores, 4 threads; I used J805, Anaconda python 3.5,
> >>>> Octave 5.2.0.
> >>>> ----------------------------------------------------------------------
> >>>> For information about J forums see
> http://www.jsoftware.com/forums.htm
> >>>
> >>>
> >>> --
> >>> This email has been checked for viruses by AVG.
> >>> https://www.avg.com
> >>>
> >>> ----------------------------------------------------------------------
> >>> For information about J forums see http://www.jsoftware.com/forums.htm
> >>>
> >> ----------------------------------------------------------------------
> >> For information about J forums see http://www.jsoftware.com/forums.htm
> >>
> > ----------------------------------------------------------------------
> > For information about J forums see http://www.jsoftware.com/forums.htm
>
> ----------------------------------------------------------------------
> For information about J forums see http://www.jsoftware.com/forums.htm
>
----------------------------------------------------------------------
For information about J forums see http://www.jsoftware.com/forums.htm

Reply via email to