Matrix multiplication on arm64 android should already be fully optimized, including Blas routine with arm64 asimd kernel Openmp multithreading
Optimized on desktop too, J runs as fast as other multithreaded optimized blas lapack such as openblas. On Mon, May 24, 2021, 3:53 PM Ric Sherlock <tikk...@gmail.com> wrote: > Just to provide some context to Henry's statement that things have changed > a bit since J8.05, below are the timings I get on my phone (Pixel 4a) using > J902. > > ,.f"0]2^>:i.13 > 0.024127 > 1e_5 > 2e_6 > 3e_6 > 3.4e_5 > 0.000909 > 0.000425 > 0.012697 > 0.020461 > 0.139175 > 1.00075 > 6.6658 > 56.7179 > > > > On Mon, 24 May 2021, 15:00 Henry Rich, <henryhr...@gmail.com> wrote: > > > J8.05 is very out-of-date for +/ . * . Since then I have rewritten the > > JE code a couple of times: the current version is pretty fast and has > > special code depending on matrix sizes. > > > > If you are doing performance measurement you need to get an up-to-date > > J. Many primitives and combinations run 5-10x faster than they did in > > 8.05. > > > > Henry Rich > > > > On 5/23/2021 10:32 PM, Imre Patyi wrote: > > > Dear Programming in J, > > > > > > I made another test of numerical calculation in J, > > > this time looking at multiplying two matrices using > > > (+/ .*) and here is what I have found. It seems to > > > me that J with (+/ .*) has acceptable speed only for > > > matrices of order about 128 or below, after which order it > > > quickly falls behind other standard numerical software such > > > as python with numpy, and Octave. I also wrote a naive C > > > program for matrix multiplication; for orders 256, 1024, > > > ..., 8192 J tracks as 2 to 4 faster than the naive C program > > > (which does not do SIMD or mind caching much). > > > > > > Numpy and Octave are able to use multiple threads and/or cores > > > just by calling ordinary 'matmul', and they are about 7 to > > > 25 times as fast as J in my experiment. As a primitive in J > > > the command (+/ .*) could be just as fast as in any competent > > > numerical program available in C for matrix multiplication. > > > Even if you do not want multithreading in J, it seems to > > > me that (+/ .*) has roughly 1/4 or 1/8 the speed of what should > > > be possible for a single threaded program. It seems especially > > > troubling that it becomes just as slow as a plain vanilla > > > naive C program for larger sizes of the matrices. I am not sure > > > why J does not seem to use BLAS or LAPACK for matrix multiplication. > > > > > > Yours sincerely, > > > Imre Patyi > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > Here is the summary of timings. > > > > > > n time, C time, J time, python time, Octave (time, J)/(time, C) (time, > > > J)/(time, python) (time, J)/(time, Octave) > > > 256 0.0780 0.0073 0.0010 0.0007 0.0936 7.3047 9.8987 > > > 512 0.2680 0.0671 0.0100 0.0050 0.2505 6.7137 13.4195 > > > 1024 1.8400 0.7293 0.0479 0.0380 0.3964 15.2255 19.1919 > > > 2048 14.0430 6.0432 0.2663 0.2851 0.4303 22.6938 21.1960 > > > 4096 109.8290 54.4634 2.2739 2.1620 0.4959 23.9513 25.1917 > > > 8192 874.8430 435.2600 17.1282 17.2197 0.4975 25.4120 25.2769 > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > File: example-of-matmul.ijs > > > > > > f=: 3 : 0 > > > N=.y > > > a=.2 o. ((4 : '(1234*x)+(5678*y)')"0 0)/~ (i.N) > > > NB.smoutput(i.5){(i.5){a > > > NB.smoutput'' > > > t=.timex'b=:a(+/ .*)a' > > > NB.smoutput(i.5){(i.5){b > > > NB.t;(60 60#:t) > > > t > > > ) > > > > > > NB. Sample run. > > > NB. ,.f"0]2^>:i.13 > > > NB. 0.0135541 > > > NB. 3.5e_6 > > > NB. 2.9e_6 > > > NB. 4e_6 > > > NB. 1.77e_5 > > > NB. 0.0001052 > > > NB. 0.0008633 > > > NB. 0.0072972 > > > NB. 0.0671373 > > > NB. 0.729313 > > > NB. 6.04315 > > > NB. 54.4634 > > > NB. 435.26 > > > > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > File: example-with-numpy.py > > > > > > import numpy, time > > > def f(n): > > > i=numpy.array(numpy.arange(n).reshape((1,n))) > > > a=numpy.cos(numpy.array(1234*i+5678*i.T)) > > > #print(a.shape) > > > t0=time.time() > > > b=numpy.matmul(a,a) > > > return time.time()-t0 > > > > > > for i in range(1,1+13): > > > print(f(2**i)) > > > > > > > > > r""" Sample run. > > > C:>py "example-with-numpy.py" > > > 0.0020143985748291016 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0 > > > 0.0009746551513671875 > > > 0.0 > > > 0.0009989738464355469 > > > 0.009999990463256836 > > > 0.04790067672729492 > > > 0.26629042625427246 > > > 2.273921251296997 > > > 17.128154277801514 > > > """ > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > File: The command I used in Octave. > > > > > >>> for n=2.^(1:13) ; i=(0:n-1) ; a=cos(1234*i'+5678*i) ; tic,b=a*a;toc, > > end > > > Elapsed time is 1.3113e-05 seconds. > > > Elapsed time is 1.90735e-05 seconds. > > > Elapsed time is 1.38283e-05 seconds. > > > Elapsed time is 1.3113e-05 seconds. > > > Elapsed time is 2.09808e-05 seconds. > > > Elapsed time is 4.88758e-05 seconds. > > > Elapsed time is 0.000244141 seconds. > > > Elapsed time is 0.00073719 seconds. > > > Elapsed time is 0.00500298 seconds. > > > Elapsed time is 0.0380011 seconds. > > > Elapsed time is 0.285108 seconds. > > > Elapsed time is 2.16196 seconds. > > > Elapsed time is 17.2197 seconds. > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > File: example-of-naive-matmul.c > > > > > > #include <stdlib.h> > > > #include <stdio.h> > > > #include <math.h> > > > > > > int > > > main(int argc, char **argv){ > > > > > > int N ; > > > if(argc==0){ > > > N=8192 ; > > > } else { > > > N=atoi(argv[1]) ; > > > } > > > > > > double *a=(double*)calloc(N*N,sizeof(double)); > > > double *aT=(double*)calloc(N*N,sizeof(double)); > > > for(int i=0 ; i<N ; i++){ > > > for(int j =0 ; j<N ; j++){ > > > a[i+N*j]=aT[j+N*i]=cos(1234*i+5678*j) ; > > > } > > > } > > > > > > double *b=(double*)calloc(N*N,sizeof(double)); > > > for(int i=0 ; i<N ; i++){ > > > for(int j=0 ; j<N ; j++){ > > > double bij=0.0 ; > > > for(int k=0 ; k<N ; k++){ > > > bij += aT[k+N*i]*a[k+N*j] ; > > > } > > > b[i+N*j]=bij ; > > > } > > > } > > > printf("\n") ; > > > /* > > > for(int i=0 ; i<5 ; i++){ > > > for(int j=0 ; j<5 ; j++){ > > > printf("%f\t",a[i+N*j]) ; > > > } > > > printf("\n") ; > > > } > > > printf("\n") ; > > > for(int i=0 ; i<5 ; i++){ > > > for(int j=0 ; j<5 ; j++){ > > > printf("%f\t",b[i+N*j]) ; > > > } > > > printf("\n") ; > > > } > > > */ > > > } > > > > > > /* Sample run. > > > $ cc -o example-of-naive-matmul{,.c} -O3 > > > $ for i in {1..13}; do n=`echo 2^$i|bc`; echo $n ; time > > > ./example-of-naive-matmul $n ; done > > > 2 > > > > > > > > > real 0m0.038s > > > user 0m0.015s > > > sys 0m0.000s > > > 4 > > > > > > > > > real 0m0.045s > > > user 0m0.000s > > > sys 0m0.030s > > > 8 > > > > > > > > > real 0m0.047s > > > user 0m0.030s > > > sys 0m0.000s > > > 16 > > > > > > > > > real 0m0.046s > > > user 0m0.046s > > > sys 0m0.015s > > > 32 > > > > > > > > > real 0m0.051s > > > user 0m0.015s > > > sys 0m0.000s > > > 64 > > > > > > > > > real 0m0.046s > > > user 0m0.000s > > > sys 0m0.030s > > > 128 > > > > > > > > > real 0m0.045s > > > user 0m0.000s > > > sys 0m0.046s > > > 256 > > > > > > > > > real 0m0.078s > > > user 0m0.015s > > > sys 0m0.030s > > > 512 > > > > > > > > > real 0m0.268s > > > user 0m0.218s > > > sys 0m0.030s > > > 1024 > > > > > > > > > real 0m1.840s > > > user 0m1.811s > > > sys 0m0.030s > > > 2048 > > > > > > > > > real 0m14.043s > > > user 0m13.937s > > > sys 0m0.062s > > > 4096 > > > > > > > > > real 1m49.829s > > > user 1m49.578s > > > sys 0m0.125s > > > 8192 > > > > > > > > > real 14m34.843s > > > user 14m33.046s > > > sys 0m0.874s > > > > > > */ > > > > > > ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ > > > I ran all of the above on a lower midrange laptop with Windows 10, > > > i5, 8GB RAM, 2 cores, 4 threads; I used J805, Anaconda python 3.5, > > > Octave 5.2.0. > > > ---------------------------------------------------------------------- > > > For information about J forums see http://www.jsoftware.com/forums.htm > > > > > > -- > > This email has been checked for viruses by AVG. > > https://www.avg.com > > > > ---------------------------------------------------------------------- > > For information about J forums see http://www.jsoftware.com/forums.htm > > > ---------------------------------------------------------------------- > For information about J forums see http://www.jsoftware.com/forums.htm > ---------------------------------------------------------------------- For information about J forums see http://www.jsoftware.com/forums.htm