> On 16 Feb 2024, at 2:48 am, Marten van Kerkwijk <m...@astro.utoronto.ca> > wrote: > >> In [45]: %timeit np.add.reduce(a, axis=None) >> 42.8 µs ± 2.44 µs per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >> >> In [43]: %timeit dotsum(a) >> 26.1 µs ± 718 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) >> >> But theoretically, sum, should be faster than dot product by a fair bit. >> >> Isn’t parallelisation implemented for it? > > I cannot reproduce that: > > In [3]: %timeit np.add.reduce(a, axis=None) > 19.7 µs ± 184 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) > > In [4]: %timeit dotsum(a) > 47.2 µs ± 360 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) > > But almost certainly it is indeed due to optimizations, since .dot uses > BLAS which is highly optimized (at least on some platforms, clearly > better on yours than on mine!). > > I thought .sum() was optimized too, but perhaps less so?
I can confirm at least it does not seem to use multithreading – with the conda-installed numpy+BLAS I almost exactly reproduce your numbers, whereas linked against my own OpenBLAS build In [3]: %timeit np.add.reduce(a, axis=None) 19 µs ± 111 ns per loop (mean ± std. dev. of 7 runs, 100,000 loops each) # OMP_NUM_THREADS=1 In [4]: %timeit dots(a) 20.5 µs ± 164 ns per loop (mean ± std. dev. of 7 runs, 10,000 loops each) # OMP_NUM_THREADS=8 In [4]: %timeit dots(a) 9.84 µs ± 1.1 µs per loop (mean ± std. dev. of 7 runs, 100,000 loops each) add.reduce shows no difference between the two and always remains at <= 100 % CPU usage. dotsum is scaling still better with larger matrices, e.g. ~4 x for 1000x1000. Cheers, Derek _______________________________________________ NumPy-Discussion mailing list -- numpy-discussion@python.org To unsubscribe send an email to numpy-discussion-le...@python.org https://mail.python.org/mailman3/lists/numpy-discussion.python.org/ Member address: arch...@mail-archive.com