Accelerated BLAS operations are accelerated precisely by taking these
opportunities to rearrange the computations, not just (or primarily) by
parallelism. They are very finely tuned kernels that use the fine details
of the CPU/FPU to pipeline instructions (which might be SIMD) and optimize
memory movement and register use. In particular, SIMD instructions might be
used, on your machine, to deal with rows in batches of 4, giving the
pattern that you are seeing.

I, personally, am not willing to give up those optimizations for a slightly
more predictable order of additions.

On Tue, Jul 25, 2023 at 8:04 AM Junyan Xu <junyanxum...@gmail.com> wrote:

> Hello everyone,
>
> I asked this question
> https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-result-when-computed-in-two-pieces
> on StackOverflow a few days ago but still haven't got an answer. Long story
> short, I discovered (in certain common settings like Google Colab) that the
> method NumPy uses to compute the dot product of a 4D vector with a 4 x m
> matrix depends on m, so that if you cut a 4 x 2m matrix into two 4 x m
> matrices and compute the dot products with the vector individually, you get
> a slightly different result. More precisely, I found that:
>
> For m = 1, the order ((0+1)+2)+3 is used;
> for m = 2 and 3, the order (0+1)+(2+3) is used for all entries;
> for 4 ≤ m ≤ 7, unknown method is used for the first 4 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> for 8 ≤ m ≤ 11, unknown method is used for the first 8 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> for 12 ≤ m ≤ 15, unknown method is used for the first 12 entries, and the
> order ((0+1)+2)+3 is used for the rest;
> and the pattern probably continues.
>
> If we instead take the dot product of a m x 4 matrix with a 4D vector
> (rowMode = True in my code):
>
> For m = 1, the order ((0+1)+2)+3 is used;
> for m ≥ 2, the order (0+2)+(1+3) is used for all entries (verified up to m
> = 17).
>
> Moreover, if you replace 4 by an arbitrary n, the difference between the
> results grows, super-linearly in n in the vec-dot-mat case, but
> sub-linearly in the mat-dot-vec case.
>
> I think this behavior is highly non-intuitive and probably unnecessary: we
> are simply repeating the same operation m times (possibly in parallel), why
> should the method chosen depends on how many times the operation is
> repeated? Therefore, I'd like to propose an enhancement item to make the
> result independent of m. I realize this may need to be done upstream, so as
> a first step we may need to find out which BLAS library functions etc. are
> called to compute dot products and which are responsible for this behavior.
>
> Notebook for figuring out order of addition:
> https://colab.research.google.com/drive/1O4NtTWMiZbSxKFVg-lfBJDVypIBdVJ22?usp=sharing
> Notebook for growth with n:
> https://colab.research.google.com/drive/10ecRWvcMEXY0RchVlaLyPUFHJBA4KwA6?usp=sharing
>
> Thank you in advance for your thoughts on this.
>
> Best regards,
>
> Junyan
> _______________________________________________
> NumPy-Discussion mailing list -- numpy-discussion@python.org
> To unsubscribe send an email to numpy-discussion-le...@python.org
> https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
> Member address: robert.k...@gmail.com
>


-- 
Robert Kern
_______________________________________________
NumPy-Discussion mailing list -- numpy-discussion@python.org
To unsubscribe send an email to numpy-discussion-le...@python.org
https://mail.python.org/mailman3/lists/numpy-discussion.python.org/
Member address: arch...@mail-archive.com

Reply via email to