Hello everyone,
I asked this question
https://stackoverflow.com/questions/76707696/np-dot-yields-a-different-result-when-computed-in-two-pieces
on StackOverflow a few days ago but still haven't got an answer. Long story
short, I discovered (in certain common settings like Google Colab) that the
Accelerated BLAS operations are accelerated precisely by taking these
opportunities to rearrange the computations, not just (or primarily) by
parallelism. They are very finely tuned kernels that use the fine details
of the CPU/FPU to pipeline instructions (which might be SIMD) and optimize
memory m