I remember back when a.sum(axis=0) was much slower than a.sum(axis=1) for
something like a=np.ones((1000, 1000)). But now it runs in about the same
time. How does numpy do it?
Does numpy do something like
for i in range(a.shape[0]):
for j in range(x.shape[1]):
result[j] += a[i, j]
___
Numpy uses pairwise summation along the fast axis if that axis contains no
more than 8192 elements. How was 8192 chosen?
Doubling to 16384 would result in a lot more function call overhead due to
the recursion. Is it a speed issue? Memory? Or something else entirely?
__