[ https://issues.apache.org/jira/browse/ARROW-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302281#comment-17302281 ]
Yibo Cai commented on ARROW-11758: ---------------------------------- For record. Arrow's pairwise sum implementation is a bit different from numpy. Numpy adopts recursion with manually unrolled loops for last 16 summations. Arrow code is non-recursive, and no unroll. Though equivalent in theory, they are different for floating point arithmetic in reality. There may be gaps for large array summations. Actually, neither numpy nor pyarrow can give accurate results. The point is they both guarantees O(logn) round-off error bound. In below test, pyarrow gives the accurate result. Of course, there must be some tests in favor of numpy. The truth value is from mpmath arbitrary precision lib. *Test code* {code:python} import numpy as np import pyarrow.compute as pc import mpmath mpmath.dps = 30 t = np.arange(123456, 654321, dtype='float64') * 987654321 print(' numpy: %.0f' % np.sum(t)) print(' arrow: %.0f' % pc.sum(t).as_py()) print('mpmath: %.0f' % mpmath.fsum(t)) {code} *Result* {noformat} numpy: 203898299380326662144 arrow: 203898299380326498304 mpmath: 203898299380326498304 {noformat} > [C++][Compute] Summation kernel round-off error > ----------------------------------------------- > > Key: ARROW-11758 > URL: https://issues.apache.org/jira/browse/ARROW-11758 > Project: Apache Arrow > Issue Type: Bug > Components: C++ > Reporter: Yibo Cai > Assignee: Yibo Cai > Priority: Major > Labels: pull-request-available > Fix For: 4.0.0 > > Time Spent: 3.5h > Remaining Estimate: 0h > > From below test, summation kernel is of lower precision than numpy.sum. > Numpy implements pairwise summation [1] with O(logn) round-off error, better > than O(n\) error from naive summation. > *sum.py* > {code:python} > import numpy as np > import pyarrow.compute as pc > t = np.arange(321000, dtype='float64') > t2 = t - np.mean(t) > t2 *= t2 > print('numpy sum:', np.sum(t2)) > print('arrow sum:', pc.sum(t2)) > {code} > *test result* > {noformat} > # Verified with wolfram alpha (arbitrary precision), Numpy's result is > correct. > $ ARROW_USER_SIMD_LEVEL=SSE4_2 python sum.py > numpy sum: 2756346749973250.0 > arrow sum: 2756346749973248.0 > $ ARROW_USER_SIMD_LEVEL=AVX2 python sum.py > numpy sum: 2756346749973250.0 > arrow sum: 2756346749973249.0 > {noformat} > [1] https://en.wikipedia.org/wiki/Pairwise_summation -- This message was sent by Atlassian Jira (v8.3.4#803005)