[ 
https://issues.apache.org/jira/browse/ARROW-11758?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17302281#comment-17302281
 ] 

Yibo Cai commented on ARROW-11758:
----------------------------------

For record.
 Arrow's pairwise sum implementation is a bit different from numpy. Numpy 
adopts recursion with manually unrolled loops for last 16 summations. Arrow 
code is non-recursive, and no unroll. Though equivalent in theory, they are 
different for floating point arithmetic in reality. There may be gaps for large 
array summations.

Actually, neither numpy nor pyarrow can give accurate results. The point is 
they both guarantees O(logn) round-off error bound.

In below test, pyarrow gives the accurate result. Of course, there must be some 
tests in favor of numpy. The truth value is from mpmath arbitrary precision lib.

*Test code*
{code:python}
import numpy as np
import pyarrow.compute as pc
import mpmath

mpmath.dps = 30
t = np.arange(123456, 654321, dtype='float64') * 987654321

print(' numpy: %.0f' % np.sum(t))
print(' arrow: %.0f' % pc.sum(t).as_py())
print('mpmath: %.0f' % mpmath.fsum(t))
{code}
*Result*
{noformat}
 numpy: 203898299380326662144
 arrow: 203898299380326498304
mpmath: 203898299380326498304
{noformat}

> [C++][Compute] Summation kernel round-off error
> -----------------------------------------------
>
>                 Key: ARROW-11758
>                 URL: https://issues.apache.org/jira/browse/ARROW-11758
>             Project: Apache Arrow
>          Issue Type: Bug
>          Components: C++
>            Reporter: Yibo Cai
>            Assignee: Yibo Cai
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 4.0.0
>
>          Time Spent: 3.5h
>  Remaining Estimate: 0h
>
> From below test, summation kernel is of lower precision than numpy.sum.
> Numpy implements pairwise summation [1] with O(logn) round-off error, better 
> than O(n\) error from naive summation.
> *sum.py*
> {code:python}
> import numpy as np
> import pyarrow.compute as pc
> t = np.arange(321000, dtype='float64')
> t2 = t - np.mean(t)
> t2 *= t2
> print('numpy sum:', np.sum(t2))
> print('arrow sum:', pc.sum(t2))
> {code}
> *test result*
> {noformat}
> # Verified with wolfram alpha (arbitrary precision), Numpy's result is 
> correct. 
> $ ARROW_USER_SIMD_LEVEL=SSE4_2 python sum.py
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973248.0
> $ ARROW_USER_SIMD_LEVEL=AVX2 python sum.py 
> numpy sum: 2756346749973250.0
> arrow sum: 2756346749973249.0
> {noformat}
> [1] https://en.wikipedia.org/wiki/Pairwise_summation



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to