[issue20499] Rounding errors with statistics.variance

Wolfgang Maier Fri, 07 Feb 2014 03:32:49 -0800

Wolfgang Maier added the comment:

I have written a patch for this issue (I'm uploading the complete new code for 
everyone to try it - importing it into Python3.3 works fine; a diff with 
additional tests against Oscar's example will follow soon).
Just as Oscar suggested, this new version performs all calculations using exact 
rational arithmetics and rounds/coerces only before returning the final result 
to the user. Its precision is, thus, only limited by that of the input data 
sequence.
It passes Oscar's examples 1-3 as you can easily test yourself. It also gives 
the correct answer in the fourth example - mean([D('1.2'), D('1.3'), 
D('1.55')]) -, although on my system the original statistics module gets this 
one right already.


The implementation I chose for this is a bit different from Oscar's suggestion. 
Essentially, it introduces a dedicated module-private class _ExactRatio to 
represent numbers as exact ratios and that gets passed between different 
functions in the module. This class borrows many of its algorithms from 
fractions.Fraction, but has some specialized methods for central tasks in the 
statistics module making it much more efficient in this context than 
fractions.Fraction. This class is currently really minimal, but can easily be 
extended if necessary.
In my implementation this new class is used throughout the module whenever 
calculations with or conversions to exact ratios have to be performed, which 
allowed me to preserve almost all of the original code and to factor out the 
changes to the class.

As for performance, the gain imagined by Oscar is not always realized even 
though the variance functions are now using single passes over the data. 
Specifically, in the case of floats the overhead of having to convert 
everything to exact ratios first eats up all the savings.
In the case of fractional input, there is a dramatic performance boost though. 
I compiled a small table comparing (kind of) average performance of the two 
versions with various input data types. Take this with a grain of salt because 
the differences can vary quite a bit depending on the exact data:

data type        performance gain(+)/loss(-) over original module / %
---------        ----------------------------------------------------
float                              - 10 %
short Decimal                      + 10 %
long Decimal                       - 25 %
Fraction                           + 80 % (!!)
MyFloat                            + 25

With Decimal input the costs of conversion to exact ratios depends on the 
digits in the Decimals, so with short Decimals the savings from the single-pass 
algorithm are larger than the conversion costs, but this reverses for very long 
Decimals.
MyFloat is a minimal class inheriting from float and overriding just its 
arithmetic methods to return MyFloat instances again.
The performance gain with Fraction input comes from two changes, the 
single-pass algorithm and an optimization in _sum (with Fraction, more than 
with any other type, the dictionary built by _sum can grow quite large and the 
optimization is in the conversion of the dictionary elements to exact ratios). 
This is why the extent of this gain can sometimes be significantly higher than 
the 80% listed in the table.

Try this, for example:

from statistics import variance as v
from statistics_with_exact_ratios import variance as v2
from fractions import Fraction

data = [Fraction(1,x) for x in range(1,2000)]

print('calculating variance using original statistics module ...')
print(float(v(data)))
print('now using exact ratio calculations ...')
print(float(v2(data)))

I invite everybody to test my implementation, which is very unlikely to be free 
of bugs at this stage.

----------
type:  -> enhancement
Added file: http://bugs.python.org/file33955/statistics_with_exact_ratios.py

_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue20499>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue20499] Rounding errors with statistics.variance

Reply via email to