Wolfgang Maier added the comment:
I have written a patch for this issue (I'm uploading the complete new code for
everyone to try it - importing it into Python3.3 works fine; a diff with
additional tests against Oscar's example will follow soon).
Just as Oscar suggested, this new version performs all calculations using exact
rational arithmetics and rounds/coerces only before returning the final result
to the user. Its precision is, thus, only limited by that of the input data
sequence.
It passes Oscar's examples 1-3 as you can easily test yourself. It also gives
the correct answer in the fourth example - mean([D('1.2'), D('1.3'),
D('1.55')]) -, although on my system the original statistics module gets this
one right already.
The implementation I chose for this is a bit different from Oscar's suggestion.
Essentially, it introduces a dedicated module-private class _ExactRatio to
represent numbers as exact ratios and that gets passed between different
functions in the module. This class borrows many of its algorithms from
fractions.Fraction, but has some specialized methods for central tasks in the
statistics module making it much more efficient in this context than
fractions.Fraction. This class is currently really minimal, but can easily be
extended if necessary.
In my implementation this new class is used throughout the module whenever
calculations with or conversions to exact ratios have to be performed, which
allowed me to preserve almost all of the original code and to factor out the
changes to the class.
As for performance, the gain imagined by Oscar is not always realized even
though the variance functions are now using single passes over the data.
Specifically, in the case of floats the overhead of having to convert
everything to exact ratios first eats up all the savings.
In the case of fractional input, there is a dramatic performance boost though.
I compiled a small table comparing (kind of) average performance of the two
versions with various input data types. Take this with a grain of salt because
the differences can vary quite a bit depending on the exact data:
data type performance gain(+)/loss(-) over original module / %
--------- ----------------------------------------------------
float - 10 %
short Decimal + 10 %
long Decimal - 25 %
Fraction + 80 % (!!)
MyFloat + 25
With Decimal input the costs of conversion to exact ratios depends on the
digits in the Decimals, so with short Decimals the savings from the single-pass
algorithm are larger than the conversion costs, but this reverses for very long
Decimals.
MyFloat is a minimal class inheriting from float and overriding just its
arithmetic methods to return MyFloat instances again.
The performance gain with Fraction input comes from two changes, the
single-pass algorithm and an optimization in _sum (with Fraction, more than
with any other type, the dictionary built by _sum can grow quite large and the
optimization is in the conversion of the dictionary elements to exact ratios).
This is why the extent of this gain can sometimes be significantly higher than
the 80% listed in the table.
Try this, for example:
from statistics import variance as v
from statistics_with_exact_ratios import variance as v2
from fractions import Fraction
data = [Fraction(1,x) for x in range(1,2000)]
print('calculating variance using original statistics module ...')
print(float(v(data)))
print('now using exact ratio calculations ...')
print(float(v2(data)))
I invite everybody to test my implementation, which is very unlikely to be free
of bugs at this stage.
----------
type: -> enhancement
Added file: http://bugs.python.org/file33955/statistics_with_exact_ratios.py
_______________________________________
Python tracker <[email protected]>
<http://bugs.python.org/issue20499>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe:
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com