[issue35904] Add statistics.fmean(seq)

2019-04-23 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 7280048690244e73b13f4f68b96c244bcb5434e8 by Raymond Hettinger in 
branch 'master':
bpo-35904: Add missing fmean() entry to the summary table (GH-12919)
https://github.com/python/cpython/commit/7280048690244e73b13f4f68b96c244bcb5434e8


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-04-23 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +12845

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-21 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-21 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 47d9987247bcc45983a6d51fd1ae46d5d356d0f8 by Raymond Hettinger in 
branch 'master':
bpo-35904: Add statistics.fmean() (GH-11892)
https://github.com/python/cpython/commit/47d9987247bcc45983a6d51fd1ae46d5d356d0f8


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-18 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

PR looks good to me, thanks Raymond.

Just at the moment I'm having problems with my internet connection leading to 
technical difficulties with Github. Hopefully I can resolve this soon.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> Yes please! I'm happy with the name fmean

Okay, the PR is ready.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-16 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +11920
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-10 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> Would you like me to submit a PR with docs and tests?

Yes please! I'm happy with the name fmean.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-09 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

FWIW, fabs() and fmod() are other examples of where "f" indicates a function 
specialized to convert inputs to floats, perform floating point math, and 
return a float.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-08 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Would you like me to submit a PR with docs and tests?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-08 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

+1 from me as well.

I like your count() solution because 1) it gives the same answer for both an 
iterator and for an iterable 2) it preserves the memory friendly 
characteristics of iterators, 3), it is still reasonably fast, and 4) the 
function signature is still simple.

My top name preference is "fmean" because I'm used to "isqrt" for integers, 
"fsqrt" for floats, and "cmath" for complex where the first letter means the 
type.  Incidentally, that is why "fsum" is named with an "f". My second choice 
is "fastmean" without an underscore.  To my ear, "float_mean" is confusing, as 
if "float" were a verb and "mean" were the direct object, the antonym of 
"sink_mean" ;-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-08 Thread Mark Dickinson


Mark Dickinson  added the comment:

No further bikeshedding on the name from me; +1 on whatever you decide on. :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-08 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Oh, I seem to have accidentally reverted the change of title.

Sorry, that was definitely not intended and I don't know how it 
happened. But now that it has, I'm not going to change it until we have 
a decision on a name.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-07 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> On my current 3.8 build, this code given an approx 500x speed-up

On my system, I only get a 30x speed-up using your timeit code. Using 
ints instead of random floats, I only get a 9x speed-up.

This just goes to show how sensitive these timing results are on 
platform and hardware.

What do you think of this implementation?

def floatmean(data:Iterable) -> Float:
try:
n = len(data)
except TypeError:
# Handle iterators with no len.
n = 0
def count(x):
nonlocal n
n += 1
return x
total = math.fsum(map(count, data))
return total/n
else:
return math.fsum(data)/n

Compared to the "no frills" fsum()/len() version:

- I see no visible slowdown on lists of floats;
- it handles iterators as well.

On my computer, the difference between the sequence path and the 
iterator path is just a factor of 3.5. How does it compare on other 
machines?

As for the name, I think we have three reasonable candidates:

float_mean
fast_mean
fmean

(with or without underscores for the first two). Do people have a 
preference?

--
title: Add statistics.fastmean(seq) -> Add statistics.fmean(seq)

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-07 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

>>def fmean(seq: Sequence[float]) -> float:
>>return math.fsum(seq) / len(seq)
>
> Is it intentional that this doesn't support iterators?

Since we need both the sum and the length, this seemed like a good starting 
point.  Also, the existing mean() function already covers the more general 
cases.

I suspect that it is common to keep the data in memory so that more than one 
descriptive statistic can be generated:

data = load_measurements()
data.sort()
n = len(data)
mu = fastmean(data)
sigma = stdev(data, xbar=mu)
low, q1, q2, q3, high = data[0], data[n//4], data[n//2], data[3*n//4], 
data[-1]
popular = mode(data, first_tie=True)


It's possible (though possibly not desirable) to provide an fallback path:

def fastmean(data: Iterable) -> float:
try:
return fsum(data) / len(data)
except TypeError:
# Slow alternative   
return float(mean(data))
# Memory intensive alternative
data = list(data)
return fsum(data) / len(data)  
# Less accurate alternative
total = n = 0
for n, x in enumerate(data, start=1):
total += x
return total / n

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-07 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

>def fmean(seq: Sequence[float]) -> float:
>return math.fsum(seq) / len(seq)

Is it intentional that this doesn't support iterators?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-07 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

In the PEP, I did say that I was making no attempt to compete with numpy 
for speed, and that correctness was more important than speed.

That doesn't mean I don't care about speed. Nor do I necessarily care 
about absolute precision when given nothing but float arguments. Mark 
suggests that using fsum() will be accurate to within 1.5 ulp which 
satisfies me for float arguments.

I doubt that stdev etc would be able to promise that accuracy, so 
provided your data is all floats, that seems like a pretty good result 
for the mean.

But I'm not super-keen on having two separate mean() functions if that 
opens the floodgates to people wanting every statistics function to grow 
a fast-but-floats-only twin. That would make me sad. But maybe mean() is 
special enough to justify twinning it.

In my ideal world, I'd have a single mean() function that had a 
fast-path for float data, but would automatically drop down to a slower 
but more accurate path for other types, out-of-range data, etc. I 
believe that the the built-in sum() function does something like this.

When I say "more accurate", this isn't a complaint about fsum(). It 
refers to the limitation of floats themselves. Call me Mr Silly if you 
like, but if I need to take the average of numbers bigger than 2**1074 I 
would like to be able to, even if it takes a bit longer :-)

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-06 Thread Mark Dickinson


Mark Dickinson  added the comment:

Double-checking my own assertions: here's an example of a list xs of floats for 
which `fsum(xs) / len(xs)` is out by more than 1 ulp. (Obtained simply by 
checking a few lists of random.random() outputs; it's probably possible to 
construct something more obvious.)

Python 3.7.2 (default, Dec 30 2018, 08:55:50)
[Clang 10.0.0 (clang-1000.11.45.5)] on darwin
Type "help", "copyright", "credits" or "license" for more information.
>>> import fractions, math>>> xs = [float.fromhex(h) for h in 
>>> ['0x1.88104e64ffa5cp-3', '0x1.9b793215ddca8p-3', '0x1.754cbf6b09730p-1', 
>>> '0x1.e2b4ca1df3680p-2', '0x1.91689b66782e1p-1']]
>>> approx_mean = math.fsum(xs) / len(xs)
>>> approx_mean  # in [0.25, 0.5], so 1 ulp is 2**-54
0.47536945341150305
>>> exact_mean = sum(fractions.Fraction(x) for x in xs) / len(xs)
>>> exact_mean
Fraction(10704368466236809, 22517998136852480)
>>> error_in_ulps = abs(exact_mean - fractions.Fraction(approx_mean)) * 2**54
>>> float(error_in_ulps)
1.2

I ran checks on 100 such randomly generated lists, and the error never 
exceeded 1.5 ulps.

Sketch of proof of the 1.5 ulps bound: the fsum result is out by at most 0.5 
ulps; the length n of the list is presumably represented exactly (lists with 
more than 2**53 elements would be infeasible). Division of the fsum result by n 
keeps the relative error the same, but potentially magnifies the ulps error by 
two, due to the usual "wobble" between relative error and ulps error, so that 
gives us up to 1 ulp error. Then the result of the division may need to be 
rounded again, giving another potential error of up to 0.5 ulps. The bound is 
strict: we can't actually attain 1.5 ulps, so the result we get can't be more 
than 1 ulp away from the correctly rounded result.

> the proposed fmean name seems a bit unfortunate in that it's reversing the 
> sense of sum and fsum

I see Josh already made this observation: apologies for the duplicate 
bikeshedding.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-06 Thread Mark Dickinson


Mark Dickinson  added the comment:

> Are you saying that "fsum(seq) / len(seq)" is incorrect because on some
> older builds there is a rare possibility of an error of 1 unit in the last
> place?

Just to be clear, it's not that rare a possibility, and it's not restricted to 
older builds. I think Raymond is referring to the bug where on some machines 
that are still using the x87 for arithmetic, double rounding can lead to the 
fsum result being out (by potentially an arbitrary amount, not just 1 ulp). But 
even without that consideration, we'll still often be out by 1 ulp or more on 
typical new systems.

However, if I got my sums right (and assuming that fsum _is_ correctly rounded, 
and IEEE 754 arithmetic is in use, and the rounding mode hasn't been changed 
from its default of round-ties-to-even, and we're excluding corner cases like 
overflow, underflow, etc., etc.), the result of fsum(seq) / len(seq) can never 
be out by more than 1.5 ulps. That's better than NumPy can promise (even with 
its use of pairwise summation in some -- but not all -- cases), and should be 
good enough for almost any practical purpose.

I think I'd rather see the regular statistics.mean sacrifice the 
good-to-the-last-bit accuracy (and it's really not that much of a sacrifice: 
accurate to 1.5 ulps is a _very_ respectable compromise) than have a new 
function. I don't know how feasible that is, though, given all the 
type-handling in statistics.mean.

Bikeshedding time: the proposed fmean name seems a bit unfortunate in that it's 
reversing the sense of sum and fsum, where fsum is the correctly-rounded, 
slightly slower variant. My first assumption on seeing the name was that fmean 
was supposed to be to mean as fsum is to sum.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-06 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-05 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> Correctness over speed.

Are you saying that "fsum(seq) / len(seq)" is incorrect because on some older 
builds there is a rare possibility of an error of 1 unit in the last place?  
Just about everyone who uses this module will not care one whit about that.  
ISTM, a requirement for exactness violates "practicality beats purity".  We 
really don't want to be in the position of advising people to never use this 
module because of design choices that aren't in line with user needs.

> It is easier to speed up a correct but slow function 
> than to correct a fast but buggy one.

Taken to an extreme, this could be used justify a BubbleSort.  In the case of 
statistics.mean(), slowness of three orders of magnitude makes this a nearly 
catastrophic choice for real world use.

> I don't like the naming symmetry between fsum and fmean;

Any reasonable name will suffice.  I suggested an f prefix that could be taken 
to mean that it is "fast" or that it returns a "float".  It isn't much 
different than isqrt() for an integer square root.

> numpy already exists for people who need insane speed for these algorithms 
> and are willing to compromise accuracy;

Let's see what Steven and Tim think about this.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-05 Thread Josh Rosenberg


Josh Rosenberg  added the comment:

Correct me if I'm wrong, but at least initially, the first listed goal of 
statistics (per the PEP) was:

"Correctness over speed. It is easier to speed up a correct but slow function 
than to correct a fast but buggy one."

numpy already exists for people who need insane speed for these algorithms and 
are willing to compromise accuracy; am I wrong in my impression that statistics 
is more about providing correct batteries included that are fast enough for 
simple uses, not reimplementing numpy piece by piece for hardcore number 
crunching?

Even if such a function were desirable, I don't like the naming symmetry 
between fsum and fmean; it's kind of misleading. math.fsum is a slower, but 
more precise, version of the built-in sum. Having statistics.fmean be a faster, 
less accurate, version of statistics.mean reverses that relationship between 
the f-prefixed and non-f-prefixed versions of a function.

--
nosy: +josh.r

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35904] Add statistics.fmean(seq)

2019-02-05 Thread Raymond Hettinger


New submission from Raymond Hettinger :

The current mean() function makes heroic efforts to achieve last bit accuracy 
and when possible to retain the data type of the input.

What is needed is an alternative that has a simpler signature, that is much 
faster, that is highly accurate without demanding perfection, and that is 
usually what people expect mean() is going to do, the same as their calculators 
or numpy.mean():

   def fmean(seq: Sequence[float]) -> float:
   return math.fsum(seq) / len(seq)

On my current 3.8 build, this code given an approx 500x speed-up (almost three 
orders of magnitude).   Note that having a fast fmean() function is important 
in resampling statistics where the mean() is typically called many times:  
http://statistics.about.com/od/Applications/a/Example-Of-Bootstrapping.htm 


$ ./python.exe -m timeit -r 11 -s 'from random import random' -s 'from 
statistics import mean' -s 'seq = [random() for i in range(10_000)]' 'mean(seq)'
50 loops, best of 11: 6.8 msec per loop

$ ./python.exe -m timeit -r 11 -s 'from random import random' -s 'from math 
import fsum' -s 'mean=lambda seq: fsum(seq)/len(seq)' -s 'seq = [random() for i 
in range(10_000)]' 'mean(seq)'
2000 loops, best of 11: 155 usec per loop

--
assignee: steven.daprano
components: Library (Lib)
messages: 334894
nosy: rhettinger, steven.daprano, tim.peters
priority: normal
severity: normal
status: open
title: Add statistics.fmean(seq)
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com