[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

Mark Dickinson Sun, 05 Jan 2020 02:57:40 -0800


Mark Dickinson <dicki...@gmail.com> added the comment:


[Karthikeyan]

> can possibly break again if (x-c) * (x-c) was also changed to return float64 
> in future

I think it's safe to assume that multiplying two NumPy float32's will continue 
to give a float32 back in the future; NumPy has no reason to give back a 
different type, and changing this would be a big breaking change.

The big difference between (x-c)**2 and (x-c)*(x-c) in this respect is that the 
latter is purely an operation on float32 operands, while the former is a 
mixed-type operation: NumPy sees a binary operation between a float32 and a 
Python int, and has to figure out a suitable type for the result. The int to 
float32 conversion is regarded as introducing unacceptable precision loss, so 
it chooses float64 as an acceptable output type, converts both input operands 
to that type, and then does the computation.

Some relevant parts of the NumPy docs:

https://docs.scipy.org/doc/numpy/reference/ufuncs.html#casting-rules
https://docs.scipy.org/doc/numpy/reference/generated/numpy.result_type.html

Even for pure Python floats, x*x is a simpler, more accurate, and likely faster 
operation than x**2. On a typical system, the former (eventually) maps to a 
hardware floating-point multiplication, which is highly likely to be correctly 
rounded, while the latter, after conversion of the r.h.s. to a float, maps to a 
libm pow call. That libm pow call *could* conceivably have a fast/accurate path 
for a right-hand-side of 2.0, but it could equally conceivably not have such a 
path.

OTOH, (x-c)*(x-c) repeats the subtraction unnecessarily, but perhaps assignment 
expressions could rescue us? For example, replace:

    sum((x-c)**2 for x in data)

with:

    sum((y:=(x-c))*y for x in data)

[Steven]

> I am inclined to have the stdev of float32 return a float32 is possible. 

Would that also imply intermediate calculations being performed only with 
float32, or would intermediate calculations be performed with a more precise 
type? float32 has small enough precision to run into real accuracy problems 
with modestly large datasets. For example:

    >>> import numpy as np
    >>> sum(np.ones(10**8, dtype=np.float32), start=np.float32(0))
    16777216.0  # should be 10**8

or, less dramatically:

    >>> sum(np.full(10**6, 1.73, dtype=np.float32), start=np.float32(0)) / 10**6
    1.74242125  # only two accurate significant figures

----------
nosy: +mark.dickinson

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue39218>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue39218] Assertion failure when calling statistics.variance() on a float32 Numpy array

Reply via email to