[Python-ideas] Re: NAN handling in statistics functions

Marc-Andre Lemburg Tue, 24 Aug 2021 00:48:57 -0700

On 24.08.2021 05:53, Steven D'Aprano wrote:
> At the moment, the handling of NANs in the statistics module is 
> implementation dependent. In practice, that *usually* means that if your 
> data has a NAN in it, the result you get will probably be a NAN.
> 
>     >>> statistics.mean([1, 2, float('nan'), 4])
>     nan
> 
> But there are unfortunate exceptions to this:
> 
>     >>> statistics.median([1, 2, float('nan'), 4])
>     nan
>     >>> statistics.median([float('nan'), 1, 2, 4])
>     1.5
> 
> I've spoken to users of other statistics packages and languages, such as 
> R, and I cannot find any consensus on what the "right" behaviour should 
> be for NANs except "not that!".
> 
> So I propose that statistics functions gain a keyword only parameter to 
> specify the desired behaviour when a NAN is found:
> 
> - raise an exception
> 
> - return NAN
> 
> - ignore it (filter out NANs)
> 
> which seem to be the three most common preference. (It seems to be 
> split roughly equally between the three.)
> 
> Thoughts? Objections?


Sounds good. This is similar to the errors argument we have
for codecs where users can determine what the behavior should be
in case of an error in processing.

> Does anyone have any strong feelings about what should be the default? 

No strong preference, but if the objective is to continue calculations
as much as possible even in the face of missing values, returning NAN
is the better choice.

Second best would be an exception, IMO, to signal: please be explicit
about what to do about NANs in the calculation. It helps reduce the
needed backtracking when the end result of a calculation
turns out to be NAN.

Filtering out NANs should always be an explicit choice to make.
Ideally such filtering should happen *before* any calculations
get applied. In some cases, it's better to replace NANs with
use case specific default values. In others, removing them is the
right thing to do.

Note that e.g. SQL defaults to ignoring NULLs in aggregate functions
such as AVG(), so there are standard precedents for ignoring NAN values
per default as well. And yes, that default can lead to wrong results
in reports which are hard to detect.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Experts (#1, Aug 24 2021)
>>> Python Projects, Coaching and Support ...    https://www.egenix.com/
>>> Python Product Development ...        https://consulting.egenix.com/
________________________________________________________________________

::: We implement business ideas - efficiently in both time and costs :::

   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
    D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
           Registered at Amtsgericht Duesseldorf: HRB 46611
               https://www.egenix.com/company/contact/
                     https://www.malemburg.com/

_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/L5QB4GUPYXNYBFKG43VSGOWVE27Y5BIF/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: NAN handling in statistics functions

Reply via email to