[Python-ideas] Re: Fix statistics.median()?

Christopher Barker Sun, 29 Dec 2019 16:08:42 -0800

On Sun, Dec 29, 2019 at 3:26 PM Richard Damon <rich...@damon-family.org>
wrote:


> > Frankly, I’m also confused as to why folks seem to think this is an
> > issue to be addressed in the sort() functions
>


> The way I see it, is that median doesn't handle NaNs in a reasonable
> way, because sorted doesn't handle them,


I don't think so -- it doesn't handle NaNs because it takes a decision
about how they should be handled, and code to write; maybe more code
because you can't use the bare sort() functions, but sort will never solve
the problem both generically and properly by itself.


> because it is easy and quick to
> not handle NaN, and to handle them you need to define an Official
> meaning for them, and there are multiple reasonable meanings.


exactly.


> The reason
> to push most solutions to sorted, is that except for ignore, which can
> easily be implemented as a data filter to the input of the function, the
> exact same problem occurs in multiple functions (in the statistics
> module, that would include quantile) so by the principle of DRY, that is
> the logical place to implement the solution (if we don't implement the
> solution as an input filter)
>

well, no -- the logical place for DRY is to use the SAME sort
implementation for all functions in the statistics module that need a sort.
It only makes sense to try to push this to the standard sort if it were to
be used, in the same way, but many other uses od sort, and it didn't break
any current uses. ON the other hand, saying "this is how the statistics
module interprets NaNs, and how things will be sorted" is a localized -- it
does not require it be useful for anything else, and it will, by
definition, not break any code that doesn't use the statistics module.

At its beginning, the statistics module disclaims being a complete all
> encompassing statistics package,


sure -- but that doesn't mean it couldn't be more complete than it
currently is.

and suggests using one if you need more
> advanced features, which I would consider most processing of NaN to be
> included in.


That's a perfectly valid opinion, but while I think that perhaps "handling
missing values" could be considered advanced, I'm not sure "giving a
correct and meaningful answer for all values of expressly supported data
types is "advanced" -- in a way, quite the opposite -- it's less "advanced"
coders, ones that are not thinking about where NaNs might appear, and what
the implication of that is, that are going to be bitten by the current
implementation.

Docs can help, but I think we can, and should, do better than that -- after
all it's well known that "no one reads documentation".

One big reason to NOT fix the issue with NaNs in median is
> that such a fix likely has a measurable impact ction the processing of
> the median.


You mean performance? Sure, but as I've argued before (no idea if anyone
agrees with me) the statistics package is already not a high performance
package anyway. If it turns out that it slows it down by, say, a factor of
two or more, then yes, maybe we need to forget it.


> I suspect that the simplest solution, and one that doesn't
> impact other uses would be simple filter functions (and perhaps median
> could be defined with a arguement for what function to use, with a None
> option that would be fairly quick. One filter would remove Nans (or
> None), one would throw an exception if there is a Nan, and another would
> just return the sequence [nan] if there are any NaNs in the input
> sequence (so the median would be nan). The same options could be added
> other operations like quantile which has the similar issue, and made
> available to the program for other use.
>

I agree -- this could be a good way to go.


> There is one other option that might be possible to fix sorted,


<snip> see the last post if you want the details ...


> This would say that sorted would work with NaNs, but for median most
> NaNs are treated as more positive than infinity, so the median is
> biased, but at least you don't get absurd results.


yeah, but this is what I meant above -- you'd still want to check for NaNs
in the statistics functions. Though It would be a fast check, 'cause you
could check only the ones on the end after sorting.

But the biggest barrier is that it would be a fair bit of churn on the
sort() functions (and the float class), and would only help for floats
anyway. If someone want to propose this, please do -- but I don't think we
should wait for that to do something with the statistics module. Also, if
you want to pursue this, do go back and find the thread about type-checked
sorting -- I think this is it:

https://mail.python.org/pipermail/python-dev/2016-October/146613.html

I'm not sure if anything ever came of that.

- CHB

-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/75BZ5UACRE6SWUJ4C4RYH2G6AQFKN7J3/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to