[Python-ideas] Re: Fix statistics.median()?

Richard Damon Sun, 29 Dec 2019 15:25:21 -0800

On 12/29/19 1:16 AM, Christopher Barker wrote:

OMG! Thus is fun and all, but:
On Sat, Dec 28, 2019 at 9:11 PM Richard Damon<rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
    ... practicality beats purity.
And practically, everyone in this thread understands what a float is,and what a NaN is and is not.
Richard: I am honestly confused about what you think we should do.Sure, you can justify why the statistics module doesn’t currentlyhandle NaN’s well, but that doesn’t address the question of what itshould do.
As far as I can tell, the only reasons for the current approach isease of implementation and performance. Which are fine reasons, andwhy it was done that way in the first place.
But there seems to be (mostly) a consensus that it would be good tobetter handle NaNs in the statistics module.
I think the thing to do is decide what we want NaNs to mean: shouldthey be interpreting as missing values or, essentially, errors.
You’ve made a good case that None is the “right” thing to use formissing values — and could be used with int and other types. So yes,if the statistics module were to grow support for missing values, thatcould be the way to do it.
Which means that NaNs should either raise an exception or return NaNas a result. Those are options that are better than the current state.
Nevertheless, I think there is a practical argument for NaN-as-missingvalue. Granted, it is widely used in other systems because it can bestored in a float data type, and that is not required for Python. Butit is widely used, so is familiar to many.
But if we don’t go that route, it would be good to provideNaN-filtering routines in the statistics module — as the relatedthread shows, NaN detection is not trivial.
Frankly, I’m also confused as to why folks seem to think this is anissue to be addressed in the sort() functions — those are way toogeneral and low level to be expected to solve this. And it would be amuch heavier lift to make a change that central to Python anyway.
-CHB

The way I see it, is that median doesn't handle NaNs in a reasonableway, because sorted doesn't handle them, because it is easy and quick tonot handle NaN, and to handle them you need to define an Officialmeaning for them, and there are multiple reasonable meanings. The reasonto push most solutions to sorted, is that except for ignore, which caneasily be implemented as a data filter to the input of the function, theexact same problem occurs in multiple functions (in the statisticsmodule, that would include quantile) so by the principle of DRY, that isthe logical place to implement the solution (if we don't implement thesolution as an input filter)

At its beginning, the statistics module disclaims being a complete allencompassing statistics package, and suggests using one if you need moreadvanced features, which I would consider most processing of NaN to beincluded in. One big reason to NOT fix the issue with NaNs in median isthat such a fix likely has a measurable impact ction the processing ofthe median. I suspect that the simplest solution, and one that doesn'timpact other uses would be simple filter functions (and perhaps mediancould be defined with a arguement for what function to use, with a Noneoption that would be fairly quick. One filter would remove Nans (orNone), one would throw an exception if there is a Nan, and another wouldjust return the sequence [nan] if there are any NaNs in the inputsequence (so the median would be nan). The same options could be addedother operations like quantile which has the similar issue, and madeavailable to the program for other use.

There is one other option that might be possible to fix sorted, is thatthe IEEE spec does define another comparison function that could be usedby sorted to sort the numbers, called something like total_order(a, b)which returns true if a has a total order less than b, the total orderbeing defined such that it acts like < for normal numbers, but alsoprovides a total order for value that < doesn't work as well for,(including equal values that have different representations, like -0 <+0 in the total_order but are == in the normal order). total_orderdefines positive NaNs to be greater than infinity (and negative NaNsless then negative infinity) NaNs with differing representations beingordered by their representation, which puts sNaNs on the extremes beyondthe quiet NaNs.

To do this, float would need to define a dunder (maybe __ltto__) fortotal order compare, which sorted would use instead of __lt__ if itexists (and either sorted does the fallback, or Object just defaultsthis new dunder to call __lt__ if not overridden. Having object do thefallover would allow classes like set to remove this new dunder sosorted generates an error if you try to sort them, since in general,sets don't provide anthing close to a total order with <

This would say that sorted would work with NaNs, but for median mostNaNs are treated as more positive than infinity, so the median isbiased, but at least you don't get absurd results. My expectation wouldbe if written in C or assembly, the total order comparison of two floatswould be fast, maybe not as fast as a simple compare, but is just acouple of machine instructions, so small compared to the other code inthe loop.



--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/RWLM2PSHHWNZRLEEG2TECAGCQJZCNZEG/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to