On 12/29/19 1:16 AM, Christopher Barker wrote:
OMG! Thus is fun and all, but:
On Sat, Dec 28, 2019 at 9:11 PM Richard Damon
<rich...@damon-family.org <mailto:rich...@damon-family.org>> wrote:
... practicality beats purity.
And practically, everyone in this thread understands what a float is,
and what a NaN is and is not.
Richard: I am honestly confused about what you think we should do.
Sure, you can justify why the statistics module doesn’t currently
handle NaN’s well, but that doesn’t address the question of what it
should do.
As far as I can tell, the only reasons for the current approach is
ease of implementation and performance. Which are fine reasons, and
why it was done that way in the first place.
But there seems to be (mostly) a consensus that it would be good to
better handle NaNs in the statistics module.
I think the thing to do is decide what we want NaNs to mean: should
they be interpreting as missing values or, essentially, errors.
You’ve made a good case that None is the “right” thing to use for
missing values — and could be used with int and other types. So yes,
if the statistics module were to grow support for missing values, that
could be the way to do it.
Which means that NaNs should either raise an exception or return NaN
as a result. Those are options that are better than the current state.
Nevertheless, I think there is a practical argument for NaN-as-missing
value. Granted, it is widely used in other systems because it can be
stored in a float data type, and that is not required for Python. But
it is widely used, so is familiar to many.
But if we don’t go that route, it would be good to provide
NaN-filtering routines in the statistics module — as the related
thread shows, NaN detection is not trivial.
Frankly, I’m also confused as to why folks seem to think this is an
issue to be addressed in the sort() functions — those are way too
general and low level to be expected to solve this. And it would be a
much heavier lift to make a change that central to Python anyway.
-CHB
The way I see it, is that median doesn't handle NaNs in a reasonable
way, because sorted doesn't handle them, because it is easy and quick to
not handle NaN, and to handle them you need to define an Official
meaning for them, and there are multiple reasonable meanings. The reason
to push most solutions to sorted, is that except for ignore, which can
easily be implemented as a data filter to the input of the function, the
exact same problem occurs in multiple functions (in the statistics
module, that would include quantile) so by the principle of DRY, that is
the logical place to implement the solution (if we don't implement the
solution as an input filter)
At its beginning, the statistics module disclaims being a complete all
encompassing statistics package, and suggests using one if you need more
advanced features, which I would consider most processing of NaN to be
included in. One big reason to NOT fix the issue with NaNs in median is
that such a fix likely has a measurable impact ction the processing of
the median. I suspect that the simplest solution, and one that doesn't
impact other uses would be simple filter functions (and perhaps median
could be defined with a arguement for what function to use, with a None
option that would be fairly quick. One filter would remove Nans (or
None), one would throw an exception if there is a Nan, and another would
just return the sequence [nan] if there are any NaNs in the input
sequence (so the median would be nan). The same options could be added
other operations like quantile which has the similar issue, and made
available to the program for other use.
There is one other option that might be possible to fix sorted, is that
the IEEE spec does define another comparison function that could be used
by sorted to sort the numbers, called something like total_order(a, b)
which returns true if a has a total order less than b, the total order
being defined such that it acts like < for normal numbers, but also
provides a total order for value that < doesn't work as well for,
(including equal values that have different representations, like -0 <
+0 in the total_order but are == in the normal order). total_order
defines positive NaNs to be greater than infinity (and negative NaNs
less then negative infinity) NaNs with differing representations being
ordered by their representation, which puts sNaNs on the extremes beyond
the quiet NaNs.
To do this, float would need to define a dunder (maybe __ltto__) for
total order compare, which sorted would use instead of __lt__ if it
exists (and either sorted does the fallback, or Object just defaults
this new dunder to call __lt__ if not overridden. Having object do the
fallover would allow classes like set to remove this new dunder so
sorted generates an error if you try to sort them, since in general,
sets don't provide anthing close to a total order with <
This would say that sorted would work with NaNs, but for median most
NaNs are treated as more positive than infinity, so the median is
biased, but at least you don't get absurd results. My expectation would
be if written in C or assembly, the total order comparison of two floats
would be fast, maybe not as fast as a simple compare, but is just a
couple of machine instructions, so small compared to the other code in
the loop.
--
Richard Damon
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/RWLM2PSHHWNZRLEEG2TECAGCQJZCNZEG/
Code of Conduct: http://python.org/psf/codeofconduct/