[Python-ideas] Re: Fix statistics.median()?

Steven D'Aprano Sun, 29 Dec 2019 23:03:16 -0800

On Sun, Dec 29, 2019 at 06:22:49PM -0500, Richard Damon wrote:

> The way I see it, is that median doesn't handle NaNs in a reasonable 
> way, because sorted doesn't handle them, because it is easy and quick to 
> not handle NaN, and to handle them you need to define an Official 
> meaning for them, and there are multiple reasonable meanings.

There is already an official way to handle NAN comparisons, and Python
follows that official standard, and so does sorted.

It might be annoying, it might be inconvenient, it might be surprising
to people's intuition (if they only think about values with a total
order and not the possibility of partial order) but the way NANs are
sorted seemingly at random throughout a sequence is a logical and
correct consequence of the semantics of NANs and the way sorting
operates.

It might be surprising, but what are the alternatives?

We can:

(1) Detect sets of values which make a partial order and raise an
exception or warning. But that's *really* expensive, it is at least
O(N**2) and requires comparing every element with every other element.

(2) Don't bother exhaustively comparing every element to every other
element, but when sorting, each time you compare two elements, check for
a partial order between those two elements alone.

But that too increases the cost of sorting, it has many false negatives
(you miss a lot of partial ordered sets), and it breaks the invariant
that sorting requires only the less than operator.

See https://bugs.python.org/issue36095 for more details.

(3) Don't try to detect all sets of values with a partial order, but
single out NANs only.

That violates the "special cases aren't special enough" koan in the Zen,
and it will slow down sorting for everyone even if they don't have NANs,
but okay, maybe this special case *is* special enough. So what do we do
now?

(4) There's a NAN in your data, so sort() will raise.

That's a bit extreme. It will annoy and frustrate people for whom the
odd placement of NANs is not a problem they care about. E.g. if I want
to process a set of numbers in order from smallest to largest, I don't
care if a NAN happens to get placed between 3.5 and 4.25.

(5) Move all the NANs to the front of the sequence! No the end of the
sequence! Half to the front and half to the back! Order them according
to their sign bit! No, order them according to the payload! Or both!

That's already six ways to order NANs. Let's focus on the two obvious
ones:

(6) Move all NANs to the start of the list. Or the end of the list.

Either way, you artifically decrease or increase the median:

[NAN, NAN, NAN, NAN, 1, 2, 3, 4, 5, 6, 7] # median 2
[1, 2, 3, 4, 5, 6, 7, NAN, NAN, NAN, NAN] # median 6

(7) No no no, once you have sorted the NANs to one end or the other,
median can check for the existance of even a single NAN, and then raise
or return NAN or strip them out or do whatever you like.

Great, but we can do the same without changing sort in any way. Doing
this in sort when it could be done elsewhere is the wrong place, and
probably less efficient too. Doing it in median, rather than sort, lets
you short-circuit as soon as you see a single NAN.

So changing the way sort() doesn't seem very promising. All the
alternatives have serious downsides that outweigh the benefits.

(8) How about fixing NANs? (For some definition of "fixing".)

That will violate the standard, and will change the behaviour of any
code that uses comparisons, some potentially far-reaching changes in
behaviour. And it glosses over the question of *how* we should change
NANs to order differently. Raise? Less than everything? Greater than
everything? Sort between negatives and zero? Justify your choice.

> One big reason to NOT fix the issue with NaNs in median is that such a
> fix likely has a measurable impact ction the processing of the median.

I just did some quick and dirty tests to try to estimate the likely cost
of NAN processing on median(). I get a slowdown of between 4 x and 8 x
by applying NAN processing.

For mean(), I get a slowdown of a factor of between 1.4 to 1.7 times.

--
Steven
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at
https://mail.python.org/archives/list/python-ideas@python.org/message/XYH4LZDPY5JFGWIGM56VAHUM374KNLJT/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to