[Python-ideas] Re: Fix statistics.median()?

Andrew Barnert via Python-ideas Mon, 30 Dec 2019 15:37:34 -0800

> On Dec 30, 2019, at 14:05, Richard Damon <rich...@damon-family.org> wrote:
> 
> On 12/30/19 4:22 PM, Andrew Barnert via Python-ideas wrote:
>>> On Dec 30, 2019, at 06:50, Richard Damon <rich...@damon-family.org> wrote:
>>> On 12/30/19 12:06 AM, Andrew Barnert wrote:
>>>>> On Dec 29, 2019, at 20:04, Richard Damon <rich...@damon-family.org> wrote:
>>>>> Thus your total_order, while not REALLY a total order, is likely good 
>>>>> enough for most purposes.
>>>> Well, it is a total order of equivalence classes (with all IEEE-equal 
>>>> values being equivalent, all negative NaNs being equivalent, and all 
>>>> positive NaNs being equivalent).
>>>> But whether it’s good enough for most purposes depends on what those 
>>>> purposes are.
>>>> I’m having a hard time imagining any purposes where “some well-defined but 
>>>> arbitrary, unnatural, and idiosyncratic total order of equivalence classes 
>>>> of the binary64 values” is of any use at all. Maybe that’s just a failure 
>>>> of my imagination. But “the specific well-defined but arbitrary, unnatural 
>>>> order that almost every other similar bit of code in every language is 
>>>> likely to use” seems a lot more likely to be useful.
>>>> (And, needless to say, so does “this specific order that I defined 
>>>> myself”. I don’t know why the OP wanted to treat all NaNs as equivalents 
>>>> values greater than inf, but presumably he does know, and that’s fine. So, 
>>>> I don’t think that needs to be built into the module as one of the 
>>>> options, but I don’t see why he shouldn’t be able to specify it 
>>>> explicitly.)
>>> But it DOESN'T make all positive NaNs equivalent, as they do not compare 
>>> 'equal'. Neither is less than the other, but they don't compare equal. If 
>>> you use a sort function (not the one that Python currently uses) that uses 
>>> an equality test as part of its sort (rather than just letting the 
>>> falsehood of a < b and b< a imply that a == b) then it will break and 
>>> perhaps return illogical results. This is getting into the weeds of corner 
>>> cases, so may not be important.
>> Yes it does. If all positive NaNs sort the same with this key function and 
>> the Python sorted function, they form an equivalence class under that 
>> function. The fact that they wouldn’t form an equivalence class under a 
>> different function is irrelevant. It’s like arguing that 1 and 8 can’t be in 
>> the same equivalence class for mod 7 addition because they’re not equivalent 
>> under mod 11 addition.
>> And I don’t even know what you’re trying to argue here. That we shouldn’t 
>> allow this key function because it wouldn’t provide a total order if someone 
>> monkeypatched builtins to replace sorted? And therefore we should invent 
>> some “better” order that nobody has asked for and offer that?
> 
> The issue is that the functions generates keys of the form (1, nan) for NaNs, 
> and two of these do not compare equal, so not an equivalency class.


Forming an equivalence class is always under a specific relation, not absolute. 
The fact that they don’t form an equivalence class under == is no more relevant 
than the fact that they do form an equivalence class under is. There are a huge 
number or potential relations over binary64 x binary64, and many of them are 
equivalence relations and the rest are not, and the only thing that matters is 
the one actually used by sorted. Which is not is or == or any of the zillions 
or other possible relations, it’s `not (a<b or b<a)‘.

For floats with no key, that is not an equivalence relation. For example, 2 R 
nan, and nan R 3, but not 2 R 3, so R is not transitive. And that’s why floats 
with NaNs don’t get sorted correctly without a key. That’s the problem you’re 
apparently trying to solve

And this key does solve that problem. key(2) R key(NaN) doesn’t hold, so the 
original intransitivity doesn’t arise. And neither does any other; it’s 
transitive over the whole domain. And it’s also reflexive and symmetric. And 
that’s the definition of an equivalence relation. And any equivalence relation 
partitions the set into equivalence classes, again by definition. Here, the 
(key-processed) NaNs are in one class, the other (key-processed) values are in 
the same classes they would have been in with no key function, which also 
happens to be the same classes they’d be in with == on the unkeyed values, and 
these classes are totally ordered under <. And that’s what sorted requires. QED.

That doesn’t mean it’s a _useful_ solution. If I want a specific total ordering 
over specific equivalence classes and you give me a different total ordering 
over different equivalence classes, that isn’t useful. But not because it isn’t 
a total ordering over equivalence classes. It is, it’s just the wrong one. So 
offering a different total ordering over equivalence classes that still isn’t 
the one I asked for is no better and no worse: it’s equally valid and equally 
useless.

> IF the sort routine only uses <, and assumes that if neither a < b or b < a 
> then they are equal, but not all sort routine use that method. I think the 
> current sorted function just uses <, so it works as a key, it just isn't 
> guaranteed to work for all sort algorithms.

But who cares that a key that median is passing to sorted wouldn’t work with 
other sort algorithms that it will never be used with? That doesn’t affect 
whether it works with sorted.

> Basically, if a sort ever compares with a Nan using normal comparison rules, 
> you have opened the door to problems.

But because median is not ever using a sort that will ever compare a NaN with 
==, that isn’t relevant.

> In this case we will only ever compare a NaN to another NaN, and depending on 
> the sort routine, it still might work, and if the algorithm never uses an 
> equality test to see that they aren't equal (and few sorts actually test for 
> equality) it won't cause a problem.

> One simple solution is to just make the nan tuples be (1, 0) or (-1, 0) and 
> just make all NaNs a real equivalence class. If you want NaN to sort by their 
> representation, then you could could make that the second term of the tuple 
> (as an int)

Solution to what? You just said there’s no problem, so what problem are you 
trying to solve? The simplest solution to “it won’t cause a problem” is to 
change nothing. 

If you’re trying to solve the problem that someone could theoretically 
monkeypatch builtins.sorted and make it use a different algorithm, you haven’t 
solved that, because someone could just as easily monkeypatch builtins.sorted 
and make it always return [3, 1, 2] for any input. If “what if someone replaces 
the sorted function?” isn’t consenting adults territory, I don’t know what it.

>>> I will admit that if you want to sort a list that contains a mix of floats, 
>>> Decimals, and Rationals, then the tuple approach might be better then the 
>>> conversion to a 64 bit int as the key, since that doesn't work for the 
>>> other types except by converting them to float first.
>> Better for what?
>> If there’s no good use for the OP’s order or the tuple order or IEEE total 
>> order, we shouldn’t try to provide any of them; figuring out which one 
>> “makes sense” or coming up with one that “makes more sense” when we don’t 
>> even know what case we’re trying to make sense of is pointless. Inventing a 
>> new order that nobody asked for and nobody can even imagine a use for 
>> doesn’t help anyone, it’s just more code to write and maintain and one more 
>> option to confuse users with that benefits nobody. It doesn’t matter what 
>> properties that new order has.
>> If there is a good use for any of them, on the other hand… inventing a 
>> different order that isn’t one any of them asked for still doesn’t help 
>> anyone. We need to provide what they actually asked for, or give them a way 
>> to do so (most obviously, with a key parameter), or just reject their use as 
>> unimportant.
> Building the 64 bit int key to sort floats by IEEE total_order only works if 
> all the values are floats, as that key won't compare properly with other 
> types of numbers. If your data does have other types of numbers, than making 
> the tuple to force NaNs into their right place, and let the normal value 
> comparison work for normal numbers (or infinities which I think get compared 
> right).

There are a huge number of possible total orders of equivalence classes you 
could induce over the floats, and there’s not one that’s “the right one”. 
They’re all valid, but if someone needs a specific one, only that one is the 
one they needs.

>>> As to why NaNs are greater than infinity, that comes from the IEEE 
>>> specification on the total_order relationship of floating point numbers.
>> So what? If you care about IEEE total order, you want IEEE total order, not 
>> an unspecified idiosyncratic order that’s similar to it in some ways.
> 
> The question was why were NaNs sorted this way, and the basic answer is to 
> get a usable order for sorting, you have to put them somewhere, and the IEEE 
> total order definition is as good as any. Ultimately, no position would make 
> sense universally.  The order that IEEE chose is 1) a simple order to create 
> (being based fairly simply on the representation for binary floating point), 
> and 2) putting NaNs at the extreme may make it easier to remove them if that 
> is what is desired.
> 
> The order he generates is very close to the IEEE total order, the difference 
> are:
> 
> 1) It doesn't seperate -0 for +0, which shouldn't matter for most 
> applications.
> 
> 2) It doesn't provide an order between NaNs, but unless you are taking 
> special measures to distinguish the NaNs anyway, that doesn't really matter.

And it also doesn’t distinguish equal but distinct-bit-pattern subnormal values.

Sure, most of these things don’t come up very often. But that’s just why 
needing IEEE total order doesn’t come up very often. If it does come up, it’s 
because one or all of these things matter, so offering something that’s kind of 
similar to IEEE total order doesn’t help.

If the argument is that nobody will ever need IEEE total order, that’s fine. 
The person who asked for it already said he can’t think of a real use for it. 
And neither can anyone else. But then this solution is even less useful—it’s 
failing to solve a problem that didn’t even exist in the first place. If the 
only things people will ever need to do with nans and median is ignore, raise, 
poison, or backward compatibility with whatever Python 3.8  did, it doesn’t 
matter what fifth option you invent, it’s not going to solve anyone’s actual 
problem.
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/OSYT4GLO62HFNP4HD7GVMEEGQR3EHIRE/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to