Re: Sorting NaNs

2018-06-11 Thread Anders Munch

Steven D'Aprano:


It is not a guess if the user explicitly specifies that as the behaviour.


If that was the context, sure, no problem.

- Anders

--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-10 Thread Steven D'Aprano
On Sun, 10 Jun 2018 21:28:02 +0200, Anders Munch wrote:

> Richard Damon wrote:
> 
>> The two behaviors that I have heard suggested are:
>>
>> 1) If any of the inputs are a NaN, the median should be a NaN.
>> (Propagating the NaN as indicator of a numeric error)
>>
>> 2) Remove the NaNs from the input set and process what is left. If
>> nothing, then return a NaN (treating NaN as a 'No Data' placeholder).
> 
> 3) Raise an exception.
> 
> I can't believe anyone even suggested 2).  "In the face of ambiguity,
> refuse the temptation to guess."

It is not a guess if the user explicitly specifies that as the behaviour.

It would be no more of a guess than if the user called

data = [x for x in data if not math.isnan(x)]

on their data first.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-10 Thread Richard Damon
On 6/10/18 3:28 PM, Anders Munch wrote:
> Richard Damon wrote:
>
>> The two behaviors that I have heard suggested are:
>>
>> 1) If any of the inputs are a NaN, the median should be a NaN.
>> (Propagating the NaN as indicator of a numeric error)
>>
>> 2) Remove the NaNs from the input set and process what is left. If
>> nothing, then return a NaN (treating NaN as a 'No Data' placeholder).
>
> 3) Raise an exception.
>
> I can't believe anyone even suggested 2).  "In the face of ambiguity,
> refuse the temptation to guess."
>
> regards, Anders
>
Many people doing statistics (mis-)use 'NaN' as a flag for 'missing data
for this record'. If you want the statistic of a sample as an estimate
of the statistic of the population, then omitting 'missing data' can be
a reasonable option. I will agree that I see issues with this attitude,
but it isn't uncommon.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-10 Thread Anders Munch

Richard Damon wrote:


The two behaviors that I have heard suggested are:

1) If any of the inputs are a NaN, the median should be a NaN.
(Propagating the NaN as indicator of a numeric error)

2) Remove the NaNs from the input set and process what is left. If
nothing, then return a NaN (treating NaN as a 'No Data' placeholder).


3) Raise an exception.

I can't believe anyone even suggested 2).  "In the face of ambiguity, 
refuse the temptation to guess."


regards, Anders

--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Gregory Ewing

Peter Pearson wrote:

What applications would have to worry about colliding floats?  I
don't know.  I'm coming from cryptology, where worrying about such
things becomes a reflex.


If collisions are something to be feared, then you're into
hash territory, where you're probably going to want *much*
more than even 52 bits.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Steven D'Aprano
On Fri, 08 Jun 2018 17:45:58 +, Peter Pearson wrote:

> What bothered me was my feeling that a "reasonable observer"
> would expect the random-float population to be much larger than 2**32,
> and the probably-collision-free sample size to be accordingly much
> larger than 2**16, which is, after all, small enough to appear in many
> applications.

People's intuition about probability is crap. Hence the Birthday Paradox 
I already linked to, and the Monty Hall Problem, which you can google 
for, and the Prosecutor's Fallacy:

http://www.conceptstew.co.uk/pages/prosecutors_fallacy.html

which sometimes (often?) leads to real miscarriages of justice:

https://en.wikipedia.org/wiki/Sally_Clark




-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Peter Pearson
On Fri, 8 Jun 2018 02:15:02 + (UTC), Steven D'Aprano wrote:
> On Thu, 07 Jun 2018 20:43:10 +, Peter Pearson wrote:
[snip]
>> 
>> But gosh, if there are only 2**32 different "random" floats, then you'd
>> have about a 50% chance of finding a collision among any set of 2**16
>> samples.  Is that really tolerable?
>
> Why wouldn't it be? It would be shocking if a sufficiently large sequence 
> of numbers contained no collisions at all: that would imply the values 
> were very much NON random.

[snip]

> . . . I understand that Python's Mersenne Twister implementation 
> is based on 64-bit ints.

OK, I'll relax, particularly since Michael Lamparski's experiment
strongly indicates that random floats are drawn from a population much
larger than 2**16.

You're completely correct, of course, in noting that an absence of
collisions would condemn the random-number generator just as badly
as an excess.  What bothered me was my feeling that a "reasonable
observer" would expect the random-float population to be much larger
than 2**32, and the probably-collision-free sample size to be accordingly
much larger than 2**16, which is, after all, small enough to appear
in many applications.

Exactly what the "reasonable observer" would expect that population to
be, I don't know.  To a mathematician, there's zero chance of collision
in any finite sample of real numbers, or even just rational numbers; but
I don't think anybody would expect that from a computer.  When I picture
the diligent software engineer asking himself, "Wait, how large can I
make this sample before I'll start seeing collisions," I imagine his
first guess is going to be the size of a float's mantissa.

What applications would have to worry about colliding floats?  I
don't know.  I'm coming from cryptology, where worrying about such
things becomes a reflex.

-- 
To email me, substitute nowhere->runbox, invalid->com.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Chris Angelico
On Fri, Jun 8, 2018 at 9:37 PM, Steven D'Aprano
 wrote:
> On Fri, 08 Jun 2018 12:23:40 +1000, Chris Angelico wrote:
>
>> On Fri, Jun 8, 2018 at 12:15 PM, Steven D'Aprano
>>  wrote:
>>> If you truly were limited to 2**32 different values (we're not), then
>>> it would be exactly right and proper to expect a collision in 2**16
>>> samples. Actually, a lot less than that: more like 78000.
>>>
>>> https://en.wikipedia.org/wiki/Birthday_problem
>>
>> 2**16 is 65536, so the figure you give is pretty close to the
>> rule-of-thumb estimate that the square root of your pool is where you're
>> likely to have collisions.
>
> D'oh!
>
>
> I mean, I knew that, I was just testing to see if anyone was paying
> attention.
>

Plus, you were making another very clear point: Humans are *terrible*
at estimating exponentiation. Unless you've memorized specific values
(2**16 and 2**32 are fairly common and familiar), it's really hard to
estimate the value of x**y. Want to hazard a guess as to what
0.99**100 is? (That's the chances that a one-in-a-hundred event won't
happen after a hundred opportunities.) What about 85**8? (Theoretical
password entropy if you have uppercase, lowercase, digit, symbol.)
What is the smallest n such that 26**n > 85**8? (That's the length of
monocase password required for equivalent theoretical entropy.)

Thank you for that elegant demonstration of an important fact about humans. :-)

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Steven D'Aprano
On Fri, 08 Jun 2018 12:23:40 +1000, Chris Angelico wrote:

> On Fri, Jun 8, 2018 at 12:15 PM, Steven D'Aprano
>  wrote:
>> If you truly were limited to 2**32 different values (we're not), then
>> it would be exactly right and proper to expect a collision in 2**16
>> samples. Actually, a lot less than that: more like 78000.
>>
>> https://en.wikipedia.org/wiki/Birthday_problem
> 
> 2**16 is 65536, so the figure you give is pretty close to the
> rule-of-thumb estimate that the square root of your pool is where you're
> likely to have collisions.

D'oh!


I mean, I knew that, I was just testing to see if anyone was paying 
attention.





-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-08 Thread Chris Angelico
On Fri, Jun 8, 2018 at 12:15 PM, Steven D'Aprano
 wrote:
> If you truly were limited to 2**32 different values (we're not), then it
> would be exactly right and proper to expect a collision in 2**16 samples.
> Actually, a lot less than that: more like 78000.
>
> https://en.wikipedia.org/wiki/Birthday_problem

2**16 is 65536, so the figure you give is pretty close to the
rule-of-thumb estimate that the square root of your pool is where
you're likely to have collisions.

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Gregory Ewing

Michael Lamparski wrote:


In any case, it's verifiably not true for CPython.


Yes, CPython uses a particularly good PRNG. You may not be
as lucky using libraries that come with other languages.
A great many PRNG algorithms have been proposed, and a
good proportion of them produce 32-bit ints as their
basic output.

Of course, this isn't necessarily a problem, since
you can always stick two of them together if you need
more bits. But for the majority of purposes, 32 bits
is probably all you need anyway.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Steven D'Aprano
On Thu, 07 Jun 2018 20:43:10 +, Peter Pearson wrote:

> On Thu, 07 Jun 2018 19:02:42 +1200, Gregory Ewing wrote:
>> Steven D'Aprano wrote:
>>> But if it were (let's say) 1 ULP greater or less than one half, would
>>> we even know?
>>
>> In practice it's probably somewhat bigger than 1 ULP. A typical PRNG
>> will first generate a 32-bit integer and then map it to a float, giving
>> a resolution coarser than the 52 bits of an IEEE double.
>>
>> But even then, the probability of getting exactly 0.5 is only 1/2^32,
>> which you're not likely to notice.
> 
> But gosh, if there are only 2**32 different "random" floats, then you'd
> have about a 50% chance of finding a collision among any set of 2**16
> samples.  Is that really tolerable?

Why wouldn't it be? It would be shocking if a sufficiently large sequence 
of numbers contained no collisions at all: that would imply the values 
were very much NON random.

If you truly were limited to 2**32 different values (we're not), then it 
would be exactly right and proper to expect a collision in 2**16 samples. 
Actually, a lot less than that: more like 78000.

https://en.wikipedia.org/wiki/Birthday_problem

(For 2**60 values, we'd need about 1.2 billion samples.)

Greg's statement about "a typical PRNG" may or may not be true, depending 
on what counts as "typical". The standard C language PRNG is notoriously 
awful, with a tiny period and lots and lots of correlations between 
values. Its barely random-ish.

But like many other languages, Python uses a more modern random number 
generator, the Mersenne Twister, which passes a battery of statistical 
tests for randomness (including DieHard) and has a very long period of 
2**19937 - 1. I understand that Python's Mersenne Twister implementation 
is based on 64-bit ints.

https://en.wikipedia.org/wiki/Mersenne_Twister




-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Michael Lamparski
On Thu, Jun 7, 2018 at 4:43 PM, Peter Pearson 
wrote:

> But gosh, if there are only 2**32 different "random" floats, then
> you'd have about a 50% chance of finding a collision among any
> set of 2**16 samples.  Is that really tolerable?
> --
> https://mail.python.org/mailman/listinfo/python-list
>

In any case, it's verifiably not true for CPython.

> >>> def birthday(b):
> ...   rand = random.random
> ...   xs = [rand() for _ in range(2**b)]
> ...   return len(xs) - len(set(xs))
> ...
> >>> birthday(24)
> 0
> >>> [birthday(24) for _ in range(10)]
> [0, 0, 0, 0, 0, 0, 0, 0, 0, 0]

Michael
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Peter Pearson
On Thu, 07 Jun 2018 19:02:42 +1200, Gregory Ewing wrote:
> Steven D'Aprano wrote:
>> But if it were (let's say) 1 ULP greater or less 
>> than one half, would we even know?
>
> In practice it's probably somewhat bigger than 1 ULP.
> A typical PRNG will first generate a 32-bit integer and
> then map it to a float, giving a resolution coarser than
> the 52 bits of an IEEE double.
>
> But even then, the probability of getting exactly 0.5
> is only 1/2^32, which you're not likely to notice.

But gosh, if there are only 2**32 different "random" floats, then
you'd have about a 50% chance of finding a collision among any
set of 2**16 samples.  Is that really tolerable?
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Chris Angelico
On Thu, Jun 7, 2018 at 2:14 PM, Steven D'Aprano
 wrote:
> On Sat, 02 Jun 2018 21:02:14 +1000, Chris Angelico wrote:
>
>> Point of curiosity: Why "> 0.5"?
>
> No particular reason, I just happened to hit that key and then copied and
> pasted the line into the next one.

Hah! The simplicity of it.

>> Normally when I want a fractional
>> chance, I write the comparison the other way: "random.random() < 0.5"
>> has exactly a 50% chance of occurring (presuming that random.random()
>> follows its correct documented distribution). I've no idea what the
>> probability of random.random() returning exactly 0.5 is
>
> Neither do I. But I expect that "exactly 50% chance" is only
> approximately true :-)

Oh, I have no doubt about that. (Though as Gregory says, the
uniformity is based on the PRNG more than on any enumeration of
floats. There are probably floating-point values between 0.0 and 1.0
that can never actually be returned.)

> So given that our mathematically pure(ish) probability of 0.5 for the
> reals has to be mapped in some way to a finite number of floats, I
> wouldn't want to categorically say that that the probability remains
> *precisely* one half. But if it were (let's say) 1 ULP greater or less
> than one half, would we even know?
>
> 0.5 - 1 ULP = 0.49994
>
> 0.5 + 1 ULP = 0.5001
>
>
> I would love to see the statistical experiment that could distinguish
> those two probabilities from exactly 1/2 to even a 90% confidence
> level :-)

LOL, no kidding. How many RNG rolls would you need before you could
even distinguish between 50-50 and 49-51% chance? How many people have
done any sort of statistical analysis even that accurate?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-07 Thread Gregory Ewing

Steven D'Aprano wrote:
But if it were (let's say) 1 ULP greater or less 
than one half, would we even know?


In practice it's probably somewhat bigger than 1 ULP.
A typical PRNG will first generate a 32-bit integer and
then map it to a float, giving a resolution coarser than
the 52 bits of an IEEE double.

But even then, the probability of getting exactly 0.5
is only 1/2^32, which you're not likely to notice.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-06 Thread Steven D'Aprano
On Sat, 02 Jun 2018 21:02:14 +1000, Chris Angelico wrote:

> Point of curiosity: Why "> 0.5"? 

No particular reason, I just happened to hit that key and then copied and 
pasted the line into the next one.


> Normally when I want a fractional
> chance, I write the comparison the other way: "random.random() < 0.5"
> has exactly a 50% chance of occurring (presuming that random.random()
> follows its correct documented distribution). I've no idea what the
> probability of random.random() returning exactly 0.5 is

Neither do I. But I expect that "exactly 50% chance" is only 
approximately true :-)

My understanding is that given the assumption of uniformity, the 50% 
chance is mathematically true, but in that case, it makes no difference 
whether you go from 0 to 0.5 or 0.5 to 1.0. Mathematically it makes no 
difference whether you include or exclude the end points. In the Real 
numbers, there's an uncountable infinity of points either way.

*But* once you move to actual floats, that's no longer true. There are a 
great many more floats between 0 and 0.5 than between 0.5 and 1.0. 
Including the end points, if we enumerate the floats we get:

0.0 --> 0
0.5 --> 4602678819172646912
1.0 --> 4607182418800017408

so clearly the results of random.random() cannot be uniformly distributed 
over the individual floats. If they were, the probably of getting 
something less than or equal to 0.5 would be 4602678819172646912 / 
4607182418800017407 or a little more than 99.9%.

So given that our mathematically pure(ish) probability of 0.5 for the 
reals has to be mapped in some way to a finite number of floats, I 
wouldn't want to categorically say that that the probability remains 
*precisely* one half. But if it were (let's say) 1 ULP greater or less 
than one half, would we even know?

0.5 - 1 ULP = 0.49994

0.5 + 1 ULP = 0.5001


I would love to see the statistical experiment that could distinguish 
those two probabilities from exactly 1/2 to even a 90% confidence 
level :-)


> but since it
> can return 0.0 and cannot return 1.0, I've just always used less-than.
> (Also because it works nicely with other values - there's a 30% chance
> that random.random() is less than 0.3, etc.) Is there a reason for going
> greater-than, or is it simply that it doesn't matter?

No, no particular reason. If I had thought about it I would have used < 
too, but I didn't.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Richard Damon
On 6/2/18 7:35 PM, Steven D'Aprano wrote:
> On Sat, 02 Jun 2018 17:28:28 +0200, Peter J. Holzer wrote:
>
>> On 2018-06-02 10:40:48 +, Steven D'Aprano wrote:
>>> On Sat, 02 Jun 2018 09:32:05 +0200, Peter J. Holzer wrote:
 Also nope. It looks like NaNs just mess up sorting in an
 unpredictable way. Is this the intended behaviour or just an accident
 of implementation? (I think it's the latter: I can see how a sort
 algorithm which doesn't treat NaN specially would produce such
 results.)
>>>
>>> Neither -- it is a deliberate decision
>> If it was a deliberate decicion I would say it was intentional.
>
> As the author of the statistics module, I can absolutely and 
> categorically tell you without even the tiniest doubt that the behavour 
> of statistics.median with NANs is certainly not intentional.
>
> The behaviour of median with NANs is unspecified. It is whatever the 
> implementation happens to do. If it gives the right answer, great, and if 
> it doesn't, it doesn't.
>
> If somebody cares about this use case enough to suggest an alternative 
> behaviour, I'm willing to consider it.
>
The two behaviors that I have heard suggested are:

1) If any of the inputs are a NaN, the median should be a NaN.
(Propagating the NaN as indicator of a numeric error)

2) Remove the NaNs from the input set and process what is left. If
nothing, then return a NaN (treating NaN as a 'No Data' placeholder).

These are very different in interpretation, and not hard to create as a
wrapper to the function, so maybe not worth adding to the core function
and make it a bit slower.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Richard Damon
On 6/2/18 7:50 PM, Ned Batchelder wrote:
> Careful, "same input" is vague.  In Python2, object() instances are
> compared based on their id, in other words, their memory location. It
> would be easy to overlook the idea that the layout in memory is part
> of whether two inputs are "the same".
>
> I have no idea what factors determine the result of comparing NaNs.
>
> --Ned.

NaNs have VERY defined comparison results, they are just strange. If you
compare a NaN to another floating point value (even another NaN) all
comparisons except 'not equals' are false, and 'not equals' are true.
Thus all NaNs are unequal to all numbers, including themselves, but are
neither greater than or than than any of them.


-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Steven D'Aprano
On Sat, 02 Jun 2018 21:51:16 +0100, Ben Bacarisse wrote:

> Paul Rubin  writes:
> 
>> Steven D'Aprano  writes:
>>> it too will mess up sorting in unpredictable ways. So don't do that.
>>
>> Hmm.  GHCi 7.4.2:
>>
>> Prelude> let x = 0.0 / 0.0
>> Prelude> x
>> NaN
>> Prelude> x==x
>> False
>> Prelude> :m Data.List
>> Prelude Data.List> sort [1,2,x,4,5]
>> [1.0,2.0,4.0,5.0,NaN]
> 
> But
> 
>   Prelude Data.List> sort [1,x,2,4,5]
>   [2.0,4.0,5.0,NaN,1.0]


Thanks Ben, and Paul (unfortunately I don't see Paul's messages for some 
reason).

If users of the statistics library would like to see better or different 
support for NANs, I urge you to let me know.





-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Ned Batchelder

On 6/2/18 6:16 PM, Richard Damon wrote:

On 6/2/18 4:51 PM, Ben Bacarisse wrote:

Paul Rubin  writes:


Steven D'Aprano  writes:

it too will mess up sorting in unpredictable ways. So don't do that.

Hmm.  GHCi 7.4.2:

 Prelude> let x = 0.0 / 0.0
 Prelude> x
 NaN
 Prelude> x==x
 False
 Prelude> :m Data.List
 Prelude Data.List> sort [1,2,x,4,5]
 [1.0,2.0,4.0,5.0,NaN]

But

   Prelude Data.List> sort [1,x,2,4,5]
   [2.0,4.0,5.0,NaN,1.0]

and

   Prelude Data.List> sort [1,2,x,4,5,x]
   [NaN,1.0,2.0,4.0,5.0,NaN]

and

   Prelude Data.List> sort [1,2,x,4,5,x,1/0]
   [1.0,2.0,4.0,Infinity,NaN,5.0,NaN]


Not sure what to make of this but at least sorting seems to give a
predictable result.

I suspect it is predictable if you know the algorithm, but I doubt it's
specified nor easily guessable from "outside".

(GHCi, version 8.0.2 here)


The sorting algorithm is almost certainly deterministic (as is the
comparisons with Nan), so given the same input you will get the same
output,
Careful, "same input" is vague.  In Python2, object() instances are 
compared based on their id, in other words, their memory location. It 
would be easy to overlook the idea that the layout in memory is part of 
whether two inputs are "the same".


I have no idea what factors determine the result of comparing NaNs.

--Ned.
--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Gregory Ewing

Peter J. Holzer wrote:

If it was a deliberate decicion I would say it was intentional.


It was a deliberate decision not to define an ordering for NaNs,
but the particular behaviour of sorting them is accidental.

--
Greg
--
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Steven D'Aprano
On Sat, 02 Jun 2018 17:28:28 +0200, Peter J. Holzer wrote:

> On 2018-06-02 10:40:48 +, Steven D'Aprano wrote:
>> On Sat, 02 Jun 2018 09:32:05 +0200, Peter J. Holzer wrote:
>> > Also nope. It looks like NaNs just mess up sorting in an
>> > unpredictable way. Is this the intended behaviour or just an accident
>> > of implementation? (I think it's the latter: I can see how a sort
>> > algorithm which doesn't treat NaN specially would produce such
>> > results.)
>> 
>> 
>> Neither -- it is a deliberate decision
> 
> If it was a deliberate decicion I would say it was intentional.


As the author of the statistics module, I can absolutely and 
categorically tell you without even the tiniest doubt that the behavour 
of statistics.median with NANs is certainly not intentional.

The behaviour of median with NANs is unspecified. It is whatever the 
implementation happens to do. If it gives the right answer, great, and if 
it doesn't, it doesn't.

If somebody cares about this use case enough to suggest an alternative 
behaviour, I'm willing to consider it.



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Richard Damon
On 6/2/18 4:51 PM, Ben Bacarisse wrote:
> Paul Rubin  writes:
>
>> Steven D'Aprano  writes:
>>> it too will mess up sorting in unpredictable ways. So don't do that.
>> Hmm.  GHCi 7.4.2:
>>
>> Prelude> let x = 0.0 / 0.0
>> Prelude> x
>> NaN
>> Prelude> x==x
>> False
>> Prelude> :m Data.List
>> Prelude Data.List> sort [1,2,x,4,5]
>> [1.0,2.0,4.0,5.0,NaN]
> But
>
>   Prelude Data.List> sort [1,x,2,4,5]
>   [2.0,4.0,5.0,NaN,1.0]
>
> and
>
>   Prelude Data.List> sort [1,2,x,4,5,x]
>   [NaN,1.0,2.0,4.0,5.0,NaN]
>
> and
>
>   Prelude Data.List> sort [1,2,x,4,5,x,1/0]
>   [1.0,2.0,4.0,Infinity,NaN,5.0,NaN]
>
>> Not sure what to make of this but at least sorting seems to give a
>> predictable result.
> I suspect it is predictable if you know the algorithm, but I doubt it's
> specified nor easily guessable from "outside".
>
> (GHCi, version 8.0.2 here)
>
The sorting algorithm is almost certainly deterministic (as is the
comparisons with Nan), so given the same input you will get the same
output, which says you could predict the output by effectively running
the sort and look at the results and predict it will happen the same,
this presumes you know the exact implementation of the sort (which you
could get from the implementation source code). This could be called
'predictable' but that is a real stretch.


Because the sort is stable, I would expect that away from the Nans in
the output the results are likely to be at least mostly sorted, as the
algorithm should mostly be working right except where a NaN is moving
through.

-- 
Richard Damon

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Ben Bacarisse
Paul Rubin  writes:

> Steven D'Aprano  writes:
>> it too will mess up sorting in unpredictable ways. So don't do that.
>
> Hmm.  GHCi 7.4.2:
>
> Prelude> let x = 0.0 / 0.0
> Prelude> x
> NaN
> Prelude> x==x
> False
> Prelude> :m Data.List
> Prelude Data.List> sort [1,2,x,4,5]
> [1.0,2.0,4.0,5.0,NaN]

But

  Prelude Data.List> sort [1,x,2,4,5]
  [2.0,4.0,5.0,NaN,1.0]

and

  Prelude Data.List> sort [1,2,x,4,5,x]
  [NaN,1.0,2.0,4.0,5.0,NaN]

and

  Prelude Data.List> sort [1,2,x,4,5,x,1/0]
  [1.0,2.0,4.0,Infinity,NaN,5.0,NaN]

> Not sure what to make of this but at least sorting seems to give a
> predictable result.

I suspect it is predictable if you know the algorithm, but I doubt it's
specified nor easily guessable from "outside".

(GHCi, version 8.0.2 here)

-- 
Ben.
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Peter J. Holzer
On 2018-06-02 10:40:48 +, Steven D'Aprano wrote:
> On Sat, 02 Jun 2018 09:32:05 +0200, Peter J. Holzer wrote:
> > Also nope. It looks like NaNs just mess up sorting in an unpredictable
> > way. Is this the intended behaviour or just an accident of
> > implementation? (I think it's the latter: I can see how a sort algorithm
> > which doesn't treat NaN specially would produce such results.)
> 
> 
> Neither -- it is a deliberate decision

If it was a deliberate decicion I would say it was intentional.

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Chris Angelico
On Sat, Jun 2, 2018 at 9:02 PM, Steven D'Aprano
 wrote:
> The violation of reflexivity is weird though :-)
>
> x = float(NAN)
> x == x  # returns False
>
> Don't argue, just accept it :-)

The way I learned it was: there are (of course) a finite number of bit
patterns that can represent NaN, but conceptually, there are an
infinite number of things that are not numbers, and so two different
NaN values could actually represent different non-numbers. Thus, for
consistency, we assume that no two non-numbers are equal, regardless
of trivial things like bit pattern or identity. (The IEEE standard
doesn't have anything about identity, AFAIK, so it's just the bit
pattern.)

Whether that helps or not is up to you. :) Violating reflexivity truly
is weird, especially since some comparisons are described as "x == y"
but implemented as "x is y or x == y". But trying to maintain
reflexivity would make other things even more weird...

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Steven D'Aprano
On Sat, 02 Jun 2018 10:31:37 +0100, Paul Moore wrote:

> 1. The behaviour of comparisons involving NaN values is weird (not
> undefined, as far as I know NaN behaviour is very well defined, but
> violates a number of normally fundamental properties of comparisons) 

Not so much weird. They're just *unordered* -- no order is defined on 
NANs, so if you ask whether one NAN is less than something, the answer is 
always False.

Some maths libraries provide a separate set of comparison functions which 
return a NAN instead of True/False on NAN comparisons, but Python doesn't 
do that.

The violation of reflexivity is weird though :-)

x = float(NAN)
x == x  # returns False

Don't argue, just accept it :-)


> 2. The precise behaviour of the sort algorithm use by Python is not
> mandated by the language.

I don't think that is quite right: Python's sort is defined to be a 
stable, lexicographical sort. That's quite well-defined, for values which 
define a total ordering.

http://mathworld.wolfram.com/TotallyOrderedSet.html

The floats excluding NANs are totally ordered; the floats including NANs 
are not.

For values which are not totally ordered, no guarantees can be made. The 
result of sorting will depend on the precise details of the sort 
algorithm, the semantics of how the values compare, and the initial order 
of the values. 


> A consequence of (1) is that there is no meaningful definition of "a
> sorted list of numbers" if that list includes NaN,

Correct.


[...]
> So call it an accident of implementation of you like. Or "sorting a list
> with NaNs in it is meaningless" if you prefer. Or "undefined behaviour"
> if you're a fan of the language in the C standard.


While sorting NANs will return in some unpredictable or arbitrary order, 
that last one does not apply. If the C definition of undefined behaviour 
applied, you could write a two-line script like this:


print("Hello")
result = sorted([1, float("nan"), 2])


and the compiler could *legally* generate code to erase your hard disk 
instead of printing "Hello".



-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Chris Angelico
On Sat, Jun 2, 2018 at 8:40 PM, Steven D'Aprano
 wrote:
> Python's sort algorithm assumes that objects have consistent, sensible
> comparisons. If you write an object like this:
>
>
> class Rubbish:
> def __eq__(self, other):
>  return random.random() > 0.5
> def __lt__(self, other):
>  return random.random() > 0.5
> def __gt__(self, other):
>  return random.random() > 0.5
>
> it too will mess up sorting in unpredictable ways. So don't do that.
>

Point of curiosity: Why "> 0.5"? Normally when I want a fractional
chance, I write the comparison the other way: "random.random() < 0.5"
has exactly a 50% chance of occurring (presuming that random.random()
follows its correct documented distribution). I've no idea what the
probability of random.random() returning exactly 0.5 is, but since it
can return 0.0 and cannot return 1.0, I've just always used less-than.
(Also because it works nicely with other values - there's a 30% chance
that random.random() is less than 0.3, etc.) Is there a reason for
going greater-than, or is it simply that it doesn't matter?

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Steven D'Aprano
On Sat, 02 Jun 2018 09:32:05 +0200, Peter J. Holzer wrote:

> Also nope. It looks like NaNs just mess up sorting in an unpredictable
> way. Is this the intended behaviour or just an accident of
> implementation? (I think it's the latter: I can see how a sort algorithm
> which doesn't treat NaN specially would produce such results.)


Neither -- it is a deliberate decision to ignore the possibility of such 
pathological objects when sorting. Feature requests to reconsider this 
decision in statistics.median will be considered seriously :-)

NANs are unordered values. ALL of these ought to return False, for ANY 
float value x:

NAN < x
NAN <= x
NAN > x
NAN >= x
NAN == x

Regardless of the value of x, NAN  x is always false (apart 
from != which is always true).

Python's sort algorithm assumes that objects have consistent, sensible 
comparisons. If you write an object like this:


class Rubbish:
def __eq__(self, other):
 return random.random() > 0.5
def __lt__(self, other):
 return random.random() > 0.5
def __gt__(self, other):
 return random.random() > 0.5

it too will mess up sorting in unpredictable ways. So don't do that.


-- 
Steven D'Aprano
"Ever since I learned about confirmation bias, I've been seeing
it everywhere." -- Jon Ronson

-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Chris Angelico
On Sat, Jun 2, 2018 at 7:31 PM, Paul Moore  wrote:
> On 2 June 2018 at 08:32, Peter J. Holzer  wrote:
>> Browsing through older messages I came upon a thread discussing the
>> treatment of NaNs by median(). Since you have to (partially) sort the
>> values to compute the median, I played around with sorted():
>>
>> Python 3.5.3 (default, Jan 19 2017, 14:11:04)
>> [GCC 6.3.0 20170118] on linux
>> Type "help", "copyright", "credits" or "license" for more information.
>>
> sorted([3, 5, float('NaN'), 1])
>> [3, 5, nan, 1]
>>
>> What? Does NaN cause sorted to return the original list?
>>
> sorted([3, 5, float('NaN'), 1, 0.5])
>> [3, 5, nan, 0.5, 1]
>>
>> Nope. Does it partition the list into sublists, which are sorted
>> individually?
>>
> sorted([3, 5, -8, float('NaN'), 1, 0.5])
>> [-8, 0.5, 1, 3, 5, nan]
> sorted([3, 5, -8, float('NaN'), 1, 0.5, 33])
>> [-8, 0.5, 1, 3, 5, nan, 33]
>>
>> Also nope. It looks like NaNs just mess up sorting in an unpredictable
>> way. Is this the intended behaviour or just an accident of
>> implementation? (I think it's the latter: I can see how a sort algorithm
>> which doesn't treat NaN specially would produce such results.)
>
> I'd simply assume it's the result of two factors:
>
> 1. The behaviour of comparisons involving NaN values is weird (not
> undefined, as far as I know NaN behaviour is very well defined, but
> violates a number of normally fundamental properties of comparisons)
> 2. The precise behaviour of the sort algorithm use by Python is not
> mandated by the language.
>
> A consequence of (1) is that there is no meaningful definition of "a
> sorted list of numbers" if that list includes NaN, as the definition
> "being sorted" relies on properties like "only one of a < b and b < a
> is true" (precisely which properties depend on how you define "sorted"
> which no-one normally cares about because under normal assumptions
> they are all equivalent). A list including NaNs therefore cannot be
> permuted into a sorted order (which is the basic language-mandated
> detail of sorted() - there are others, but this is what matters here).
>
> So call it an accident of implementation of you like. Or "sorting a
> list with NaNs in it is meaningless" if you prefer. Or "undefined
> behaviour" if you're a fan of the language in the C standard.

Sorting a list of uncomparable objects is meaningless (though not
undefined). I believe that a list of floats with some NaNs in it
should have all the other values sort correctly, with the NaNs doing
whatever they feel like doing (they're like cats - have you ever tried
to sort an array of cats?); but that's hard to do with the vanilla
sorting operation. If you want some sort of predictable behaviour,
it's not too hard; all you need is a comparison function that puts all
the NaNs together:

>>> sorted([3, 5, -8, float('NaN'), 1, 0.5, 33], key=lambda x: (x!=x, x))
[-8, 0.5, 1, 3, 5, 33, nan]

Or if you'd rather shove them all to the beginning:

>>> sorted([3, 5, -8, float('NaN'), 1, 0.5, 33], key=lambda x: (x==x, x))
[nan, -8, 0.5, 1, 3, 5, 33]

With that change, you have a guarantee that all finite and infinite
values will be correctly sorted, and the NaNs will be grouped (but in
no particular order within that group).

ChrisA
-- 
https://mail.python.org/mailman/listinfo/python-list


Re: Sorting NaNs

2018-06-02 Thread Paul Moore
On 2 June 2018 at 08:32, Peter J. Holzer  wrote:
> Browsing through older messages I came upon a thread discussing the
> treatment of NaNs by median(). Since you have to (partially) sort the
> values to compute the median, I played around with sorted():
>
> Python 3.5.3 (default, Jan 19 2017, 14:11:04)
> [GCC 6.3.0 20170118] on linux
> Type "help", "copyright", "credits" or "license" for more information.
>
 sorted([3, 5, float('NaN'), 1])
> [3, 5, nan, 1]
>
> What? Does NaN cause sorted to return the original list?
>
 sorted([3, 5, float('NaN'), 1, 0.5])
> [3, 5, nan, 0.5, 1]
>
> Nope. Does it partition the list into sublists, which are sorted
> individually?
>
 sorted([3, 5, -8, float('NaN'), 1, 0.5])
> [-8, 0.5, 1, 3, 5, nan]
 sorted([3, 5, -8, float('NaN'), 1, 0.5, 33])
> [-8, 0.5, 1, 3, 5, nan, 33]
>
> Also nope. It looks like NaNs just mess up sorting in an unpredictable
> way. Is this the intended behaviour or just an accident of
> implementation? (I think it's the latter: I can see how a sort algorithm
> which doesn't treat NaN specially would produce such results.)

I'd simply assume it's the result of two factors:

1. The behaviour of comparisons involving NaN values is weird (not
undefined, as far as I know NaN behaviour is very well defined, but
violates a number of normally fundamental properties of comparisons)
2. The precise behaviour of the sort algorithm use by Python is not
mandated by the language.

A consequence of (1) is that there is no meaningful definition of "a
sorted list of numbers" if that list includes NaN, as the definition
"being sorted" relies on properties like "only one of a < b and b < a
is true" (precisely which properties depend on how you define "sorted"
which no-one normally cares about because under normal assumptions
they are all equivalent). A list including NaNs therefore cannot be
permuted into a sorted order (which is the basic language-mandated
detail of sorted() - there are others, but this is what matters here).

So call it an accident of implementation of you like. Or "sorting a
list with NaNs in it is meaningless" if you prefer. Or "undefined
behaviour" if you're a fan of the language in the C standard.

Paul
-- 
https://mail.python.org/mailman/listinfo/python-list


Sorting NaNs

2018-06-02 Thread Peter J. Holzer
Browsing through older messages I came upon a thread discussing the
treatment of NaNs by median(). Since you have to (partially) sort the
values to compute the median, I played around with sorted():

Python 3.5.3 (default, Jan 19 2017, 14:11:04) 
[GCC 6.3.0 20170118] on linux
Type "help", "copyright", "credits" or "license" for more information.

>>> sorted([3, 5, float('NaN'), 1])
[3, 5, nan, 1]

What? Does NaN cause sorted to return the original list?

>>> sorted([3, 5, float('NaN'), 1, 0.5])
[3, 5, nan, 0.5, 1]

Nope. Does it partition the list into sublists, which are sorted
individually?

>>> sorted([3, 5, -8, float('NaN'), 1, 0.5])
[-8, 0.5, 1, 3, 5, nan]
>>> sorted([3, 5, -8, float('NaN'), 1, 0.5, 33])
[-8, 0.5, 1, 3, 5, nan, 33]

Also nope. It looks like NaNs just mess up sorting in an unpredictable
way. Is this the intended behaviour or just an accident of
implementation? (I think it's the latter: I can see how a sort algorithm
which doesn't treat NaN specially would produce such results.)

hp

-- 
   _  | Peter J. Holzer| we build much bigger, better disasters now
|_|_) || because we have much more sophisticated
| |   | h...@hjp.at | management tools.
__/   | http://www.hjp.at/ | -- Ross Anderson 


signature.asc
Description: PGP signature
-- 
https://mail.python.org/mailman/listinfo/python-list