[Python-ideas] Re: Fix statistics.median()?

Richard Damon Sat, 28 Dec 2019 17:30:45 -0800

On 12/28/19 1:14 AM, Christopher Barker wrote:

On Fri, Dec 27, 2019 at 8:14 PM Richard Damon<[email protected] <mailto:[email protected]>> wrote:
    > It is a well known axiom of computing that returning an *incorrect*
    > result is a very bad thing.

    There is also an axiom that you can only expect valid results if you
    meet the operations pre-conditions.


sure.

    Sometimes, being totally defensive in checking for 'bad' inputs costs
    you too much performance.


it can, yes, there are no hard rules about anything.

    The stated requirement on the statistics module is you feed it
    'numbers', and a NaN is by definition Not a Number.
Sure, but NaN IS a part of the Python float type, and is can and willshow up once in a while. That is not the same as expecting median (irany other function is the module) to work with some arbitrary othertype. Practicality beats purity -- floats WILL be widely used in tehmodule, they should be accommodated.
Here is the text in the docs:
"""
This module provides functions for calculating mathematical statisticsof numeric (Real-valued) data.
<snip>
Unless explicitly noted, these functions support int, float, Decimaland Fraction. Behaviour with other types (whether in the numeric toweror not) is currently unsupported.
"""
So this is pretty clear - the module is designed to work with int,float, Decimal and Fraction -- so I think it's worth some effort towell-support those. And both float and Decimal have a NaN of some sort.

But the documentation that you reference say it works with NUMBERS, andNaN are explicitly NOT A NUMBER, so the statistic module specificallyhasn't made a claim that it will work with them.

Also, the general section says, unless explicitly noted, and medianmakes an explicit reference to type that support order but not additionneeding to use a different function, which implies that a data type theIS ordered and supports addition is usable with median (presumably someuser defined number class that acts close enough to other number classesthat the average of a and b is a value between them)

    The Median function also implies that its inputs have the property of
    having an order, as well as being able to be added (and if you
    can't add
    them, then you need to use median_lower or median_upper)
sure, but those are expected of the types above anyway. It doesn'tseem to me that this package is designed to work with ANY type thathas order and supports the operations needed by the function. Forinstance, lists of numbers can be compared, so:
In [69]: stuff = [[1,2,3],
    ...:          [4,6,1],
    ...:          [8,9],
    ...:          [4,5,6,7,8,9],
    ...:          [5.0]
    ...:          ]

In [70]:

In [70]: statistics.median(stuff)
Out[70]: [4, 6, 1]

Actually, since lists don't support addition in the manner requested,median isn't appropriate, but perhaps it has meaning withmedian_lower(). For example, if the lists represented hierarchicalreferences in a document (Chapter 1, Section 2, Paragraph 3) then themedian might have a meaning.

The fact that that worked is just a coincidence, it is not animportant part of the design that it does.
    I will also point out that really the median function is a small
    part of
    the problem, all the x-tile based functions have the same issue,
absolutely, and pretty much all of them, though many will just giveyou NaN as a result, which is much better than an arbitrary result.

x-tile base functions (like median is 50th percentile, upper and lowerquartile (25 and 75 percentile), the one provide in the module isquantiles (which actually can compute any even grouping).

    and
    fundamentally it is a problem with sorted().
I don't think so. In fact, sorted() is explicitly designed to workwith any type that is consistently ordered (I'm not sure if they HAVEto be "total ordered), and floats with NaNs are not that.

Most sort routines require that the data be at least define a partialorder and effectively define a total order based on considering that ifa < b is false and b < a is false then a and b form an equivalency classthat we don't care about order. Sets for instance with < being sub setof, have a consistent ordering but don't form a consistent equivalencyclass and thus don't sort properly.

As the statistics module is specifically designed to work with numbers(and essentially only with numbers) it's the appropriate place to putan accommodation for NaNs. Not to mention that while sorted() could beadapted to do something more consistent with NaNs, since it is ageneral purpose function, it's hard to know what the behavior shouldbe -- raise an Exception? remove them? put them all at the front orback? Which makes sense depends entirely on the application. Thestatistics module, on the other hand, is for statistics, so whilethere are still multiple options, it's a lot easier to pick one and gofor it.
    Has anyone tried to implement a version of these that checks for
    inputs
    that don't have a total order and seen what the performance impact is?
I think checking for total order is a red herring -- there really is areason to specifically deal with NaNs in floats (and decimals), notANY type that may not be total ordered.

Some of the problems is that fixing one issue doesn't come close tofixing the problem. The stated problem is that newbie/casual programmersget confused that some things don't work when you get NaNs in the mix.Why is it ok to say that [3, 1, 4, nan, 2] is a 'sorted' array, but that4 can't be the median of that array.

I am not saying that we can't fix the problem (though I think I havemade a reasonable argument that it isn't a problem that MUST be fixed),but more that a change in just median is the wrong spot to make thisfix. The real problem is that to the naive program when they dosomething wrong and get NaNs into their data, get confused at some ofthe strange answers that they can get. The fact that NaNs don't order isone of the points of confusion, and rather than try to fix one by onethe various operations that are defined for sorted data to handle toissue, why not go to the core issue and deal with the base sortingoperations. Perhaps even better would be a math mode (perhapsunfortunately needed to be the default since we are trying to helpbeginners) that isn't fully IEEE compliant but throws exceptions on theerrors that get us into the territory that causes the confusion. Thismight actually not impact that many programs in real life, as how manyprograms actually need to be generating NaNs as a result of calculations.

    Testing for NaNs isn't trivial, as elsewhere it was pointed out
    that how
    you check is based on the type of the number you have (Decimals being
    different from floats).


yes, that is unfortunate.

    To be really complete, you want to actually
    detect that you have some elements that don't form a total order with
    the other elements.
as above, being "complete" isn't necessary here. And even if you werecomplete, what would you DO with a general non-total-ordered type?Raising an Exception would be OK, but any other option (like treat itas a missing value) wouldn't make sense if you doin't know more aboutthe type.The

The only answer that I see that makes sense is Raising an Exception(likely in sorted). You also probably can't be totally thorough inchecking, as checking completely associativity would be an N**3operation which is way to slow. Likely you would live with just testingthat for each pair you check has exactly one of a < b, a == b, b < abeing true.

    In many ways, half fixing the issue makes it worse, as improving
    reliability can lead you to forget that you need to take care, so the
    remaining issues catch you harder.
I can't think of an example of this -- it's a fine principle, but ifwe were to make the entire statistics module do something reasonablewith Nans -- what exactly other issues would that hide?
In short: practicality beats purity:
- The fact that the module is designed to work with all the standardnumber types doesn't mean it has to work with ANY type that supportsthe operations needed for a given function.
- NaNs are part of the Python float and Decimal implementations -- sothey WILL show up once in a while. It would be good to handle them.
- NaNs can also be very helpful to indicate missing values -- this iscan actually be very handy for statistics calculations. So it could bea nice feature to add -- that NaN means "missing value"

ASSUMING that NaNs represent missing data is just ONE possibleinterpretation for it. Making it THE interpretation in a low levelpackage seems wrong. It also is an interpretation that is easy to createwith a simple helper function that takes one sequence and returnsanother one where all the NaNs are removed. Other interpretations ofwhat a NaN should reflect aren't as easy to implement outside theoperation, so if you are going to pick an interpretation, it probablyshould be something else that is hard to handle externally.

Also, in Python, because it is dynamically typed, would be used betterin my mind using something like None to indicate missing data in thearray. NaN was chosen is some language as a missing value because theycouldn't handle mixed type data arrays.


-CHB


--
Christopher Barker, PhD


-- h
Richard Damon
_______________________________________________
Python-ideas mailing list -- [email protected]
To unsubscribe send an email to [email protected]
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/[email protected]/message/VRDAF4HV4GSTTHSK7NM5KOLRB3QPOO72/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to