On Fri, Dec 27, 2019 at 8:14 PM Richard Damon <rich...@damon-family.org>
wrote:

> > It is a well known axiom of computing that returning an *incorrect*
> > result is a very bad thing.
>
> There is also an axiom that you can only expect valid results if you
> meet the operations pre-conditions.
>

sure.


> Sometimes, being totally defensive in checking for 'bad' inputs costs
> you too much performance.
>

it can, yes, there are no hard rules about anything.


> The stated requirement on the statistics module is you feed it
> 'numbers', and a NaN is by definition Not a Number.
>

Sure, but NaN IS a part of the Python float type, and is can and will show
up once in a while. That is not the same as expecting median (ir any other
function is the module) to work with some arbitrary other type.
Practicality beats purity -- floats WILL be widely used in teh module, they
should be accommodated.

Here is the text in the docs:
"""
This module provides functions for calculating mathematical statistics of
numeric (Real-valued) data.
<snip>
Unless explicitly noted, these functions support int, float, Decimal and
Fraction. Behaviour with other types (whether in the numeric tower or not)
is currently unsupported.
"""

So this is pretty clear - the module is designed to work with int, float,
Decimal and Fraction -- so I think it's worth some effort to well-support
those. And both float and Decimal have a NaN of some sort.


> The Median function also implies that its inputs have the property of
> having an order, as well as being able to be added (and if you can't add
> them, then you need to use median_lower or median_upper)
>

sure, but those are expected of the types above anyway. It doesn't seem to
me that this package is designed to work with ANY type that has order and
supports the operations needed by the function. For instance, lists of
numbers can be compared, so:

In [69]: stuff = [[1,2,3],
    ...:          [4,6,1],
    ...:          [8,9],
    ...:          [4,5,6,7,8,9],
    ...:          [5.0]
    ...:          ]


In [70]:


In [70]: statistics.median(stuff)

Out[70]: [4, 6, 1]

The fact that that worked is just a coincidence, it is not an important
part of the design that it does.

I will also point out that really the median function is a small part of
> the problem, all the x-tile based functions have the same issue,


absolutely, and pretty much all of them, though many will just give you NaN
as a result, which is much better than an arbitrary result.


> and
> fundamentally it is a problem with sorted().
>

I don't think so. In fact, sorted() is explicitly designed to work with any
type that is consistently ordered (I'm not sure if they HAVE to be "total
ordered), and floats with NaNs are not that.

As the statistics module is specifically designed to work with numbers (and
essentially only with numbers) it's the appropriate place to put an
accommodation for NaNs. Not to mention that while sorted() could be adapted
to do something more consistent with NaNs, since it is a general purpose
function, it's hard to know what the behavior should be -- raise an
Exception? remove them? put them all at the front or back? Which makes
sense depends entirely on the application. The statistics module, on the
other hand, is for statistics, so while there are still multiple options,
it's a lot easier to pick one and go for it.

Has anyone tried to implement a version of these that checks for inputs
> that don't have a total order and seen what the performance impact is?


I think checking for total order is a red herring -- there really is a
reason to specifically deal with NaNs in floats (and decimals), not ANY
type that may not be total ordered.

Testing for NaNs isn't trivial, as elsewhere it was pointed out that how
> you check is based on the type of the number you have (Decimals being
> different from floats).


yes, that is unfortunate.


> To be really complete, you want to actually
> detect that you have some elements that don't form a total order with
> the other elements.
>

as above, being "complete" isn't necessary here. And even if you were
complete, what would you DO with a general non-total-ordered type? Raising
an Exception would be OK, but any other option (like treat it as a missing
value) wouldn't make sense if you doin't know more about the type.


> In many ways, half fixing the issue makes it worse, as improving
> reliability can lead you to forget that you need to take care, so the
> remaining issues catch you harder.
>

I can't think of an example of this -- it's a fine principle, but if we
were to make the entire statistics module do something reasonable with Nans
-- what exactly other issues would that hide?

In short: practicality beats purity:

- The fact that the module is designed to work with all the standard number
types doesn't mean it has to work with ANY type that supports the
operations needed for a given function.

- NaNs are part of the Python float and Decimal implementations -- so they
WILL show up once in a while. It would be good to handle them.

- NaNs can also be very helpful to indicate missing values -- this is can
actually be very handy for statistics calculations. So it could be a nice
feature to add -- that NaN means "missing value"

-CHB


-- 
Christopher Barker, PhD

Python Language Consulting
  - Teaching
  - Scientific Software Development
  - Desktop GUI and Web Development
  - wxPython, numpy, scipy, Cython
_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/IG7TQOOTL4FAA4ENQDUO7SNH4PDEFVV4/
Code of Conduct: http://python.org/psf/codeofconduct/

Reply via email to