[Python-ideas] Re: Fix statistics.median()?

David Mertz Sun, 29 Dec 2019 16:08:45 -0800

Several points:

* NaN as missing-value is widely used outside the Python standard library.
One could argue, somewhat reasonably, that Pandas and NumPy and PyTorch
misinterpret the IEEE-754 intention here, but this is EVERYWHERE in
numeric/scientific Python.  We could DOCUMENT that None is a better
placeholder for *missing* but we shouldn't be obnoxious to millions of
users of stuff outside stdlib.


* sorted() is WAY too low-level to add this logic to, and numeric types
with NaNs are much too special for the generic sorting.  That said, we DO
NOT NEED IT.  list.sort() and sorted() and friends already take a key
parameter.  This lets the appropriate tool—i.e. the statistics module, and
other things—develop a total_order() key function to match the IEEE
suggested ordering.  There is absolutely no reason or need to change
sorted() to accommodate this.

* Yes, obviously I made the subject line about statistics.median(), but the
xtile() functions have all the same concerns, and live in the same module.

* For quiet NaNs, it really is easy to get them innocently.  E.g.:

def my_results(it):
    for x in it:
        x_1 = func1_with_asymptotes(x)
        x_2 = func2_with_asymptotes(x)
        result = x_1 / x_2
        yield result

median = statistics.median(my_results(my_iter))

That's perfectly reasonable code that will SOMETIMES wind up with qNaNs in
the collection of values... but that USUALLY will not.

* There is absolutely no need to lose any efficiency by making the
statistics functions more friendly.  All we need is an optional parameter
whose spelling I've suggested as `on_nan` (but bikeshed freely).  Under at
least one value of that parameter, we can keep EXACTLY the current
implementation, with all its warts and virtues as-is.  Maybe a spelling for
that option could be 'unsafe' or 'fast'?

* Another option can be 'ignore' (maybe 'skip', but 'ignore' is more
Pandas-like) which is simply:

def median(it, on_nan=DEFAULT):
    if on_nan == 'unsafe':
        ... do all the current stuff ...
    elif on_nan == "ignore":
        return median((x for x in it if not is_nan(x)), on_nan='unsafe')
    elif on_nan = "ieee_total_order":
        ... something with sorted(it, key=total_order) ...

Yes, this requires agreeing on the right implementation of is_nan(), with
several plausible versions proposed in this thread.

* With the 'raise' and 'poison' ('propagate'?) options, the implementation
would be more like this:

items = []
for x in it:
    if is_nan(x):
        if on_nan == 'raise':
            raise ValueError('No median exists of collections with NaNs')
        elif on_nan == 'poison':
            return float('nan')
        else:
            items.append(x)
return median(items, on_nan='unsafe')


I think that's everything, really.  Nothing gets any slower, all use cases
are accommodated.

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/V5JTTRAXFAQSWCE3LY3JOZITGS5LG3GB/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to