Several points: * NaN as missing-value is widely used outside the Python standard library. One could argue, somewhat reasonably, that Pandas and NumPy and PyTorch misinterpret the IEEE-754 intention here, but this is EVERYWHERE in numeric/scientific Python. We could DOCUMENT that None is a better placeholder for *missing* but we shouldn't be obnoxious to millions of users of stuff outside stdlib.
* sorted() is WAY too low-level to add this logic to, and numeric types with NaNs are much too special for the generic sorting. That said, we DO NOT NEED IT. list.sort() and sorted() and friends already take a key parameter. This lets the appropriate tool—i.e. the statistics module, and other things—develop a total_order() key function to match the IEEE suggested ordering. There is absolutely no reason or need to change sorted() to accommodate this. * Yes, obviously I made the subject line about statistics.median(), but the xtile() functions have all the same concerns, and live in the same module. * For quiet NaNs, it really is easy to get them innocently. E.g.: def my_results(it): for x in it: x_1 = func1_with_asymptotes(x) x_2 = func2_with_asymptotes(x) result = x_1 / x_2 yield result median = statistics.median(my_results(my_iter)) That's perfectly reasonable code that will SOMETIMES wind up with qNaNs in the collection of values... but that USUALLY will not. * There is absolutely no need to lose any efficiency by making the statistics functions more friendly. All we need is an optional parameter whose spelling I've suggested as `on_nan` (but bikeshed freely). Under at least one value of that parameter, we can keep EXACTLY the current implementation, with all its warts and virtues as-is. Maybe a spelling for that option could be 'unsafe' or 'fast'? * Another option can be 'ignore' (maybe 'skip', but 'ignore' is more Pandas-like) which is simply: def median(it, on_nan=DEFAULT): if on_nan == 'unsafe': ... do all the current stuff ... elif on_nan == "ignore": return median((x for x in it if not is_nan(x)), on_nan='unsafe') elif on_nan = "ieee_total_order": ... something with sorted(it, key=total_order) ... Yes, this requires agreeing on the right implementation of is_nan(), with several plausible versions proposed in this thread. * With the 'raise' and 'poison' ('propagate'?) options, the implementation would be more like this: items = [] for x in it: if is_nan(x): if on_nan == 'raise': raise ValueError('No median exists of collections with NaNs') elif on_nan == 'poison': return float('nan') else: items.append(x) return median(items, on_nan='unsafe') I think that's everything, really. Nothing gets any slower, all use cases are accommodated. -- Keeping medicines from the bloodstreams of the sick; food from the bellies of the hungry; books from the hands of the uneducated; technology from the underdeveloped; and putting advocates of freedom in prisons. Intellectual property is to the 21st century what the slave trade was to the 16th.
_______________________________________________ Python-ideas mailing list -- python-ideas@python.org To unsubscribe send an email to python-ideas-le...@python.org https://mail.python.org/mailman3/lists/python-ideas.python.org/ Message archived at https://mail.python.org/archives/list/python-ideas@python.org/message/V5JTTRAXFAQSWCE3LY3JOZITGS5LG3GB/ Code of Conduct: http://python.org/psf/codeofconduct/