[Python-ideas] Re: Fix statistics.median()?

David Mertz Sun, 29 Dec 2019 21:07:14 -0800

On Sun, Dec 29, 2019 at 11:33 PM Andrew Barnert <abarn...@yahoo.com> wrote:


> IEEE total order specifies a distinct order for every distinct bit
> pattern, and tries to do so in a way that makes sense.
>

Ok, ok... I've got "learned up" about this three times now :-).  Given we
cannot control those bit patterns from Python, I'm a bit "meh"... but I get
the rule (yeah, yeah, struct module)


> The 95% case is handled by just ignore and raise. Novices should probably
> never be using anything else.
> Experts will definitely often want poison. And probably sometimes fast for
> backward compatibility and/or performance. That gets you to 98%.
>

Fair enough.  I really only care about the 98% case.  But if you can
convince Steven  to add `key=` as well, no real harm to me.  My only
concern is a beginner who types `help(median)` and scratches her head over
the key oddness.  But I guess the docstring can say "Don't worry about this
if you don't need a custom sort order for your objects."

It's also no real extra work to pass along a `key` argument to the
`sorted()` internal to the function.  I guess on the off chance the
implementation moves to Quickselect it will be slightly more work. But I
guess really not that much even then (hmmm... I think the implementation
would have to contain a kind of DSU inside it though for that).

Do remember that using `sorted()` is an implementation detail, not a
promise of functions in statistics module.

And experts might also want something different from IEEE total order, like
> uniformly pushing all NaNs to the end. I’m not sure when you’d  actually
> want that, but since it was the original suggestion that kicked off this
> whole discussion, it’s obviously not inconceivable.
>

I think that idea of "NaNs to the end" was just ill-conceived nonsense.  I
mean MAYBE I can see a good in putting the +/-nans to both ends of the
order, so MAYBE the stuff in the middle winds up being in median.  But if
you have 100 nans and 50 real numbers, it seems just silly to automatically
select NaN as the median.  Especially under the "missing data" use that is
so common in data science (Pandas, R, etc).


> I get the idea that, once you’ve already got an on_nan param, adding
> another value to that param doesn’t add as much cognitive load as adding a
> whole other param would. But I think a total order value is so rarely
> useful that it’s probably more load than it’s worth, while a key param is a
> more widely useful and therefore worth more load (although maybe still not
> enough).
>

Oh... yeah, the 'ieee_total_order' value was absolutely silly and over
specialized. I just threw it in because some folks in the thread mentioned
it.  Albeit, it's a value to rarely use for the one parameter, so that's a
little less burden than another parameter.  In Pandas we see that a lot...
some parameter will have 20 options, but 95% of the users use the default,
and 4.9% use one non-default option.  So the remaining 18 options cover the
0.1% use cases.

Yours, David...

-- 
Keeping medicines from the bloodstreams of the sick; food
from the bellies of the hungry; books from the hands of the
uneducated; technology from the underdeveloped; and putting
advocates of freedom in prisons.  Intellectual property is
to the 21st century what the slave trade was to the 16th.

_______________________________________________
Python-ideas mailing list -- python-ideas@python.org
To unsubscribe send an email to python-ideas-le...@python.org
https://mail.python.org/mailman3/lists/python-ideas.python.org/
Message archived at 
https://mail.python.org/archives/list/python-ideas@python.org/message/6VBRXYNH37JG6Q6GMMZQTFLKPGANOBCR/
Code of Conduct: http://python.org/psf/codeofconduct/

[Python-ideas] Re: Fix statistics.median()?

Reply via email to