[issue36546] Add quantiles() to the statistics module

Raymond Hettinger Sat, 27 Apr 2019 03:46:35 -0700


Raymond Hettinger <raymond.hettin...@gmail.com> added the comment:


Thanks for propelling this forward :-)  I'm really happy to have an easy to 
reach tool that readily summarizes the shape of data and that can be used to 
compare how distributions differ.


> Octave and Maple call their parameter "method", so if we 
> stick with  "method" we're in good company.

The Langford paper also uses the word "method", so that is likely just the 
right word.


> I'm more concerned about the values taken by the method parameter.
> "Inclusive" and "Exclusive"h ave a related but distinct meaning 
> when it comes to quartiles which is different from the Excel usage

Feel free to change it to whatever communicates the best.  The meaning I was 
going for is closer to the notions of open-interval or closed interval.  In 
terms of use cases, one is for describing population data where the minimum 
input really is the 0th percentile and the maximum is the 100th percentile.  
The other is for sample data where the underlying population will have values 
outside the range of the empirical samples.  I'm not sure what words bests 
describe the distinction.  The word "inclusive" and "exclusive" approximated 
that idea but maybe you can do better.


> I have a working version of quantiles() which supports cutpoints 
> and all nine calculation methods supported by R.

My recommendation is to not do this.  Usually, it's better to start simple, 
focusing on core use cases (i.e. sample and population), then let users teach 
us what additions they really need (this is a YAGNI argument).  Once a feature 
is offered, it can never be taken away even if it proves to be not helpful in 
most situations or is mostly unused.

In his 20 year retrospective, Hyndman expressed dismay that his paper had the 
opposite effect of what was intended (hoping for a standardization on a single 
approach rather than a proliferation of all nine methods).  My experience in 
API design is that offering users too many choices will complicate their lives, 
leading to suboptimal and incorrect choices and creating confusion.   That is 
likely why most software packages other than R only offer one or two options.

If you hold off, you can always add these options later.  We might just find 
that what we've got suffices for most everyday uses.   Also, I thought the 
spirit of the statistics module was to offer a few core statistical tools aimed 
at non-experts, deferring to external packages for more rich collections of 
optimized, expert tools that cover every option.  For me, the best analogy is 
my two cameras. One is a point and shoot that is easy to use and does a 
reasonable job. The other is a professional SLR with hundreds of settings that 
I had to go to photography school to learn to use.

FWIW, I held-off on adding "cut_points" because the normal use case is to get 
equally spaced quantiles.  It would be unusual to want 0.25 and 0.50 but not 
0.75.   The other reason is that user provided cut-points conflict with core 
concept of "Divide *dist* into *n* continuous intervals with equal 
probability."  User provided cut-points provide other ways to go wrong as well 
(not being sorted, 0.0 or 1.0 not being valid for some methods, values outside 
the range 0.0 to 1.0).  The need for cut_points makes more sense for numpy or 
scipy where is common to pass around a linspace. Everyday Python isn't like 
that.

----------

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue36546>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue36546] Add quantiles() to the statistics module

Reply via email to