[issue35775] Add a general selection function to statistics

2020-03-09 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-05-24 Thread Rémi Lapeyre

Rémi Lapeyre  added the comment:

Hi Steven, thanks for taking the time to reviewing my patch.

Regarding the relevance of add select(), I was looking for work to do in the 
bug tracker and found some references to it 
(https://bugs.python.org/issue21592#msg219934 for example).

I knew that there is multiples definition of the percentiles but got sloppy in 
my previous response by wanting to answer quickly. I will try not to do this 
again.


Regarding the use of sorting, I thought that sorting would be quicker than 
doing the other linear-time algorithm in Python given the general performance 
of Tim sort, some tests in https://bugs.python.org/issue21592 agreed with that.

For the iterator, I was thinking about how to implement percentiles when 
writing select() and thought that by writing:


def _select(data, i, key=None):
if not len(data):
raise StatisticsError("select requires at least one data point")
if not (1 <= i <= len(data)):
raise StatisticsError(f"The index looked for must be between 1 and 
{len(data)}")
data = sorted(data, key=key)
return islice(data, i-1, None)

def select(data, i, key=None):
return next(_select(data, y, key=key))


and then doing some variant of:

it = _select(data, i, key=key)
left, right = next(it), next(it)
# compute percentile with left and right

to implement the quantiles without sorting multiple time the list. Now that 
quantiles() has been implement by Raymond Hettinger, this is moot anyway.

Since its probably not useful, feel free to disregard my PR.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-19 Thread Steven D'Aprano

Steven D'Aprano  added the comment:

Rémi. I've read over your patch and have some comments:

(1) You call sorted() to produce a list, but then instead of retrieving the 
item using ``data[i-1]`` you use ``itertools.islice``. That seems unnecessary 
to me. Do you have a reason for using ``islice``?

(2) select is not very useful on its own, we actually want it so we can 
calculate quantiles, e.g. percentiles, deciles, quartiles. If we want the 
k-quantile (e.g. k=100 for percentiles) then there are k+1 k-quantiles in 
total, including the minimum and maximum. E.g quartiles divide the data set 
into four equal sections, so there are five boundary values including the min 
and max.

So the caller is likely to be calling select repeatedly on the same data set, 
and hence making a copy of that data and sorting it repeatedly. If the data set 
is small, repeatedly making sorted copies is still cheap enough, but for large 
data sets, that will be expensive.

Do you have any thoughts on how to deal with that?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> I'm very interested in adding quartiles and 
> general quantiles/fractiles, but I'm not 
> so sure that this select(data, index) function would be useful. 

I concur with Steven.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Steven D'Aprano

Steven D'Aprano  added the comment:

On Fri, Jan 18, 2019 at 11:13:41PM +, Rémi Lapeyre wrote:

> Wouldn't be the 5-th percentile be select(data, round(len(data)/20)?

Oh if only it were that simple!

Using the method you suggest, the 50th percentile is not the same as the 
median unless the length of the list is three more than a multiple of 
four. It also runs into problems for small lists where the index rounds 
down to zero.

Langford (2006) does a literature review and finds fifteen methods for 
calculating the quartiles (Q1, Q2, Q3), of which twelve are distinct and 
incompatible; Hyndman & Fan (1996) did similar for general quantiles and 
came up with nine, of which seven match Langford's.

I know of at least six other methods, which gives a total of 20 distinct 
ways of calculating quartiles or quantiles.

http://jse.amstat.org/v14n3/langford.html

https://robjhyndman.com/publications/quantiles/

I stress that these are not merely different algorithms which give the 
same answer, but different methods which sometimes disagree on their 
answers. So whichever method you use, some people are going to be 
annoyed or confused or both.

http://mathforum.org/library/drmath/view/60969.html

Other statistics libraries provide a choice, e.g.:

- R and Octave provide the same 9 as H
- Maple provides 6 of those, plus 2 others.
- Wessa provides 5 that match H, plus another 3.
- SAS provides 5.
- even Excel provides 2 different ways.

Statisticians don't even agree on which is the "best" method. H 
recommend their method number 8. Langford recommends his method 4. I 
think that your suggestion matches Langford's method 14, which is H's 
method 3.

Selecting the i-th item from a list is the easy part. Turning that into 
meaningful quantiles, percentiles etc is where it gets really hairy. My 
favourite quote on this comes from J Nash on the Gnumeric mailing list:

Ultimately, this question boils down to where to cut to
divide 4 candies among 5 children. No matter what you do,
things get ugly.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Rémi Lapeyre

Rémi Lapeyre  added the comment:

Wouldn't be the 5-th percentile be select(data, round(len(data)/20)?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

I'm very interested in adding quartiles and general quantiles/fractiles, but 
I'm not so sure that this select(data, index) function would be useful. Can you 
explain how you would use this?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Mark Dickinson


Change by Mark Dickinson :


--
nosy: +mark.dickinson

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
nosy: +rhettinger, steven.daprano

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
keywords: +patch, patch
pull_requests: +11337, 11338
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Rémi Lapeyre

Change by Rémi Lapeyre :


--
keywords: +patch
pull_requests: +11337
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35775] Add a general selection function to statistics

2019-01-18 Thread Rémi Lapeyre

New submission from Rémi Lapeyre :

Like discussed in #30999, the attached PR adds a general selection function to 
the statistics module. This allows to simply get the element at a given 
quantile of a collection.

https://www.cs.rochester.edu/~gildea/csc282/slides/C09-median.pdf

--
components: Library (Lib)
messages: 333964
nosy: remi.lapeyre
priority: normal
severity: normal
status: open
title: Add a general selection function to statistics
type: enhancement
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com