[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-08-08 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset 5925b7d555bc36bd43ee8704ae75cc51900cf2d4 by Raymond Hettinger 
(Miss Islington (bot)) in branch '3.8':
bpo-35892: Add usage note to mode() (GH-15122) (GH-15176)
https://github.com/python/cpython/commit/5925b7d555bc36bd43ee8704ae75cc51900cf2d4


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-08-08 Thread miss-islington


Change by miss-islington :


--
pull_requests: +14907
pull_request: https://github.com/python/cpython/pull/15176

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-08-08 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset e43e7ed36480190083740fd75e2b9cdca72f1a68 by Raymond Hettinger in 
branch 'master':
bpo-35892: Add usage note to mode() (GH-15122)
https://github.com/python/cpython/commit/e43e7ed36480190083740fd75e2b9cdca72f1a68


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-08-04 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
pull_requests: +14862
pull_request: https://github.com/python/cpython/pull/15122

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-12 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
resolution:  -> fixed
stage: patch review -> resolved
status: open -> closed

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-12 Thread Raymond Hettinger


Raymond Hettinger  added the comment:


New changeset fc06a192fdc44225ef1cc879f615a81931ad0a85 by Raymond Hettinger in 
branch 'master':
bpo-35892: Fix mode() and add multimode() (#12089)
https://github.com/python/cpython/commit/fc06a192fdc44225ef1cc879f615a81931ad0a85


--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-11 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Looks good to me, I'm happy to accept it.

Thank you for your efforts Raymond, can I trouble you to do the merge yourself 
please, I'm still having issues using the Github website.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-10 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Here's a text only link to the patch:  
https://patch-diff.githubusercontent.com/raw/python/cpython/pull/12089.patch

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-10 Thread Henry Chen


Change by Henry Chen :


--
nosy: +scotchka

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-03-10 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Steven, are you okay with applying this PR so we can put this to rest, cleanly 
and permanently?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-28 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

Attached a draft PR for discussion purposes.  Let me know what you think (I'm 
not wedded to any part of it).

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-28 Thread Raymond Hettinger


Change by Raymond Hettinger :


--
keywords: +patch
pull_requests: +12099
stage:  -> patch review

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-26 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

> Are you happy guaranteeing that it will always be the first
> mode encountered?

Yes.  

All of the other implementations I looked at make some guarantee about which 
mode is returned.  Maple, Matlab, and Excel all return the first encountered.¹  
That is convenient for us because it is what Counter(data).most_common(1) 
already does and does cheaply (single pass, no auxiliary memory).  It also 
matches what a number of our other tools do:

>>> max(3, 3.0)   # 3 is first encountered
3
>>> max(3.0, 3)   # 3.0 is first encountered
3.0
>>> list(dict.fromkeys('aabbaacc'))[0] # 'a' is first encountered
'a'
>>> sorted([3, 3.0])[0]  # 3 is first encountered (due to sort stability)
3
>>> sorted([3.0, 3])[0]  # 3.0 is first encountered (due to sort stability)
3.0

¹ Scipy returned the smallest value rather than first value but it algorithm 
was sorting based to accommodate running a parallel mode() computation on 
multiple columns of an array. For us, that approach would be much slow, would 
require more memory, and would require more bookkeeping. 

P.S. I'm no longer thinking this should be saved for Pycon sprints.  That is 
right at the beta 1 feature freeze.  We should aim to get this resolved well in 
advance of that.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-26 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

> Proposed spec:
> '''
> Modify the API statistics.mode to handle multimodal cases so that the 
> first mode encountered is the one returned.  If the input is empty, 
> raise a StatisticsError.

Are you happy guaranteeing that it will always be the first mode 
encountered? I'm not happy about it, but I'll follow you lead on this 
one.

> TestCases:
> mode([])   --> StatisticsError
> mode('aabbbcc') --> 'c'

That should be 'b'.

> mode(iter('aabbbcc')) --> 'c'

And again 'b'.

> mode('eeffgg') --> 'a'

If it is first encountered, that should be 'd'.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-25 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

>  If others agree that it is sufficiently easy, we can assign 
> the task to Cheryl.

It's only easy if we clearly specify what we want to occur.  Deciding what the 
right behavior should be is not a beginner skill.

Proposed spec:
'''
Modify the API statistics.mode to handle multimodal cases so that the first 
mode encountered is the one returned.  If the input is empty, raise a 
StatisticsError.

TestCases:
mode([])   --> StatisticsError
mode('aabbbcc') --> 'c'
mode(iter('aabbbcc')) --> 'c'
mode('eeffgg') --> 'a'

Implementation:
* Discard the internal _counts function.
* Instead use Counter(data).most_common(1)[0][0]
  because that makes only a single pass over the data

Documentation:
* Update statistics.rst and include a versionchanged directive

* In the Whatsnew compatibility section, note this is a behavior change.
  Code that used to raise StatisticsError will now return a useful value.
  Note that the rationale for the change is that the current mode()
  behavior would unexpectedly fail when given multimodal data.

When: 
* We want this for 3.8 so it can't wait very long
'''

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-25 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

What do people think about leaving this as an "Easy First Issue" for the 
sprint? If others agree that it is sufficiently easy, we can assign the task to 
Cheryl.

It should be fairly easy: mode calls an internal function _counts which is not 
public and not used anywhere else, so it can be changed completely or even 
discarded if needed.

--
nosy: +cheryl.sabella

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-25 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Executive summary:

- let's change the behaviour of mode to return a single mode rather than raise 
an exception if there are multimodes;
- and let's do it without a depreciation period.


Further comments in no particular order:

I agree that in practice the current implementation of mode has turned out to 
be quite awkward.

Unfortunately I feel that just chosing one mode and returning it as "the" mode 
is statistically dubious (despite it being commonplace!). I've checked with 
some statisticians, or at least people who claim to be statisticians on the 
Internet, and they agree.

(The usual advice from statisticians is to always plot your data before talking 
about the mode or modes, as even local maxima can be meaningful.)

Nevertheless, Raymond's use-case of the nearest neighbor algorithm is 
persuasive. We should change the mode function to return one of the modes.

Which one? Again, people I've spoken to suggest that there's no statistical 
basis to picking any specific one (first found, last found, smallest, largest). 
I'm unhappy about documenting which mode is returned, as I expect it will be an 
implementation detail that could change in the future.

Raymond, are you happy to leave the specific detail of which mode is returned 
unspecified? Or do you have an argument for why we should specify it, and can 
do so without limiting future implementations?

I too strongly prefer a separate multimode function over a flag to the mode 
function.

As for backwards-compatibility, I've asked around trying to find anyone who 
claims that changing the behaviour would break their code. I haven't found 
anyone. I guess that means that unless any of the other core devs object, I'm 
happy to go with Raymond's recommendation to just change the behaviour without 
a depreciation period.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-22 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> shouldn't be "mode" at some point be replaced by "multimode" ?

No.  The signature is completely incompatible with the existing mode() function.

Like MS Excel which has two functions, MODE.SGNL and MODE.MULT, we should also 
have two functions, each with a clean signature and with running speed that is 
optimal for the desired result:

   mode(data: Iterable) -> object
   Returns a single value
   If multiple modes found, return first encountered
   Raises StatisticsError for an empty input
   Fast O(n), single pass, doesn't keep all data in memory

   multimode(data: Iterable) -> List[object]
   Always returns a list
   If multiple modes found, all are returned
   Returns an empty list for an empty input
   Slow O(n log n), loads all data in memory, full sort

For the first one, I recommend skipping deprecation and just changing the 
behavior that usually raises an exception for multiple modes.  In its current 
form, it likely to create bugs due to uncaught exceptions, and it doesn't 
provide any work-around if you do want only one of the modes.

By analogy, consider what would happen if we had a max() function that raised 
an exception when there were duplicate maximum values.  It would be almost 
usable.  So, what the real max() actually does is return the first encountered 
maximum value:

>>> max(3, 1, 3.0)
3
>>> max(3.0, 1, 3)
3.0

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-22 Thread Francis MB


Francis MB  added the comment:

Good options itemization!

>> This would give us a clean, fast API with no flags:
>>   mode(Iterable) -> scalar
>>   multimode(Iterable) -> list
[...]
>> For any of those options, we should still add a separate multimode()
>> function.
[..]
>> * Add a Deprecation warning to the current behavior of mode() when it
>> finds multimodal data.  Then change the behavior in Python 3.9.  This
>> is uncomfortable for one release, but does get us to the cleanest API
>> and best performance.

LGTM upto 3.9, shouldn't be "mode" at some point be replaced by
"multimode" ?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-19 Thread Francis MB


Francis MB  added the comment:

>> [...] This keeps the signature simple (Iterable -> Scalar). [...]
>>
>> Categorical, binned, or ordinal data:
>>
>>   mode(data: Iterable, *, first_tie=False) -> object
>>   multimode(data: Iterable) -> List[object]

This seems reasonable to me due legacy (although I really thing that
multimode is just the real thing :-) )

>> Continuous data:
>>   mode(data: Iterable[Real]) -> Real

What should return in that case: E.g.: mode([42.0, 42.0, 42.0, 1.0, 1.0, 1.0]) ?

42.0 ? or 1.0 ? or [42.0, 1.0] ? or do I have misunderstood something ? 


Thanks!

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-18 Thread Windson Yang


Windson Yang  added the comment:

I think right now we can 
> Change mode() to return the first tie instead of raising an exception.  This 
> is a behavior change but leaves you with the cleanest API going forward.

as well as 
> Add a Deprecation warning to the current behavior of mode() when it finds 
> multimodal data.  

In Python 3.9, we change the behavior.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

The attraction to having a first_tie=False flag is that it is backwards 
compatible with the current API.

However on further reflection, people would almost never want the existing 
behavior of raising an exception rather than returning a useful result.

So, we could make first_tie the default behavior and not have a flag at all.  
We would also add a separate multimode() function with a clean signature, 
always returning a list.  FWIW, MS Excel evolved to this solution as well (it 
has MODE.SGNL and MODE.MULT).

This would give us a clean, fast API with no flags:

mode(Iterable) -> scalar
multimode(Iterable) -> list  

ISTM this is an all round win:

* Fixes the severe usability problems with mode().  Currently, it can't be used 
without a try/except. The except-clause can't distinguish between an empty data 
condition and no unique mode condition, nor can the except clause access the 
dataset counts to get to a useful answer.

* It makes mode() memory friendly and faster.  Currently, mode() has to convert 
the counter items into a list() and run a full sort.  However, if we only need 
the first mode, we can feed the items iterator to max() and make only a single 
pass, never having to create a list.

However, there would be an incompatibility in the API change.  It is possible 
that a person really does want an exception instead of a value when there are 
multiple modes.  In that case, this would break their code so that they have to 
switch to multimode() as a remediation.  OTOH, it is also possible that people 
have existing code that uses mode() but their code has a latent bug because 
they weren't using a try/except and haven't yet encountered a multi-modal 
dataset. 

So here are some options:

* Change mode() to return the first tie instead of raising an exception.  This 
is a behavior change but leaves you with the cleanest API going forward.

* Add a Deprecation warning to the current behavior of mode() when it finds 
multimodal data.  Then change the behavior in Python 3.9.  This is 
uncomfortable for one release, but does get us to the cleanest API and best 
performance.

* Change mode() to have a flag where first_tie defaults to False.  This is 
backwards compatible, but it doesn't fix latent bugs, it leaves us with 
on-going API complexities, and the default case remains the least usable option.

* Leave mode() as-is and just document its limitations.

For any of those options, we should still add a separate multimode() function.

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-17 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> Did I miss something?

Yes.  It doesn't really matter which mode is returned as long as it is 
deterministically chosen.  We're proposing to return the first mode rather than 
the smallest mode.  

Scipy returns the smallest mode because that is convenient given that the 
underlying operation is np.unique() which returns unique values in sorted order 
[1]. 

We want to return the first mode encountered because that is convenient given 
that the underlying operation is max() which returns the first maximum value it 
encounters.

Another advantage of return-first rather than return-smallest is that our 
mode() would work for data values that are hashable but not orderable (i.e. 
frozensets).

[1] https://github.com/scipy/scipy/blob/v0.19.1/scipy/stats/stats.py#L378-L443

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Windson Yang


Windson Yang  added the comment:

I only tested stats.mode() from scipy, data = 'BBAAC' should also return 'A'. 
But in your code **return return Counter(seq).most_common(1)[0][0]** will 
return B. Did I miss something?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

I've been thinking about this a bit more.  ISTM that for now, it's best to get 
mode() working for everyday use, returning just the first mode encountered.  
This keeps the signature simple (Iterable -> Scalar).

In the future, if needed, there is room to add separate functions with their 
own clean signatures and use cases.  For example, having a separate multimode() 
that always returns a list would be clearer and easier to use than setting a 
flag on the current scalar valued mode() function. 

Another example would be mode_estimate() that returns a scalar float
that estimates the mode given data sampled from a continuous variable. This 
warrants its own function because the signature, use cases, options, and 
implementation method are all completely different from a mode based on 
counting.

Categorical, binned, or ordinal data:

   mode(data: Iterable, *, first_tie=False) -> object
   multimode(data: Iterable) -> List[object]

Continuous data:

   mode(data: Iterable[Real]) -> Real

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Raymond Hettinger


Raymond Hettinger  added the comment:

> We can return the smallest value from the **table** instead
> of the code below.

Internally, that does too much work and then throws most of it away.

The difference between Counter(data).most_common()[1] and 
Counter(data).most_common(1) is that the former materializes the entire items 
iterator into a list and does a full sort, while the latter is memory friendly 
and makes a single pass with min().

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Windson Yang


Windson Yang  added the comment:

IMHO, we don't need to add the option. We can return the smallest value from 
the **table** instead of the code below.

if len(table) == 1:
return table[0][0]

[1] https://github.com/python/cpython/blob/master/Lib/statistics.py#L502

--
nosy: +Windson Yang

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Francis MB

Francis MB  added the comment:

Good point Raymond!

Only a minor observation on the packages API:  


[1] SciPy: scipy.stats.mode(a, axis=0, nan_policy='propagate')
"Returns an array of the modal (most common) **value** in the passed array." 
--> Here it claims to return just ONE value

And use of different policies on parameters :
nan_policy : {‘propagate’, ‘raise’, ‘omit’}, optional
Defines how to handle when input contains nan. ‘propagate’ returns nan, ‘raise’ 
throws an error, ‘omit’ performs the calculations ignoring nan values. Default 
is ‘propagate’.

Equivalent one could say 'multivalue_policy'


[2] Matlab: Mode: "Most frequent **values** in array"

...returns the sample mode of A, which is the most frequently occurring *value* 
in A"

IMHO it seems inconsistent *values* vs. *value* (or a doc-bug ?).


An a question:
Does it that mean that mode in that case really should potentially return an 
array of values, e.g. all the values with equal frequency?

In that case the user has the chance to get the first, the last or just all, ...



[1] 
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mode.html

[2] https://la.mathworks.com/help/matlab/ref/mode.html

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-16 Thread Raymond Hettinger

Raymond Hettinger  added the comment:

I would stick with "first_tie=False".  That caters to the common case, avoids 
API complications, and does something similar to what other stats packages are 
doing.

- API Survey 

Maple: """This function is only guaranteed to return one potential mode - in 
cases where multiple modes exist, only the first detected mode is guaranteed to 
be returned.""" -- 
https://www.maplesoft.com/support/help/maple/view.aspx?path=Statistics%2fMode

R: 'R does not have a standard in-built function to calculate mode.' -- 
https://www.tutorialspoint.com/r/r_mean_median_mode.htm

Matlab: "When there are multiple values occurring equally frequently, mode 
returns the smallest of those values. ... If A is an empty 0-by-0 matrix, 
mode(A) returns NaN." -- https://www.mathworks.com/help/matlab/ref/mode.html

Mathematica: "The mode of a set of data is implemented in the Wolfram Language 
as Commonest[data]. ... When several elements occur with equal frequency, 
Commonest[list,…] picks first the ones that occur first in list." -- 
http://mathworld.wolfram.com/Mode.html and 
https://reference.wolfram.com/language/ref/Commonest.html

SciPy: "If there is more than one such value, only the smallest is returned." 
-- 
https://docs.scipy.org/doc/scipy-0.19.1/reference/generated/scipy.stats.mode.html

SAS: "Most frequent value (if not unique, the smallest mode)" -- 
https://documentation.sas.com/?docsetId=qcug=qcug_capability_sect225.htm=14.2=en

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-10 Thread Steven D'Aprano


Steven D'Aprano  added the comment:

Thanks Raymond for the interesting use-case.

The original design of mode() was support only the basic form taught in 
secondary schools, namely a single unique mode for categorical data or 
discrete numerical data.

I think it is time to consider a richer interface to support more uses, 
such as estimating the mode of continuous numerical data, and the 
multi-mode case you bring up.

One reason I've been hesitant is that deciding what is the right 
behaviour is quite hard (or at least it has been for me). I think there 
are a few cases to consider:

- the existing behaviour (which may not be very useful?) which is to
  raise an exception unless the mode is unique; 

- would it be more useful to return an out-of-band value like a NAN 
  or None?

- the multi-mode case where you want some arbitrary(?) mode, possibly
  the left-most (smallest) for numeric data;

- the multi-mode case where you want all the modes.

I like Francis' suggestion to use an enum to select the behavioural, er, 
mode (pun intended). What do you think?

--

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-10 Thread Francis MB


Francis MB  added the comment:

>> There may be better names for the flag.  "tie_goes_to_first_encountered" 
>> seemed a bit long though ;-)

Could it may be an alternative to set the mode tie case in a form like:

def mode(seq, *, case=CHOOSE_FIRST):
  [...]

(or TIE_CHOOSE_FIRST, ...) where CHOOSE_FIRST is an enum ?

Thanks!

--
nosy: +francismb

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com



[issue35892] Fix awkwardness of statistics.mode() for multimodal datasets

2019-02-03 Thread Raymond Hettinger

New submission from Raymond Hettinger :

The current code for mode() does a good deal of extra work to support its two 
error outcomes (empty input and multimodal input).  That latter case is 
informative but doesn't provide any reasonable way to find just one of those 
modes, where any of the most popular would suffice.  This arises in nearest 
neighbor algorithms for example. I suggest adding an option to the API:

   def mode(seq, *, first_tie=False):   
   if tie_goes_to_first:
   # CHOOSE FIRST x ∈ S | ∄ y ∈ S : x ≠ y ∧ count(y) > count(x)
   return return Counter(seq).most_common(1)[0][0]
   ...

Use it like this:

>>> data = 'ABBAC'
>>> assert mode(data, first_tie=True) == 'A'

With the current API, there is no reasonable way to get to 'A' from 'ABBAC'.

Also, the new code path is much faster than the existing code path because it 
extracts only the 1 most common using min() rather than the n most common which 
has to sort the whole items() list.  New path: O(n).  Existing path: O(n log n).

Note, the current API is somewhat awkward to use.  In general, a user can't 
know in advance that the data only contains a single mode.  Accordingly, every 
call to mode() has to be wrapped in a try-except.  And if the user just wants 
one of those modal values, there is no way to get to it.  See 
https://docs.scipy.org/doc/scipy/reference/generated/scipy.stats.mode.html for 
comparison.

There may be better names for the flag.  "tie_goes_to_first_encountered" seemed 
a bit long though ;-)

--
assignee: steven.daprano
components: Library (Lib)
messages: 334796
nosy: rhettinger, steven.daprano
priority: normal
severity: normal
status: open
title: Fix awkwardness of statistics.mode() for multimodal datasets
type: behavior
versions: Python 3.8

___
Python tracker 

___
___
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com