[issue45851] statistics.multimode is inefficient (time and space) (mode somewhat, too)

Stefan Pochmann Fri, 19 Nov 2021 17:42:28 -0800


New submission from Stefan Pochmann <stefan.pochm...@gmail.com>:


The current implementation is:

def multimode(data):
    counts = Counter(iter(data)).most_common()
    maxcount, mode_items = next(groupby(counts, key=itemgetter(1)), (0, []))
    return list(map(itemgetter(0), mode_items))

The most_common() does a complete sort of Counter item tuples, taking O(n log 
n) time and quite big O(n) extra space (mostly for all those tuples).

When Raymond Hettinger suggested it in 
https://bugs.python.org/issue35892#msg336338 he said it should have "running 
speed that is optimal for the desired result". But then he detailed that with 
"Slow O(n log n), loads all data in memory, full sort".

Which seems like a mistake, as that's not optimal. It's easy to do in O(n) time 
and O(1) extra memory (in addition to the Counter and result, I mean):

def multimode(data):
    counts = Counter(iter(data))
    if not counts:
        return []
    maxcount = max(counts.values())
    return [value for value, count in counts.items() if count == maxcount]

If there are only very few *different* values then the time/space after 
creating the Counter is insignificant compared to the Counter creation. But if 
there are many different values, it can be significant.

statistics.mode takes O(n) time and O(1) space, which is optimal, but I found 
an apparently faster way anyway (code at end).

For example for data = random.choices(range(n), k=n):

           |     multimode     |       mode
     n     | current  proposal | current  proposal
-----------+-------------------+------------------
         1 |    131%    70%    |    125%   58%
        10 |    144%    73%    |    119%   53%
       100 |    126%    71%    |    108%   29%
     1,000 |    123%    65%    |     62%   22%
    10,000 |    172%    55%    |     53%   18%
   100,000 |    164%    44%    |     55%   20%
 1,000,000 |     85%    20%    |     22%    4%
10,000,000 |     56%    12%    |     11%    4%

All four start with Counter(iter(data)), so I took that as baseline and the 
above results show relative additional times. For example 55% means if Counter 
construction alone took 10 seconds, the function took 15.5 seconds.

An extreme case, data = list(range(n)):

           |     multimode    |       mode
     n     | current proposal | current proposal
-----------+------------------+-----------------
         1 |   128%   67%     |   124%   56%
        10 |   187%   93%     |   141%   52%
       100 |   316%  149%     |   181%   45%
     1,000 |   380%  174%     |   213%   46%
    10,000 |   349%  111%     |   146%   30%
   100,000 |   397%  128%     |   159%   34%
 1,000,000 |   336%   95%     |   112%   24%
10,000,000 |   349%   97%     |   109%   23%

I also tried a bunch of other cases, didn't find one where my versions weren't 
quite a bit faster.

My mode() version:

from operator import indexOf
from itertools import islice

def mode(data):
    counts = Counter(iter(data))
    if not counts:
        raise StatisticsError('no mode for empty data') from None
    maxcount = max(counts.values())
    index = indexOf(counts.values(), maxcount)
    return next(islice(counts, index, None))

----------
components: Library (Lib)
messages: 406638
nosy: Stefan Pochmann
priority: normal
severity: normal
status: open
title: statistics.multimode is inefficient (time and space) (mode somewhat, too)
type: performance
versions: Python 3.10

_______________________________________
Python tracker <rep...@bugs.python.org>
<https://bugs.python.org/issue45851>
_______________________________________
_______________________________________________
Python-bugs-list mailing list
Unsubscribe: 
https://mail.python.org/mailman/options/python-bugs-list/archive%40mail-archive.com

[issue45851] statistics.multimode is inefficient (time and space) (mode somewhat, too)

Reply via email to