Vilya Harvey wrote:
2009/7/4 Andre Engels <andreeng...@gmail.com>:
On Sat, Jul 4, 2009 at 9:33 AM, mclovin<hanoo...@gmail.com> wrote:
Currently I need to find the most common elements in thousands of
arrays within one large array (arround 2 million instances with ~70k
unique elements)...
Try flattening the arrays into a single large array & sorting it. Then
you can just iterate over the large array counting as you go; you only
ever have to insert into the dict once for each value and there's no
lookups in the dict....

Actually the next step is to maintain a min-heap as you run down the
sorted array.  Something like:

import numpy as np
import heapq


def frequency(arr):
    '''Generate frequency-value pairs from a numpy array'''
    clustered = arr.flatten() # copy (so can safely sort)
    clustered.sort() # Bring identical values together
    scanner = iter(clustered)
    last = scanner.next()
    count = 1
    for el in scanner:
        if el == last:
            count += 1
        else:
            yield count, last
            last = el
            count = 1
    yield count, last


def most_frequent(arr, N):
    '''Return the top N (freq, val) elements in arr'''
    counted = frequency(arr) # get an iterator for freq-val pairs
    heap = []
    # First, just fill up the array with the first N distinct
    for i in range(N):
        try:
            heap.append(counted.next())
        except StopIteration:
            break # If we run out here, no need for a heap
    else:
        # more to go, switch to a min-heap, and replace the least
        # element every time we find something better
        heapq.heapify(heap)
        for pair in counted:
            if pair > heap[0]:
                heapq.heapreplace(heap, pair)
    return sorted(heap, reverse=True) # put most frequent first.


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list

Reply via email to