Re: finding most common elements between thousands of multiple arrays.

2009-07-10 Thread Scott David Daniels

Raymond Hettinger wrote:

[Scott David Daniels]

def most_frequent(arr, N): ...

In Py2.4 and later, see heapq.nlargest().

I should have remembered this one


In Py3.1, see collections.Counter(data).most_common(n)

This one is from Py3.2, I think.


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-08 Thread Raymond Hettinger
[Scott David Daniels]
> def most_frequent(arr, N):
>      '''Return the top N (freq, val) elements in arr'''
>      counted = frequency(arr) # get an iterator for freq-val pairs
>      heap = []
>      # First, just fill up the array with the first N distinct
>      for i in range(N):
>          try:
>              heap.append(counted.next())
>          except StopIteration:
>              break # If we run out here, no need for a heap
>      else:
>          # more to go, switch to a min-heap, and replace the least
>          # element every time we find something better
>          heapq.heapify(heap)
>          for pair in counted:
>              if pair > heap[0]:
>                  heapq.heapreplace(heap, pair)
>      return sorted(heap, reverse=True) # put most frequent first.

In Py2.4 and later, see heapq.nlargest().
In Py3.1, see collections.Counter(data).most_common(n)


Raymond
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-07 Thread Andrew Henshaw

"mclovin"  wrote in message 
news:c5332c9b-2348-4194-bfa0-d70c77107...@x3g2000yqa.googlegroups.com...
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> ---d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
> It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

Would the following work for you, or am I missing something?  For a 5Kx5K 
array, this takes about a tenth of a second on my machine.  This code 
doesn't deal with the sub-array issue.

#
import numpy
import time

LOWER = 0
UPPER = 1024
SIZE = 5000
NUM_BEST = 4

# sample data
data = numpy.random.randint(LOWER, UPPER, (SIZE,SIZE)).astype(int)

time.clock()
count = numpy.bincount(data.flat)
best = sorted(zip(count, range(len(count[-NUM_BEST:]
print 'time=', time.clock()
print best



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-06 Thread Scott David Daniels

Peter Otten wrote:

Scott David Daniels wrote:


Scott David Daniels wrote:



 t = timeit.Timer('sum(part[:-1]==part[1:])',
  'from __main__ import part')


What happens if you calculate the sum in numpy? Try

t = timeit.Timer('(part[:-1]==part[1:]).sum()',
 'from __main__ import part')


Good idea, I hadn't thought of adding numpy bools.
(part[:-1]==part[1:]).sum()
is only a slight improvement over
len(part[part[:-1]==part[1:]])
when there are few elements, but it is almost twice
as fast when there are a lot (reflecting the work
of allocating and copying).

>>> import numpy
>>> import timeit
>>> original = numpy.random.normal(0, 100, (1000, 1000)).astype(int)
>>> data = original.flatten()
>>> data.sort()
>>> t = timeit.Timer('sum(part[:-1]==part[1:])',
 'from __main__ import part')
>>> u = timeit.Timer('len(part[part[:-1]==part[1:]])',
 'from __main__ import part')
>>> v = timeit.Timer('(part[:-1]==part[1:]).sum()',
 'from __main__ import part')

>>> part = data[::100]
>>> (part[:-1]==part[1:]).sum()
9390
>>> t.repeat(3, 10)
[0.56368281443587875, 0.55615057220961717, 0.55465764503594528]
>>> u.repeat(3, 1000)
[0.89576580263690175, 0.89276374511291579, 0.8937328626963108]
>>> v.repeat(3, 1000)
[0.24798598704592223, 0.24715431709898894, 0.24498979618920202]
>>>
>>> part = original.flatten()[::100]
>>> (part[:-1]==part[1:]).sum()
27
>>> t.repeat(3, 10)
[0.57576898739921489, 0.56410158274297828, 0.56988248506445416]
>>> u.repeat(3, 1000)
[0.27312186325366383, 0.27315007913011868, 0.27214492344683094]
>>> v.repeat(3, 1000)
[0.28410342655297427, 0.28374053126867693, 0.28318990262732768]
>>>

Net result: go back to former definition of candidates (a number,
not the actual entries), but calculate that number as matches.sum(),
not len(part[matches]).

Now the latest version of this (compressed) code:
> ...
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches) # count identified matches
> while candidates > N * 10: # 10 -- heuristic
> stride *= 2 # # heuristic increase
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> while candidates < N * 3: # heuristic slop for long runs
> stride //= 2 # heuristic decrease
> sampled = data[::stride]
> matches = sampled[:-1] == sampled[1:]
> candidates = sum(matches)
> former = None
> past = 0
> for value in sampled[matches]:
> ...
is:
  ...
  sampled = data[::stride]
  matches = sampled[:-1] == sampled[1:]
  candidates = matches.sum() # count identified matches
  while candidates > N * 10: # 10 -- heuristic
  stride *= 2 # # heuristic increase
  sampled = data[::stride]
  matches = sampled[:-1] == sampled[1:]
  candidates = matches.sum()
  while candidates < N * 3: # heuristic slop for long runs
  stride //= 2 # heuristic decrease
  sampled = data[::stride]
  matches = sampled[:-1] == sampled[1:]
  candidates = matches.sum()
  former = None
  past = 0
  for value in sampled[matches]:
  ...

Now I think I can let this problem go, esp. since it was
mclovin's problem in the first place.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-05 Thread Peter Otten
Scott David Daniels wrote:

> Scott David Daniels wrote:

>  t = timeit.Timer('sum(part[:-1]==part[1:])',
>   'from __main__ import part')

What happens if you calculate the sum in numpy? Try

t = timeit.Timer('(part[:-1]==part[1:]).sum()',
 'from __main__ import part')


Peter

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-05 Thread Steven D'Aprano
On Sun, 05 Jul 2009 17:30:58 -0700, Scott David Daniels wrote:

> Summary: when dealing with numpy, (or any bulk <-> individual values
> transitions), try several ways that you think are equivalent and
> _measure_.

This advice is *much* more general than numpy -- it applies to any 
optimization exercise. People's intuitions about what's fast and what's 
slow are often very wrong.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-05 Thread Scott David Daniels

Scott David Daniels wrote:

... Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.


Boy, I cannot let go.  I did a bit of a test checking for cost to
calculated number of discovered samples, and found after:
import timeit
import numpy
original = numpy.random.random(0, 100, (1000, 1000)).astype(int)
data = original.flatten()
data.sort()
part = data[::100]
t = timeit.Timer('sum(part[:-1]==part[1:])',
 'from __main__ import part')
v = timeit.Timer('len(part[part[:-1]==part[1:]])',
 'from __main__ import part')

I got:
>>> t.repeat(3, 10)
[0.58319842326318394, 0.57617574300638807, 0.57831819407238072]
>>> v.repeat(3, 1000)
[0.93933027801040225, 0.93704535073584339, 0.94096260837613954]

So, len(part[mask]) is almost 50X faster!  I checked:
>>> sum(part[:-1]==part[1:])
9393
>>> len(part[part[:-1]==part[1:]])
9393

That's an awful lot of matches, so I with high selectivity:
data = original.flatten()  # no sorting, so runs missing
part = data[::100]

>>> t.repeat(3, 10)
[0.58641335700485797, 0.58458854407490435, 0.58872594142576418]
>>> v.repeat(3, 1000)
[0.27352554584422251, 0.27375686015921019, 0.27433291102624935]

about 200X faster

>>> len(part[part[:-1]==part[1:]])
39
>>> sum(part[:-1]==part[1:])
39

So my new version of this (compressed) code:

...
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches) # count identified matches
while candidates > N * 10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)
former = None
past = 0
for value in sampled[matches]:
...

is:
  ...
  sampled = data[::stride]
  candidates = sampled[sampled[:-1] == sampled[1:]]
  while len(candidates) > N * 10: # 10 -- heuristic
  stride *= 2 # # heuristic increase
  sampled = data[::stride]
  candidates = sampled[sampled[:-1] == sampled[1:]]
  while len(candidates) < N * 3: # heuristic slop for long runs
  stride //= 2 # heuristic decrease
  sampled = data[::stride]
  candidates = sampled[sampled[:-1] == sampled[1:]]
  former = None
  past = 0
  for value in candidates:
  ...
This change is important, for we try several strides before
settling on a choice, meaning the optimization can be valuable.
This also means we could be pickier at choosing strides (try
more values), since checking is cheaper than before.

Summary: when dealing with numpy, (or any bulk <-> individual values
transitions), try several ways that you think are equivalent and
_measure_.  In the OODB work I did we called this "impedance mismatch,"
and it is likely some boundary transitions are _much_ faster than
others.  The sum case is one of them; I am getting numpy booleans
back, rather than numpy booleans, so conversions aren't going fastpath.

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Steven D'Aprano
On Sat, 04 Jul 2009 07:19:48 -0700, Scott David Daniels wrote:

> Actually the next step is to maintain a min-heap as you run down the
> sorted array.  Something like:

Not bad.

I did some tests on it, using the following sample data:

arr = np.array([xrange(i, i+7000) for i in xrange(143)] + 
  [[750]*7000] + [xrange(3*i, 3*i+7000) for i in xrange(142)])


and compared your code against the following simple function:


def count(arr, N):
D = {}
for v in arr:
for x in v:
D[x] = D.get(x, 0) + 1
freq = []
for el, f in D.iteritems():
freq.append((f, el))
return sorted(freq, reverse=True)[:N]


As a rough figure, your min-heap code is approximately twice as fast as 
mine.

To the OP: I think you should start profiling your code and find out 
exactly *where* it is slow and concentrate on that. I think that trying a 
heuristic to estimate the most frequent elements by taking a random 
sample is likely to be a mistake -- whatever you're trying to accomplish 
with the frequency counts, the use of such a heuristic will mean that 
you're only approximately accomplishing it.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Steven D'Aprano
On Sat, 04 Jul 2009 15:06:29 -0700, mclovin wrote:

> like I said I need to do this 480,000 times so to get this done
> realistically I need to analyse about 5 a second. It appears that the
> average matrix size contains about 15 million elements.

Have you considered recording the element counts as you construct the 
arrays? This will marginally increase the time it takes to build the 
array, but turn finding the most frequent elements into a very quick 
operation.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Emile van Sebille

On 7/4/2009 12:33 AM mclovin said...

Currently I need to find the most common elements in thousands of
arrays within one large array (arround 2 million instances with ~70k
unique elements)

so I set up a dictionary to handle the counting so when I am
iterating  I 


  ** up the count on the corrosponding dictionary element **

Right at this point, instead of or in addition to counting, why not save 
the large array index in a list?  Then when you've identified the 25 
most common elements you'll already have a list of pointer to the 
instances to work from.


Emile


--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread MRAB

mclovin wrote:

On Jul 4, 3:29 pm, MRAB  wrote:

mclovin wrote:

[snip]


like I said I need to do this 480,000 times so to get this done
realistically I need to analyse about 5 a second. It appears that the
average matrix size contains about 15 million elements.
I threaded my program using your code and I did about 1,000 in an hour
so it is still much too slow.
When I selected 1 million random elements to count, 8 out of the top
10 of those were in the top 25 of the precise way and 18 of the 25
were in the top 25 of the precise way. so I suppose that could be an
option.

The values are integers, aren't they? What is the range of values?


There are appox 550k unique values with a range of 0-2million with
gaps.


I've done a little experimentation with lists (no numpy involved) and
found that I got a x2 speed increase if I did the counting using a list,
something like this:

counts = [0] * 200
for x in values:
counts[x] += 1
counts = dict(e for e in enumerate(values) if e[1] != 0)
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Scott David Daniels

mclovin wrote:

On Jul 4, 12:51 pm, Scott David Daniels  wrote:

mclovin wrote:

OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.
Is there a "bet guess" strategy where it is not 100% accurate but much
faster?


Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
 If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n).  That should
drastically reduce the scan cost, as long as stride is reasonably large

sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.
So just pull out those points, use bisect to get their frequencies, and
feed those results into the heap accumulation.

--Scott David Daniels


I dont quite understand what you are saying but I know this: the times
the most common element appears varies greatly. Sometimes it appears
over 1000 times, and some times it appears less than 50. It all
depends on the density of the arrays I am analyzing.


Here's a heuristic replacement for my previous frequency code:
I've tried to mark where you could fudge numbers if the run time
is at all close.

def frequency(arr_of_arr, N, stride=100)
'''produce (freq, value) pairs for data in arr_of_arr.

Tries to produce > N pairs.  stride is a guess at half
the length of the shortest run in the top N runs.
'''
# if the next two lines are too slow, this whole approach is toast
data = arr_of_arr.flatten()  # big allocation
data.sort() # a couple of seconds for 25 million ints

# stride is a length forcing examination of a run.
sampled = data[::stride]
# Note this is a view into data, and is still sorted.
# We know that any run of length 2 * stride - 1 in data _must_ have
# consecutive entries in sampled.  Compare them "in parallel"
matches = sampled[:-1] == sampled[1:]
# matches is True or False for stride-separated values from sampled
candidates = sum(matches) # count identified matches

# while candidates is huge, keep trying with a larger stride
while candidates > N *10: # 10 -- heuristic
stride *= 2 # # heuristic increase
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)

# if we got too few, move stride down:
while candidates < N * 3: # heuristic slop for long runs
stride //= 2 # heuristic decrease
sampled = data[::stride]
matches = sampled[:-1] == sampled[1:]
candidates = sum(matches)

# Here we have a "nice" list of candidates that is likely
# to include every run we actually want. sampled[matches] is
# the sorted list of candidate values.  It may have duplicates
former = None
past = 0
# In the loop here we only use sampled to the pick values we
# then go find in data.  We avoid checking for same value twice
for value in sampled[matches]:
if value == former:
continue # A long run: multiple matches in sampled
former = value # Make sure we only try this one once
# find the beginning of the run
start = bisect.bisect_left(data, value, past)
# find the end of the run (we know it is at least stride long)
past = bisect.bisect_right(data, value, start + stride)
yield past - start, value # produce frequency, value data

--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread mclovin
On Jul 4, 3:29 pm, MRAB  wrote:
> mclovin wrote:
>
> [snip]
>
> > like I said I need to do this 480,000 times so to get this done
> > realistically I need to analyse about 5 a second. It appears that the
> > average matrix size contains about 15 million elements.
>
> > I threaded my program using your code and I did about 1,000 in an hour
> > so it is still much too slow.
>
> > When I selected 1 million random elements to count, 8 out of the top
> > 10 of those were in the top 25 of the precise way and 18 of the 25
> > were in the top 25 of the precise way. so I suppose that could be an
> > option.
>
> The values are integers, aren't they? What is the range of values?

There are appox 550k unique values with a range of 0-2million with
gaps.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread MRAB

mclovin wrote:
[snip]

like I said I need to do this 480,000 times so to get this done
realistically I need to analyse about 5 a second. It appears that the
average matrix size contains about 15 million elements.

I threaded my program using your code and I did about 1,000 in an hour
so it is still much too slow.

When I selected 1 million random elements to count, 8 out of the top
10 of those were in the top 25 of the precise way and 18 of the 25
were in the top 25 of the precise way. so I suppose that could be an
option.


The values are integers, aren't they? What is the range of values?
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread mclovin
On Jul 4, 12:51 pm, Scott David Daniels  wrote:
> mclovin wrote:
> > OK then. I will try some of the strategies here but I guess things
> > arent looking too good. I need to run this over a dataset that someone
> > pickled. I need to run this 480,000 times so you can see my
> > frustration. So it doesn't need to be "real time" but it would be nice
> > it was done sorting this month.
>
> > Is there a "bet guess" strategy where it is not 100% accurate but much
> > faster?
>
> Well, I timed a run of a version of mine, and the scan is approx 5X
> longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
> see how quickly the copy and sort happens.So you could try a variant
> exploiting the following property:
>      If you know the minimum length of a run that will be in the top 25,
> then the value for each of the most-frequent run entries must show up at
> positions n * stride and (n + 1) * stride (for some n).  That should
> drastically reduce the scan cost, as long as stride is reasonably large.
>
> For my uniformly distributed 0..1024 values in 5M x 5M array,
> About 2.5 sec to flatten and sort.
> About 15 sec to run one of my heapish thingies.
> the least frequency encountered: 24716
> so, with stride at
>
> sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
> So there are only 1000 points to investigate.
> With any distribution other than uniform, that should go _way_ down.
> So just pull out those points, use bisect to get their frequencies, and
> feed those results into the heap accumulation.
>
> --Scott David Daniels

I dont quite understand what you are saying but I know this: the times
the most common element appears varies greatly. Sometimes it appears
over 1000 times, and some times it appears less than 50. It all
depends on the density of the arrays I am analyzing.

like I said I need to do this 480,000 times so to get this done
realistically I need to analyse about 5 a second. It appears that the
average matrix size contains about 15 million elements.

I threaded my program using your code and I did about 1,000 in an hour
so it is still much too slow.

When I selected 1 million random elements to count, 8 out of the top
10 of those were in the top 25 of the precise way and 18 of the 25
were in the top 25 of the precise way. so I suppose that could be an
option.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Scott David Daniels

mclovin wrote:

OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.

Is there a "bet guess" strategy where it is not 100% accurate but much
faster?


Well, I timed a run of a version of mine, and the scan is approx 5X
longer than the copy-and-sort.  Time arr_of_arr.flatten().sort() to
see how quickly the copy and sort happens.So you could try a variant
exploiting the following property:
If you know the minimum length of a run that will be in the top 25,
then the value for each of the most-frequent run entries must show up at
positions n * stride and (n + 1) * stride (for some n).  That should
drastically reduce the scan cost, as long as stride is reasonably large.

For my uniformly distributed 0..1024 values in 5M x 5M array,
About 2.5 sec to flatten and sort.
About 15 sec to run one of my heapish thingies.
the least frequency encountered: 24716
so, with stride at

sum(flattened[:-stride:stride] == flattened[stride::stride]) == 1000
So there are only 1000 points to investigate.
With any distribution other than uniform, that should go _way_ down.
So just pull out those points, use bisect to get their frequencies, and 
feed those results into the heap accumulation.


--Scott David Daniels
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Lie Ryan
mclovin wrote:
> OK then. I will try some of the strategies here but I guess things
> arent looking too good. I need to run this over a dataset that someone
> pickled. I need to run this 480,000 times so you can see my
> frustration. So it doesn't need to be "real time" but it would be nice
> it was done sorting this month.
> 
> Is there a "bet guess" strategy where it is not 100% accurate but much
> faster?

Heuristics?

If you don't need 100% accuraccy, you can simply sample 1 or so
element and find the most common element in this small sample space. It
should be much faster, though you'll probably need to determine the best
cutoff number (too small and you're risking biases, too large and it
would be slower). random.sample() might be useful here.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Vilya Harvey
2009/7/4 Steven D'Aprano :
> On Sat, 04 Jul 2009 13:42:06 +, Steven D'Aprano wrote:
>
>> On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
>>
>>> 2009/7/4 Andre Engels :
 On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>> ...
 There's no better algorithm for the general case. No method of
 checking the matrices using less than 200-x look-ups will ensure
 you that there's not a new value with x occurences lurking somewhere.
>>>
>>> Try flattening the arrays into a single large array & sorting it. Then
>>> you can just iterate over the large array counting as you go; you only
>>> ever have to insert into the dict once for each value and there's no
>>> lookups in the dict.
>>
>> You're suggesting to do a whole bunch of work copying 2,000,000 pointers
>> into a single array, then a whole bunch of more work sorting that second
>> array (which is O(N*log N) on average), and then finally iterate over
>> the second array. Sure, that last step will on average involve fewer
>> than O(N) steps,
>
> Er what?
>
> Ignore that last comment -- I don't know what I was thinking. You still
> have to iterate over all N elements, sorted or not.
>
>> but to get to that point you've already done more work
>> than just iterating over the array-of-arrays in the first place.
>
> What it does buy you though, as you pointed out, is reducing the number
> of explicit dict lookups and writes. However, dict lookups and writes are
> very fast, fast enough that they're used throughout Python. A line like:
>
> count += 1
>
> actually is a dict lookup and write.

I did some tests, just to be sure, and you're absolutely right: just
creating the flattened list took several hundred (!) times as long as
iterating through all the lists in place. Live and learn...

Vil.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread mclovin
OK then. I will try some of the strategies here but I guess things
arent looking too good. I need to run this over a dataset that someone
pickled. I need to run this 480,000 times so you can see my
frustration. So it doesn't need to be "real time" but it would be nice
it was done sorting this month.

Is there a "bet guess" strategy where it is not 100% accurate but much
faster?

On Jul 4, 7:38 am, Steven D'Aprano  wrote:
> On Sat, 04 Jul 2009 13:42:06 +, Steven D'Aprano wrote:
> > On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
>
> >> 2009/7/4 Andre Engels :
> >>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
>  Currently I need to find the most common elements in thousands of
>  arrays within one large array (arround 2 million instances with ~70k
>  unique elements)
> > ...
> >>> There's no better algorithm for the general case. No method of
> >>> checking the matrices using less than 200-x look-ups will ensure
> >>> you that there's not a new value with x occurences lurking somewhere.
>
> >> Try flattening the arrays into a single large array & sorting it. Then
> >> you can just iterate over the large array counting as you go; you only
> >> ever have to insert into the dict once for each value and there's no
> >> lookups in the dict.
>
> > You're suggesting to do a whole bunch of work copying 2,000,000 pointers
> > into a single array, then a whole bunch of more work sorting that second
> > array (which is O(N*log N) on average), and then finally iterate over
> > the second array. Sure, that last step will on average involve fewer
> > than O(N) steps,
>
> Er what?
>
> Ignore that last comment -- I don't know what I was thinking. You still
> have to iterate over all N elements, sorted or not.
>
> > but to get to that point you've already done more work
> > than just iterating over the array-of-arrays in the first place.
>
> What it does buy you though, as you pointed out, is reducing the number
> of explicit dict lookups and writes. However, dict lookups and writes are
> very fast, fast enough that they're used throughout Python. A line like:
>
> count += 1
>
> actually is a dict lookup and write.
>
> --
> Steven

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Steven D'Aprano
On Sat, 04 Jul 2009 13:42:06 +, Steven D'Aprano wrote:

> On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:
> 
>> 2009/7/4 Andre Engels :
>>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
 Currently I need to find the most common elements in thousands of
 arrays within one large array (arround 2 million instances with ~70k
 unique elements)
> ...
>>> There's no better algorithm for the general case. No method of
>>> checking the matrices using less than 200-x look-ups will ensure
>>> you that there's not a new value with x occurences lurking somewhere.
>> 
>> Try flattening the arrays into a single large array & sorting it. Then
>> you can just iterate over the large array counting as you go; you only
>> ever have to insert into the dict once for each value and there's no
>> lookups in the dict.
> 
> You're suggesting to do a whole bunch of work copying 2,000,000 pointers
> into a single array, then a whole bunch of more work sorting that second
> array (which is O(N*log N) on average), and then finally iterate over
> the second array. Sure, that last step will on average involve fewer
> than O(N) steps, 

Er what?

Ignore that last comment -- I don't know what I was thinking. You still 
have to iterate over all N elements, sorted or not.

> but to get to that point you've already done more work
> than just iterating over the array-of-arrays in the first place.

What it does buy you though, as you pointed out, is reducing the number 
of explicit dict lookups and writes. However, dict lookups and writes are 
very fast, fast enough that they're used throughout Python. A line like:

count += 1

actually is a dict lookup and write.



-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Scott David Daniels

Vilya Harvey wrote:

2009/7/4 Andre Engels :

On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:

Currently I need to find the most common elements in thousands of
arrays within one large array (arround 2 million instances with ~70k
unique elements)...

Try flattening the arrays into a single large array & sorting it. Then
you can just iterate over the large array counting as you go; you only
ever have to insert into the dict once for each value and there's no
lookups in the dict


Actually the next step is to maintain a min-heap as you run down the
sorted array.  Something like:

import numpy as np
import heapq


def frequency(arr):
'''Generate frequency-value pairs from a numpy array'''
clustered = arr.flatten() # copy (so can safely sort)
clustered.sort() # Bring identical values together
scanner = iter(clustered)
last = scanner.next()
count = 1
for el in scanner:
if el == last:
count += 1
else:
yield count, last
last = el
count = 1
yield count, last


def most_frequent(arr, N):
'''Return the top N (freq, val) elements in arr'''
counted = frequency(arr) # get an iterator for freq-val pairs
heap = []
# First, just fill up the array with the first N distinct
for i in range(N):
try:
heap.append(counted.next())
except StopIteration:
break # If we run out here, no need for a heap
else:
# more to go, switch to a min-heap, and replace the least
# element every time we find something better
heapq.heapify(heap)
for pair in counted:
if pair > heap[0]:
heapq.heapreplace(heap, pair)
return sorted(heap, reverse=True) # put most frequent first.


--Scott David Daniels
scott.dani...@acm.org
--
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Steven D'Aprano
On Sat, 04 Jul 2009 10:55:44 +0100, Vilya Harvey wrote:

> 2009/7/4 Andre Engels :
>> On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
>>> Currently I need to find the most common elements in thousands of
>>> arrays within one large array (arround 2 million instances with ~70k
>>> unique elements)
...
>> There's no better algorithm for the general case. No method of checking
>> the matrices using less than 200-x look-ups will ensure you that
>> there's not a new value with x occurences lurking somewhere.
> 
> Try flattening the arrays into a single large array & sorting it. Then
> you can just iterate over the large array counting as you go; you only
> ever have to insert into the dict once for each value and there's no
> lookups in the dict.

You're suggesting to do a whole bunch of work copying 2,000,000 pointers 
into a single array, then a whole bunch of more work sorting that second 
array (which is O(N*log N) on average), and then finally iterate over the 
second array. Sure, that last step will on average involve fewer than 
O(N) steps, but to get to that point you've already done more work than 
just iterating over the array-of-arrays in the first place.

Now, if you're really lucky, your strategy can be done in fast C code 
instead of slow Python code, and you might see a speed-up for values of N 
which aren't too big. But I wouldn't put money on it.

Another strategy might be to pre-count elements in each array, as you 
build or modify them. This will marginally slow down each modification 
you make to the array, but the payback will be that finding the frequency 
of any element will be almost instantaneous.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Neil Crighton
You can join all your arrays into a single big array with concatenate.

>>> import numpy as np
>>> a = np.concatenate(array_of_arrays)

Then count the number of occurrances of each unique element using this trick
with searchsorted. This should be pretty fast.

>>> a.sort()
>>> unique_a = np.unique(a)
>>> count = []
>>> for val in unique_a:
... count.append(a.searchsorted(val,side='right') - a.searchsorted(val,
side='left'))
>>> mostcommonvals = unique_a[np.argsort(count)[-25:]]


Neil

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Vilya Harvey
2009/7/4 Andre Engels :
> On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
>> Currently I need to find the most common elements in thousands of
>> arrays within one large array (arround 2 million instances with ~70k
>> unique elements)
>>
>> so I set up a dictionary to handle the counting so when I am
>> iterating  I up the count on the corrosponding dictionary element. I
>> then iterate through the dictionary and find the 25 most common
>> elements.
>>
>> the elements are initially held in a array within an array. so I am am
>> just trying to find the most common elements between all the arrays
>> contained in one large array.
>> my current code looks something like this:
>> d = {}
>> for arr in my_array:
>> -for i in arr:
>> #elements are numpy integers and thus are not accepted as dictionary
>> keys
>> ---d[int(i)]=d.get(int(i),0)+1
>>
>> then I filter things down. but with my algorithm that only takes about
>> 1 sec so I dont need to show it here since that isnt the problem.
>>
>>
>> But there has to be something better. I have to do this many many
>> times and it seems silly to iterate through 2 million things just to
>> get 25. The element IDs are integers and are currently being held in
>> numpy arrays in a larger array. this ID is what makes up the key to
>> the dictionary.
>>
>>  It currently takes about 5 seconds to accomplish this with my current
>> algorithm.
>>
>> So does anyone know the best solution or algorithm? I think the trick
>> lies in matrix intersections but I do not know.
>
> There's no better algorithm for the general case. No method of
> checking the matrices using less than 200-x look-ups will ensure
> you that there's not a new value with x occurences lurking somewhere.

Try flattening the arrays into a single large array & sorting it. Then
you can just iterate over the large array counting as you go; you only
ever have to insert into the dict once for each value and there's no
lookups in the dict. I don't know numpy, so there's probably a more
efficient way to write this, but this should show what I'm talking
about:

big_arr = sorted(reduce(list.__add__, my_array, []))
counts = {}
last_val = big_arr[0]
count = 0
for val in big_arr:
if val == last_val:
count += 1
else:
counts[last_val] = count
count = 0
last_val = val
counts[last_val] = count# to get the count for the last value.

If flattening the arrays isn't practical, you may still get some
improvements by sorting them individually and applying the same
principle to each of them:

counts = {}
for arr in my_array:
sorted_arr = sorted(arr)
last_val = sorted_arr[0]
count = 0
for val in sorted_arr:
if val == last_val:
count += 1
else:
counts[last_val] = counts.get(last_val, 0) + count
count = 0
last_val = val
counts[last_val] = counts.get(last_val, 0) + count

Hope that helps...

Vil.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Andre Engels
On Sat, Jul 4, 2009 at 9:33 AM, mclovin wrote:
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> ---d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
>  It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

There's no better algorithm for the general case. No method of
checking the matrices using less than 200-x look-ups will ensure
you that there's not a new value with x occurences lurking somewhere.

However, if you need this information more often, and if the number of
values that changes in between is small compared to the total size of
the matrices, you could make a gain in subsequent calculations by
remembering part results. Depending on the situation you could for
example keep the dictionary with the counts and update it each time,
or keep such a dictionary for each "big" matrix, and set a flag when
that dictionary is not up-to-date any more

.

-- 
André Engels, andreeng...@gmail.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: finding most common elements between thousands of multiple arrays.

2009-07-04 Thread Chris Rebert
On Sat, Jul 4, 2009 at 12:33 AM, mclovin wrote:
> Currently I need to find the most common elements in thousands of
> arrays within one large array (arround 2 million instances with ~70k
> unique elements)
>
> so I set up a dictionary to handle the counting so when I am
> iterating  I up the count on the corrosponding dictionary element. I
> then iterate through the dictionary and find the 25 most common
> elements.
>
> the elements are initially held in a array within an array. so I am am
> just trying to find the most common elements between all the arrays
> contained in one large array.
> my current code looks something like this:
> d = {}
> for arr in my_array:
> -for i in arr:
> #elements are numpy integers and thus are not accepted as dictionary
> keys
> ---d[int(i)]=d.get(int(i),0)+1
>
> then I filter things down. but with my algorithm that only takes about
> 1 sec so I dont need to show it here since that isnt the problem.
>
>
> But there has to be something better. I have to do this many many
> times and it seems silly to iterate through 2 million things just to
> get 25. The element IDs are integers and are currently being held in
> numpy arrays in a larger array. this ID is what makes up the key to
> the dictionary.
>
>  It currently takes about 5 seconds to accomplish this with my current
> algorithm.
>
> So does anyone know the best solution or algorithm? I think the trick
> lies in matrix intersections but I do not know.

Using a defaultdict
(http://docs.python.org/library/collections.html#collections.defaultdict)
would speed it up (but only slightly) and make it clearer to read.

Cheers,
Chris
-- 
http://blog.rebertia.com
-- 
http://mail.python.org/mailman/listinfo/python-list