Re: Keeping track of the N largest values

2010-12-31 Thread Dan Stromberg
On Sat, Dec 25, 2010 at 7:42 AM, Roy Smith r...@panix.com wrote:
 I'm processing a stream of N numbers and want to keep track of the K
 largest.  There's too many numbers in the stream (i.e. N is too large)
 to keep in memory at once.  K is small (100 would be typical).

 From a theoretical point of view, I should be able to do this in N log K
 time.  What I'm doing now is essentially:

 top = [-1]    # Assume all x are = 0
 for x in input():
    if x = top[0]:
        continue
    top.append(x)
    if len(top)  K:
        top.sort()
        top.pop(0)

 I can see pathological cases (say, all input values the same) where
 running time would be N K log K, but on average (N  K and random
 distribution of values), this should be pretty close to N.

 Is there a better way to do this, either from a theoretical running time
 point of view, or just a nicer way to code this in Python?
 --
 http://mail.python.org/mailman/listinfo/python-list


heapq is probably fine, but I've empirically found a treap faster than
a heap under some conditions:

http://stromberg.dnsalias.org/~strombrg/treap/
http://stromberg.dnsalias.org/~strombrg/highest/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-28 Thread Miki
 I'm processing a stream of N numbers and want to keep track of the K 
 largest.  There's too many numbers in the stream (i.e. N is too large) 
 to keep in memory at once.  K is small (100 would be typical).
 ...

deque can be bounded by maxsize, might fit the bill:
 from collections import deque
 d = deque([], 3)
 for i in range(10): d.append(i)
 d
deque([7, 8, 9], maxlen=3)
 


HTH,
--
Miki
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-27 Thread Stefan Sonnenberg-Carstens

Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x  a] + [a] + [x for x in l if x  a]
print l 

A minor fault made it into my prog:

l = [0]
K = 10

while 1:
a = input()
l=[x for x in l if x  a] + [a] + [x for x in l if x  a]
if len(l) == K:
l.remove(min(l))
print l
--
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-26 Thread Peter Otten
Roy Smith wrote:

 In article xns9e59a44d7cc49duncanbo...@127.0.0.1,
  Duncan Booth duncan.bo...@invalid.invalid wrote:
 
 Roy Smith r...@panix.com wrote:
 
  
  I'm processing a stream of N numbers and want to keep track of the K
  largest.  There's too many numbers in the stream (i.e. N is too large)
  to keep in memory at once.  K is small (100 would be typical).
  ...
  Is there a better way to do this, either from a theoretical running
  time point of view, or just a nicer way to code this in Python?
 
 How about:
 
   from heapq import nlargest
   top = nlargest(K, input())
 
 It uses a heap so avoids completely resorting each time.
 
 Hmmm, that looks like it would solve the problem as stated, but there's
 another twist.  In addition to finding the K largest values, I *also*
 need to do some other processing on all the values (which I didn't show
 in the original example, incorrectly thinking it not germane to my
 question).  The problem with nlargest() is that it doesn't give me a
 hook to do that.

If Paul's solution doesn't suffice -- the heapq module has the building 
blocks for a custom solution, e. g.:

import heapq
from functools import partial

class NLargest(object):
def __init__(self, n):
self.n = n
self.heap = []
def tally(self, item):
heap = self.heap
if len(heap) = self.n:
heapq.heappushpop(heap, item)
self.tally = partial(heapq.heappushpop, heap)
else:
heapq.heappush(heap, item)
def __str__(self):
return str(sorted(self.heap, reverse=True))

if __name__ == __main__:
import random
items = range(100)
random.shuffle(items)
accu = NLargest(10)
for item in items:
accu.tally(item)
print item, accu

 
 PS: I'm assuming heapq.nlargest(n, iterable) operates with memory
 proportional to n, and not to the iterator length.  That's the only
 reasonable conclusion, but the docs don't actually come out and say it.

I would hope so.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-26 Thread n00m

from bisect import insort_left

K = 5
top = []
while 1:
x = input()
if len(top)  K:
insort_left(top, x)
elif x  top[0]:
del top[0]
insort_left(top, x)
print top


will be enough

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-26 Thread Roy Smith
In article 
e05e480b-8956-4984-b4cc-9a1666380...@l32g2000yqc.googlegroups.com,
 n00m n...@narod.ru wrote:

 from bisect import insort_left
 
 K = 5
 top = []
 while 1:
 x = input()
 if len(top)  K:
 insort_left(top, x)
 elif x  top[0]:
 del top[0]
 insort_left(top, x)
 print top
 
 
 will be enough

Hmmm, that's an interesting idea.  Thanks.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-26 Thread Stefan Sonnenberg-Carstens

Am 25.12.2010 16:42, schrieb Roy Smith:

I'm processing a stream of N numbers and want to keep track of the K
largest.  There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once.  K is small (100 would be typical).

 From a theoretical point of view, I should be able to do this in N log K
time.  What I'm doing now is essentially:

top = [-1]# Assume all x are= 0
for x in input():
 if x= top[0]:
 continue
 top.append(x)
 if len(top)  K:
 top.sort()
 top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N  K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?

Here is my version:

l = []
K = 10

while 1:
a = input()
if len(l) == K:
l.remove(min(l))
l=[x for x in l if x  a] + [a] + [x for x in l if x  a]
print l
--
http://mail.python.org/mailman/listinfo/python-list


Keeping track of the N largest values

2010-12-25 Thread Roy Smith
I'm processing a stream of N numbers and want to keep track of the K 
largest.  There's too many numbers in the stream (i.e. N is too large) 
to keep in memory at once.  K is small (100 would be typical).

From a theoretical point of view, I should be able to do this in N log K 
time.  What I'm doing now is essentially:

top = [-1]# Assume all x are = 0   

for x in input():
if x = top[0]:
continue
top.append(x)
if len(top)  K:
top.sort()
top.pop(0)

I can see pathological cases (say, all input values the same) where 
running time would be N K log K, but on average (N  K and random 
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time 
point of view, or just a nicer way to code this in Python?
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread Peter Otten
Roy Smith wrote:

 I'm processing a stream of N numbers and want to keep track of the K
 largest.  There's too many numbers in the stream (i.e. N is too large)
 to keep in memory at once.  K is small (100 would be typical).
 
 From a theoretical point of view, I should be able to do this in N log K
 time.  What I'm doing now is essentially:
 
 top = [-1]# Assume all x are = 0
 for x in input():
 if x = top[0]:
 continue
 top.append(x)
 if len(top)  K:
 top.sort()
 top.pop(0)
 
 I can see pathological cases (say, all input values the same) where
 running time would be N K log K, but on average (N  K and random
 distribution of values), this should be pretty close to N.
 
 Is there a better way to do this, either from a theoretical running time
 point of view, or just a nicer way to code this in Python?

http://docs.python.org/library/heapq.html#heapq.nlargest
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread Duncan Booth
Roy Smith r...@panix.com wrote:

 
 I'm processing a stream of N numbers and want to keep track of the K 
 largest.  There's too many numbers in the stream (i.e. N is too large)
 to keep in memory at once.  K is small (100 would be typical).
 ... 
 Is there a better way to do this, either from a theoretical running
 time point of view, or just a nicer way to code this in Python?

How about:

  from heapq import nlargest
  top = nlargest(K, input())

It uses a heap so avoids completely resorting each time.

-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread Robert Kern

On 12/25/10 10:42 AM, Roy Smith wrote:

I'm processing a stream of N numbers and want to keep track of the K
largest.  There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once.  K is small (100 would be typical).


From a theoretical point of view, I should be able to do this in N log K

time.  What I'm doing now is essentially:

top = [-1]# Assume all x are= 0
for x in input():
 if x= top[0]:
 continue
 top.append(x)
 if len(top)  K:
 top.sort()
 top.pop(0)

I can see pathological cases (say, all input values the same) where
running time would be N K log K, but on average (N  K and random
distribution of values), this should be pretty close to N.

Is there a better way to do this, either from a theoretical running time
point of view, or just a nicer way to code this in Python?


heapq.nlargest()

http://docs.python.org/library/heapq#heapq.nlargest

--
Robert Kern

I have come to believe that the whole world is an enigma, a harmless enigma
 that is made terrible by our own mad attempt to interpret it as though it had
 an underlying truth.
  -- Umberto Eco

--
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread Roy Smith
In article xns9e59a44d7cc49duncanbo...@127.0.0.1,
 Duncan Booth duncan.bo...@invalid.invalid wrote:

 Roy Smith r...@panix.com wrote:
 
  
  I'm processing a stream of N numbers and want to keep track of the K 
  largest.  There's too many numbers in the stream (i.e. N is too large)
  to keep in memory at once.  K is small (100 would be typical).
  ... 
  Is there a better way to do this, either from a theoretical running
  time point of view, or just a nicer way to code this in Python?
 
 How about:
 
   from heapq import nlargest
   top = nlargest(K, input())
 
 It uses a heap so avoids completely resorting each time.

Hmmm, that looks like it would solve the problem as stated, but there's 
another twist.  In addition to finding the K largest values, I *also* 
need to do some other processing on all the values (which I didn't show 
in the original example, incorrectly thinking it not germane to my 
question).  The problem with nlargest() is that it doesn't give me a 
hook to do that.

PS: I'm assuming heapq.nlargest(n, iterable) operates with memory 
proportional to n, and not to the iterator length.  That's the only 
reasonable conclusion, but the docs don't actually come out and say it.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread Paul Rubin
Roy Smith r...@panix.com writes:
   from heapq import nlargest
   top = nlargest(K, input())

 In addition to finding the K largest values, I *also* need to do some
 other processing on all the values   The problem with nlargest()
 is that it doesn't give me a hook to do that.

def processed_input():
  for x in input():
process(x)
yield x

top = nlargest(K, processed_input())

You could also write that more consisely with genexps but it's a bit
clumsy.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Keeping track of the N largest values

2010-12-25 Thread John Nagle

On 12/25/2010 8:04 AM, Peter Otten wrote:

Roy Smith wrote:


I'm processing a stream of N numbers and want to keep track of the K
largest.  There's too many numbers in the stream (i.e. N is too large)
to keep in memory at once.  K is small (100 would be typical).


http://docs.python.org/library/heapq.html#heapq.nlargest


Incidentally, if it happens that the data is already in
a database, MySQL will do that.

SELECT val FROM tab ORDER BY val DESC LIMIT N;

will, for small N, keep only N values. For large N,
it sorts.

   That's for an un-indexed field.  If val is an
index, this is a very fast operation, since the tree
building has already been done.

John Nagle

--
http://mail.python.org/mailman/listinfo/python-list