Re: Keeping track of the N largest values
On Sat, Dec 25, 2010 at 7:42 AM, Roy Smith r...@panix.com wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1] # Assume all x are = 0 for x in input(): if x = top[0]: continue top.append(x) if len(top) K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? -- http://mail.python.org/mailman/listinfo/python-list heapq is probably fine, but I've empirically found a treap faster than a heap under some conditions: http://stromberg.dnsalias.org/~strombrg/treap/ http://stromberg.dnsalias.org/~strombrg/highest/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). ... deque can be bounded by maxsize, might fit the bill: from collections import deque d = deque([], 3) for i in range(10): d.append(i) d deque([7, 8, 9], maxlen=3) HTH, -- Miki -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Am 26.12.2010 19:51, schrieb Stefan Sonnenberg-Carstens: l = [] K = 10 while 1: a = input() if len(l) == K: l.remove(min(l)) l=[x for x in l if x a] + [a] + [x for x in l if x a] print l A minor fault made it into my prog: l = [0] K = 10 while 1: a = input() l=[x for x in l if x a] + [a] + [x for x in l if x a] if len(l) == K: l.remove(min(l)) print l -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Roy Smith wrote: In article xns9e59a44d7cc49duncanbo...@127.0.0.1, Duncan Booth duncan.bo...@invalid.invalid wrote: Roy Smith r...@panix.com wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). ... Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? How about: from heapq import nlargest top = nlargest(K, input()) It uses a heap so avoids completely resorting each time. Hmmm, that looks like it would solve the problem as stated, but there's another twist. In addition to finding the K largest values, I *also* need to do some other processing on all the values (which I didn't show in the original example, incorrectly thinking it not germane to my question). The problem with nlargest() is that it doesn't give me a hook to do that. If Paul's solution doesn't suffice -- the heapq module has the building blocks for a custom solution, e. g.: import heapq from functools import partial class NLargest(object): def __init__(self, n): self.n = n self.heap = [] def tally(self, item): heap = self.heap if len(heap) = self.n: heapq.heappushpop(heap, item) self.tally = partial(heapq.heappushpop, heap) else: heapq.heappush(heap, item) def __str__(self): return str(sorted(self.heap, reverse=True)) if __name__ == __main__: import random items = range(100) random.shuffle(items) accu = NLargest(10) for item in items: accu.tally(item) print item, accu PS: I'm assuming heapq.nlargest(n, iterable) operates with memory proportional to n, and not to the iterator length. That's the only reasonable conclusion, but the docs don't actually come out and say it. I would hope so. -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
from bisect import insort_left K = 5 top = [] while 1: x = input() if len(top) K: insort_left(top, x) elif x top[0]: del top[0] insort_left(top, x) print top will be enough -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
In article e05e480b-8956-4984-b4cc-9a1666380...@l32g2000yqc.googlegroups.com, n00m n...@narod.ru wrote: from bisect import insort_left K = 5 top = [] while 1: x = input() if len(top) K: insort_left(top, x) elif x top[0]: del top[0] insort_left(top, x) print top will be enough Hmmm, that's an interesting idea. Thanks. -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Am 25.12.2010 16:42, schrieb Roy Smith: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1]# Assume all x are= 0 for x in input(): if x= top[0]: continue top.append(x) if len(top) K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? Here is my version: l = [] K = 10 while 1: a = input() if len(l) == K: l.remove(min(l)) l=[x for x in l if x a] + [a] + [x for x in l if x a] print l -- http://mail.python.org/mailman/listinfo/python-list
Keeping track of the N largest values
I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1]# Assume all x are = 0 for x in input(): if x = top[0]: continue top.append(x) if len(top) K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Roy Smith wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1]# Assume all x are = 0 for x in input(): if x = top[0]: continue top.append(x) if len(top) K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? http://docs.python.org/library/heapq.html#heapq.nlargest -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Roy Smith r...@panix.com wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). ... Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? How about: from heapq import nlargest top = nlargest(K, input()) It uses a heap so avoids completely resorting each time. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
On 12/25/10 10:42 AM, Roy Smith wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). From a theoretical point of view, I should be able to do this in N log K time. What I'm doing now is essentially: top = [-1]# Assume all x are= 0 for x in input(): if x= top[0]: continue top.append(x) if len(top) K: top.sort() top.pop(0) I can see pathological cases (say, all input values the same) where running time would be N K log K, but on average (N K and random distribution of values), this should be pretty close to N. Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? heapq.nlargest() http://docs.python.org/library/heapq#heapq.nlargest -- Robert Kern I have come to believe that the whole world is an enigma, a harmless enigma that is made terrible by our own mad attempt to interpret it as though it had an underlying truth. -- Umberto Eco -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
In article xns9e59a44d7cc49duncanbo...@127.0.0.1, Duncan Booth duncan.bo...@invalid.invalid wrote: Roy Smith r...@panix.com wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). ... Is there a better way to do this, either from a theoretical running time point of view, or just a nicer way to code this in Python? How about: from heapq import nlargest top = nlargest(K, input()) It uses a heap so avoids completely resorting each time. Hmmm, that looks like it would solve the problem as stated, but there's another twist. In addition to finding the K largest values, I *also* need to do some other processing on all the values (which I didn't show in the original example, incorrectly thinking it not germane to my question). The problem with nlargest() is that it doesn't give me a hook to do that. PS: I'm assuming heapq.nlargest(n, iterable) operates with memory proportional to n, and not to the iterator length. That's the only reasonable conclusion, but the docs don't actually come out and say it. -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
Roy Smith r...@panix.com writes: from heapq import nlargest top = nlargest(K, input()) In addition to finding the K largest values, I *also* need to do some other processing on all the values The problem with nlargest() is that it doesn't give me a hook to do that. def processed_input(): for x in input(): process(x) yield x top = nlargest(K, processed_input()) You could also write that more consisely with genexps but it's a bit clumsy. -- http://mail.python.org/mailman/listinfo/python-list
Re: Keeping track of the N largest values
On 12/25/2010 8:04 AM, Peter Otten wrote: Roy Smith wrote: I'm processing a stream of N numbers and want to keep track of the K largest. There's too many numbers in the stream (i.e. N is too large) to keep in memory at once. K is small (100 would be typical). http://docs.python.org/library/heapq.html#heapq.nlargest Incidentally, if it happens that the data is already in a database, MySQL will do that. SELECT val FROM tab ORDER BY val DESC LIMIT N; will, for small N, keep only N values. For large N, it sorts. That's for an un-indexed field. If val is an index, this is a very fast operation, since the tree building has already been done. John Nagle -- http://mail.python.org/mailman/listinfo/python-list