Re: Best way to handle large lists?
Hi ! Thanks Jeremy. I am in the process of converting my stuff to use sets! I wouldn't have thought it would have made that big a deal! I guess it is live and learn. If you have simplified records with big amount of data, you can trying dbhash. With this you don't get out from memory... dd import dbhash import time import random import gc import sys itemcount = 25 db = dbhash.open('test.dbh','w') for i in range(itemcount): db[str(i)] = str(i) littlelist = [] littleset = set() while len(littlelist) 1000: x = str(random.randint(0, itemcount-1)) if not (x in littlelist): littlelist.append(x) littleset.add(x) def DBHash(): gc.collect() hk = db.has_key st = time.time() newlist = [] for val in littlelist: if hk(val): newlist.append(val) et = time.time() print Size, len(newlist) newlist.sort() print Hash, hash(str(newlist)) print Time, %04f%(et-st) print def Set(): gc.collect() largeset = set() for i in range(itemcount): largeset.add(str(i)) st = time.time() newset = largeset.intersection(littleset) newsetlist = [] while newset: newsetlist.append(newset.pop()) et = time.time() print Size, len(newsetlist) newsetlist.sort() print Hash, hash(str(newsetlist)) print Time, %04f%(et-st) DBHash() Set() -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Maybe the application should use sets instead of lists for these collections. What would sets do for me over lists? searching for an element in a list is O(n) searching for an element in a set is O(1) (for reasonable distributed elements) Harald -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
So are you saying that using a dict means a faster search since you only need to look up one value? I would think that you would have to look through the keys and stop at the first key that matches since each key has to be uniq, so perhaps if it is nearer the front of the set of keys then perhaps it would be a quicker lookup? On the other hand, if it is nearer the end of the set of keys would it not be slower? Does this make it more dependent on the search order whether a list or a dict is faster? Or am I completely on the wrong track? -h Hari Sekhon Fredrik Lundh wrote: Hari Sekhon wrote: That is surprising since I read on this list recently that lists were faster than dicts depends on what you're doing with them, of course. It was one reason that was cited as to why local vars are better than global vars. L[int] is indeed a bit faster than D[string] (but not much), but that doesn't mean that you can loop over *all* items in a list faster than you can look up a single key in a dictionary. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Hari Sekhon wrote: So are you saying that using a dict means a faster search since you only need to look up one value? I would think that you would have to look through the keys and stop at the first key that matches since each key has to be uniq, so perhaps if it is nearer the front of the set of keys then perhaps it would be a quicker lookup? http://en.wikipedia.org/wiki/Hash_table /F -- http://mail.python.org/mailman/listinfo/python-list
Best way to handle large lists?
I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger [EMAIL PROTECTED] wrote: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] How about: smallSet = set(smallList) something = [ i for i in bigList if i not in smallSet ] Use timeit.py on some representative data to see what difference that makes. -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger [EMAIL PROTECTED] writes: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] diff = list(set(bigList) - set(smallList)) Note that doesn't get you the elements in any particular order. -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger írta: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz Hi ! If you have big list, you can use dbm like databases. They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help. In is very slow in large datasets, but bsddb is use hash values, so it is very quick. The SleepyCat database have many extras, you can set the cache size and many other parameters. Or if you don't like dbm style databases, you can use SQLite. Also quick, you can use SQL commands. A little slower than bsddb, but it is like SQL server. You can improve the speed with special parameters. dd -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? Bill Williams In article [EMAIL PROTECTED], Chaz Ginger [EMAIL PROTECTED] wrote: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
I don't know much about the python internals either, so this may be the blind leading the blind, but aren't dicts much slower to work with than lists and therefore wouldn't your suggestion to use dicts be much slower? I think it's something to do with the comparative overhead of using keys in dicts rather than using positional indexes in lists/arrays... At least that is what I thought. Can anyone confirm this? -h Hari Sekhon Bill Williams wrote: I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? Bill Williams In article [EMAIL PROTECTED], Chaz Ginger [EMAIL PROTECTED] wrote: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Bill Williams enlightened us with: I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? And how would that beat O(n)? Every element of bigList has to be scanned at one point, either to compare it to every earlier element in bigList and eliminate it, or to compare it to every element in smallList. Run benchmarks on the suggestions, and see which is fastest for yourself. Sybren -- Sybren Stüvel Stüvel IT - http://www.stuvel.eu/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
I've done that and decided that Python's 'list comprehension' isn't a way to go. I was hoping that perhaps someone had some experience with some C or C++ library that has a Python interface that would make a difference. Chaz Sybren Stuvel wrote: Bill Williams enlightened us with: I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? And how would that beat O(n)? Every element of bigList has to be scanned at one point, either to compare it to every earlier element in bigList and eliminate it, or to compare it to every element in smallList. Run benchmarks on the suggestions, and see which is fastest for yourself. Sybren -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Sybren Stuvel [EMAIL PROTECTED] writes: I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? And how would that beat O(n)? Every element of bigList has to be scanned at one point, either to compare it to every earlier element in bigList and eliminate it, or to compare it to every element in smallList. Maybe the application should use sets instead of lists for these collections. -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Paul Rubin wrote: Sybren Stuvel [EMAIL PROTECTED] writes: I don't know enough about Python internals, but the suggested solutions all seem to involve scanning bigList. Can this presumably linear operation be avoided by using dict or similar to find all occurrences of smallist items in biglist and then deleting those occurrences? And how would that beat O(n)? Every element of bigList has to be scanned at one point, either to compare it to every earlier element in bigList and eliminate it, or to compare it to every element in smallList. Maybe the application should use sets instead of lists for these collections. What would sets do for me over lists? Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger wrote: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz IMHO the only way to speed things up is to know more about the actual data in the lists (e.g are the elements unique, can they be sorted, etc) and take advantage of all that information to come up with a faster algorithm. If they are unique, sets might be a good choice. If they are sorted, bisect module might help. The specifics about the list(s) may yield a faster method. -Larry -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger wrote: What would sets do for me over lists? It's faster to tell whether something is in a set or dict than in a list (for some minimum size). Jeremy -- Jeremy Sanders http://www.jeremysanders.net/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Larry Bates wrote: Chaz Ginger wrote: I have a system that has a few lists that are very large (thousands or tens of thousands of entries) and some that are rather small. Many times I have to produce the difference between a large list and a small one, without destroying the integrity of either list. I was wondering if anyone has any recommendations on how to do this and keep performance high? Is there a better way than [ i for i in bigList if i not in smallList ] Thanks. Chaz IMHO the only way to speed things up is to know more about the actual data in the lists (e.g are the elements unique, can they be sorted, etc) and take advantage of all that information to come up with a faster algorithm. If they are unique, sets might be a good choice. If they are sorted, bisect module might help. The specifics about the list(s) may yield a faster method. -Larry Each item in the list is a fully qualified domain name, e.g. foo.bar.com. The order in the list has no importance. That is about all there is to the list other than to say the number of items in a list can top out about 10,000. Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Chaz Ginger [EMAIL PROTECTED] wrote in message news:[EMAIL PROTECTED] Each item in the list is a fully qualified domain name, e.g. foo.bar.com. The order in the list has no importance. So you don't actually need to use lists at all, then. You can just use sets and write: newSet = bigSet - littleSet -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Jeremy Sanders wrote: Chaz Ginger wrote: What would sets do for me over lists? It's faster to tell whether something is in a set or dict than in a list (for some minimum size). Jeremy That is surprising since I read on this list recently that lists were faster than dicts and that variables stored in lists were faster referenced/used. It was one reason that was cited as to why local vars are better than global vars. The guy actually did a looping test and timed it to show the speed difference. Can anybody please step in and correct us? -h Hari Sekhon -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Jeremy Sanders wrote: Chaz Ginger wrote: What would sets do for me over lists? It's faster to tell whether something is in a set or dict than in a list (for some minimum size). As a footnote, this program import random num = 10 a = set( range(num) ) for i in range(10): x = random.randint(0, num-1) in a completes in less than a second, whereas import random num = 10 a = range(num) for i in range(10): x = random.randint(0, num-1) in a takes a long time on my computer. Jeremy -- Jeremy Sanders http://www.jeremysanders.net/ -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Jeremy Sanders wrote: Jeremy Sanders wrote: Chaz Ginger wrote: What would sets do for me over lists? It's faster to tell whether something is in a set or dict than in a list (for some minimum size). As a footnote, this program import random num = 10 a = set( range(num) ) for i in range(10): x = random.randint(0, num-1) in a completes in less than a second, whereas import random num = 10 a = range(num) for i in range(10): x = random.randint(0, num-1) in a takes a long time on my computer. Jeremy Thanks Jeremy. I am in the process of converting my stuff to use sets! I wouldn't have thought it would have made that big a deal! I guess it is live and learn. Peace, Chaz -- http://mail.python.org/mailman/listinfo/python-list
Re: Best way to handle large lists?
Hari Sekhon wrote: That is surprising since I read on this list recently that lists were faster than dicts depends on what you're doing with them, of course. It was one reason that was cited as to why local vars are better than global vars. L[int] is indeed a bit faster than D[string] (but not much), but that doesn't mean that you can loop over *all* items in a list faster than you can look up a single key in a dictionary. /F -- http://mail.python.org/mailman/listinfo/python-list