Re: Best way to handle large lists?

2006-10-04 Thread durumdara
Hi !
 Thanks Jeremy. I am in the process of converting my stuff to use sets! I
 wouldn't have thought it would have made that big a deal! I guess it is
 live and learn.
   
If you have simplified records with big amount of data, you can trying 
dbhash. With this you don't get out from memory...

dd


import dbhash
import time
import random
import gc
import sys

itemcount = 25

db = dbhash.open('test.dbh','w')
for i in range(itemcount):
db[str(i)] = str(i)

littlelist = []
littleset = set()

while len(littlelist)  1000:
x = str(random.randint(0, itemcount-1))
if not (x in littlelist):
littlelist.append(x)
littleset.add(x)

def DBHash():
gc.collect()
hk = db.has_key
st = time.time()
newlist = []
for val in littlelist:
if hk(val):
newlist.append(val)
et = time.time()
print Size, len(newlist)
newlist.sort()
print Hash, hash(str(newlist))
print Time, %04f%(et-st)
print

def Set():
gc.collect()
largeset = set()
for i in range(itemcount):
largeset.add(str(i))

st = time.time()
newset = largeset.intersection(littleset)
newsetlist = []
while newset:
newsetlist.append(newset.pop())
et = time.time()
print Size, len(newsetlist)
newsetlist.sort()
print Hash, hash(str(newsetlist))
print Time, %04f%(et-st)

DBHash()
Set()



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-04 Thread GHUM
  Maybe the application should use sets instead of lists for these
  collections.
 What would sets do for me over lists?

searching for an element in a list is O(n)
searching for an element in a set is O(1)   (for reasonable distributed
elements)

Harald

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-04 Thread Hari Sekhon




So are you saying that using a dict means a faster search since you
only need to look up one value?

I would think that you would have to look through the keys and stop at
the first key that matches since each key has to be uniq, so perhaps if
it is nearer the front of the set of keys then perhaps it would be a
quicker lookup?

On the other hand, if it is nearer the end of the set of keys would it
not be slower?

Does this make it more dependent on the search order whether a list or
a dict is faster? Or am I completely on the wrong track?

-h
Hari Sekhon


Fredrik Lundh wrote:

  Hari Sekhon wrote:

  
  
That is surprising since I read on this list recently that lists were 
faster than dicts

  
  
depends on what you're doing with them, of course.

  
  
It was one reason that was cited as to why local vars are better than

  
global vars.

L[int] is indeed a bit faster than D[string] (but not much), but that 
doesn't mean that you can loop over *all* items in a list faster than 
you can look up a single key in a dictionary.

/F

  



-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best way to handle large lists?

2006-10-04 Thread Fredrik Lundh
Hari Sekhon wrote:

 So are you saying that using a dict means a faster search since you only
 need to look up one value?

 I would think that you would have to look through the keys and stop at
 the first key that matches since each key has to be uniq, so perhaps if
 it is nearer the front of the set of keys then perhaps it would be a
 quicker lookup?

http://en.wikipedia.org/wiki/Hash_table

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Best way to handle large lists?

2006-10-03 Thread Chaz Ginger
I have a system that has a few lists that are very large (thousands or
tens of thousands of entries) and some that are rather small. Many times
I have to produce the difference between a large list and a small one,
without destroying the integrity of either list. I was wondering if
anyone has any recommendations on how to do this and keep performance
high? Is there a better way than

[ i for i in bigList if i not in smallList ]

Thanks.
Chaz
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Duncan Booth
Chaz Ginger [EMAIL PROTECTED] wrote:

 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than
 
 [ i for i in bigList if i not in smallList ]

How about:

smallSet = set(smallList)
something = [ i for i in bigList if i not in smallSet ]

Use timeit.py on some representative data to see what difference that 
makes.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Paul Rubin
Chaz Ginger [EMAIL PROTECTED] writes:
 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than
 
 [ i for i in bigList if i not in smallList ]

diff = list(set(bigList) - set(smallList))

Note that doesn't get you the elements in any particular order.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread durumdara
Chaz Ginger írta:
 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than

 [ i for i in bigList if i not in smallList ]

 Thanks.
 Chaz
   
Hi !

If you have big list, you can use dbm like databases.
They are very quick. BSDDB, flashdb, etc. See SleepyCat, or see python help.

In is very slow in large datasets, but bsddb is use hash values, so it
is very quick.
The SleepyCat database have many extras, you can set the cache size and
many other parameters.

Or if you don't like dbm style databases, you can use SQLite. Also
quick, you can use SQL commands.
A little slower than bsddb, but it is like SQL server. You can improve
the speed with special parameters.

dd

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Bill Williams
I don't know enough about Python internals, but the suggested solutions 
all seem to involve scanning bigList. Can this presumably linear 
operation be avoided by using dict or similar to find all occurrences of 
smallist items in biglist and then deleting those occurrences?

Bill Williams



In article [EMAIL PROTECTED],
 Chaz Ginger [EMAIL PROTECTED] wrote:

 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than
 
 [ i for i in bigList if i not in smallList ]
 
 Thanks.
 Chaz
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Hari Sekhon
I don't know much about the python internals either, so this may be the 
blind leading the blind, but aren't dicts much slower to work with than 
lists and therefore wouldn't your suggestion to use dicts be much 
slower? I think it's something to do with the comparative overhead of 
using keys in dicts rather than using positional indexes in lists/arrays...

At least that is what I thought.

Can anyone confirm this?

-h

Hari Sekhon



Bill Williams wrote:
 I don't know enough about Python internals, but the suggested solutions 
 all seem to involve scanning bigList. Can this presumably linear 
 operation be avoided by using dict or similar to find all occurrences of 
 smallist items in biglist and then deleting those occurrences?

 Bill Williams



 In article [EMAIL PROTECTED],
  Chaz Ginger [EMAIL PROTECTED] wrote:

   
 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than

 [ i for i in bigList if i not in smallList ]

 Thanks.
 Chaz
 
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Sybren Stuvel
Bill Williams enlightened us with:
 I don't know enough about Python internals, but the suggested
 solutions all seem to involve scanning bigList. Can this presumably
 linear operation be avoided by using dict or similar to find all
 occurrences of smallist items in biglist and then deleting those
 occurrences?

And how would that beat O(n)? Every element of bigList has to be
scanned at one point, either to compare it to every earlier element in
bigList and eliminate it, or to compare it to every element in
smallList.

Run benchmarks on the suggestions, and see which is fastest for
yourself.

Sybren
-- 
Sybren Stüvel
Stüvel IT - http://www.stuvel.eu/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Chaz Ginger
I've done that and decided that Python's 'list comprehension' isn't a
way to go. I was hoping that perhaps someone had some experience with
some C or C++ library that has a Python interface that would make a
difference.

Chaz

Sybren Stuvel wrote:
 Bill Williams enlightened us with:
 I don't know enough about Python internals, but the suggested
 solutions all seem to involve scanning bigList. Can this presumably
 linear operation be avoided by using dict or similar to find all
 occurrences of smallist items in biglist and then deleting those
 occurrences?
 
 And how would that beat O(n)? Every element of bigList has to be
 scanned at one point, either to compare it to every earlier element in
 bigList and eliminate it, or to compare it to every element in
 smallList.
 
 Run benchmarks on the suggestions, and see which is fastest for
 yourself.
 
 Sybren
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Paul Rubin
Sybren Stuvel [EMAIL PROTECTED] writes:
  I don't know enough about Python internals, but the suggested
  solutions all seem to involve scanning bigList. Can this presumably
  linear operation be avoided by using dict or similar to find all
  occurrences of smallist items in biglist and then deleting those
  occurrences?
 
 And how would that beat O(n)? Every element of bigList has to be
 scanned at one point, either to compare it to every earlier element in
 bigList and eliminate it, or to compare it to every element in
 smallList.

Maybe the application should use sets instead of lists for these
collections.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Chaz Ginger
Paul Rubin wrote:
 Sybren Stuvel [EMAIL PROTECTED] writes:
 I don't know enough about Python internals, but the suggested
 solutions all seem to involve scanning bigList. Can this presumably
 linear operation be avoided by using dict or similar to find all
 occurrences of smallist items in biglist and then deleting those
 occurrences?
 And how would that beat O(n)? Every element of bigList has to be
 scanned at one point, either to compare it to every earlier element in
 bigList and eliminate it, or to compare it to every element in
 smallList.
 
 Maybe the application should use sets instead of lists for these
 collections.
What would sets do for me over lists?

Chaz
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Larry Bates
Chaz Ginger wrote:
 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than
 
 [ i for i in bigList if i not in smallList ]
 
 Thanks.
 Chaz


IMHO the only way to speed things up is to know more about the
actual data in the lists (e.g are the elements unique, can they
be sorted, etc) and take advantage of all that information to
come up with a faster algorithm.  If they are unique, sets
might be a good choice.  If they are sorted, bisect module
might help.  The specifics about the list(s) may yield a faster
method.

-Larry
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Jeremy Sanders
Chaz Ginger wrote:

 What would sets do for me over lists?

It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

Jeremy

-- 
Jeremy Sanders
http://www.jeremysanders.net/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Chaz Ginger
Larry Bates wrote:
 Chaz Ginger wrote:
 I have a system that has a few lists that are very large (thousands or
 tens of thousands of entries) and some that are rather small. Many times
 I have to produce the difference between a large list and a small one,
 without destroying the integrity of either list. I was wondering if
 anyone has any recommendations on how to do this and keep performance
 high? Is there a better way than

 [ i for i in bigList if i not in smallList ]

 Thanks.
 Chaz
 
 
 IMHO the only way to speed things up is to know more about the
 actual data in the lists (e.g are the elements unique, can they
 be sorted, etc) and take advantage of all that information to
 come up with a faster algorithm.  If they are unique, sets
 might be a good choice.  If they are sorted, bisect module
 might help.  The specifics about the list(s) may yield a faster
 method.
 
 -Larry
Each item in the list is a fully qualified domain name, e.g.
foo.bar.com. The order in the list has no importance. That is about all
there is to the list other than to say the number of items in a list can
top out about 10,000.

Chaz
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Richard Brodie

Chaz Ginger [EMAIL PROTECTED] wrote in message 
news:[EMAIL PROTECTED]

 Each item in the list is a fully qualified domain name, e.g.
 foo.bar.com. The order in the list has no importance.

So you don't actually need to use lists at all, then.
You can just use sets and write:

newSet = bigSet - littleSet


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Hari Sekhon




Jeremy Sanders wrote:

  Chaz Ginger wrote:

  
  
What would sets do for me over lists?

  
  
It's faster to tell whether something is in a set or dict than in a list
(for some minimum size).

Jeremy

  

That is surprising since I read on this list recently that lists were
faster than dicts and that variables stored in lists were faster
referenced/used. It was one reason that was cited as to why local vars
are better than global vars. The guy actually did a looping test and
timed it to show the speed difference.

Can anybody please step in and correct us?

-h

Hari Sekhon


-- 
http://mail.python.org/mailman/listinfo/python-list

Re: Best way to handle large lists?

2006-10-03 Thread Jeremy Sanders
Jeremy Sanders wrote:

 Chaz Ginger wrote:
 
 What would sets do for me over lists?
 
 It's faster to tell whether something is in a set or dict than in a list
 (for some minimum size).

As a footnote, this program

import random
num = 10

a = set( range(num) )
for i in range(10):
x = random.randint(0, num-1) in a

completes in less than a second, whereas

import random
num = 10

a = range(num)
for i in range(10):
x = random.randint(0, num-1) in a

takes a long time on my computer.

Jeremy

-- 
Jeremy Sanders
http://www.jeremysanders.net/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Chaz Ginger
Jeremy Sanders wrote:
 Jeremy Sanders wrote:
 
 Chaz Ginger wrote:

 What would sets do for me over lists?
 It's faster to tell whether something is in a set or dict than in a list
 (for some minimum size).
 
 As a footnote, this program
 
 import random
 num = 10
 
 a = set( range(num) )
 for i in range(10):
 x = random.randint(0, num-1) in a
 
 completes in less than a second, whereas
 
 import random
 num = 10
 
 a = range(num)
 for i in range(10):
 x = random.randint(0, num-1) in a
 
 takes a long time on my computer.
 
 Jeremy
 
Thanks Jeremy. I am in the process of converting my stuff to use sets! I
wouldn't have thought it would have made that big a deal! I guess it is
live and learn.

Peace,
Chaz
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: Best way to handle large lists?

2006-10-03 Thread Fredrik Lundh
Hari Sekhon wrote:

 That is surprising since I read on this list recently that lists were 
 faster than dicts

depends on what you're doing with them, of course.

 It was one reason that was cited as to why local vars are better than
  global vars.

L[int] is indeed a bit faster than D[string] (but not much), but that 
doesn't mean that you can loop over *all* items in a list faster than 
you can look up a single key in a dictionary.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list