Re: matching strings in a large set of strings

2010-05-06 Thread M.-A. Lemburg
Dennis Lee Bieber wrote:
 On Thu, 29 Apr 2010 11:38:28 +0200, Karin Lagesen
 karin.lage...@bio.uio.no declaimed the following in comp.lang.python:
 
 Hello.

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.

   Is this another string also exactly 14 characters long?
 
 Now, I have tried storing these strings as a list, a set and a dictionary.
 I know that finding things in a set and a dictionary is very much faster
 than working with a list, so I tried those first. However, I run out of
 memory building both the set and the dictionary, so what I seem to be left
 with is the list, and using the in method.

   So don't load them into memory... First use a file-based (not memory
 limited) sort routine on the 80M strings (if position in the file is
 important, make a copy with the line number on the end of each line
 before sorting -- but since you've tried using Sets which do not retain
 order... probably not a matter).
 
   Then, since you know the length of the file, and the length of each
 line is fixed, you can do direct access I/O (well, since the C-stream
 style I/O doesn't do record access, you'll need to calculate offsets
 from the start of the file as (record# - 1) * recordlen)...
 
   That lets you do a binary search on the file. Much faster than a
 linear search (linear search will average out to 41.5M read operations;
 binary should be around 1 reads)
 
   Or load the data into some relational database and let the compiled
 code do the searching -- probably faster than byte-code interpreted
 Python for the same algorithm...

... or put the data into a simple on-disc dictionary such as
what mxBeeBase implements:

http://www.egenix.com/products/python/mxBase/mxBeeBase/

Once you have that dict set up, lookups are very fast.

-- 
Marc-Andre Lemburg
eGenix.com

Professional Python Services directly from the Source  (#1, May 06 2010)
 Python/Zope Consulting and Support ...http://www.egenix.com/
 mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/
 mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/


::: Try our new mxODBC.Connect Python Database Interface for free ! 


   eGenix.com Software, Skills and Services GmbH  Pastor-Loeh-Str.48
D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg
   Registered at Amtsgericht Duesseldorf: HRB 46611
   http://www.egenix.com/company/contact/
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-05-03 Thread Bryan
Karin Lagesen wrote:
 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.
[...]
 I run out of memory building both the set and the dictionary, so
 what I seem to be left with is the list

I agree with the recommendations to try modules that maintain the set
on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash
to about 8 million buckets, and use a compact representation for the
several strings in each bucket. That gives you most of the speed and
avoids most of the memory overhead. Python makes it easy:


class ConstLenStrSet:

 Keep a set of strings all of the same length.
Support set.add() and Python's 'in' test.


def __init__(self, strlen=14, nbuckets=800):
assert strlen  0 and nbuckets  0
self.strlen = strlen
self.nbuckets = nbuckets | 1
self.table = [''] * self.nbuckets

def _hashstr(self, s):
return hash(s) % self.nbuckets

def add(self, s):
assert len(s) == self.strlen
if s not in self:
self.table[self._hashstr(s)] += s

def __contains__(self, s):
data = self.table[self._hashstr(s)]
for i in range(0, len(data), self.strlen):
if data[i : i + self.strlen] == s:
return True
return False


# Rudimentary demo-test:

from random import choice
from string import letters

def rnd_str(slen=14):
return ''.join(choice(letters) for _ in range(slen))

# Test against Python's built-in set
sref = set()
for i in range(83):
sref.add(rnd_str())

print('Strings generated.')

sset = sset = ConstLenStrSet(14, 8)
for s in sref:
sset.add(s)

print 'ConstLenStrSet loaded.'

for s in sref:
assert s in sset

for i in range(1000):
s = rnd_str()
if s in sref:
print 'That was unlucky.'
else:
assert s not in sset



If building the set is too slow, and you know you don't have a lot of
duplicate strings, you can use a faster insert method that doesn't
check whether the string is already in the set:


def add_quick(self, s):
assert len(s) == self.strlen
self.table[self._hashstr(s)] += s


--
--Bryan
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-05-02 Thread Albert van der Horst
In article 877hnpjtdw@rudin.co.uk,
Paul Rudin  paul.nos...@rudin.co.uk wrote:
Karin Lagesen karin.lage...@bio.uio.no writes:

 Hello.

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.

 Now, I have tried storing these strings as a list, a set and a dictionary.
 I know that finding things in a set and a dictionary is very much faster
 than working with a list, so I tried those first. However, I run out of
 memory building both the set and the dictionary, so what I seem to be left
 with is the list, and using the in method.

 I imagine that there should be a faster/better way than this?


Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

And if this is practical there should be no swapping problems,
as the working set will be a small fraction of the data used.

SNIP

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.

There are a lot of possibility, but they depend a great deal on
secondary conditions, like how often the 83 M dataset changes.

Groetjes Albert

--
-- 
Albert van der Horst, UTRECHT,THE NETHERLANDS
Economic growth -- being exponential -- ultimately falters.
alb...@spearc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: External Hashing [was Re: matching strings in a large set of strings]

2010-05-01 Thread Jack
http://www.swizwatch.com/
All Cartier replica watches sold at Hotwristwatch.com are brand-new and high
quality. Each Cartier Replica Watch produced is examined carefully by our
quality test department and each watch is inspected again before being sent
to our customer. It is our desire that you do experience the joy of shopping
when buying one of our Cartier Replica Watches at our site. Some Cartier
Watches may need to be special ordered, please call for availability on
Cartier watches. Best service you will receive from us.

Helmut Jarausch jarau...@igpm.rwth-aachen.de 
??:840jkofai...@mid.dfncis.de...
I think one could apply an external hashing technique which would require 
only
 very few disk accesses per lookup.
 Unfortunately, I'm now aware of an implementation in Python.
 Does anybody know about a Python implementation of external hashing?

 Thanks,
 Helmut.

 -- 
 Helmut Jarausch

 Lehrstuhl fuer Numerische Mathematik
 RWTH - Aachen University
 D 52056 Aachen, Germany 


-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-05-01 Thread News123
Dennis Lee Bieber wrote:
 On Thu, 29 Apr 2010 11:38:28 +0200, Karin Lagesen
 karin.lage...@bio.uio.no declaimed the following in comp.lang.python:
 
 Hello.

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.



   So don't load them into memory... First use a file-based (not memory

 
   That lets you do a binary search on the file. Much faster than a
 linear search (linear search will average out to 41.5M read operations;
 binary should be around 1 reads)

Don't you meant 27 reads instead of 41.5 M reads?

 from math import log
 log(83e6)/log(2)
26.306608000671101



N
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-05-01 Thread Stefan Behnel

Duncan Booth, 30.04.2010 10:20:

So more than 3GB just for the strings (and that's for Python 2.x on
Python 3.x you'll need nearly 5GB).

Running on a 64 bit version of Python should be fine, but for a 32 bit
system a naive approach just isn't going to work.

Option 1: use a trie. That should reduce storage, maybe it will reduce
it enough, maybe not. It depends on the data.


Depending on the implementation and the data, a trie can also use a lot 
/more/ space than the set of strings that it contains. The 83 million 14 
character strings can well include a relatively large subset of the 
possible permutations (imagine purely numeric, hex-numeric or lower-case 
alpha-numeric strings, for example), so the trie will likely need to branch 
very often with very low intermediate run length. If you use pointers for 
trie branches, that's at least 8 bytes per branch on a 64bit system, versus 
1 byte per character in a byte string list. Depending on the ratio of 
branches to characters, one or the other may win. So a naive approach 
likely won't work for tries either.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-05-01 Thread News123
Dennis Lee Bieber wrote:
 On Sat, 01 May 2010 13:48:02 +0200, News123 news1...@free.fr declaimed
 the following in gmane.comp.python.general:
 
 Dennis Lee Bieber wrote:
 That lets you do a binary search on the file. Much faster than a
 linear search (linear search will average out to 41.5M read operations;
 binary should be around 1 reads)
 Don't you meant 27 reads instead of 41.5 M reads?

   Don't you mean my 1 reads? The 41.5M is the average for the
 linear search.
  
Indeed, this is what I meant. or phrased differently:
about 27 reads with a binary search instead of 41.5M reads average with
a linear search.




 from math import log
 log(83e6)/log(2)
 26.306608000671101
   Probably -- it was late at night and I was working things out in my
 head...

I know about late nights.  I just wondered whether I overlooked something.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-30 Thread Paul Rudin
Karin Lagesen karin.lage...@bio.uio.no writes:

 Hello.

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.

 Now, I have tried storing these strings as a list, a set and a dictionary.
 I know that finding things in a set and a dictionary is very much faster
 than working with a list, so I tried those first. However, I run out of
 memory building both the set and the dictionary, so what I seem to be left
 with is the list, and using the in method.

 I imagine that there should be a faster/better way than this?


Shouldn't a set with 83 million 14 character strings be fine in memory
on a stock PC these days? I suppose if it's low on ram you might start
swapping which will kill performance. Perhaps the method you're using to
build the data structures creates lots of garbage? How much ram do you
have and how much memory does the python process use as it builds your
data structures?

A set should give good performance if the target string is also 14
characters.

If you do end up with the list then try using bisect
http://docs.python.org/library/bisect.html which should be quicker
than  just using in (which I think just scans linearly through the list
testing for equality).

There are other algorithms you can use that have better theoretical
performance than using bisect for this particular problem, but you get
into trade offs between executing things in python as opposed to C if
you start to hand code things in python.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-30 Thread Duncan Booth
Paul Rudin paul.nos...@rudin.co.uk wrote:

 Shouldn't a set with 83 million 14 character strings be fine in memory
 on a stock PC these days? I suppose if it's low on ram you might start
 swapping which will kill performance. Perhaps the method you're using
 to build the data structures creates lots of garbage? How much ram do
 you have and how much memory does the python process use as it builds
 your data structures?

Some simple experiments should show you that a stock PC running a 32 bit 
Python will struggle:

 s = 12345678901234
 sys.getsizeof(s)
38
 83*38
3154

So more than 3GB just for the strings (and that's for Python 2.x on 
Python 3.x you'll need nearly 5GB).

Running on a 64 bit version of Python should be fine, but for a 32 bit 
system a naive approach just isn't going to work.

Option 1: use a trie. That should reduce storage, maybe it will reduce 
it enough, maybe not. It depends on the data.

Option 2: use a simple database. e.g. shelve. Simple to write and easy 
to use.

Option 3: use a linear search through the file containing the 83 million 
strings. If the OP really does want to check *one* string then that is 
comparatively slow but still much faster than posting the question here. 
If they really wanted to check say 10,000 strings then put those strings 
in a set and test each line of the 83 million line file against the set 
instead of doing it the other way round. At some number of test strings 
this is probably faster than using a database.


-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-30 Thread Steven D'Aprano
On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote:

 Karin Lagesen karin.lage...@bio.uio.no writes:
 
 Hello.

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.

 Now, I have tried storing these strings as a list, a set and a
 dictionary. I know that finding things in a set and a dictionary is
 very much faster than working with a list, so I tried those first.
 However, I run out of memory building both the set and the dictionary,
 so what I seem to be left with is the list, and using the in method.

 I imagine that there should be a faster/better way than this?


 Shouldn't a set with 83 million 14 character strings be fine in memory
 on a stock PC these days? 

Not even close. Using Python 2.6:

 s = 12345678901234
 assert len(s) == 14
 import sys
 sys.getsizeof(s)
38

So a single 14 char string takes 38 bytes.


 import random, string
 chars = list(string.letters + string.digits)*4
 def rnd_str():
... random.shuffle(chars)
... return ''.join(chars[:14])
...
 s = set()
 while len(s)  83000:
... s.add(rnd_str())
...
 sys.getsizeof(s)
1048688


So a set with 83000 such strings takes approximately 1 MB. So far fairly 
trivial. But that's just the memory used by the container (the set), not 
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course 
is trivial for a modern PC, but the OP is using 83 million such strings, 
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An 
entry level desktop PC these days is generally 2GB, and entry level 
notebooks might have half a gig.

If the OP is on a 64 bit system, every pointer will be twice the size, 
leading to even larger memory requirements. And if the strings are 
unicode, then potentially they could be anything up to four times larger 
each. Worst case, the OP could need something of the order of 24 GB to 
store the strings all in memory.


-- 
Steven
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-30 Thread Paul Rudin
Duncan Booth duncan.bo...@invalid.invalid writes:

 Paul Rudin paul.nos...@rudin.co.uk wrote:

 Shouldn't a set with 83 million 14 character strings be fine in memory
 on a stock PC these days? I suppose if it's low on ram you might start
 swapping which will kill performance. Perhaps the method you're using
 to build the data structures creates lots of garbage? How much ram do
 you have and how much memory does the python process use as it builds
 your data structures?

 Some simple experiments should show you that a stock PC running a 32 bit 
 Python will struggle:

 s = 12345678901234
 sys.getsizeof(s)
 38
 83*38
 3154

 So more than 3GB just for the strings (and that's for Python 2.x on 
 Python 3.x you'll need nearly 5GB).

 Running on a 64 bit version of Python should be fine, but for a 32 bit 
 system a naive approach just isn't going to work.

It depends - a few gig of RAM can be cheap compared with programmer
time. If you know you can solve a problem by spending a few euros on
some extra RAM it can be a good solution! It depends of course where the
code is being deployed - if it's something that's intended to be
deployed widely then you can't expect thousands of people to go out and
buy more RAM - but if it's a one off deployment for a particular
environment then it can be the best way to go.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-30 Thread Christian Heimes

s = 12345678901234
assert len(s) == 14
import sys
sys.getsizeof(s)

38

So a single 14 char string takes 38 bytes.


Make that at least 40 bytes. You have to take memory alignment into account.


So a set with 83000 such strings takes approximately 1 MB. So far fairly
trivial. But that's just the memory used by the container (the set), not
the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course
is trivial for a modern PC, but the OP is using 83 million such strings,
not 83 thousand, which gives us a grand total of at least 3 gigabytes. An
entry level desktop PC these days is generally 2GB, and entry level
notebooks might have half a gig.


You are pretty much screwed on a 32bit system here. In my experience 
32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually 
malloc() will fail since large amounts of the 4 GB address space is 
reserved for other things like stack, entry point, shared library 
mappings, error detection etc. Memory fragmentation isn't an issue here.


Other ideas:

* use a SQL database with an index on the data column. The index could 
optimize the starting with case.
* You need to search for a string inside a large set of texts? Sounds 
like a job for a fulltext search machine! Grab PyLucene and index your 
data inside a lucene database. A SSD helps alot here.


Christian

--
http://mail.python.org/mailman/listinfo/python-list


External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Helmut Jarausch
I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?

Thanks,
Helmut.

-- 
Helmut Jarausch

Lehrstuhl fuer Numerische Mathematik
RWTH - Aachen University
D 52056 Aachen, Germany
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Tim Chase

On 04/30/2010 12:51 PM, Helmut Jarausch wrote:

I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?


While you don't detail what you're hashing, Stephan Behnel 
already suggested (in the parent thread) using one of Python's 
native dbm modules (I just use anydbm and let it choose).  The 
underlying implementations should be fairly efficient assuming 
you don't use the dumbdbm last-resort fallback).  With the anydbm 
interface, you can implement dict/set semantics as long as you 
take care that everything is marshalled into and out of strings 
for keys/values.


-tkc



--
http://mail.python.org/mailman/listinfo/python-list


Re: External Hashing [was Re: matching strings in a large set of strings]

2010-04-30 Thread Dave Angel

Helmut Jarausch wrote:

I think one could apply an external hashing technique which would require only
very few disk accesses per lookup.
Unfortunately, I'm now aware of an implementation in Python.
Does anybody know about a Python implementation of external hashing?

Thanks,
Helmut.

  
That's what databases are for.  But if you're trying to streamline it 
further than a database, you need to give some constraints.  In an 
earlier thread, you mentioned strings with a constant size.  I don't 
think you ever actually said that the match string was also of the same 
size, but I'll assume it is.  Final question is how often you'll be 
updating the list, versus searching it.  If you do both frequently, 
stick to a database.


On the other hand, if the list on disk is fixed, you could thoroughly 
optimize its makeup.  Perhaps the easiest efficient form would be to 
have a file in two parts.  The first part is the hash table, where each 
entry has a seek-offset and a size.  And the rest of the file is just a 
list of items, sorted by hash value.


If you make a hash large enough that the maximum collision list is a 
couple of k, then two seeks would get any record.  First seek to the 
hash entry, then use that to decide where to seek for the data, and how 
much to read.  Then you simply search that block in memory.


You could also play games with a two-level hash, where each block of 
records has its own mini hash table.  Doesn't seem worth it for a mere 
83million records.


DaveA
--
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Peter Otten
Karin Lagesen wrote:

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.
 
 Now, I have tried storing these strings as a list, a set and a dictionary.
 I know that finding things in a set and a dictionary is very much faster
 than working with a list, so I tried those first. However, I run out of
 memory building both the set and the dictionary, so what I seem to be left
 with is the list, and using the in method.
 
 I imagine that there should be a faster/better way than this?

Do you need all matches or do you just have to know whether there are any? 
Can the search string be shorter than 14 characters?

One simple improvement over the list may be using one big string instead of 
the 83 million short ones and then search it using string methods.

Peter
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Stefan Behnel

Karin Lagesen, 29.04.2010 11:38:

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?


Try one of the dbm modules in the stdlib. They give you dictionary-like 
lookups on top of a persistent database.


Stefan

--
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Mark Tolonen


Karin Lagesen karin.lage...@bio.uio.no wrote in message 
news:416f727c6f5b0edb932b425db9579808.squir...@webmail.uio.no...

Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?


You may be overthinking it.  How fast does it need to be?  I generated the 
following file:


f=open('test.txt','wt')
for i in xrange(8300):
   ...  f.write('0123456789ABCD\n')
   ...
f.close()

It took about 15 seconds to search that file with a command line search tool 
(Windows, and a slow IDE hard drive):


findstr xyz test.txt

It took about twice that to search via Python with 'in':

for line in open('test.txt'):
   ...  if 'xyz' in line:
   ...   print line
   ...

Reading only a line at a time has the advantage of using very little memory. 
Storing 83 million 14-character strings in memory requires a minimum of 
about 1.2GB not counting overhead for lists/sets/dictionaries.


-Mark


--
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread MRAB

Karin Lagesen wrote:

Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.

Now, I have tried storing these strings as a list, a set and a dictionary.
I know that finding things in a set and a dictionary is very much faster
than working with a list, so I tried those first. However, I run out of
memory building both the set and the dictionary, so what I seem to be left
with is the list, and using the in method.

I imagine that there should be a faster/better way than this?


You could sort the list and then use a binary chop (Google can tell you
what that is if you don't know already).
--
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Duncan Booth
MRAB pyt...@mrabarnett.plus.com wrote:

 Karin Lagesen wrote:
 Hello.
 
 I have approx 83 million strings, all 14 characters long. I need to
 be able to take another string and find out whether this one is
 present within the 83 million strings.
 
 Now, I have tried storing these strings as a list, a set and a
 dictionary. I know that finding things in a set and a dictionary is
 very much faster than working with a list, so I tried those first.
 However, I run out of memory building both the set and the
 dictionary, so what I seem to be left with is the list, and using the
 in method. 
 
 I imagine that there should be a faster/better way than this?
 
 You could sort the list and then use a binary chop (Google can tell
 you what that is if you don't know already).
 

... and Python comes batteries included. You can use bisect.bisect_left to 
do the binary chop and then you just have to check whether it found the 
string.


-- 
Duncan Booth http://kupuguy.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Miki

 I have approx 83 million strings, all 14 characters long. I need to be
 able to take another string and find out whether this one is present
 within the 83 million strings.
Have a look at the shelve module.

If you want to write the algorithm yourself, I suggest
http://en.wikipedia.org/wiki/Trie

HTH,
--
Miki
http://pythonwise.blogspot.com
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: matching strings in a large set of strings

2010-04-29 Thread Terry Reedy

On 4/29/2010 5:38 AM, Karin Lagesen wrote:

Hello.

I have approx 83 million strings, all 14 characters long. I need to be
able to take another string and find out whether this one is present
within the 83 million strings.


If the 'other string' is also 14 chars, so that you are looking for 
exact matches, and you are doing this repeatedly, you might look to 
match hash values first.


--
http://mail.python.org/mailman/listinfo/python-list