Re: matching strings in a large set of strings
Dennis Lee Bieber wrote: On Thu, 29 Apr 2010 11:38:28 +0200, Karin Lagesen karin.lage...@bio.uio.no declaimed the following in comp.lang.python: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Is this another string also exactly 14 characters long? Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. So don't load them into memory... First use a file-based (not memory limited) sort routine on the 80M strings (if position in the file is important, make a copy with the line number on the end of each line before sorting -- but since you've tried using Sets which do not retain order... probably not a matter). Then, since you know the length of the file, and the length of each line is fixed, you can do direct access I/O (well, since the C-stream style I/O doesn't do record access, you'll need to calculate offsets from the start of the file as (record# - 1) * recordlen)... That lets you do a binary search on the file. Much faster than a linear search (linear search will average out to 41.5M read operations; binary should be around 1 reads) Or load the data into some relational database and let the compiled code do the searching -- probably faster than byte-code interpreted Python for the same algorithm... ... or put the data into a simple on-disc dictionary such as what mxBeeBase implements: http://www.egenix.com/products/python/mxBase/mxBeeBase/ Once you have that dict set up, lookups are very fast. -- Marc-Andre Lemburg eGenix.com Professional Python Services directly from the Source (#1, May 06 2010) Python/Zope Consulting and Support ...http://www.egenix.com/ mxODBC.Zope.Database.Adapter ... http://zope.egenix.com/ mxODBC, mxDateTime, mxTextTools ...http://python.egenix.com/ ::: Try our new mxODBC.Connect Python Database Interface for free ! eGenix.com Software, Skills and Services GmbH Pastor-Loeh-Str.48 D-40764 Langenfeld, Germany. CEO Dipl.-Math. Marc-Andre Lemburg Registered at Amtsgericht Duesseldorf: HRB 46611 http://www.egenix.com/company/contact/ -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen wrote: I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. [...] I run out of memory building both the set and the dictionary, so what I seem to be left with is the list I agree with the recommendations to try modules that maintain the set on disk (sqlite, dbm, shelve), but here's an in-memory solution: Hash to about 8 million buckets, and use a compact representation for the several strings in each bucket. That gives you most of the speed and avoids most of the memory overhead. Python makes it easy: class ConstLenStrSet: Keep a set of strings all of the same length. Support set.add() and Python's 'in' test. def __init__(self, strlen=14, nbuckets=800): assert strlen 0 and nbuckets 0 self.strlen = strlen self.nbuckets = nbuckets | 1 self.table = [''] * self.nbuckets def _hashstr(self, s): return hash(s) % self.nbuckets def add(self, s): assert len(s) == self.strlen if s not in self: self.table[self._hashstr(s)] += s def __contains__(self, s): data = self.table[self._hashstr(s)] for i in range(0, len(data), self.strlen): if data[i : i + self.strlen] == s: return True return False # Rudimentary demo-test: from random import choice from string import letters def rnd_str(slen=14): return ''.join(choice(letters) for _ in range(slen)) # Test against Python's built-in set sref = set() for i in range(83): sref.add(rnd_str()) print('Strings generated.') sset = sset = ConstLenStrSet(14, 8) for s in sref: sset.add(s) print 'ConstLenStrSet loaded.' for s in sref: assert s in sset for i in range(1000): s = rnd_str() if s in sref: print 'That was unlucky.' else: assert s not in sset If building the set is too slow, and you know you don't have a lot of duplicate strings, you can use a faster insert method that doesn't check whether the string is already in the set: def add_quick(self, s): assert len(s) == self.strlen self.table[self._hashstr(s)] += s -- --Bryan -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
In article 877hnpjtdw@rudin.co.uk, Paul Rudin paul.nos...@rudin.co.uk wrote: Karin Lagesen karin.lage...@bio.uio.no writes: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? And if this is practical there should be no swapping problems, as the working set will be a small fraction of the data used. SNIP There are other algorithms you can use that have better theoretical performance than using bisect for this particular problem, but you get into trade offs between executing things in python as opposed to C if you start to hand code things in python. There are a lot of possibility, but they depend a great deal on secondary conditions, like how often the 83 M dataset changes. Groetjes Albert -- -- Albert van der Horst, UTRECHT,THE NETHERLANDS Economic growth -- being exponential -- ultimately falters. alb...@spearc.xs4all.nl =n http://home.hccnet.nl/a.w.m.van.der.horst -- http://mail.python.org/mailman/listinfo/python-list
Re: External Hashing [was Re: matching strings in a large set of strings]
http://www.swizwatch.com/ All Cartier replica watches sold at Hotwristwatch.com are brand-new and high quality. Each Cartier Replica Watch produced is examined carefully by our quality test department and each watch is inspected again before being sent to our customer. It is our desire that you do experience the joy of shopping when buying one of our Cartier Replica Watches at our site. Some Cartier Watches may need to be special ordered, please call for availability on Cartier watches. Best service you will receive from us. Helmut Jarausch jarau...@igpm.rwth-aachen.de ??:840jkofai...@mid.dfncis.de... I think one could apply an external hashing technique which would require only very few disk accesses per lookup. Unfortunately, I'm now aware of an implementation in Python. Does anybody know about a Python implementation of external hashing? Thanks, Helmut. -- Helmut Jarausch Lehrstuhl fuer Numerische Mathematik RWTH - Aachen University D 52056 Aachen, Germany -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Dennis Lee Bieber wrote: On Thu, 29 Apr 2010 11:38:28 +0200, Karin Lagesen karin.lage...@bio.uio.no declaimed the following in comp.lang.python: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. So don't load them into memory... First use a file-based (not memory That lets you do a binary search on the file. Much faster than a linear search (linear search will average out to 41.5M read operations; binary should be around 1 reads) Don't you meant 27 reads instead of 41.5 M reads? from math import log log(83e6)/log(2) 26.306608000671101 N -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Duncan Booth, 30.04.2010 10:20: So more than 3GB just for the strings (and that's for Python 2.x on Python 3.x you'll need nearly 5GB). Running on a 64 bit version of Python should be fine, but for a 32 bit system a naive approach just isn't going to work. Option 1: use a trie. That should reduce storage, maybe it will reduce it enough, maybe not. It depends on the data. Depending on the implementation and the data, a trie can also use a lot /more/ space than the set of strings that it contains. The 83 million 14 character strings can well include a relatively large subset of the possible permutations (imagine purely numeric, hex-numeric or lower-case alpha-numeric strings, for example), so the trie will likely need to branch very often with very low intermediate run length. If you use pointers for trie branches, that's at least 8 bytes per branch on a 64bit system, versus 1 byte per character in a byte string list. Depending on the ratio of branches to characters, one or the other may win. So a naive approach likely won't work for tries either. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Dennis Lee Bieber wrote: On Sat, 01 May 2010 13:48:02 +0200, News123 news1...@free.fr declaimed the following in gmane.comp.python.general: Dennis Lee Bieber wrote: That lets you do a binary search on the file. Much faster than a linear search (linear search will average out to 41.5M read operations; binary should be around 1 reads) Don't you meant 27 reads instead of 41.5 M reads? Don't you mean my 1 reads? The 41.5M is the average for the linear search. Indeed, this is what I meant. or phrased differently: about 27 reads with a binary search instead of 41.5M reads average with a linear search. from math import log log(83e6)/log(2) 26.306608000671101 Probably -- it was late at night and I was working things out in my head... I know about late nights. I just wondered whether I overlooked something. -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen karin.lage...@bio.uio.no writes: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? A set should give good performance if the target string is also 14 characters. If you do end up with the list then try using bisect http://docs.python.org/library/bisect.html which should be quicker than just using in (which I think just scans linearly through the list testing for equality). There are other algorithms you can use that have better theoretical performance than using bisect for this particular problem, but you get into trade offs between executing things in python as opposed to C if you start to hand code things in python. -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Paul Rudin paul.nos...@rudin.co.uk wrote: Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? Some simple experiments should show you that a stock PC running a 32 bit Python will struggle: s = 12345678901234 sys.getsizeof(s) 38 83*38 3154 So more than 3GB just for the strings (and that's for Python 2.x on Python 3.x you'll need nearly 5GB). Running on a 64 bit version of Python should be fine, but for a 32 bit system a naive approach just isn't going to work. Option 1: use a trie. That should reduce storage, maybe it will reduce it enough, maybe not. It depends on the data. Option 2: use a simple database. e.g. shelve. Simple to write and easy to use. Option 3: use a linear search through the file containing the 83 million strings. If the OP really does want to check *one* string then that is comparatively slow but still much faster than posting the question here. If they really wanted to check say 10,000 strings then put those strings in a set and test each line of the 83 million line file against the set instead of doing it the other way round. At some number of test strings this is probably faster than using a database. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
On Fri, 30 Apr 2010 08:23:39 +0100, Paul Rudin wrote: Karin Lagesen karin.lage...@bio.uio.no writes: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? Not even close. Using Python 2.6: s = 12345678901234 assert len(s) == 14 import sys sys.getsizeof(s) 38 So a single 14 char string takes 38 bytes. import random, string chars = list(string.letters + string.digits)*4 def rnd_str(): ... random.shuffle(chars) ... return ''.join(chars[:14]) ... s = set() while len(s) 83000: ... s.add(rnd_str()) ... sys.getsizeof(s) 1048688 So a set with 83000 such strings takes approximately 1 MB. So far fairly trivial. But that's just the memory used by the container (the set), not the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course is trivial for a modern PC, but the OP is using 83 million such strings, not 83 thousand, which gives us a grand total of at least 3 gigabytes. An entry level desktop PC these days is generally 2GB, and entry level notebooks might have half a gig. If the OP is on a 64 bit system, every pointer will be twice the size, leading to even larger memory requirements. And if the strings are unicode, then potentially they could be anything up to four times larger each. Worst case, the OP could need something of the order of 24 GB to store the strings all in memory. -- Steven -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Duncan Booth duncan.bo...@invalid.invalid writes: Paul Rudin paul.nos...@rudin.co.uk wrote: Shouldn't a set with 83 million 14 character strings be fine in memory on a stock PC these days? I suppose if it's low on ram you might start swapping which will kill performance. Perhaps the method you're using to build the data structures creates lots of garbage? How much ram do you have and how much memory does the python process use as it builds your data structures? Some simple experiments should show you that a stock PC running a 32 bit Python will struggle: s = 12345678901234 sys.getsizeof(s) 38 83*38 3154 So more than 3GB just for the strings (and that's for Python 2.x on Python 3.x you'll need nearly 5GB). Running on a 64 bit version of Python should be fine, but for a 32 bit system a naive approach just isn't going to work. It depends - a few gig of RAM can be cheap compared with programmer time. If you know you can solve a problem by spending a few euros on some extra RAM it can be a good solution! It depends of course where the code is being deployed - if it's something that's intended to be deployed widely then you can't expect thousands of people to go out and buy more RAM - but if it's a one off deployment for a particular environment then it can be the best way to go. -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
s = 12345678901234 assert len(s) == 14 import sys sys.getsizeof(s) 38 So a single 14 char string takes 38 bytes. Make that at least 40 bytes. You have to take memory alignment into account. So a set with 83000 such strings takes approximately 1 MB. So far fairly trivial. But that's just the memory used by the container (the set), not the contents. 38 bytes * 83,000 strings = another 3 MB. Which of course is trivial for a modern PC, but the OP is using 83 million such strings, not 83 thousand, which gives us a grand total of at least 3 gigabytes. An entry level desktop PC these days is generally 2GB, and entry level notebooks might have half a gig. You are pretty much screwed on a 32bit system here. In my experience 32bit system can't store more than 2.5 to 2.8 GB on the heap. Eventually malloc() will fail since large amounts of the 4 GB address space is reserved for other things like stack, entry point, shared library mappings, error detection etc. Memory fragmentation isn't an issue here. Other ideas: * use a SQL database with an index on the data column. The index could optimize the starting with case. * You need to search for a string inside a large set of texts? Sounds like a job for a fulltext search machine! Grab PyLucene and index your data inside a lucene database. A SSD helps alot here. Christian -- http://mail.python.org/mailman/listinfo/python-list
External Hashing [was Re: matching strings in a large set of strings]
I think one could apply an external hashing technique which would require only very few disk accesses per lookup. Unfortunately, I'm now aware of an implementation in Python. Does anybody know about a Python implementation of external hashing? Thanks, Helmut. -- Helmut Jarausch Lehrstuhl fuer Numerische Mathematik RWTH - Aachen University D 52056 Aachen, Germany -- http://mail.python.org/mailman/listinfo/python-list
Re: External Hashing [was Re: matching strings in a large set of strings]
On 04/30/2010 12:51 PM, Helmut Jarausch wrote: I think one could apply an external hashing technique which would require only very few disk accesses per lookup. Unfortunately, I'm now aware of an implementation in Python. Does anybody know about a Python implementation of external hashing? While you don't detail what you're hashing, Stephan Behnel already suggested (in the parent thread) using one of Python's native dbm modules (I just use anydbm and let it choose). The underlying implementations should be fairly efficient assuming you don't use the dumbdbm last-resort fallback). With the anydbm interface, you can implement dict/set semantics as long as you take care that everything is marshalled into and out of strings for keys/values. -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: External Hashing [was Re: matching strings in a large set of strings]
Helmut Jarausch wrote: I think one could apply an external hashing technique which would require only very few disk accesses per lookup. Unfortunately, I'm now aware of an implementation in Python. Does anybody know about a Python implementation of external hashing? Thanks, Helmut. That's what databases are for. But if you're trying to streamline it further than a database, you need to give some constraints. In an earlier thread, you mentioned strings with a constant size. I don't think you ever actually said that the match string was also of the same size, but I'll assume it is. Final question is how often you'll be updating the list, versus searching it. If you do both frequently, stick to a database. On the other hand, if the list on disk is fixed, you could thoroughly optimize its makeup. Perhaps the easiest efficient form would be to have a file in two parts. The first part is the hash table, where each entry has a seek-offset and a size. And the rest of the file is just a list of items, sorted by hash value. If you make a hash large enough that the maximum collision list is a couple of k, then two seeks would get any record. First seek to the hash entry, then use that to decide where to seek for the data, and how much to read. Then you simply search that block in memory. You could also play games with a two-level hash, where each block of records has its own mini hash table. Doesn't seem worth it for a mere 83million records. DaveA -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen wrote: I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? Do you need all matches or do you just have to know whether there are any? Can the search string be shorter than 14 characters? One simple improvement over the list may be using one big string instead of the 83 million short ones and then search it using string methods. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen, 29.04.2010 11:38: I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? Try one of the dbm modules in the stdlib. They give you dictionary-like lookups on top of a persistent database. Stefan -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen karin.lage...@bio.uio.no wrote in message news:416f727c6f5b0edb932b425db9579808.squir...@webmail.uio.no... Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? You may be overthinking it. How fast does it need to be? I generated the following file: f=open('test.txt','wt') for i in xrange(8300): ... f.write('0123456789ABCD\n') ... f.close() It took about 15 seconds to search that file with a command line search tool (Windows, and a slow IDE hard drive): findstr xyz test.txt It took about twice that to search via Python with 'in': for line in open('test.txt'): ... if 'xyz' in line: ... print line ... Reading only a line at a time has the advantage of using very little memory. Storing 83 million 14-character strings in memory requires a minimum of about 1.2GB not counting overhead for lists/sets/dictionaries. -Mark -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
Karin Lagesen wrote: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? You could sort the list and then use a binary chop (Google can tell you what that is if you don't know already). -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
MRAB pyt...@mrabarnett.plus.com wrote: Karin Lagesen wrote: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Now, I have tried storing these strings as a list, a set and a dictionary. I know that finding things in a set and a dictionary is very much faster than working with a list, so I tried those first. However, I run out of memory building both the set and the dictionary, so what I seem to be left with is the list, and using the in method. I imagine that there should be a faster/better way than this? You could sort the list and then use a binary chop (Google can tell you what that is if you don't know already). ... and Python comes batteries included. You can use bisect.bisect_left to do the binary chop and then you just have to check whether it found the string. -- Duncan Booth http://kupuguy.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. Have a look at the shelve module. If you want to write the algorithm yourself, I suggest http://en.wikipedia.org/wiki/Trie HTH, -- Miki http://pythonwise.blogspot.com -- http://mail.python.org/mailman/listinfo/python-list
Re: matching strings in a large set of strings
On 4/29/2010 5:38 AM, Karin Lagesen wrote: Hello. I have approx 83 million strings, all 14 characters long. I need to be able to take another string and find out whether this one is present within the 83 million strings. If the 'other string' is also 14 chars, so that you are looking for exact matches, and you are doing this repeatedly, you might look to match hash values first. -- http://mail.python.org/mailman/listinfo/python-list