Re: don't need dictionary's keys - hash table?
Yeah hash(hash(immutable))=hash(immutable) it seems. Not sure this is a specification but it happens that way: -- $ hash('abc') -1600925533 $ hash(hash('abc')) -1600925533 $ hash(hash(hash(('abc' -1600925533 - Of course then hash(0)=0, hash(1)=0 and also hash(2**32)=1, all this in standard CPython on IA32. Nick Vatamaniuc Ganesan Rajagopal wrote: Terry == Terry Hancock [EMAIL PROTECTED] writes: Note that it is trivial to catch collisions on entry and correct them: key = hash(url_string) i = 0 if d.has_key(key): while d.has_key(key): i += 1 hash is a number. It's sufficient to do while d.has_key(key): key += 1 I am a little surprised that hash(hash(s)) == hash(s), is that actually true? hash(42) 42 Ganesan -- Ganesan Rajagopal -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
I am a little surprised that hash(hash(s)) == hash(s), is that actually true? As hashes are integers the hash of an integer is the integer itself, this isn't to surprising. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Terry Hancock wrote: Nick Vatamaniuc wrote: The original post just said that he wanted to index some values by their urls and didn't really care about the urls themselves, so I suggested that he just use the hash of the key as the key. The dictionary will then take a hash of that [note that: hash(string)=hash(hash(string)) ] then the dictionary will not keep the reference to the urls and GC will collect them. So instead of dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then to retrieve he will have to do old_value=dic[hash(urlstring]]. Note that it is trivial to catch collisions on entry and correct them: key = hash(url_string) i = 0 if d.has_key(key): while d.has_key(key): i += 1 d[key + repr(i)] = val else: d[key] = val or something along those lines. No need to keep the original string. How do you intend to get the value back out of the dictionary? candidates = [] for key in xrange(hash(url)): try: candidates.append(d[key]) except KeyError: break Instead, I would put all values with the same hash into a container, along the lines of d.setdefault(hash(url), []).append(value)) Here is a simple implementation that can be used to put the idea to a test. It keeps all but the first url for a cluster with identical hashes. class Cluster(object): A cluster of dictionary values with the same hash(key). Helper for Dict. def __init__(self, default): self.pairs = {} self.default = default def __setitem__(self, key, value): assert key not in self.pairs self.pairs[key] = value def __getitem__(self, key): return self.pairs.get(key, self.default) def __repr__(self): return Cluster(default=%r, pairs=%r) % (self.default, self.pairs) def itervalues(self): yield self.default for value in self.pairs.itervalues(): yield value class Dict(object): A dictionary that stores hashes instead of keys as long as there are no collisions. The value entered first becomes the default for a cluster of values with the same hash. It follows that you cannot implement a reliable containment test and must make sure by exterior means that only existing keys are looked up. def __init__(self): self.dict = {} def __setitem__(self, key, value): short_key = hash(key) try: cluster = self.dict[short_key] except KeyError: self.dict[short_key] = value else: if isinstance(cluster, Cluster): cluster[key] = value else: cluster = Cluster(cluster) cluster[key] = value self.dict[short_key] = cluster def __getitem__(self, key): cluster = self.dict[hash(key)] if not isinstance(cluster, Cluster): return cluster return cluster[key] def itervalues(self): for value in self.dict.itervalues(): if isinstance(value, Cluster): for v in value.itervalues(): yield v else: yield value def __repr__(self): return Dict(%s) % repr(self.dict)[1:-1] if __name__ == __main__: class BadHash(object): Key with user-defined hash. Allows enforcing hash collisions. def __init__(self, value, hash): self.value = value self.hash = hash def __hash__(self): return hash(self.hash) def __eq__(self, other): return self.value == other.value def __repr__(self): return BadHash(value=%r, hash=%r) % (self.value, self.hash) d = Dict() for i, v in enumerate(alpha beta gamma delta.split()): d[BadHash(v, i)] = v for v in sigma tau omega.split(): d[BadHash(v, 42)] = v print d assert d[BadHash(gamma, 2)] == gamma assert d[BadHash(sigma, 42)] == sigma assert d[BadHash(tau, 42)] == tau # but be warned that assert d[BadHash(whatever, 42)] == sigma expected = sorted(alpha beta gamma delta sigma tau omega.split()) assert list(sorted(d.itervalues())) == expected I fear that the memory footprint of a url is not large enough to warrant the approach, though. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Terry Hancock wrote: Nick Vatamaniuc wrote: The original post just said that he wanted to index some values by their urls and didn't really care about the urls themselves, so I suggested that he just use the hash of the key as the key. The dictionary will then take a hash of that [note that: hash(string)=hash(hash(string)) ] then the dictionary will not keep the reference to the urls and GC will collect them. So instead of dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then to retrieve he will have to do old_value=dic[hash(urlstring]]. Note that it is trivial to catch collisions on entry and correct them: key = hash(url_string) i = 0 if d.has_key(key): while d.has_key(key): i += 1 d[key + repr(i)] = val else: d[key] = val or something along those lines. No need to keep the original string. How do you intend to get the value back out of the dictionary? candidates = [] for key in xrange(hash(url), sys.maxint): try: candidates.append(d[key]) except KeyError: break Instead, I would put all values with the same hash into a container, along the lines of d.setdefault(hash(url), []).append(value)) Here is a simple implementation that can be used to put the idea to a test. It keeps all but the first url for a cluster with identical hashes. class Cluster(object): A cluster of dictionary values with the same hash(key). Helper for Dict. def __init__(self, default): self.pairs = {} self.default = default def __setitem__(self, key, value): assert key not in self.pairs self.pairs[key] = value def __getitem__(self, key): return self.pairs.get(key, self.default) def __repr__(self): return Cluster(default=%r, pairs=%r) % (self.default, self.pairs) def itervalues(self): yield self.default for value in self.pairs.itervalues(): yield value class Dict(object): A dictionary that stores hashes instead of keys as long as there are no collisions. The value entered first becomes the default for a cluster of values with the same hash. It follows that you cannot implement a reliable containment test and must make sure by exterior means that only existing keys are looked up. def __init__(self): self.dict = {} def __setitem__(self, key, value): short_key = hash(key) try: cluster = self.dict[short_key] except KeyError: self.dict[short_key] = value else: if isinstance(cluster, Cluster): cluster[key] = value else: cluster = Cluster(cluster) cluster[key] = value self.dict[short_key] = cluster def __getitem__(self, key): cluster = self.dict[hash(key)] if not isinstance(cluster, Cluster): return cluster return cluster[key] def itervalues(self): for value in self.dict.itervalues(): if isinstance(value, Cluster): for v in value.itervalues(): yield v else: yield value def __repr__(self): return Dict(%s) % repr(self.dict)[1:-1] if __name__ == __main__: class BadHash(object): Key with user-defined hash. Allows enforcing hash collisions. def __init__(self, value, hash): self.value = value self.hash = hash def __hash__(self): return hash(self.hash) def __eq__(self, other): return self.value == other.value def __repr__(self): return BadHash(value=%r, hash=%r) % (self.value, self.hash) d = Dict() for i, v in enumerate(alpha beta gamma delta.split()): d[BadHash(v, i)] = v for v in sigma tau omega.split(): d[BadHash(v, 42)] = v print d assert d[BadHash(gamma, 2)] == gamma assert d[BadHash(sigma, 42)] == sigma assert d[BadHash(tau, 42)] == tau # but be warned that assert d[BadHash(whatever, 42)] == sigma expected = sorted(alpha beta gamma delta sigma tau omega.split()) assert list(sorted(d.itervalues())) == expected I fear that the memory footprint of a url is not large enough to warrant the approach, though. Peter -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] wrote: depending on your application, a bloom filter might be a good enough: http://en.wikipedia.org/wiki/Bloom_filter Thanks (everyone) for the comments. I like the idea of the bloom filter or using an md5 hash, since a rare collision will not be a show-stopper in my case. btw, since my brain is on vacation, could anyone who already has all the necessary background information in their head (Paul?) tell me if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good idea... i.e. something like: import sha try: all except NameError: def all(S): for x in S: if not x: return False return True def digest(data): d = sha.sha(data).digest() return [d[i:i+4] for i in range(0, len(d), 4)] class bloom(object): def __init__(self): self.filter = [set() for i in range(len(digest()))] def add(self, data): for s, p in zip(self.filter, digest(data)): s.add(p) def __contains__(self, data): return all(p in s for s, p in zip(self.filter, digest(data))) b = bloom() b.add(showbiz) b.add(absolution) print hullaballo in b print showbiz in b print origin of symmetry in b (and if this isn't a good idea, why it isn't). /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Fredrik Lundh [EMAIL PROTECTED] writes: btw, since my brain is on vacation, could anyone who already has all the necessary background information in their head (Paul?) tell me if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good idea... Erm, my brain isn't on vacation, it's just plain missing ;-). However I'd say there's nothing wrong in principle with chopping up sha/md5 output if you want to get several hashes as you're doing, assuming the total hash length is enough. The advantage over doing separate hashes for each piece is optimization: you don't compute as many hashes. But there's more flexibility with something like (untested): def multi_hash(data, i, size):# find i'th slice of hash return sha.new('%x,%s'% (i, data)).digest()[:size] Also, the idea of a Bloom filter is to use less memory than a hash table, by not storing the keys. In your implementation you chop the sha/md5 output into 4-byte pieces, which means about 4 billion slots in each set, which might be too many for a typical computer. Also, you end up storing each of those 4-byte values as a Python string. So I think in practice you'd want smaller sets, and to represent them as bit vectors instead of as Python sets (untested): import array class bit_vector(object): def __init__(self, size_in_bits): nbytes = (size_in_bits + 7) // 8 self.vec = array.array('B', '\0' * nbytes) def __getattr__(self, i): # get i'th bit from vector a,b = divmod(i, 8) return bool(self.vec[a] (1 b)) def __setattr__(self, i, value): # set i'th bit in vector a,b = divmod(i, 8) v = int(bool(value))# the actual bit to set, 1 or 0 self.vec[a] = (self.vec[a] ~(1 b)) | (v b) If we use a bit vector representation we want the hash value to be an integer (untested): def bloom_hash(data, i, size_in_bits): # my favorite way to get an integer value from sha h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16) return h % size_in_bits We'd then implement the Bloom filter (untested): class bloom(object): def __init__(self, nlayers, layersize): self.nlayers, self.layersize = nlayers, layersize self.filter = [bit_vector(layersize) for i in xrange(nlayers)] def add(self, data): for i in xrange(self.nlayers): self.filter[i][bloom_hash(data, i, self.layersize)] = True def __contains__(self, data): return False not in \ (self.filter[i][bloom_hash(data, i, self.layersize)] \ for i in xrange(self.nlayers)) Finally, I think to use Bloom filters effectively you have to choose nlayers and layersize carefully, according to the number of keys you expect to see, the amount of memory you have available, and/or the probability of false matches that you're willing to accept. The standard references (whatever they are) about Bloom filters probably have some suitable formulas for computing these things. -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Paul Rubin http://[EMAIL PROTECTED] writes: Finally, I think to use Bloom filters effectively you have to choose nlayers and layersize carefully, according to the number of keys you expect to see, the amount of memory you have available, and/or the probability of false matches that you're willing to accept. The standard references (whatever they are) about Bloom filters probably have some suitable formulas for computing these things. Heh, there's an online calculator: http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html Link is from http://en.wikipedia.org/wiki/Bloom_filter which is a good article. As described in the article, there's only one layer in the Bloom filter (you use n different hashes but they all touch the same bit map) which corresponds to how the ancient Unix spell checker worked, I think. It would take some analysis to show which way is better. I may try working out the formulas as an exercise but I'm too sleepy for that right now. -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] (k) wrote: k Hello, k I am using some very large dictionaries with keys that are long strings k (urls). For a large dictionary these keys start to take up a k significant amount of memory. I do not need access to these keys -- I k only need to be able to retrieve the value associated with a certain k key, so I do not want to have the keys stored in memory. Could I just k hash() the url strings first and use the resulting integer as the key? k I think what I'm after here is more like a tradition hash table. If I k do it this way am I going to get the memory savings I am after? Will k the hash function always generate unique keys? Also, would the same k technique work for a set? Maybe a Berkeley DB hash file would be a good alternative. It can contain all your key,value pairs but will only keep a small amount in memory. -- Piet van Oostrum [EMAIL PROTECTED] URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4] Private email: [EMAIL PROTECTED] -- http://mail.python.org/mailman/listinfo/python-list
don't need dictionary's keys - hash table?
Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm after here is more like a tradition hash table. If I do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? Any other thoughts or considerations are appreciated. Thank You. -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] wrote: Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm after here is more like a tradition hash table. If I do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? I just realized that of course the hash is not always going to be unique, so this wouldn't really work. And it seems a hash table would still need to store the keys (as strings) so that string comparisons can be done when a collision occurs. I guess there's no avoiding storing they keys? -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] wrote: Will the hash function always generate unique keys? no. hash() is designed for dictionaries (hash tables), not for use as a cryptographic hash. depending on your application, a bloom filter might be a good enough: http://en.wikipedia.org/wiki/Bloom_filter (see the links section for a Python implementation) /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] wrote: I just realized that of course the hash is not always going to be unique, so this wouldn't really work. And it seems a hash table would still need to store the keys (as strings) so that string comparisons can be done when a collision occurs. btw, Python's dictionary type *is* a highly-optimized implementation of a traditional hash table. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] wrote: Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm after here is more like a tradition hash table. python dictionaries are traditional hash-tables. If I do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? Any other thoughts or considerations are appreciated. You could try and create a md5 sum of your strings and use that as key. It _should_ be good enough, but I'm no crypto expert so take that with a grain of salt. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
It should be enough but it might be a little slower than hash(string). Diez B. Roggisch wrote: [EMAIL PROTECTED] wrote: Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm after here is more like a tradition hash table. python dictionaries are traditional hash-tables. If I do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? Any other thoughts or considerations are appreciated. You could try and create a md5 sum of your strings and use that as key. It _should_ be good enough, but I'm no crypto expert so take that with a grain of salt. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Dictionaries are hash tables in Python. If you don't really want to store the keys, just use dic[hash(key)]=value, this way the dictionary will have the same shape and distribution of keys as dic[key]=value because hash('abc')=hash(hash('abc')) but the long string of actual keys are not referenced by the dictionary. Now you couldn't do dic.keys() and see your urls, and every time you want to do dic['abc'] you would get a KeyError exception. Hope this helps, Nick Vatamaniuc [EMAIL PROTECTED] wrote: [EMAIL PROTECTED] wrote: Hello, I am using some very large dictionaries with keys that are long strings (urls). For a large dictionary these keys start to take up a significant amount of memory. I do not need access to these keys -- I only need to be able to retrieve the value associated with a certain key, so I do not want to have the keys stored in memory. Could I just hash() the url strings first and use the resulting integer as the key? I think what I'm after here is more like a tradition hash table. If I do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? I just realized that of course the hash is not always going to be unique, so this wouldn't really work. And it seems a hash table would still need to store the keys (as strings) so that string comparisons can be done when a collision occurs. I guess there's no avoiding storing they keys? -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Nick Vatamaniuc wrote: If you don't really want to store the keys, just use dic[hash(key)]=value, this way the dictionary will have the same shape and distribution of keys as dic[key]=value because hash('abc')=hash(hash('abc')) but the long string of actual keys are not referenced by the dictionary. how come you're so sure that there will never be any collisions ? /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
how come you're so sure that there will never be any collisions ? because none of his strings want their insurance to go up... :*) -tkc -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Fred, I never said there will be no collisions. For clarity, can you quote the exact phrase where I mentioned that? To say that there will be no collision is nonsense because the # of possible long url strings is certainly larger than 2^32, applying the pigeon hole principle you could easily show that there will be collisions. The point is to make collision as unlikely as possible for the human readable text (that is make them as uniform as possible), that is why creating good cryptographic hashes is not easy. Not that the Python hash function work as hard as an MD5 (it is probably a multiplication and a modulo if I had to guess). But this is a topic beyond scope of this thread. The original post just said that he wanted to index some values by their urls and didn't really care about the urls themselves, so I suggested that he just use the hash of the key as the key. The dictionary will then take a hash of that [note that: hash(string)=hash(hash(string)) ] then the dictionary will not keep the reference to the urls and GC will collect them. So instead of dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then to retrieve he will have to do old_value=dic[hash(urlstring]]. Hopefully this make my point more clear, Nick V. Fredrik Lundh wrote: Nick Vatamaniuc wrote: If you don't really want to store the keys, just use dic[hash(key)]=value, this way the dictionary will have the same shape and distribution of keys as dic[key]=value because hash('abc')=hash(hash('abc')) but the long string of actual keys are not referenced by the dictionary. how come you're so sure that there will never be any collisions ? /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Nick Vatamaniuc wrote: I never said there will be no collisions. For clarity, can you quote the exact phrase where I mentioned that? the text I quoted is only true if you can guarantee that there will be no collisions. /F -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Please don't top post. I had to fix that below. Any other thoughts or considerations are appreciated. You could try and create a md5 sum of your strings and use that as key. It _should_ be good enough, but I'm no crypto expert so take that with a grain of salt. It should be enough but it might be a little slower than hash(string). hash alone won't be sufficient, as it is guaranteed to have to much collisions. The OP already figured that out, btw. Diez -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
[EMAIL PROTECTED] writes: do it this way am I going to get the memory savings I am after? Will the hash function always generate unique keys? Also, would the same technique work for a set? A hash function that always generates unique hashes is called a perfect hash. They tend to be slow, and of course, to generate one, you need to know all the keys in advance. If you use a long cryptographic hash like md5 or sha, the chances of collision is very small. That's probably your best bet. In general, if your hash is N bits long, you will probably get collisions once you have around 2**(N/2) keys. sha is a 160 bit hash so getting collisions by accident is extremely unlikely. -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
depending on your application, a bloom filter might be a good enough: http://en.wikipedia.org/wiki/Bloom_filter Thanks (everyone) for the comments. I like the idea of the bloom filter or using an md5 hash, since a rare collision will not be a show-stopper in my case. -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Nick Vatamaniuc wrote: The original post just said that he wanted to index some values by their urls and didn't really care about the urls themselves, so I suggested that he just use the hash of the key as the key. The dictionary will then take a hash of that [note that: hash(string)=hash(hash(string)) ] then the dictionary will not keep the reference to the urls and GC will collect them. So instead of dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then to retrieve he will have to do old_value=dic[hash(urlstring]]. Note that it is trivial to catch collisions on entry and correct them: key = hash(url_string) i = 0 if d.has_key(key): while d.has_key(key): i += 1 d[key + repr(i)] = val else: d[key] = val or something along those lines. No need to keep the original string. Obviously this is only efficient if collisions are rare. I am a little surprised that hash(hash(s)) == hash(s), is that actually true? Cheers, Terry -- Terry Hancock ([EMAIL PROTECTED]) Anansi Spaceworks http://www.AnansiSpaceworks.com -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Terry == Terry Hancock [EMAIL PROTECTED] writes: Note that it is trivial to catch collisions on entry and correct them: key = hash(url_string) i = 0 if d.has_key(key): while d.has_key(key): i += 1 hash is a number. It's sufficient to do while d.has_key(key): key += 1 I am a little surprised that hash(hash(s)) == hash(s), is that actually true? hash(42) 42 Ganesan -- Ganesan Rajagopal -- http://mail.python.org/mailman/listinfo/python-list
Re: don't need dictionary's keys - hash table?
Ganesan Rajagopal [EMAIL PROTECTED] writes: hash is a number. It's sufficient to do while d.has_key(key): key += 1 This is called linear probing and is not considered that great a collision resolution strategy for hash tables. Really, if you want an exhaustive study about hashing, see Knuth vol 3. If you're just trying to get the application doing something reasonable, let the database do its job. -- http://mail.python.org/mailman/listinfo/python-list