Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Nick Vatamaniuc
Yeah hash(hash(immutable))=hash(immutable) it seems. Not sure this is a
specification but it happens that way:
--
$  hash('abc')
-1600925533
$  hash(hash('abc'))
-1600925533
$  hash(hash(hash(('abc'
-1600925533

-
Of course then hash(0)=0, hash(1)=0 and also hash(2**32)=1, all this in
standard CPython on IA32.

Nick Vatamaniuc



Ganesan Rajagopal wrote:
  Terry == Terry Hancock [EMAIL PROTECTED] writes:

  Note that it is trivial to catch collisions on entry and correct them:

  key = hash(url_string)
  i  = 0
  if d.has_key(key):
 while d.has_key(key):
 i += 1

 hash is a number. It's sufficient to do

 while d.has_key(key):
 key += 1

  I am a little surprised that hash(hash(s)) == hash(s), is that actually
  true?
 
  hash(42)
 42
 
 Ganesan
 
 -- 
 Ganesan Rajagopal

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Diez B. Roggisch
   I am a little surprised that hash(hash(s)) == hash(s), is that actually
 true?

As hashes are integers  the hash of an integer is the integer itself, 
this isn't to surprising.


Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Peter Otten
Terry Hancock wrote:

 Nick Vatamaniuc wrote:
  The original post just said that he wanted to index some values by
  their urls and didn't really care about the urls themselves, so I
  suggested that he just use the hash of the key as the key. The
  dictionary will then take a hash of that [note that:
  hash(string)=hash(hash(string)) ] then the dictionary will not keep
  the reference to the urls and GC will collect them. So instead of
  dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
  to retrieve he will have to do old_value=dic[hash(urlstring]].
 
 Note that it is trivial to catch collisions on entry and correct them:
 
 key = hash(url_string)
 i  = 0
 if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
 else:
d[key] = val
 
 or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url)):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
A cluster of dictionary values with the same hash(key).

Helper for Dict.

def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return Cluster(default=%r, pairs=%r) % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.

def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
   cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return Dict(%s) % repr(self.dict)[1:-1]

if __name__ == __main__:
class BadHash(object):
Key with user-defined hash.

Allows enforcing hash collisions.

def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return BadHash(value=%r, hash=%r) % (self.value, self.hash)
d = Dict()
for i, v in enumerate(alpha beta gamma delta.split()):
d[BadHash(v, i)] = v
for v in sigma tau omega.split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash(gamma, 2)] == gamma

assert d[BadHash(sigma, 42)] == sigma
assert d[BadHash(tau, 42)] == tau

# but be warned that
assert d[BadHash(whatever, 42)] == sigma

expected = sorted(alpha beta gamma delta sigma tau omega.split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Peter Otten
Terry Hancock wrote:

 Nick Vatamaniuc wrote:
  The original post just said that he wanted to index some values by
  their urls and didn't really care about the urls themselves, so I
  suggested that he just use the hash of the key as the key. The
  dictionary will then take a hash of that [note that:
  hash(string)=hash(hash(string)) ] then the dictionary will not keep
  the reference to the urls and GC will collect them. So instead of
  dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
  to retrieve he will have to do old_value=dic[hash(urlstring]].
 
 Note that it is trivial to catch collisions on entry and correct them:
 
 key = hash(url_string)
 i  = 0
 if d.has_key(key):
while d.has_key(key):
i += 1
d[key + repr(i)] = val
 else:
d[key] = val
 
 or something along those lines. No need to keep the original string.

How do you intend to get the value back out of the dictionary?

candidates = []
for key in xrange(hash(url), sys.maxint):
try:
candidates.append(d[key])
except KeyError:
break

Instead, I would put all values with the same hash into a container, along
the lines of

d.setdefault(hash(url), []).append(value))

Here is a simple implementation that can be used to put the idea to a test.
It keeps all but the first url for a cluster with identical hashes.

class Cluster(object):
A cluster of dictionary values with the same hash(key).

Helper for Dict.

def __init__(self, default):
self.pairs = {}
self.default = default
def __setitem__(self, key, value):
assert key not in self.pairs
self.pairs[key] = value
def __getitem__(self, key):
return self.pairs.get(key, self.default)
def __repr__(self):
return Cluster(default=%r, pairs=%r) % (self.default, self.pairs)
def itervalues(self):
yield self.default
for value in self.pairs.itervalues():
yield value

class Dict(object):
A dictionary that stores hashes instead of keys as long as there are
no collisions.

The value entered first becomes the default for a cluster of values with
the same hash. It follows that you cannot implement a reliable
containment test and must make sure by exterior means that only existing
keys are looked up.

def __init__(self):
self.dict = {}
def __setitem__(self, key, value):
short_key = hash(key)
try:
cluster = self.dict[short_key]
except KeyError:
self.dict[short_key] = value
else:
if isinstance(cluster, Cluster):
   cluster[key] = value
else:
cluster = Cluster(cluster)
cluster[key] = value
self.dict[short_key] = cluster
def __getitem__(self, key):
cluster = self.dict[hash(key)]
if not isinstance(cluster, Cluster):
return cluster
return cluster[key]
def itervalues(self):
for value in self.dict.itervalues():
if isinstance(value, Cluster):
for v in value.itervalues():
yield v
else:
yield value
def __repr__(self):
return Dict(%s) % repr(self.dict)[1:-1]

if __name__ == __main__:
class BadHash(object):
Key with user-defined hash.

Allows enforcing hash collisions.

def __init__(self, value, hash):
self.value = value
self.hash = hash
def __hash__(self):
return hash(self.hash)
def __eq__(self, other):
return self.value == other.value
def __repr__(self):
return BadHash(value=%r, hash=%r) % (self.value, self.hash)
d = Dict()
for i, v in enumerate(alpha beta gamma delta.split()):
d[BadHash(v, i)] = v
for v in sigma tau omega.split():
d[BadHash(v, 42)] = v
print d
assert d[BadHash(gamma, 2)] == gamma

assert d[BadHash(sigma, 42)] == sigma
assert d[BadHash(tau, 42)] == tau

# but be warned that
assert d[BadHash(whatever, 42)] == sigma

expected = sorted(alpha beta gamma delta sigma tau omega.split())
assert list(sorted(d.itervalues())) == expected

I fear that the memory footprint of a url is not large enough to warrant the
approach, though.

Peter




-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

 depending on your application, a bloom filter might be a good enough:

  http://en.wikipedia.org/wiki/Bloom_filter


 Thanks (everyone) for the comments.  I like the idea of the bloom
 filter or using an md5 hash, since a rare collision will not be a
 show-stopper in my case.

btw, since my brain is on vacation, could anyone who already has all the 
necessary
background information in their head (Paul?) tell me if using a chopped-up MD5 
or
SHA-1 (or whatever) digest as the hash functions for a bloom filter is a good 
idea...

i.e. something like:

import sha

try:
all
except NameError:
def all(S):
 for x in S:
 if not x:
return False
 return True

def digest(data):
d = sha.sha(data).digest()
return [d[i:i+4] for i in range(0, len(d), 4)]

class bloom(object):

def __init__(self):
self.filter = [set() for i in range(len(digest()))]

def add(self, data):
for s, p in zip(self.filter, digest(data)):
s.add(p)
def __contains__(self, data):

return all(p in s for s, p in zip(self.filter, digest(data)))

b = bloom()
b.add(showbiz)
b.add(absolution)
print hullaballo in b
print showbiz in b
print origin of symmetry in b

(and if this isn't a good idea, why it isn't).

/F 



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Paul Rubin
Fredrik Lundh [EMAIL PROTECTED] writes:
 btw, since my brain is on vacation, could anyone who already has all
 the necessary background information in their head (Paul?) tell me
 if using a chopped-up MD5 or SHA-1 (or whatever) digest as the hash
 functions for a bloom filter is a good idea...

Erm, my brain isn't on vacation, it's just plain missing ;-).  However
I'd say there's nothing wrong in principle with chopping up sha/md5
output if you want to get several hashes as you're doing, assuming the
total hash length is enough.  The advantage over doing separate hashes
for each piece is optimization: you don't compute as many hashes.
But there's more flexibility with something like (untested):

   def multi_hash(data, i, size):# find i'th slice of hash
 return sha.new('%x,%s'% (i, data)).digest()[:size]

Also, the idea of a Bloom filter is to use less memory than a hash
table, by not storing the keys.  In your implementation you chop the
sha/md5 output into 4-byte pieces, which means about 4 billion slots
in each set, which might be too many for a typical computer.  Also,
you end up storing each of those 4-byte values as a Python string.  So
I think in practice you'd want smaller sets, and to represent them as
bit vectors instead of as Python sets (untested):

  import array
  class bit_vector(object):
 def __init__(self, size_in_bits):
nbytes = (size_in_bits + 7) // 8
self.vec = array.array('B', '\0' * nbytes)

 def __getattr__(self, i):  # get i'th bit from vector
a,b = divmod(i, 8)
return bool(self.vec[a]  (1  b))

 def __setattr__(self, i, value):   # set i'th bit in vector
a,b = divmod(i, 8)
v = int(bool(value))# the actual bit to set, 1 or 0
self.vec[a] = (self.vec[a]  ~(1  b)) | (v  b)

If we use a bit vector representation we want the hash value to be an
integer (untested):

   def bloom_hash(data, i, size_in_bits):
  # my favorite way to get an integer value from sha
  h = int(sha.new('%x,%s'% (i, data)).hexdigest(), 16)
  return h % size_in_bits

We'd then implement the Bloom filter (untested):

   class bloom(object):
  def __init__(self, nlayers, layersize):
 self.nlayers, self.layersize = nlayers, layersize
 self.filter = [bit_vector(layersize) for i in xrange(nlayers)]

  def add(self, data):
 for i in xrange(self.nlayers):
self.filter[i][bloom_hash(data, i, self.layersize)] = True

  def __contains__(self, data):
 return False not in \ 
   (self.filter[i][bloom_hash(data, i, self.layersize)] \
 for i in xrange(self.nlayers))

Finally, I think to use Bloom filters effectively you have to choose
nlayers and layersize carefully, according to the number of keys you
expect to see, the amount of memory you have available, and/or the
probability of false matches that you're willing to accept.  The
standard references (whatever they are) about Bloom filters probably
have some suitable formulas for computing these things.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Paul Rubin
Paul Rubin http://[EMAIL PROTECTED] writes:
 Finally, I think to use Bloom filters effectively you have to choose
 nlayers and layersize carefully, according to the number of keys you
 expect to see, the amount of memory you have available, and/or the
 probability of false matches that you're willing to accept.  The
 standard references (whatever they are) about Bloom filters probably
 have some suitable formulas for computing these things.

Heh, there's an online calculator:

http://www.cc.gatech.edu/fac/Pete.Manolios/bloom-filters/calculator.html

Link is from 

http://en.wikipedia.org/wiki/Bloom_filter

which is a good article.  As described in the article, there's only
one layer in the Bloom filter (you use n different hashes but they
all touch the same bit map) which corresponds to how the ancient Unix
spell checker worked, I think.  It would take some analysis to show
which way is better.  I may try working out the formulas as an
exercise but I'm too sleepy for that right now.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-13 Thread Piet van Oostrum
 [EMAIL PROTECTED] (k) wrote:

k Hello,
k I am using some very large dictionaries with keys that are long strings
k (urls).  For a large dictionary these keys start to take up a
k significant amount of memory.  I do not need access to these keys -- I
k only need to be able to retrieve the value associated with a certain
k key, so I do not want to have the keys stored in memory.  Could I just
k hash() the url strings first and use the resulting integer as the key?
k I think what I'm after here is more like a tradition hash table.  If I
k do it this way am I going to get the memory savings I am after?  Will
k the hash function always generate unique keys?  Also, would the same
k technique work for a set?

Maybe a Berkeley DB hash file would be a good alternative. It can contain
all your key,value pairs but will only keep a small amount in memory.
-- 
Piet van Oostrum [EMAIL PROTECTED]
URL: http://www.cs.uu.nl/~piet [PGP 8DAE142BE17999C4]
Private email: [EMAIL PROTECTED]
-- 
http://mail.python.org/mailman/listinfo/python-list


don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
Hello,
I am using some very large dictionaries with keys that are long strings
(urls).  For a large dictionary these keys start to take up a
significant amount of memory.  I do not need access to these keys -- I
only need to be able to retrieve the value associated with a certain
key, so I do not want to have the keys stored in memory.  Could I just
hash() the url strings first and use the resulting integer as the key?
I think what I'm after here is more like a tradition hash table.  If I
do it this way am I going to get the memory savings I am after?  Will
the hash function always generate unique keys?  Also, would the same
technique work for a set?

Any other thoughts or considerations are appreciated.

Thank You.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
[EMAIL PROTECTED] wrote:
 Hello,
 I am using some very large dictionaries with keys that are long strings
 (urls).  For a large dictionary these keys start to take up a
 significant amount of memory.  I do not need access to these keys -- I
 only need to be able to retrieve the value associated with a certain
 key, so I do not want to have the keys stored in memory.  Could I just
 hash() the url strings first and use the resulting integer as the key?
 I think what I'm after here is more like a tradition hash table.  If I
 do it this way am I going to get the memory savings I am after?  Will
 the hash function always generate unique keys?  Also, would the same
 technique work for a set?


I just realized that of course the hash is not always going to be
unique, so this wouldn't really work.  And it seems a hash table would
still need to store the keys (as strings) so that string comparisons
can be done when a collision occurs.  I guess there's no avoiding
storing they keys?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

 Will the hash function always generate unique keys?

no.  hash() is designed for dictionaries (hash tables), not for use as a 
cryptographic hash.

depending on your application, a bloom filter might be a good enough:

 http://en.wikipedia.org/wiki/Bloom_filter

(see the links section for a Python implementation)

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Fredrik Lundh
[EMAIL PROTECTED] wrote:

 I just realized that of course the hash is not always going to be
 unique, so this wouldn't really work.  And it seems a hash table would
 still need to store the keys (as strings) so that string comparisons
 can be done when a collision occurs.

btw, Python's dictionary type *is* a highly-optimized implementation of 
a traditional hash table.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Diez B. Roggisch
[EMAIL PROTECTED] wrote:

 Hello,
 I am using some very large dictionaries with keys that are long strings
 (urls).  For a large dictionary these keys start to take up a
 significant amount of memory.  I do not need access to these keys -- I
 only need to be able to retrieve the value associated with a certain
 key, so I do not want to have the keys stored in memory.  Could I just
 hash() the url strings first and use the resulting integer as the key?
 I think what I'm after here is more like a tradition hash table. 

python dictionaries are traditional hash-tables.

 If I 
 do it this way am I going to get the memory savings I am after?  Will
 the hash function always generate unique keys?  Also, would the same
 technique work for a set?
 
 Any other thoughts or considerations are appreciated.

You could try and create a md5 sum of your strings and use that as key. It
_should_ be good enough, but I'm no crypto expert so take that with a grain
of salt.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Nick Vatamaniuc
It should be enough but it might be a little slower than hash(string).

Diez B. Roggisch wrote:
 [EMAIL PROTECTED] wrote:

  Hello,
  I am using some very large dictionaries with keys that are long strings
  (urls).  For a large dictionary these keys start to take up a
  significant amount of memory.  I do not need access to these keys -- I
  only need to be able to retrieve the value associated with a certain
  key, so I do not want to have the keys stored in memory.  Could I just
  hash() the url strings first and use the resulting integer as the key?
  I think what I'm after here is more like a tradition hash table.

 python dictionaries are traditional hash-tables.

  If I
  do it this way am I going to get the memory savings I am after?  Will
  the hash function always generate unique keys?  Also, would the same
  technique work for a set?
 
  Any other thoughts or considerations are appreciated.

 You could try and create a md5 sum of your strings and use that as key. It
 _should_ be good enough, but I'm no crypto expert so take that with a grain
 of salt.
 
 Diez

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Nick Vatamaniuc
Dictionaries are hash tables in Python.

If you don't really want to store the keys, just use
dic[hash(key)]=value, this way the dictionary will have the same shape
and distribution of keys as dic[key]=value because
hash('abc')=hash(hash('abc'))  but the long string of actual keys are
not referenced by the dictionary.

Now you couldn't do dic.keys() and see your urls, and every time you
want to do dic['abc'] you would get a KeyError exception.

Hope this helps,
Nick Vatamaniuc

[EMAIL PROTECTED] wrote:
 [EMAIL PROTECTED] wrote:
  Hello,
  I am using some very large dictionaries with keys that are long strings
  (urls).  For a large dictionary these keys start to take up a
  significant amount of memory.  I do not need access to these keys -- I
  only need to be able to retrieve the value associated with a certain
  key, so I do not want to have the keys stored in memory.  Could I just
  hash() the url strings first and use the resulting integer as the key?
  I think what I'm after here is more like a tradition hash table.  If I
  do it this way am I going to get the memory savings I am after?  Will
  the hash function always generate unique keys?  Also, would the same
  technique work for a set?
 

 I just realized that of course the hash is not always going to be
 unique, so this wouldn't really work.  And it seems a hash table would
 still need to store the keys (as strings) so that string comparisons
 can be done when a collision occurs.  I guess there's no avoiding
 storing they keys?

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Fredrik Lundh
Nick Vatamaniuc wrote:

 If you don't really want to store the keys, just use
 dic[hash(key)]=value, this way the dictionary will have the same shape
 and distribution of keys as dic[key]=value because
 hash('abc')=hash(hash('abc'))  but the long string of actual keys are
 not referenced by the dictionary.

how come you're so sure that there will never be any collisions ?

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Tim Chase
 how come you're so sure that there will never be any collisions ?

because none of his strings want their insurance to go up...

:*)

-tkc



-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Nick Vatamaniuc
Fred,

I never said there will be no collisions. For clarity, can you quote
the exact phrase where I mentioned that?

To say that there will be no collision is nonsense because the # of
possible long url strings is certainly larger than 2^32, applying the
pigeon hole principle you could easily show that there will be
collisions.

The point is to make collision as unlikely as possible for the human
readable text (that is make them as uniform as possible), that is why
creating good cryptographic hashes is not easy.  Not that the Python
hash function work as hard as an MD5 (it is probably a multiplication
and a modulo if I had to guess). But this is a topic beyond scope of
this thread.

The original post just said that he wanted to index some values by
their urls and didn't really care about the urls themselves, so I
suggested that he just use the hash of the key as the key. The
dictionary will then take a hash of that [note that:
hash(string)=hash(hash(string)) ]  then  the dictionary will not keep
the reference to the urls and GC will collect them. So instead of
dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
to retrieve he will have to do old_value=dic[hash(urlstring]].

Hopefully this make my point more clear,
Nick V.

Fredrik Lundh wrote:
 Nick Vatamaniuc wrote:

  If you don't really want to store the keys, just use
  dic[hash(key)]=value, this way the dictionary will have the same shape
  and distribution of keys as dic[key]=value because
  hash('abc')=hash(hash('abc'))  but the long string of actual keys are
  not referenced by the dictionary.

 how come you're so sure that there will never be any collisions ?
 
 /F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Fredrik Lundh
Nick Vatamaniuc wrote:

 I never said there will be no collisions. For clarity, can you quote
 the exact phrase where I mentioned that?

the text I quoted is only true if you can guarantee that there will be 
no collisions.

/F

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Diez B. Roggisch

Please don't top post. I had to fix that below.

 Any other thoughts or considerations are appreciated.
 You could try and create a md5 sum of your strings and use that as key. It
 _should_ be good enough, but I'm no crypto expert so take that with a grain
 of salt.

  It should be enough but it might be a little slower than hash(string).

hash alone won't be sufficient, as it is guaranteed to have to much 
collisions. The OP already figured that out, btw.

Diez
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Paul Rubin
[EMAIL PROTECTED] writes:
 do it this way am I going to get the memory savings I am after?  Will
 the hash function always generate unique keys?  Also, would the same
 technique work for a set?

A hash function that always generates unique hashes is called a
perfect hash.  They tend to be slow, and of course, to generate one,
you need to know all the keys in advance.

If you use a long cryptographic hash like md5 or sha, the chances of 
collision is very small.  That's probably your best bet.

In general, if your hash is N bits long, you will probably get
collisions once you have around 2**(N/2) keys.  sha is a 160
bit hash so getting collisions by accident is extremely unlikely.
-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread kdotsky
 depending on your application, a bloom filter might be a good enough:

  http://en.wikipedia.org/wiki/Bloom_filter


Thanks (everyone) for the comments.  I like the idea of the bloom
filter or using an md5 hash, since a rare collision will not be a
show-stopper in my case.

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Terry Hancock
Nick Vatamaniuc wrote:
  The original post just said that he wanted to index some values by
  their urls and didn't really care about the urls themselves, so I
  suggested that he just use the hash of the key as the key. The
  dictionary will then take a hash of that [note that:
  hash(string)=hash(hash(string)) ] then the dictionary will not keep
  the reference to the urls and GC will collect them. So instead of
  dic[urlstring']=value he will do dic[hash(urlstring)]=value. But then
  to retrieve he will have to do old_value=dic[hash(urlstring]].

Note that it is trivial to catch collisions on entry and correct them:

key = hash(url_string)
i  = 0
if d.has_key(key):
   while d.has_key(key):
   i += 1
   d[key + repr(i)] = val
else:
   d[key] = val

or something along those lines. No need to keep the original string.

Obviously this is only efficient if collisions are rare.
   
I am a little surprised that hash(hash(s)) == hash(s), is that actually
true?

Cheers,
Terry

-- 
Terry Hancock ([EMAIL PROTECTED])
Anansi Spaceworks http://www.AnansiSpaceworks.com

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Ganesan Rajagopal
 Terry == Terry Hancock [EMAIL PROTECTED] writes:

 Note that it is trivial to catch collisions on entry and correct them:

 key = hash(url_string)
 i  = 0
 if d.has_key(key):
while d.has_key(key):
i += 1

hash is a number. It's sufficient to do

while d.has_key(key):
key += 1
   
 I am a little surprised that hash(hash(s)) == hash(s), is that actually
 true?

 hash(42)
42

Ganesan

-- 
Ganesan Rajagopal

-- 
http://mail.python.org/mailman/listinfo/python-list


Re: don't need dictionary's keys - hash table?

2006-07-12 Thread Paul Rubin
Ganesan Rajagopal [EMAIL PROTECTED] writes:
 hash is a number. It's sufficient to do
 
 while d.has_key(key):
 key += 1

This is called linear probing and is not considered that great a
collision resolution strategy for hash tables.  Really, if you want an
exhaustive study about hashing, see Knuth vol 3.  If you're just
trying to get the application doing something reasonable, let the
database do its job.
-- 
http://mail.python.org/mailman/listinfo/python-list