Re: a note on multiword hashed tokens and collisions (fwd)

Michael Parker 21 Apr 2004 00:27:22 -0000

On Wed, Apr 21, 2004 at 10:15:13AM +1200, Sidney Markowitz wrote:
> Michael Parker wrote:
> >Question is, is using that value gonna work in the long run
> >for dbs with 3-4 million tokens?
> 
> substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my 
> no-brainer criteria of expecting no collisions at all.
>


[ Snip Sidney's excellent analysis ]

Ahhh...ok it's much clearer now, thanks.

> 
> substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32 
> million tokens.
> 
> Can you try your benchmark with that? I expect that it would get you the 
> same performance as the 40 bit version and still enough reduction in db 
> size to be worthwhile.
> 

It should only really affect size (~5% greater than the CHAR(5)
version), so I didn't do a full benchmark.

> But I think that my numbers show that 40-bit should be ok at 4 million 
> and certainly at 3 million.

I think the overlap of a few tokens is fine and I'm gonna move forward
with the CHAR(5) version.

For SQL we gain a 40% space savings over the previous method (this
includes switching to int userid instead of username) using MyISAM
tables in MySQL.

Michael

Re: a note on multiword hashed tokens and collisions (fwd)

Reply via email to