On Wed, Apr 21, 2004 at 10:15:13AM +1200, Sidney Markowitz wrote: > Michael Parker wrote: > >Question is, is using that value gonna work in the long run > >for dbs with 3-4 million tokens? > > substr(sha1($token), -5) and CHAR(5) is good to about 2 million using my > no-brainer criteria of expecting no collisions at all. >
[ Snip Sidney's excellent analysis ] Ahhh...ok it's much clearer now, thanks. > > substr(sha1($token), -6) and CHAR(6) is no-brainer good to about 32 > million tokens. > > Can you try your benchmark with that? I expect that it would get you the > same performance as the 40 bit version and still enough reduction in db > size to be worthwhile. > It should only really affect size (~5% greater than the CHAR(5) version), so I didn't do a full benchmark. > But I think that my numbers show that 40-bit should be ok at 4 million > and certainly at 3 million. I think the overlap of a few tokens is fine and I'm gonna move forward with the CHAR(5) version. For SQL we gain a 40% space savings over the previous method (this includes switching to int userid instead of username) using MyISAM tables in MySQL. Michael
