> -----Original Message----- > From: [EMAIL PROTECTED] [mailto:pgsql-hackers- > [EMAIL PROTECTED] On Behalf Of Greg Stark > Sent: Tuesday, August 01, 2006 4:42 PM > To: Hannu Krosing > Cc: Andrew Dunstan; Gregory Stark; Tom Lane; Alvaro Herrera; Jim C. Nasby; > Luke Lonergan; [EMAIL PROTECTED]; Bruce Momjian; Jie Zhang; Gavin > Sherry; pgsql-hackers@postgresql.org > Subject: Re: [HACKERS] Hash indexes > > Hannu Krosing <[EMAIL PROTECTED]> writes: > > > Ühel kenal päeval, T, 2006-08-01 kell 10:54, kirjutas Andrew Dunstan: > > > Gregory Stark wrote: > > > > > > > > I looked a while back and was suspicious about the actual hash > functions too. > > > > It seemed like a lot of them were vastly suboptimal. That would mean > we're > > > > often dealing with mostly empty and mostly full buckets instead of > well > > > > distributed hash tables. > > > > > > > > > > > > > > > > > > This is now sounding like a lot of low hanging fruit ... highly > > > performant hash indexed tables could possibly be a very big win. > > > > > > > Are you sure about the badness of our hash functions ? > > > > I just tested and hashtext(text) has about 1.4% of collisions on about > > 120M distinct texts, which is not bad considering thet total space for > > hashes is 4G, meaning that 120M covers itself already 3% of possible > > hash space. > > I don't recall exactly what triggered my suspicions, but I think it had > more > to do with things like how int4 and int8 were mapped to hash codes and > what > the hash function looks like that compresses the hash codes into the > actual > bucket. IIRC it's just the low order bits for int8 and it's the actual > int4 > which then only takes the low order bits. I seem to recall wondering what > would happen if you had, say, a list of /24 network addresses for example. > > I didn't actually do any tests though so it's quite possible I missed > something that mitigated the issues I feared.
There is a program called QDBM: http://qdbm.sourceforge.net/ That has very effective disk based hashing. A clever extension that he uses is to use btrees to chain hash collisions (very similar to an idea I like to use from time to time which is using a hash table of skiplists). The project is LGPL but it is possible that Mikio Hirabayashi could be talked into letting the on disk hash table portion be made available with a Berkeley style license. As for effectiveness of hash functions, there are test programs on this page: http://www.burtleburtle.net/bob/hash/index.html and this hash library: http://eternallyconfuzzled.com/libs/jsw_hlib.html has statistics built into it (you can easily add and test your own hash functions using the provided API). > -- > greg > > > ---------------------------(end of broadcast)--------------------------- > TIP 9: In versions below 8.0, the planner will ignore your desire to > choose an index scan if your joining column's datatypes do not > match ---------------------------(end of broadcast)--------------------------- TIP 9: In versions below 8.0, the planner will ignore your desire to choose an index scan if your joining column's datatypes do not match