> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Greg Stark
> Sent: Tuesday, August 01, 2006 4:42 PM
> To: Hannu Krosing
> Cc: Andrew Dunstan; Gregory Stark; Tom Lane; Alvaro Herrera; Jim C. Nasby;
> Luke Lonergan; [EMAIL PROTECTED]; Bruce Momjian; Jie Zhang; Gavin
> Sherry; pgsql-hackers@postgresql.org
> Subject: Re: [HACKERS] Hash indexes
> 
> Hannu Krosing <[EMAIL PROTECTED]> writes:
> 
> > Ühel kenal päeval, T, 2006-08-01 kell 10:54, kirjutas Andrew Dunstan:
> > > Gregory Stark wrote:
> > > >
> > > > I looked a while back and was suspicious about the actual hash
> functions too.
> > > > It seemed like a lot of them were vastly suboptimal. That would mean
> we're
> > > > often dealing with mostly empty and mostly full buckets instead of
> well
> > > > distributed hash tables.
> > > >
> > > >
> > > >
> > >
> > > This is now sounding like a lot of low hanging fruit ... highly
> > > performant hash indexed tables could possibly be a very big win.
> > >
> >
> > Are you sure about the badness of our hash functions ?
> >
> > I just tested and hashtext(text) has about 1.4% of collisions on about
> > 120M distinct texts, which is not bad considering thet total space for
> > hashes is 4G, meaning that 120M covers itself already 3% of possible
> > hash space.
> 
> I don't recall exactly what triggered my suspicions, but I think it had
> more
> to do with things like how int4 and int8 were mapped to hash codes and
> what
> the hash function looks like that compresses the hash codes into the
> actual
> bucket. IIRC it's just the low order bits for int8 and it's the actual
> int4
> which then only takes the low order bits. I seem to recall wondering what
> would happen if you had, say, a list of /24 network addresses for example.
> 
> I didn't actually do any tests though so it's quite possible I missed
> something that mitigated the issues I feared.

There is a program called QDBM:
http://qdbm.sourceforge.net/

That has very effective disk based hashing.  A clever extension that he uses is 
to use btrees to chain hash collisions (very similar to an idea I like to use 
from time to time which is using a hash table of skiplists).  The project is 
LGPL but it is possible that Mikio Hirabayashi could be talked into letting the 
on disk hash table portion be made available with a Berkeley style license.

As for effectiveness of hash functions, there are test programs on this page:
http://www.burtleburtle.net/bob/hash/index.html

and this hash library:
http://eternallyconfuzzled.com/libs/jsw_hlib.html
has statistics built into it (you can easily add and test your own hash 
functions using the provided API).

> --
> greg
> 
> 
> ---------------------------(end of broadcast)---------------------------
> TIP 9: In versions below 8.0, the planner will ignore your desire to
>        choose an index scan if your joining column's datatypes do not
>        match

---------------------------(end of broadcast)---------------------------
TIP 9: In versions below 8.0, the planner will ignore your desire to
       choose an index scan if your joining column's datatypes do not
       match

Reply via email to