There are several reasons why Postgres' hash indexes currently suck, but one of the bigger ones is that the time to build an index on a large existing table is excessive, eg http://archives.postgresql.org/pgsql-novice/2007-03/msg00064.php
I'm not sure if this has been discussed before, but I suddenly realized while responding to the above message that the reason for the awful performance is pretty obvious: hashbuild starts with a minimum-size index (two buckets) and repeatedly splits buckets as insertions are done, exactly the same as ordinary dynamic growth of the index would do. This means that for an N-row table, approximately N/entries-per-bucket splits need to occur during index build, which results in roughly O(N^2) performance because we have to reprocess already-inserted entries over and over. This explains the empiric observation I made a long time ago: http://archives.postgresql.org/pgsql-hackers/2002-04/msg01379.php This could be fixed with a relatively small amount of new code: when beginning hashbuild, estimate the parent table's rowcount (the same method used by the planner will do fine, viz RelationGetNumberOfBlocks times an estimated tuple density) and construct the appropriate number of buckets immediately. No splits, just write out empty pages as fast as we can. *Then* do the insertions. Comments? regards, tom lane ---------------------------(end of broadcast)--------------------------- TIP 3: Have you checked our extensive FAQ? http://www.postgresql.org/docs/faq