Re: [HACKERS] [PATCH]-hash index improving

Dann Corbit Tue, 22 Jul 2008 20:47:01 -0700

> -----Original Message-----
> From: [EMAIL PROTECTED] [mailto:pgsql-hackers-
> [EMAIL PROTECTED] On Behalf Of Xiao Meng
> Sent: Tuesday, July 22, 2008 7:57 PM
> To: Simon Riggs
> Cc: [email protected]
> Subject: Re: [HACKERS] [PATCH]-hash index improving
> 
> Well, I'll do it after I finish my second patch.
> Hash index should be more efficient than btree when N is big enough.
> It seems meaningful to find how big N is in an experiment way.


The savings will depend on many factors.  Another thing (besides volume) which 
is important is the sort of key data being indexed.

Consider a unique key on six varchar(40) fields:

1.  Country
2.  State/Province
3.  City
4.  Division
5.  Branch
6.  Office

Typically, a key such as this will have lots of clumps of similar data, only 
being differentiated with the final segment.  This sort of index is often used 
for reporting purposes.  To determine a unique entry, it is not unlikely that 
more than 200 characters will be traversed.  A hash index gets a special boost 
here because a much smaller data signature is stored.  Even a 64 bit hash 
occupies only 8 bytes.  On the other hand, consider an index on a field 
consisting of a single character.  Here, the pages of the b-tree will have a 
huge volume of entries per page, requiring fewer pages to search, and the hash 
index is many times larger and hence more pages will have to be loaded.

These things make a hash index desirable:
1. Unique index
2. Long keys
3. Huge data cardinality
4. Equality search

These things make a hash index undesirable:
1.  Non-unique index
2.  Short keys
3.  Small data sets

These things render a hash index as worthless (except in COMBINATION with a 
b-tree type index):
1.  Need for range searches like BETWEEN
2.  Need for ORDER BY on the column(s)

As an aside:
I guess it will also be nice if you can CLUSTER both index and data values on 
the hash.  It may need a different algorithm than a b-tree clustering concept.  
I know that Rdb uses different kinds of storage areas for hashed indexes verses 
b-tree indexes.

This effort to create hashed indexes is very valuable.  Because it becomes more 
and more dominant as the data scales up, right at the point when things get 
tougher is when it becomes the most helpful.  If you have a tiny table, it does 
not even matter if you index it, because (for instance) 10 rows will probably 
always stay in memory and iteration will find what is needed instantly.  But if 
you have hundreds of millions of rows or billions of rows, now is when 
performance really matters.  So when the data scales to preposterous size 
(which it has an uncanny ability to do) the boost of performance becomes even 
more valuable.


-- 
Sent via pgsql-hackers mailing list ([email protected])
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers

Re: [HACKERS] [PATCH]-hash index improving

Reply via email to