On Sun, 2007-09-02 at 13:04 -0500, Kenneth Marshall wrote: > Dear PostgreSQL Hackers: > > After following the hackers mailing list for quite a while, > I am going to start investigating what will need to be done > to improve hash index performance. Below are the pieces of > this project that I am currently considering: > > 1. Characterize the current hash index implementation against > the BTree index, with a focus on space utilization and > lookup performance against a collection of test data. This > will give a baseline performance test to evaluate the impact > of changes. I initially do not plan to bench the hash creation > process since my initial focus will be on lookup performance. > > 2. Evaluate the performance of different hash index implementations > and/or changes to the current implementation. My current plan is > to keep the implementation as simple as possible and still provide > the desired performance. Several hash index suggestions deal with > changing the layout of the keys on a page to improve lookup > performance, including reducing the bucket size to a fraction of > a page or only storing the hash value on the page, instead of > the index value itself. My goal in this phase is to produce one > or more versions with better performance than the current BTree. > > 3. Look at build time and concurrency issues with the addition of > some additional tests to the test bed. (1) > > 4. Repeat as needed. > > This is the rough plan. Does anyone see anything critical that > is missing at this point? Please send me any suggestions for test > data and various performance test ideas, since I will be working > on that first.
Sounds good. I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There are likely to be various effects apparent as the indexes grow. It would be too easy to do all the tests with smaller indexes and miss things. Other factors are: - volatility - concurrency My general experience is that hash-based indexes are better when the range of inputs is relatively well-known, allowing a fast lookup. If that is the only benefit of hash indexes, a flexible hashing scheme may simply weaken the benefit-case for using them. If that's true, should the index build process examine the key values in the data to determine the best parameters to use? Kind of ANALYZE before build. My current feeling is that they ought to be very good at handling read-mostly situations such as privilege checking or UPDATE-intensive situations such as Customer-Current-Credit tracking, when the number of customers is large. It might also be worth looking at lossy hash indexes, i.e. the index stores only the block numbers. That would need to be part of the discussion around how lossy we will allow indexes to be. We currently have two kinds of full text index with different concurrency use cases, so it should be acceptable to have hash indexes have a clear benefit in one use case but a clear loss in another. -- Simon Riggs 2ndQuadrant http://www.2ndQuadrant.com ---------------------------(end of broadcast)--------------------------- TIP 2: Don't 'kill -9' the postmaster