On Mon, Sep 03, 2007 at 10:33:54AM +0100, Simon Riggs wrote: > > > > This is the rough plan. Does anyone see anything critical that > > is missing at this point? Please send me any suggestions for test > > data and various performance test ideas, since I will be working > > on that first. > > Sounds good. > > I'd be particularly interested in large indexes, say ~ 0.5 - 2GB. There > are likely to be various effects apparent as the indexes grow. It would > be too easy to do all the tests with smaller indexes and miss things. > > Other factors are: > - volatility > - concurrency > > My general experience is that hash-based indexes are better when the > range of inputs is relatively well-known, allowing a fast lookup. If > that is the only benefit of hash indexes, a flexible hashing scheme may > simply weaken the benefit-case for using them. If that's true, should > the index build process examine the key values in the data to determine > the best parameters to use? Kind of ANALYZE before build. > > My current feeling is that they ought to be very good at handling > read-mostly situations such as privilege checking or UPDATE-intensive > situations such as Customer-Current-Credit tracking, when the number of > customers is large. > > It might also be worth looking at lossy hash indexes, i.e. the index > stores only the block numbers. That would need to be part of the > discussion around how lossy we will allow indexes to be. > > We currently have two kinds of full text index with different > concurrency use cases, so it should be acceptable to have hash indexes > have a clear benefit in one use case but a clear loss in another. > > -- > Simon Riggs > 2ndQuadrant http://www.2ndQuadrant.com >
Simon, Thank you for your input. I would like to include some tests with large indexes too. Do you have any ideas for a test corpus or should we try and generate the test data programatically? Many people in the literature of text retrieval use the TREC* data for at least some of their runs. I am going to check at work to see if the campus has access to the data, otherwise I will do some web crawling to generate some sample data. I have just posted a reply to Tom Lane with some further ideas for consideration in the new hash index support. Like you, I suspect that volatile data that results in many index changes may not work well with hash indexes, in general. PostgreSQL has the additional burden of needing to access both the index and the data heap. Obviously, the less I/O that is needed the better the performance is likely to be. The new HOT functionality plus clustering the table data on the hash index would effectively organize the table into the "hash buckets" which could help with reducing both the churn in the index as well as in the tables. Regards, Ken ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate