On Mon, Mar 19, 2007 at 06:55:56PM -0700, Jeff Davis wrote: > On Mon, 2007-03-19 at 21:24 -0400, Tom Lane wrote: > > Jeff Davis <[EMAIL PROTECTED]> writes: > > > We can already keep stats for a functional index. Is there a reason we > > > can't keep stats for a multi-column index? > > > > The questions that need to be answered are (1) what stats are you gonna > > collect, and (2) exactly what are you going to do with them when you > > have 'em? > > > > All the previous discussions have stalled on the question of how to > > avoid trying to collect stats about an exponentially large number of > > column combinations; we've never even reached the question of what > > stats we'd actually want given that a particular combination has been > > determined to be interesting. Perhaps that's a trivial question, > > but it's been a mighty long time since I took statistics ... > > > > I know we can't keep stats on every combination of columns. My initial > idea would be to only keep stats about a multi-column index (and > probably optional for those, too). > > My thinking was that we could keep a histogram (and MCVs, etc.) of the > non-scalar key in the multi-column index. That would provide the data > the planner needs to answer a query like "WHERE a = 1 and b < 1000" if a > and b are dependent and you have an index on (a,b). <snip> > AndrewSN pointed out on IRC that keeping a histogram of non-scalar > values is not as easy as I thought, because PostgreSQL doesn't allow > arrays of composite types, among other problems.
I don't think the array problem is that big a deal, since PostgreSQL doesn't enforce array dimensions at all. You can just make the arrays for multi-column stats 2 dimensional, though handling indexes with different data types among the columns would be a bit tricky... right now the only choice I can think of would be to require that values could be cast to and from text and just store text in the array. Though obviously it'd be better to just allow arrays of composite types... The other challenge is that you can't make all the same assumptions with a multi-field histogram that you can with a single-field one. For example, if this is our index: a b - - 1 1 1 2 ... 1 1000 2 500 2 501 ... 3 5000 The histogram would likely position the buckets such that 1,1000 and 2,500 would fall within one bucket, which means the planner has no idea that b doesn't exceed 1000 when a is 1. I'm not sure how big of an issue that is in reality, though, because the planner does know that the bucket can only represent so many rows. It might be worth coming up with a different means to store the histogram for the multi-column case. > Is this a worthwhile area of exploration? ISTM it trips people up often enough to make it worth at least exploring... -- Jim Nasby [EMAIL PROTECTED] EnterpriseDB http://enterprisedb.com 512.569.9461 (cell) ---------------------------(end of broadcast)--------------------------- TIP 7: You can help support the PostgreSQL project by donating at http://www.postgresql.org/about/donate