I've been interested in what it would take to start tracking cross-column statistics. A review of the mailing lists as linked from the TODO item on the subject [1] suggests the following concerns:
1) What information exactly would be tracked? 2) How would it be kept from exploding in size? 3) For which combinations of columns would statistics be kept? The major concern in #1 seemed to be that the most suitable form for keeping most common value lists, histograms, etc. is in an array, and at the time of the posts I read, arrays of composite types weren't possible. This seems much less of a concern now -- perhaps in greatest part because a test I just did against a recent 8.4devel sure makes it look like stats on composite type columns aren't even kept. The most straightforward is that we'd keep a simple multi-dimensional histogram, but that leads to a discussion of #2. #2 is an interesting problem, and possibly the most difficult from a theoretical perspective. Provided a fair expectation that the results would be useful, I'd like to play with various forms of statistical regressions to compress large cross-column histograms. But I'm not sure I want to spend the time if there are other serious issues with the whole idea of cross-column stats. Another possibility involves not storing a histogram at all, but instead storing an array of quantiles for each column. In other words, if S represents the statistics target of a column, we'd have an array A of S+1 entries of the type in question, where the beginning and end of the array are the minimum and maximum values in the column, and the range between any two consecutive array elements A[n] and A[n+1] contains 1/S of the total entries in the column. For a uniformly distributed column, these array elements would be evenly spaced across the total range of the column; the difference between consecutive array elements would indicate deviations from this distribution. Unfortunately this only works if there's a valid distance metric for the data type in question. In the archives, #3 was raised frequently as a concern -- obviously tracking stats on all permutations of the columns in a database is a non-starter. However there seemed to be little argument over sensible defaults, namely that columns involved in a multi-column index would have cross-column statistics kept automatically, and the user would be allowed to instruct the server to keep stats on other arbitrary groups of columns. There was some discussion about the fact that the user would have to be smart about how (s)he used this feature, but there are enough other potential foot guns in the system of similar magnitude that worrying too much about that seems of little use. I bring all this up because I'm interested in looking at how this might be done, and perhaps doing it if 1) time permits, and 2) someone else doesn't beat me to it. Comments, anyone? Am I missing something obvious/serious/etc.? Thanks in advance. - Josh / eggyknap [1] "Allow accurate statistics to be collected on indexes with more than one column or expression indexes, perhaps using per-index statistics", http://wiki.postgresql.org/wiki/Todo -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers