A couple of us using the US Census TIGER database have noticed something about the statistics gathering of analyze. If you follow the thread "Query Optimizer 8.0.1" you'll see the progression of the debate.
To summarize what I think we've seen: The current implementation of analyze is designed around sampling a table to characterize the basic trend of the data. The problem with the approach is that it assumes that the data has a singular trend behavior. Greg Stark posts "Cross column statistics" touches on the general problem. The best analogy so far is the difference between an oscilloscope and a spectrum analizer. The current statistics gathering is like a sampling oscilloscope trying to display a single wave form. Some data trends are more like audio signals where the data has many smaller trends in a seemingly random stream. With a specrum analyzer you can see the various components. Use Winamp or XMMS for a visualization. Lets assume data is in a multiple sort order. Lets assume it is a set of street addresses sorted by: state, streetname, streettyppe, address MA, ABBOT, RD, 100 MA, ABBOT, RD, 200 MA, ABBOT, RD, 300 MA, ABBOT, ST, 100 MA, ABBOT, ST, 200 MA, MAPLE, RD, 100 MA, MAPLE, RD, 200 MA, MAPLE, ST, 100 ... ... WY, ABBOT, RD, 100 etc. This table has MILLIONS of rows, every single address in the country. The "trend" of state is clearly an increasing step ramp over the entire table. The trend of streetname can be imagined as a waveform of a series of ramps for each state. The trend of streettype, similarly, is a series of ramps per street name, and the wave form for address is a ramp for each streettype. The statistics PostgreSQL currently employs will work great for "state," but much less so for "streetname." A query of "select * from addresses where streetname = 'ABBOT'" will be seen as more expensive than it really is. Most of the ABBOTs will be together in about 50 clusters (one for each state, assuming every state has atlease on "ABBOT"), but the current stats are not designed to detect this. Yes, eventually, if the sub-trends are small enough, the index scans become more expensive than table scans, but the current stats can't tell where that point is. Clearly it is not at the secondary sort (or "streetname") level. I've found that increasing the sample size in analyze.c can help in specific cases, but the overall problem remains. The question is: Is this really a problem? If so, what can we do? I was thinking of trying to compute a sliding window standard deviation which should be able to detect smaller trends in an overall table, this would require a lot of work in analyze.c. If the sliding window deviation is low, then the correlation of the table should be increased, telling the planner that an index scan is a better choice. The actual math behind the values has to be worked out, of course, but what do you think about the idea? ---------------------------(end of broadcast)--------------------------- TIP 1: subscribe and unsubscribe commands go to [EMAIL PROTECTED]