> [EMAIL PROTECTED] writes: > >> The basic problem with a fixed sample is that is assumes a normal >> distribution. > > That's sort of true, but not in the way you think it is. > [snip]
Greg, I think you have an excellent ability to articulate stats, but I think that the view that this is like election polling is incorrect. Election polling assumes a very simple outcome: Some standard ditribution of a limited number options. I don't think it applies to this. > >> When you look at a sine wave on an oscilloscope, you can see it clear as >> day. When you look at music on the scope, you know there are many waves >> there, but it is difficult to make heads or tails of it. (use xmms or >> winamp to see for yourself) The waves change in frequency, amplitude, >> and >> duration over a very large scale. That's why you use a spectrum analyzer >> to go from time domain to frequency domain. In frequency domain, you can >> see the trends better. > > That's not a bad analogy to many problems where you're measuring data that > has > non-randomness in it but that are not visible in the domain that the > statistics that are being analyzed. This seems to happen a lot with > geographic > data, for instance. EXACTLY!!! > > If you find that increasing the stats targets improves things then this > isn't true. If you find that it doesn't then what's really needed is a > cleverer set of statistics to look for. I will be the first one to say that increasing the samples is not perfect, but it is a methodology that will help without major changes in postgres. Simply increasing the samples to a percentage of the estimated number of rows (with some upper and lower limits of course) will increase the accuracy of the "n_distinct" and "correlation" settings (at least a little bit), and that will make a huge impact with very little work. If we want to discuss improved statatistics, then we should include a standard deviation and a sliding window deviation, or something like that. Hell, maybe even FFT. The basic problem, I think, is that the sampling mechanism is more like an oscilloscope looking for large trends instead of a spectrum analyzer looking for the smaller ones. We have to be able to tell the planner that adjacent values are less random even though, as a whole, they are seemingly random. ---------------------------(end of broadcast)--------------------------- TIP 3: if posting/reading through Usenet, please send an appropriate subscribe-nomail command to [EMAIL PROTECTED] so that your message can get through to the mailing list cleanly