Re: [GENERAL] Expected accuracy of planner statistics

Casey Duncan Thu, 28 Sep 2006 23:47:20 -0700

On Sep 28, 2006, at 8:51 PM, Tom Lane wrote:
[..]

The information we've seen says that the only statisticallyreliable way
to arrive at an accurate n_distinct estimate is to examine most of the
table :-(. Which seems infeasible for extremely large tables,which is
exactly where the problem is worst.  Marginal increases in the sample
size seem unlikely to help much ... as indeed your experiment shows.

I think a first step might be to introduce a new analyze command,such as ANALYZE FULL. This would be executed deliberately (IOW not byautovacuum) like CLUSTER or VACUUM FULL when deemed necessary by thedba. The command as implied would scan the entire table and fill inthe stats based on that (as analyze used to do IIRC). It would alsobe useful if this command froze the stats so that autovacuum didn'tclobber them with inaccurate ones shortly thereafter. Perhaps anexplicit ANALYZE FULL FREEZE command would be useful for that case,the behavior being that a normal ANALYZE would not overwrite thestats for a stats-frozen table, another ANALYZE FULL would, however.Such a frozen state would also be useful if you wanted to hand-tweakstats for a single table and have it stick and still use autovac. AsI understand it now, with autovac on, you cannot do that unless youhack the pg_autovacuum table (i.e., set anl_base_thresh to anartificially high value).

Another option (that I think others have suggested) would be to makethis the behavior for VACUUM ANALYZE. That saves the baggage of a newcommand at least. Another advantage would be that the autovac daemoncould run it. Perhaps some smarts could also be built in. What ifVACUUM ANALYZE first runs a normal (sampled) ANALYZE. Then itperforms the VACUUM with full ANALYZE pass. The stats gathered by thelatter full pass are compared to that of the first sampled pass. Ifthe full ANALYZE statistics are sufficiently different from thesampled pass, then the table is flagged so that normal ANALYZE is notperformed by the autovac daemon on that table. Also, a global ANALYZEcould ignore it (though this seems more magical).

A more pie-in-the-sky idea could take advantage of the fact that thelarger a table is the less likely the statistics will change muchover time. If we cannot afford to sample many rows in a given analyzepass, then perhaps we should use a "newton's method" approach wherewe attempt to converge on an accurate value over time with eachanalyze pass contributing more samples to the statistics and honingthem incrementally rather than simply replacing the old ones.

I'm not statistician, so it's not clear to me how much more state youwould need to keep between analyze passes to make this viable, but inorder for this to work the following would need to be true:

1) Analyze would need to be run on a regular basis (luckily we haveautovaccum to help). You would want to analyze this tableperiodically even if nothing much changed, however. Perhaps tuningthe autovac parameters is enough here.

2) Each analyze pass would need to sample randomly so that multiplepasses tend to sample different rows.

3) The stats would need to somehow be cumulative. Perhaps this meansstoring sample values between passes, or some other statistical voodoo.

4) Needs to be smart enough to realize when a table has changeddrastically, and toss out the old stats in this case. Either that orwe require a human to tell us via ANALYZE FULL/VACUUM ANALYZE.

I think that the incremental stats approach would more or less dependon the full ANALYZE functionality for bootstrapping. I think when youfirst load the table, you want to get the stats right immediately andnot wait some indeterminate amount of time for them to "converge" onthe right value.


-Casey







---------------------------(end of broadcast)---------------------------
TIP 2: Don't 'kill -9' the postmaster

Re: [GENERAL] Expected accuracy of planner statistics

Reply via email to