On Thu, Oct 21, 2021 at 2:13 PM Greg Stark <st...@mit.edu> wrote: > The problem I'm finding is that the distribution of these small > subsets can swing quickly. And understanding intercolumn correlations > even if we could do it perfectly would be no help at all. > > Consider a table with millions of rows that are "done" but none that > are "pending". Inserting just a few hundred or thousand new pending > rows makes any estimates based on the existing statistics entirely > incorrect. Even if we had perfect statistics capable of making perfect > estimates they would be entirely wrong once a few inserts of pending > rows are done.
I am very sympathetic to this view of things. Because this asymmetry obviously exists, and matters. There is no getting around that. > Worse, this is kind of true for even n_dead_tup, n_mod_since_analyze, > etc are kind of affected by this. It's easy (at least on older > versions, maybe Peter's work has fixed this for btree) to get severe > index bloat because vacuum doesn't run for a long time relative to the > size of the busy portion of a table. My work (especially in 14) has definitely helped a great deal with index bloat, by cleaning it up in a targeted fashion, based on page-level considerations. This is just the only thing that can work; we can never expect VACUUM to be able to deal with that, no matter what. Simply because it's totally normal and expected for index bloat to grow at an uneven rate over time. I do still think that there is an unsolved issue here, which leads to problems with index bloat when there isn't "B-Tree keyspace concentration" of garbage index tuples. That problem is with the statistics that drive VACUUM themselves; they just don't work very well in certain cases [1]. Statistics that drive autovacuum usually come from ANALYZE, of course. The entire intellectual justification for database statistics doesn't really carry over to VACUUM. There are certain "physical database" implementation details that bleed into the way ANALYZE counts dead rows. For example, most dead tuples are usually LP_DEAD stub line pointers (not even tuples). They're only 4 bytes, whereas live tuples are about 30 bytes at a minimum (depending on how you count it). This leads to the ANALYZE block-based sampling becoming confused. This confusion seems related to the fact that ANALYZE is really a "logical database" thing. It's slightly amazing that statistics from ANALYZE work as well as they do for query planning, so we shouldn't be too surprised. Note that the TPC-C issue I describe in [1] involves a table that's a little bit like a queue table, but with lots of non-HOT updates (lots overall, but only one update per logical row, ever). This might tie things to what Thomas just said about DB2 and queue tables. > I'm imagining to really tackle this we should be doing something like > noticing when inserts, updates, deletes are affecting key values that > are "rare" according to the statistics and triggering autovacuum > ANALYZE statements that use indexes to only update the statistics for > the relevant key ranges. I'm not sure. I tend to think that the most promising approaches all involve some kind of execution time smarts about the statistics, and their inherent unreliability. Somehow query execution itself should become less gullible, at least in cases where we really can have high confidence in the statistics being wrong at this exact time, for this exact key space. [1] https://postgr.es/m/CAH2-Wz=9r83wcwzcpuh4fvpedm4znzbzmvp3rt21+xhqwmu...@mail.gmail.com -- Peter Geoghegan