On Wed, Jan 29, 2020 at 10:41 AM Robert Haas <robertmh...@gmail.com> wrote: > Yeah, maybe. I'm tempted to advocate for dropping the GUC and keeping > the reloption. If the worst case is a 3% regression and you expect > that to be rare, I don't think a GUC is really worth it, especially > given that the proposed semantics seem somewhat confusing. The > reloption can be used in a pinch to protect against either bugs or > performance regressions, whichever may occur, and it doesn't seem like > you need a second mechanism.
I like the idea of getting rid of the GUC. I should stop talking about it for now, and go back to reassessing the extent of the regression in highly unsympathetic cases. The patch has become faster in a couple of different ways since I last looked at this question, and it's entirely possible that the regression is even smaller than it was before. > One thing that you could do is try to learn whether deduplication (I > really don't like that name, but here we are) seems to be working for > a given index, perhaps even in a given session. For instance, suppose > you keep track of what happened the last ten times the current session > attempted deduplication within a given index. Store the state in the > relcache. It's tempting to try to reason about the state of an index over time like this, but I don't think that it's ever going to work well. Imagine a unique index where 50% of all values are NULLs, on an append-only table. Actually, let's say it's a non-unique index with unique integers, and NULL values for the remaining 50% of rows -- that way we don't get the benefit of the incoming-item-is-duplicate heuristic. There will be two contradictory tendencies within this particular index. We might end up ping-ponging between each behavior. It seems better to just accept a small performance hit. Sometimes we can add heuristics to things like the split point location choice logic (nbtsplitloc.c) that behave as if we were keeping track of how things progress across successive, related page splits in the same index -- though we don't actually keep track of anything. I'm thinking of things like Postgres v12's "split after new tuple" optimization, which makes the TPC-C indexes so much smaller. We can do these things statelessly and safely only because the heuristics have little to lose and much to gain. To a lesser extent, it's okay because the heuristics are self-limiting -- we can only make an incorrect inference about what to do because we were unlucky, but there is no reason to think that we'll consistently make the wrong choice. It feels a little bit like quicksort to me. -- Peter Geoghegan