On 7/24/23 04:10, Andrey Lepikhov wrote: > On 20/7/2023 18:46, Tomas Vondra wrote: >> On 7/20/23 08:37, Andrey Lepikhov wrote: >>> On 3/10/2022 21:56, Tom Lane wrote: >>>> Revert "Optimize order of GROUP BY keys". >>>> >>>> This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and >>>> several follow-on fixes. >>>> ... >>>> Since we're hard up against the release deadline for v15, let's >>>> revert these changes for now. We can always try again later. >>> >>> It may be time to restart the project. As a first step, I rebased the >>> patch on the current master. It wasn't trivial because of some latest >>> optimizations (a29eab, 1349d27 and 8d83a5d). >>> Now, Let's repeat the review and rewrite the current path according to >>> the reasons uttered in the revert commit. >> >> I think the fundamental task is to make the costing more reliable, and >> the commit message 443df6e2db points out a couple challenges in this >> area. Not sure how feasible it is to address enough of them ... >> >> 1) procost = 1.0 - I guess we could make this more realistic by doing >> some microbenchmarks and tuning the costs for the most expensive cases. >> >> 2) estimating quicksort comparisons - This relies on ndistinct >> estimates, and I'm not sure how much more reliable we can make those. >> Probably not much :-( Not sure what to do about this, the only thing I >> can think of is to track "reliability" of the estimates and only do the >> reordering if we have high confidence in the estimates. That means we'll >> miss some optimization opportunities, but it should limit the risk. > I read up on the history of this thread. > As I see, all the problems mentioned above can be beaten by excluding > the new cost model at all. We can sort GROUP BY columns according to the > 'ndistinct' value. > I see the reason for introducing the cost model in [1]. The main > supporting point here is that with this patch, people couldn't optimize > the query by themselves, organizing the order of the columns in a more > optimal way. But now we have at least the GUC to switch off the > behaviour introduced here. Also, some extensions, like the well-known > pg_hint_plan, can help with automation.
I think the main concern is that if we reorder the group keys and get it wrong, it's a regression. If that happens often (due to costing based on poor stats), it's a problem. Yes, there's a GUC, but that's a rather blunt instrument, unfortunately. > So, how about committing of the undoubted part of the feature and > working on the cost model in a new thread? > But Tom's commit message says this: Worse, to arrive at estimates of the number of calls made to the lower-order-column comparison functions, the code needs to make estimates of the numbers of distinct values of multiple columns, which are necessarily even less trustworthy than per-column stats. so I'm not sure this really counts as "undoubted". regards -- Tomas Vondra EnterpriseDB: http://www.enterprisedb.com The Enterprise PostgreSQL Company