Re: POC: GROUP BY optimization

Andrey Lepikhov Mon, 24 Jul 2023 05:04:48 -0700

On 24/7/2023 16:56, Tomas Vondra wrote:

On 7/24/23 04:10, Andrey Lepikhov wrote:

On 20/7/2023 18:46, Tomas Vondra wrote:

On 7/20/23 08:37, Andrey Lepikhov wrote:

On 3/10/2022 21:56, Tom Lane wrote:

Revert "Optimize order of GROUP BY keys".


This reverts commit db0d67db2401eb6238ccc04c6407a4fd4f985832 and
several follow-on fixes.
...
Since we're hard up against the release deadline for v15, let's
revert these changes for now.  We can always try again later.


It may be time to restart the project. As a first step, I rebased the
patch on the current master. It wasn't trivial because of some latest
optimizations (a29eab, 1349d27 and 8d83a5d).
Now, Let's repeat the review and rewrite the current path according to
the reasons uttered in the revert commit.


I think the fundamental task is to make the costing more reliable, and
the commit message 443df6e2db points out a couple challenges in this
area. Not sure how feasible it is to address enough of them ...

1) procost = 1.0 - I guess we could make this more realistic by doing
some microbenchmarks and tuning the costs for the most expensive cases.

2) estimating quicksort comparisons - This relies on ndistinct
estimates, and I'm not sure how much more reliable we can make those.
Probably not much :-( Not sure what to do about this, the only thing I
can think of is to track "reliability" of the estimates and only do the
reordering if we have high confidence in the estimates. That means we'll
miss some optimization opportunities, but it should limit the risk.

I read up on the history of this thread.
As I see, all the problems mentioned above can be beaten by excluding
the new cost model at all. We can sort GROUP BY columns according to the
'ndistinct' value.
I see the reason for introducing the cost model in [1]. The main
supporting point here is that with this patch, people couldn't optimize
the query by themselves, organizing the order of the columns in a more
optimal way. But now we have at least the GUC to switch off the
behaviour introduced here. Also, some extensions, like the well-known
pg_hint_plan, can help with automation.


I think the main concern is that if we reorder the group keys and get it
wrong, it's a regression. If that happens often (due to costing based on
poor stats), it's a problem. Yes, there's a GUC, but that's a rather
blunt instrument, unfortunately.

I see. My point here is if the ndistinct of one column is much more thanthe ndistinct of another, it is more probable that this correlation willbe kept in the grouping phase. Here we can get some regression, whichcan be overweighed by the possibility below.

So, how about committing of the undoubted part of the feature and
working on the cost model in a new thread?


But Tom's commit message says this:

     Worse, to arrive at estimates of the number of calls made to the
     lower-order-column comparison functions, the code needs to make
     estimates of the numbers of distinct values of multiple columns,
     which are necessarily even less trustworthy than per-column stats.

so I'm not sure this really counts as "undoubted".

Don't try to estimate multiple columns - just sort columns according tothe value of ndistinct as a heuristic.I think we should estimate the number of values of multiple columns onlyif we have extended statistics on these columns. And this can extend theapplicability of extended statistics.

The suggestion above is just an attempt to gather low-hanging fruitright now. If it is not applicable, we will go a long way.


--
regards,
Andrey Lepikhov
Postgres Professional

Re: POC: GROUP BY optimization

Reply via email to