Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Tomas Vondra Tue, 09 Jul 2019 07:54:50 -0700

On Tue, Jul 09, 2019 at 09:28:42AM -0400, James Coleman wrote:

On Mon, Jul 8, 2019 at 9:37 PM Tomas Vondra
<[email protected]> wrote:

On Mon, Jul 08, 2019 at 12:07:06PM -0400, James Coleman wrote:
> ...
>
>I guess I'm still not following. If (2) is responsible (currently) for
>adding an explicit sort, why wouldn't adding an incremental sort be an
>example of that responsibility? The subpath that either a Sort or
>IncrementalSort is being added on top of (to then feed into the
>GatherMerge) is the same in both cases right?
>
>Unless you're saying that the difference is that since we have a
>partial ordering already for incremental sort then incremental sort
>falls into the category of "maintaining" existing ordering of the
>subpath?
>

Oh, I think I understand what you're saying. Essentially, we should not
be making generate_gather_paths responsible for adding the incremental
sort. Instead, we should be looking at places than are adding explicit
sort (using create_sort_path) and also consider adding incremental sort.

I definitely agree with the second half - we should look at all places
that create explicit sorts and make them also consider incremental
sorts. That makes sense.


Yep, exactly.


If I remember correctly, one of the previous patch versions (in the early
2018 commitfests) actually modified many of those places, but it did that
in a somewhat "naive" way. It simply used incremental sort whenever the
path was partially sorted, or something like that. So it did not use
costing properly. There was an attempt to fix that in the last commitfest
but the costing model was deemed to be a bit too rough and unreliable

(especially the ndistinct estimates can be quite weak), so the agreementwas to try salvaging the patch for PG11 by only considering incremental

sort in "safest" places with greatest gains.

We've significantly improved the costing model since then, and the
implementation likely handles the corner cases much better. But that does
not mean we have to introduce the incremental sort to all those places at
once - it might be wise to split that into separate patches.

It's not just about picking the right plan - we've kinda what impact these
extra paths might have on planner performance, so maybe we should look at
that too. And the impact might be different for each of those places.

I'll leave that up to you, but I certainly won't insist on doing it all in
one huge patch.

But I'm not sure it'll address all cases - the problem is that those
places add the explicit sort because they need sorted input. Gather
Merge does not do that, it only preserves existing ordering of paths.

So it's possible the path does not have an explicit sort on to, and
gather merge will not know to add it. And once we have the gather merge
in place, we can't push place "under" it.


That's the explanation I was missing; and it makes sense (to restate:
sometimes sorting is useful even when not required for correctness of
the user returned data).


Yes, although even when the sorting is required for correctness (because
the user specified ORDER BY) you can do it at different points in the
plan. We'd still produce correct results, but the sort might be done at
the very end without these changes.

For example we might end up with plans

  Incremental Sort (presorted: a, path keys: a,b)
   -> Gather Merge (path keys: a)
       -> Index Scan (path keys: a)

but with those changes we might push the incremental sort down into the
parallel part:

  Gather Merge (path keys: a,b)
   -> Incremental Sort (presorted: a, path keys: a,b)
       -> Index Scan (path keys: a)

which is likely better. Perhaps that's what you meant, though.

In fact, we already have code dealing with this "issue" for a special
case - see gather_grouping_paths(). It generates plain gather merge
paths, but then also considers building one with explicit sort. But it
only does that for grouping paths (when it's clear we need to be looking
at grouping_pathkeys), and there are generate_gather_paths() that don't
have similar treatment.


I just find it humorous both of us were writing separate emails
mentioning that function at the same time.

;-)

...

And I think I know why is that - while gather_grouping_paths() tries to
add explicit sort below the gather merge, there are other places that
call generate_gather_paths() that don't do that. In this case it's
probably apply_scanjoin_target_to_paths() which simply builds

   parallel (seq|index) scan + gather merge

and that's it. The problem is likely the same - the code does not know
which pathkeys are "interesting" at that point. We probably need to
teach planner to do this.


I had also noticed that that was an obvious place where
generate_gather_paths() was used but an explicit sort wasn't also
added separately, which makes me think the division of labor is
probably currently wrong regardless of the incremental sort patch.

Do you agree? Should we try to fix that (likely with your new
"interesting paths" version of generate_gather_paths()) first as a
prefix patch to adding incremental sort?


I'm not sure what the generate_useful_gather_paths() should do but does
not? Or is it just the division of labor that you think is wrong? In any
case, feel free to whack it until you're happy with it.

...

Attached is a slightly modified patch series:

1) 0001 considers incremental sort paths in various places (adds the new
generate_useful_gather_paths and modifies places calling create_sort_path)


I need to spend some decent time digesting this patch, but the concept
sounds very useful.

OK.

2) 0002 and 0003 are fixes I mentioned before

3) 0004 adds a new GUC force_incremental_sort that (when set to 'on')
tries to nudge the optimizer into using incremental sort by essentially
making it free (i.e. using startup/total costs of the subpath). I've found
this useful when trying to force incremental sorts into plans where it may
not be the best strategy.


That will be super helpful. I do wonder if we need to expose (in the
production patch) a GUC of some kind to adjust incremental sort
costing so that users can try to tweak when it is preferred over
regular sort.


This GUC is really meant primarily for development, to force choice of
incremental sort during regression tests (so as to use incremental sort in
as many plans as possible). I'd remove it from the final patch. I think
the general consensus on pgsql-hackers is that we should not introduce
GUCs unless absolutely necessary. But for development GUCs - sure.

FWIW I'm not sure it's a good idea to look at both enable_incremental_sort
and enable_sort in cost_incremental_sort(). Not only end up with
disable_cost twice when both GUCs are set to 'off' at the moment, but it
might be useful to be able to disable those two sort types independently.
For example you might set just enable_sort=off and we'd still generate
incremental sort paths.

...

This definitely needs more work. We need to refactor it in some way, e.g.
have a function that would consider both explicit sort (on the cheapest
path) and incremental sort (on all paths), and call it from all those
places. Haven't tried it, though.

There's also a couple more places where we do create_sort_path() and don't
consider incremental sort yet - window functions, distinct etc.


Yep, and likely want regression tests for all of these cases also.


Yes, that's definitely something a committable patch needs to include.

Another thing we should have is a collection of tests with data sets that
"break" the costing model in some way (skew, correlated columns,
non-uniform group sizes, ...). That's something not meant for commit,
because it'll probably require significant amounts of data, but we need it
to asses the quality of the planner/costing part. I know there are various
ad-hoc test cases in the thread history, it'd be good to consolidate that
into once place.

regards

--
Tomas Vondra                  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Reply via email to