Re: [PATCH] Incremental sort (was: PoC: Partial sort)

James Coleman Sat, 28 Mar 2020 07:20:00 -0700

On Fri, Mar 27, 2020 at 10:58 PM Tomas Vondra
<tomas.von...@2ndquadrant.com> wrote:
> ...
> >> As a side note here, I'm wondering if this (determining useful pathkeys)
> >> can be made a bit smarter by looking both at query_pathkeys and pathkeys
> >> useful for merging, similarly to what truncate_useless_pathkeys() does.
> >> But that can be seen as an improvement of what we do now.
> >
> >Unless your comment below about looking at truncate_useless_pathkeys
> >is implying you're considering aiming to get this in now, I wonder if
> >we should just expand the comment to reference pathkeys useful for
> >merging as a possible future extension.
> >
>
> Maybe. I've been thinking about how to generate those path keys, but
> it's a bit tricky, because pathkeys_useful_for_merging() - the thing
> that backs truncate_useless_pathkeys() actually gets pathkeys and merely
> verifies if those are useful for merging. But this code needs to do the
> opposite - generate those pathkeys.
>
> But let's say we know we have a join on columns (a,b,c). For each of
> those PathKey values we know it's useful for merging, but should we
> generate pathkeys (a,b,c), (b,c,a), ... or any other permutation? I
> suppose we can look at pathkeys for existing paths of the relation to
> prune the possibilities a bit, but what then?


I'm not convinced it's worth it this time around. Both because of the
late hour in the CF etc., but also because it seems likely to become
pretty complex quickly, and also far more likely to raise performance
questions in planning if there's not a lot of work done to keep it
limited.

> BTW I wonder if we actually need the ec_has_volatile check in
> get_useful_pathkeys_for_relation. The comment says we can't but is it
> really true? pathkeys_useful_for_ordering doesn't do any such checks, so
> I'd bet this is merely an unnecessary copy-paste from postgres_fdw.
> Opinions?

I hadn't really looked at that part in depth before, but after reading
it over, re-reading the definition of volatility, and running a few
basic queries, I agree.

For example: we already do allow volatile pathkeys in a simple query like:
-- t(a, b) with index on (a)
select * from t order by a, random();

It makes sense that you couldn't push down such a sort to a foreign
server, given there's no constraints said function isn't operating
directly on the primary database (in the fdw case). But that obviously
wouldn't apply here.

> >> >9. optimizer/path/allpaths.c get_useful_pathkeys_for_relation:
> >> >* Considering query_pathkeys is always worth it, because it might let us
> >> >* avoid a local sort.
> >> >
> >> >That originally was a copy from the fdw code, but since the two
> >> >functions have diverged (Is that concerning? I could be confusing, but
> >> >isn't a compilation problem) I didn't move the function.
> >> >
> >>
> >> I think it's OK the two functions diverged, it's simply because the FDW
> >> one needs to check other things too. But I might rework this once I look
> >> closer at truncate_useless_pathkeys.
> >
> >Agreed, for now at least. It's tempting to think they should always be
> >shared, but I'm not convinced (without a lot more digging) that this
> >represents structural rather than incidental duplication.
> >
>
> The more I look at pathkeys_useful_for_ordering() the more I think the
> get_useful_pathkeys_for_relation() function should look more like it
> than the postgres_fdw one ...

If we go down that path (and indeed this is actually implied by
removing the volatile check too), doesn't that really just mean (at
least for now) that get_useful_pathkeys_for_relation goes away
entirely and we effectively use root->query_pathkeys directly? The
only thing your lose them is the check that each eclass has a member
in the rel. But that check probably wasn't quite right anyway: at
least for incremental sort (maybe not regular sort), I think we
probably don't necessarily care that the entire list has members in
the rel, but rather that some prefix does, right? The idea there would
be that we could create a gather merge here that supplies a partial
ordering (that prefix of query_pathkeys) and then above it the planner
might place another incremental sort (say above a join), or perhaps
even a join above that would actually be sufficient (since many joins
are capable of providing an ordering anyway).

I've attached (added prefix .txt to avoid the cfbot assuming this is a
full series) an idea in that direction to see if we're thinking along
the same route. You'll want to apply and view with `git diff -w`
probably.

The attached also adds a few XXX comments. In particular, I wonder if
we should verify that some prefix of the root->query_pathkeys is
actually made up of eclass members for the current rel, because
otherwise I think we can skip the loop on the subpaths entirely.

> >> >I did notice though that find_em_expr_for_rel() is wholesale copied
> >> >(and unchanged) from the fdw code, so I moved it to equivclass.c so
> >> >both places can share it.
> >> >
> >>
> >> +1
> >>
>
> ... which would also get rid of find_em_expr_for_rel().

... which, I think, would retain the need for find_em_expr_for_rel().

Note: The attached applied to the previous series compiles and runs
make check...but I haven't really tested it; it's meant more for "is
this the direction we want to go".

James

diff --git a/src/backend/optimizer/path/allpaths.c 
b/src/backend/optimizer/path/allpaths.c
index 32bf734820..eea41fafe3 100644
--- a/src/backend/optimizer/path/allpaths.c
+++ b/src/backend/optimizer/path/allpaths.c
@@ -2727,63 +2727,6 @@ generate_gather_paths(PlannerInfo *root, RelOptInfo 
*rel, bool override_rows)
        }
 }
 
-/*
- * get_useful_pathkeys_for_relation
- *             Determine which orderings of a relation might be useful.
- *
- * Getting data in sorted order can be useful either because the requested
- * order matches the final output ordering for the overall query we're
- * planning, or because it enables an efficient merge join.  Here, we try
- * to figure out which pathkeys to consider.
- *
- * This allows us to do incremental sort on top of an index scan under a gather
- * merge node, i.e. parallelized.
- *
- * XXX At the moment this can only ever return a list with a single element,
- * because it looks at query_pathkeys only. So we might return the pathkeys
- * directly, but it seems plausible we'll want to consider other orderings
- * in the future.
- */
-static List *
-get_useful_pathkeys_for_relation(PlannerInfo *root, RelOptInfo *rel)
-{
-       List       *useful_pathkeys_list = NIL;
-       ListCell   *lc;
-
-       /*
-        * Considering query_pathkeys is always worth it, because it might 
allow us
-        * to avoid a total sort when we have a partially presorted path 
available.
-        */
-       if (root->query_pathkeys)
-       {
-               bool            query_pathkeys_ok = true;
-
-               foreach(lc, root->query_pathkeys)
-               {
-                       PathKey    *pathkey = (PathKey *) lfirst(lc);
-                       EquivalenceClass *pathkey_ec = pathkey->pk_eclass;
-                       Expr       *em_expr;
-
-                       /*
-                        * We can't use incremental sort for pathkeys 
containing volatile
-                        * expressions. We could walk the exppression itself, 
but checking
-                        * ec_has_volatile here saves some cycles.
-                        */
-                       if (pathkey_ec->ec_has_volatile ||
-                               !(em_expr = find_em_expr_for_rel(pathkey_ec, 
rel)))
-                       {
-                               query_pathkeys_ok = false;
-                               break;
-                       }
-               }
-
-               if (query_pathkeys_ok)
-                       useful_pathkeys_list = 
list_make1(list_copy(root->query_pathkeys));
-       }
-
-       return useful_pathkeys_list;
-}
-
 /*
  * generate_useful_gather_paths
  *             Generate parallel access paths for a relation by pushing a 
Gather or
@@ -2800,7 +2743,6 @@ generate_useful_gather_paths(PlannerInfo *root, 
RelOptInfo *rel, bool override_r
        ListCell   *lc;
        double          rows;
        double     *rowsp = NULL;
-       List       *useful_pathkeys_list = NIL;
        Path       *cheapest_partial_path = NULL;
 
        /* If there are no partial paths, there's nothing to do here. */
@@ -2818,9 +2760,6 @@ generate_useful_gather_paths(PlannerInfo *root, 
RelOptInfo *rel, bool override_r
        if (!enable_incrementalsort)
                return;
 
-       /* consider incremental sort for interesting orderings */
-       useful_pathkeys_list = get_useful_pathkeys_for_relation(root, rel);
-
        /* used for explicit (full) sort paths */
        cheapest_partial_path = linitial(rel->partial_pathlist);
 
@@ -2829,104 +2768,125 @@ generate_useful_gather_paths(PlannerInfo *root, 
RelOptInfo *rel, bool override_r
         *
         * XXX I wonder if we need to consider adding a projection here, as
         * create_ordered_paths does.
+        *
+        * XXX I wonder if it'd be worth verifying that at least some prefix
+        * of root->query_pathkeys has eclass members for this rel before
+        * looping through the subpaths. If, for example, the first pathkey
+        * isn't in this rel, then we know this loop won't ever be useful.
         */
-       foreach(lc, useful_pathkeys_list)
+       foreach(lc, rel->partial_pathlist)
        {
-               List       *useful_pathkeys = lfirst(lc);
-               ListCell   *lc2;
                bool            is_sorted;
                int                     presorted_keys;
+               Path       *subpath = (Path *) lfirst(lc);
+               GatherMergePath *path;
+               List       *useful_pathkeys = NIL;
 
-               foreach(lc2, rel->partial_pathlist)
-               {
-                       Path       *subpath = (Path *) lfirst(lc2);
-                       GatherMergePath *path;
-
-                       /* path has no ordering at all, can't use incremental 
sort */
-                       if (subpath->pathkeys == NIL)
-                               continue;
+               /* path has no ordering at all, can't use incremental sort */
+               if (subpath->pathkeys == NIL)
+                       continue;
 
-                       is_sorted = 
pathkeys_common_contained_in(useful_pathkeys,
-                                                                               
                         subpath->pathkeys,
-                                                                               
                         &presorted_keys);
+               /* consider incremental sort for interesting orderings */
+               useful_pathkeys = truncate_useless_pathkeys(root, rel, 
subpath->pathkeys);
 
-                       /*
-                        * When the partial path is already sorted, we can just 
add a gather
-                        * merge on top, and we're done - no point in adding 
explicit sort.
-                        *
-                        * XXX Can't we skip this (maybe only for the cheapest 
partial path)
-                        * when the path is already sorted? Then it's likely 
duplicate with
-                        * the path created by generate_gather_paths.
-                        */
-                       if (is_sorted)
-                       {
-                               path = create_gather_merge_path(root, rel, 
subpath, rel->reltarget,
-                                                                               
                subpath->pathkeys, NULL, rowsp);
+               /*
+                * Getting data in sorted order can be useful either because 
the requested
+                * order matches the final output ordering for the overall 
query we're
+                * planning, or because it enables an efficient merge join.  
Here, we try
+                * to figure out which pathkeys to consider.
+                *
+                * This allows us to do incremental sort on top of an index 
scan under a gather
+                * merge node, i.e. parallelized.
+                *
+                * XXX At the moment this can only ever return a list with a 
single element,
+                * because it looks at query_pathkeys only. So we might return 
the pathkeys
+                * directly, but it seems plausible we'll want to consider 
other orderings
+                * in the future.
+                *
+                * XXX Should this just be root->query_pathkeys == 
useful_pathkeys
+                * (or length of the two are equal) given our using 
truncate_useless_pathkeys
+                * above?
+                */
+               is_sorted = pathkeys_common_contained_in(useful_pathkeys,
+                                                                               
                 subpath->pathkeys,
+                                                                               
                 &presorted_keys);
 
-                               add_path(rel, &path->path);
-                               continue;
-                       }
+               /*
+                * When the partial path is already sorted, we can just add a 
gather
+                * merge on top, and we're done - no point in adding explicit 
sort.
+                *
+                * XXX Can't we skip this (maybe only for the cheapest partial 
path)
+                * when the path is already sorted? Then it's likely duplicate 
with
+                * the path created by generate_gather_paths.
+                */
+               if (is_sorted)
+               {
+                       path = create_gather_merge_path(root, rel, subpath, 
rel->reltarget,
+                                                                               
        subpath->pathkeys, NULL, rowsp);
 
-                       Assert(!is_sorted);
+                       add_path(rel, &path->path);
+                       continue;
+               }
 
-                       /*
-                        * Consider regular sort for the cheapest partial path 
(for each
-                        * useful pathkeys). We know the path is not sorted, 
because we'd
-                        * not get here otherwise.
-                        *
-                        * XXX This is not redundant with the gather merge path 
created in
-                        * generate_gather_paths, because that merely preserves 
ordering of
-                        * the cheapest partial path, while here we add an 
explicit sort to
-                        * get match the useful ordering.
-                        */
-                       if (cheapest_partial_path == subpath)
-                       {
-                               Path       *tmp;
+               Assert(!is_sorted);
 
-                               tmp = (Path *) create_sort_path(root,
-                                                                               
                rel,
-                                                                               
                subpath,
-                                                                               
                useful_pathkeys,
-                                                                               
                -1.0);
+               /*
+                * Consider regular sort for the cheapest partial path (for each
+                * useful pathkeys). We know the path is not sorted, because 
we'd
+                * not get here otherwise.
+                *
+                * XXX This is not redundant with the gather merge path created 
in
+                * generate_gather_paths, because that merely preserves 
ordering of
+                * the cheapest partial path, while here we add an explicit 
sort to
+                * get match the useful ordering.
+                */
+               if (cheapest_partial_path == subpath)
+               {
+                       Path       *tmp;
 
-                               rows = tmp->rows * tmp->parallel_workers;
+                       tmp = (Path *) create_sort_path(root,
+                                                                               
        rel,
+                                                                               
        subpath,
+                                                                               
        useful_pathkeys,
+                                                                               
        -1.0);
 
-                               path = create_gather_merge_path(root, rel,
-                                                                               
                tmp,
-                                                                               
                rel->reltarget,
-                                                                               
                tmp->pathkeys,
-                                                                               
                NULL,
-                                                                               
                rowsp);
+                       rows = tmp->rows * tmp->parallel_workers;
 
-                               add_path(rel, &path->path);
+                       path = create_gather_merge_path(root, rel,
+                                                                               
        tmp,
+                                                                               
        rel->reltarget,
+                                                                               
        tmp->pathkeys,
+                                                                               
        NULL,
+                                                                               
        rowsp);
 
-                               /* Fall through */
-                       }
+                       add_path(rel, &path->path);
 
-                       /*
-                        * Consider incremental sort, but only when the subpath 
is already
-                        * partially sorted on a pathkey prefix.
-                        */
-                       if (presorted_keys > 0)
-                       {
-                               Path       *tmp;
+                       /* Fall through */
+               }
 
-                               tmp = (Path *) 
create_incremental_sort_path(root,
-                                                                               
                                        rel,
-                                                                               
                                        subpath,
-                                                                               
                                        useful_pathkeys,
-                                                                               
                                        presorted_keys,
-                                                                               
                                        -1);
-
-                               path = create_gather_merge_path(root, rel,
-                                                                               
                tmp,
-                                                                               
                rel->reltarget,
-                                                                               
                tmp->pathkeys,
-                                                                               
                NULL,
-                                                                               
                rowsp);
-
-                               add_path(rel, &path->path);
-                       }
+               /*
+                * Consider incremental sort, but only when the subpath is 
already
+                * partially sorted on a pathkey prefix.
+                */
+               if (presorted_keys > 0)
+               {
+                       Path       *tmp;
+
+                       tmp = (Path *) create_incremental_sort_path(root,
+                                                                               
                                rel,
+                                                                               
                                subpath,
+                                                                               
                                useful_pathkeys,
+                                                                               
                                presorted_keys,
+                                                                               
                                -1);
+
+                       path = create_gather_merge_path(root, rel,
+                                                                               
        tmp,
+                                                                               
        rel->reltarget,
+                                                                               
        tmp->pathkeys,
+                                                                               
        NULL,
+                                                                               
        rowsp);
+
+                       add_path(rel, &path->path);
                }
        }
 }

Re: [PATCH] Incremental sort (was: PoC: Partial sort)

Reply via email to