Re: [PATCH] Erase the distinctClause if the result is unique by definition

Andy Fan Thu, 12 Mar 2020 21:52:12 -0700

On Fri, Mar 13, 2020 at 11:46 AM David Rowley <[email protected]> wrote:


> On Fri, 13 Mar 2020 at 14:47, Andy Fan <[email protected]> wrote:
> > 1.   for pupulate_baserel_uniquekeys,  we need handle the "pk = Const"
> as well.
> > (relation_has_unqiue_for has a similar logic) currently the following
> distinct path is still
> >  there.
>
> Yeah, I left a comment in propagate_unique_keys_to_joinrel() to
> mention that still needs to be done.

> postgres=# explain select distinct b from t100 where pk = 1;
> >                                     QUERY PLAN
> >
> ----------------------------------------------------------------------------------
> >  Unique  (cost=8.18..8.19 rows=1 width=4)
> >    ->  Sort  (cost=8.18..8.19 rows=1 width=4)
> >          Sort Key: b
> >          ->  Index Scan using t100_pkey on t100  (cost=0.15..8.17 rows=1
> width=4)
> >                Index Cond: (pk = 1)
> > (5 rows)
> >
> > I think in this case,  we can add  both (pk) and (b) as the
> UniquePaths.  If so we
> > can get more opportunities to reach our goal.
>
> The UniqueKeySet containing "b" could only be added in the
> distinct_rel in the upper planner. It must not change the input_rel
> for the distinct.
>
> I think we maintain UniqueKey even without distinct_rel,  so at this
stage,
Can we say b is unique for this(no is possible)?  If yes,  we probably
need to set that information without consider the distinct clause.

It's likely best to steer clear of calling UniqueKeys UniquePaths as
> it might confuse people.  The term "path" is used in PostgreSQL as a
> lightweight representation containing all the information required to
> build a plan node in createplan.c. More details in
> src/backend/optimizer/README.
>
>
OK.


> > 2.  As for the propagate_unique_keys_to_joinrel,  we can add 1 more
> UniquePath as
> > (rel1_unique_paths,  rel2_unique_paths) if the current rules doesn't
> apply.
> > or else the following cases can't be handled.
> >
> > postgres=# explain select distinct t100.pk,  t101.pk from t100, t101;
> >                                    QUERY PLAN
> >
> --------------------------------------------------------------------------------
> >  Unique  (cost=772674.11..810981.11 rows=5107600 width=8)
> >    ->  Sort  (cost=772674.11..785443.11 rows=5107600 width=8)
> >          Sort Key: t100.pk, t101.pk
> >          ->  Nested Loop  (cost=0.00..63915.85 rows=5107600 width=8)
> >                ->  Seq Scan on t100  (cost=0.00..32.60 rows=2260 width=4)
> >                ->  Materialize  (cost=0.00..43.90 rows=2260 width=4)
> >                      ->  Seq Scan on t101  (cost=0.00..32.60 rows=2260
> width=4)
> > (7 rows)
>
> I don't really follow what you mean here. It seems to me there's no
> way we can skip doing DISTINCT in the case above.  If you've just
> missed out the join clause and you meant to have "WHERE t100.pk =
> t101.pk", then we can likely fix that later with some sort of
> functional dependency tracking.


In the above case the result should be unique, the knowledge behind that
is if *we join 2 unique results in any join method, the result is unique as
well*
in the above example,  the final unique Key is (t100.pk, t101.pk).


> Likely we can just add a Relids field
> to UniqueKeySet to track which relids are functionally dependant on a
> row from the UniqueKeySet's uk_exprs.  That might be as simple as just
> pull_varnos from the non-matched exprs and checking to ensure the
> result is a subset of functionally dependant rels.  I'd need to give
> that more thought.
>
> Was this a case you had working in your patch?
>

I think we can do that after I get your UniqueKey idea,  so, no,  my
previous patch is not as smart
as yours:)


> > But if we add such rule,  the unique paths probably become much longer,
> so we need
> > a strategy to tell if the UniquePath is useful for our query,  if not,
> we can ignore that.
> > rel->reltarget maybe a good info for such optimization.   I think we can
> take this into
> > consideration for pupulate_baserel_uniquekeys as well.
>
> I don't really think the number of unique indexes in a base rel will
> really ever get out of hand for legitimate cases.
> propagate_unique_keys_to_joinrel is just concatenating baserel
> UniqueKeySets to the joinrel. They're not copied, so it's just tagging
> pointers onto the end of an array, which is at best a memcpy(), or at
> worst a realloc() then memcpy(). That's not so costly.
>

The memcpy is not key concerns here.  My main point is we need
to focus on the length of RelOptInfo->uniquekeys.  For example:
t  has 3 uk like this (uk1),  (uk2), (uk3).  And the query is
select b from t where m = 1;  If so there is no need to add these 3
to UniqueKeys so that we can keep rel->uniquekeys shorter.

The the length of rel->uniquekeys maybe a  concern if we add the rule
I suggested above , the (t100.pk, t101.pk) case.   Think about this
for example:

1. select .. from t1, t2, t3, t4...;
2. suppose each table has 2 UniqueKeys, named (t{m}_uk{n})
3. follow my above rule (t1.pk1, t2.pk）is a UniqueKey for joinrel.
4. suppose we join with the following order (t1 vs t2 vs t3 vs t4)

For for (t1 vs t2),  we need to add 4 more UniqueKeys for this joinrel.
(t1_uk1 , t2_uk1), (t1_uk1, t2_uk2), (t1_uk2 , t2_uk1), (t1_uk2, t2_uk2)

After we come to join the last one, the joinrel->uniquekey will be much
longer
which makes the scan of it less efficient.

But this will not be an issue if my above rule should not be considered.
so
we need to talk about that first.

> For the non_null info,  Tom suggested to add maintain such info
> RelOptInfo,
> > I have done that for the not_null_info for basic relation catalog,  I
> think we can
> > maintain the same flag for joinrel and the not null info from
> find_nonnullable_vars as
> > well, but I still didn't find a good place to add that so far.
>
> I'd considered just adding a get_notnull() function to lsyscache.c.
> Just below get_attname() looks like a good spot.  I imagined just
> setting the bit in the UniqueKeySet's non_null_keys field
> corresponding to the column position from the index.  I could see the
> benefit of having a field in RelOptInfo if there was some way to
> determine the not-null properties of all columns in the table at once,
>

do you mean get the non-null  properties from catalog or restrictinfo?
if you mean catalog,  get_relation_info may be a good place for that.

but there's not, so we're likely best just looking at the ones that
> there are unique indexes on.
>


> > A small question about the following code:
> >
> > +       if (relation_has_uniquekeys_for(root, input_rel,
> get_sortgrouplist_exprs(parse->distinctClause, parse->targetList), false))
> > +       {
> > +
> > +               add_path(distinct_rel, (Path *)cheapest_input_path);
> > +
> > +               /* XXX yeah yeah, need to call the hooks etc. */
> > +
> > +               /* Now choose the best path(s) */
> > +               set_cheapest(distinct_rel);
> > +
> > +               return distinct_rel;
> > +       }
> >
> > Since we don't create new RelOptInfo/Path,  do we need to call add_path
> and set_cheapest?
>
> The distinct_rel already exists. add_path() is the standard way we
> have of adding paths to the rel's pathlist.  Why would you want to
> bypass that? set_cheapest() is our standard way of looking at the
> pathlist and figuring out the least costly one. It's not a very hard
> job to do when there's just 1 path. Not sure why you'd want to do it
> another way.
>

I got the point now.  In this case,  you create an new RelOptInfo
named distinct_rel, so we *must* set it.  Can we just return the input_rel
in this case? if we can,  we don't need that.

Re: [PATCH] Erase the distinctClause if the result is unique by definition

Reply via email to