Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-09-12 Thread Tom Lane
Jim Nasby  writes:
> I've verified that the patch still applies and make check-world is clean.

Not any more :-(.  Here's a v3 rebased over HEAD.  No substantive
change from v2.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index b83d919..0311a50 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outPlannerInfo(StringInfo str, const Pl
*** 2228,2233 
--- 2228,2234 
  	WRITE_FLOAT_FIELD(tuple_fraction, "%.4f");
  	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
  	WRITE_UINT_FIELD(qual_security_level);
+ 	WRITE_BOOL_FIELD(needBaseTids);
  	WRITE_BOOL_FIELD(hasInheritedTarget);
  	WRITE_BOOL_FIELD(hasJoinRTEs);
  	WRITE_BOOL_FIELD(hasLateralRTEs);
diff --git a/src/backend/optimizer/plan/Makefile b/src/backend/optimizer/plan/Makefile
index 88a9f7f..1db6bd5 100644
*** a/src/backend/optimizer/plan/Makefile
--- b/src/backend/optimizer/plan/Makefile
*** top_builddir = ../../../..
*** 13,18 
  include $(top_builddir)/src/Makefile.global
  
  OBJS = analyzejoins.o createplan.o initsplan.o planagg.o planmain.o planner.o \
! 	setrefs.o subselect.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 13,18 
  include $(top_builddir)/src/Makefile.global
  
  OBJS = analyzejoins.o createplan.o initsplan.o planagg.o planmain.o planner.o \
! 	planunionor.o setrefs.o subselect.o
  
  include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 34317fe..8424e04 100644
*** a/src/backend/optimizer/plan/analyzejoins.c
--- b/src/backend/optimizer/plan/analyzejoins.c
*** clause_sides_match_join(RestrictInfo *ri
*** 154,159 
--- 154,162 
   * cases, but we don't have the infrastructure to prove them.)  We also
   * have to check that the inner side doesn't generate any variables needed
   * above the join.
+  *
+  * Note: making this significantly smarter might break planunionor.c.
+  * Study that file before doing so.
   */
  static bool
  join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index f4e0a6e..1350417 100644
*** a/src/backend/optimizer/plan/planmain.c
--- b/src/backend/optimizer/plan/planmain.c
*** query_planner(PlannerInfo *root, List *t
*** 206,211 
--- 206,219 
  	add_placeholders_to_base_rels(root);
  
  	/*
+ 	 * Also, if we have a request to emit baserel CTIDs, add those to the
+ 	 * baserel targetlists now.  This likewise has to be done after join
+ 	 * removal, because we only want CTIDs for rels that survive join removal.
+ 	 */
+ 	if (root->needBaseTids)
+ 		add_base_rel_ctids(root);
+ 
+ 	/*
  	 * Construct the lateral reference sets now that we have finalized
  	 * PlaceHolderVar eval levels.
  	 */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 6b79b3a..5eb025f 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*** subquery_planner(PlannerGlobal *glob, Qu
*** 528,533 
--- 528,534 
  	root->grouping_map = NULL;
  	root->minmax_aggs = NIL;
  	root->qual_security_level = 0;
+ 	root->needBaseTids = false;
  	root->hasInheritedTarget = false;
  	root->hasRecursion = hasRecursion;
  	if (hasRecursion)
*** grouping_planner(PlannerInfo *root, bool
*** 1584,1589 
--- 1585,1591 
  		WindowFuncLists *wflists = NULL;
  		List	   *activeWindows = NIL;
  		grouping_sets_data *gset_data = NULL;
+ 		List	   *union_or_subpaths;
  		standard_qp_extra qp_extra;
  
  		/* A recursive query should always have setOperations */
*** grouping_planner(PlannerInfo *root, bool
*** 1667,1672 
--- 1669,1682 
  			preprocess_minmax_aggregates(root, tlist);
  
  		/*
+ 		 * Preprocess join OR clauses that might be better handled as UNIONs.
+ 		 * This likewise needs to be close to the query_planner() call.  But
+ 		 * it doesn't matter which of preprocess_minmax_aggregates() and this
+ 		 * function we call first, because they treat disjoint sets of cases.
+ 		 */
+ 		union_or_subpaths = split_join_or_clauses(root);
+ 
+ 		/*
  		 * Figure out whether there's a hard limit on the number of rows that
  		 * query_planner's result subplan needs to return.  Even if we know a
  		 * hard limit overall, it doesn't apply if the query has any
*** grouping_planner(PlannerInfo *root, bool
*** 1701,1706 
--- 1711,1724 
  	standard_qp_callback, _extra);
  
  		/*
+ 		 * If we found any way to convert a join OR clause to a union, finish
+ 		 * up generating the path(s) for that, and add them into the topmost
+ 		 * scan/join relation.
+ 		 */
+ 		if (union_or_subpaths)
+ 			finish_union_or_paths(root, current_rel, union_or_subpaths);
+ 
+ 		/*
  		 * Convert the query's result tlist into 

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-03-19 Thread Jim Nasby

On 3/19/17 2:32 PM, Tom Lane wrote:

Jim Nasby  writes:

Having thought about it, I share Tom's concerns. And now I'm worried
about what happens if there are multiple separate OR clauses. I guess
those would be handled by separate UNIONs?


As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.


Agreed.


I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely.  But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication.  To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.


It might still be worth-while in some circumstances. In your example, if 
there were these indexes:


a__id ON a(id), a__x ON a(x)
b__id ON b(id), b__y ON b(y)

then it might be faster to nested loop a__x=42 to b__id=a.id and union 
that with b__y=43 nested to a__id=b.id.


That said, now isn't the time to be adding more complexity.


So full joins definitely break this whole optimization.  I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step.  If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm.  The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion.  In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row.  That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.


The only case I can think of is: would it be possible (perhaps not 
today; maybe in the future) for other parts of the query to affect join 
elimination? I can't conceive of how that might happen, but if it did 
then it's possible that the elimination would work differently with the 
UNION than it would with an OR.


The comment on join_is_removable() does mention that there's other 
potentially interesting cases that we can't handle now; it's maybe worth 
mentioning


Other than that, I can't see any issues with your logic.


Any clearer yet?


Absolutely. I think it would be very valuable to include that with the 
initial comment in planunionor.c. Join reduction and removal is already 
tricky enough to wrap your head around.


Other comments:


+  * is retty mechanical, but we can't do it until we have a RelOptInfo for the


s/retty/pretty/


I suspect that in many systems single-table queries are far more common 
than CTEs, so maybe it's worth reversing those two tests in 
split_join_or_clauses().



For the Unique path, it would be nice if the code did what would be 
necessary to consider a TID hash join, since that means a user could 
create the appropriate operator and it would just be picked up. 
Certainly not worth much effort at this point though.



+   /*
+* Must not have any volatile functions in FROM or WHERE (see notes at
+* head of file).
+*/
+   if (contain_volatile_functions((Node *) parse->jointree))


Is there by chance anywhere else that needs to check that? Maybe worth 
adding the info to the Query struct if so.



+* We insist that all baserels used in the query be plain relations, so


Dumb question... views have already be decomposed at this point, right?

Perhaps less dumb question... what happens if the original query already 
had setOps? AIUI setOps work has already been done by the time 
split_join_or_clauses() happens; I just want to check that that's OK.


I'm not sure a GUC is worth it... I suspect that any query with multiple 
rels and an OR condition is going to be expensive enough that whatever 
additional plan time there is won't be noticeable.


I've verified that the patch still applies and make check-world is clean.
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-03-19 Thread Tom Lane
I wrote:
> Consider
>   SELECT count(*)
>   FROM a FULL JOIN b ON (a.id = b.id)
>   WHERE (a.x = 42 OR b.y = 43);
> and suppose that a and b have mutual FK constraints guaranteeing that
> every non-null a.id value has exactly one match in b and vice versa.

Oh, that was sloppy of me.  Join removal depends on unique constraints
not FK constraints.  So actually, this example just requires assuming
that both a.id and b.id are known unique, which is much less far-fetched
than assuming circular FK constraints.

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-03-19 Thread Tom Lane
Jim Nasby  writes:
> On 3/16/17 11:54 AM, David Steele wrote:
>> On 2/14/17 4:03 PM, Tom Lane wrote:
>>> Well, the key point is whether it's really OK to de-dup on the basis
>>> of only the CTIDs that are not eliminated in any UNION arm.  I was
>>> feeling fairly good about that until I thought of the full-join-to-
>>> left-join-to-no-join conversion issue mentioned in the comment.

>> Jim, have you had time to think about this?  Any insights?

> Having thought about it, I share Tom's concerns. And now I'm worried 
> about what happens if there are multiple separate OR clauses. I guess 
> those would be handled by separate UNIONs?

As proposed, the patch would try to optimize by splitting each OR clause
independently, and would choose whichever way gave the best cost estimate.
I'm not sure it's possible to do better than that, and even if it is,
I think improving it could be left for later.

> Tom, could you expand the description some, especially with some examples?

Here's a counterexample --- admittedly rather artificial, but still a
counterexample --- to applying the optimization when there are FULL joins.
Consider

SELECT count(*)
FROM a FULL JOIN b ON (a.id = b.id)
WHERE (a.x = 42 OR b.y = 43);

and suppose that a and b have mutual FK constraints guaranteeing that
every non-null a.id value has exactly one match in b and vice versa.  (For
the purposes of this example, I think it doesn't matter whether or not we
allow there to be null id values.)  Also suppose that there are some join
rows satisfying both a.x = 42 and b.y = 43, while other join rows satisfy
only one of the OR arms.  The patch would try to implement this query as,
approximately,

SELECT count(*)
FROM
( SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE a.x = 42
  UNION-using-ctids
  SELECT FROM a FULL JOIN b ON (a.id = b.id) WHERE b.y = 43 )

Now in the first UNION arm, we'd observe that the strict a.x = 42
condition allows reduction of the join strength to LEFT JOIN, and then
we'd deduce from the FK on b that the join to b can be dropped altogether.
In the second arm, we'd similarly conclude that a can be dropped
altogether, leaving

SELECT count(*)
FROM
( SELECT FROM a WHERE a.x = 42
  UNION-using-ctids
  SELECT FROM b WHERE b.y = 43 )

But at this point there are no rels that are not eliminated in any UNION
arm, so that no de-duping would occur in the UNION, meaning that we'd
double-count any join rows in which both a.x = 42 and b.y = 43.

I'd also considered an approach of de-duping on the basis of all relation
ctids, while allowing a rel's ctid to be returned as NULL from a UNION arm
in which the rel was eliminated entirely.  But that doesn't fix it,
because in this example the first arm would return (a.ctid, NULL) while
the second arm would return (NULL, b.ctid), so that the UNION step would
still fail to detect any duplication.  To make it work, we'd have to not
eliminate the joins, which would pretty much defeat the usefulness of
the optimization for your original example case.

So full joins definitely break this whole optimization.  I think it's okay
with left joins though, because when starting from "a left join b" it will
never be possible to remove "a" so we'll always include a.ctid in the
UNION de-duping step.  If b is removed in some arm, then it must be true
that we get exactly one left-join output row per a row, regardless of the
contents of b, in that arm.  The argument for the patch being okay is
essentially that we must get exactly one left-join output row per a row,
regardless of the contents of b, in *every* arm, because the various
modified versions of the OR clause can't affect that conclusion.  In some
of the arms we might not remove b, and we might even be able to reduce the
left join to an inner join, but there should still be no more than one
join output row per a row.  That being the case, it should be sufficient
to de-dup using a.ctid while ignoring b.ctid.

Any clearer yet?

regards, tom lane


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-03-18 Thread Jim Nasby

On 3/16/17 11:54 AM, David Steele wrote:

On 2/14/17 4:03 PM, Tom Lane wrote:

Jim Nasby  writes:

On 2/14/17 1:18 PM, Tom Lane wrote:

One point that could use further review is whether the de-duplication
algorithm is actually correct.  I'm only about 95% convinced by the
argument I wrote in planunionor.c's header comment.



I'll put some thought into it and see if I can find any holes. Are you
only worried about the removal of "useless" rels or is there more?


Well, the key point is whether it's really OK to de-dup on the basis
of only the CTIDs that are not eliminated in any UNION arm.  I was
feeling fairly good about that until I thought of the full-join-to-
left-join-to-no-join conversion issue mentioned in the comment.
Now I'm wondering if there are other holes; or maybe I'm wrong about
that one and it's not necessary to be afraid of full joins.


This patch applies cleanly (with offsets) and compiles at cccbdde.

Jim, have you had time to think about this?  Any insights?


Having thought about it, I share Tom's concerns. And now I'm worried 
about what happens if there are multiple separate OR clauses. I guess 
those would be handled by separate UNIONs?


I'm also finding it a bit hard to follow the comment Tom mentioned. I'm 
pretty sure I understand what's going on until this part:



The identical proof can be expected to apply
+  * in other arms, except in an arm that references that rel in its version
+  * of the OR clause.  But in such an arm, we have effectively added a
+  * restriction clause to what is known in other arms, which means that the
+  * set of rows output by that rel can't increase compared to other arms.


AIUI, this is describing a case something like this:

SELECT child.blah FROM child LEFT JOIN parent USING(parent_id)
  WHERE child.foo AND (child.baz=1 or child.baz=2)

given that parent.parent_id is unique. Except for these concerns, there 
would need to be a complex OR somewhere in here that sometimes 
referenced parent and sometimes didn't, such as


  WHERE child.foo AND (child.baz=1 OR parent.foo=3)

But I'm not following the logic here (very possibly because I'm wrong 
about what I said above):



+  * Therefore the situation in such an arm must be that including the rel
+  * could result in either zero or one output row, rather than exactly one
+  * output row as in other arms.  So we still don't need to consider it for
+  * de-duplication.


I'm definitely not certain about the rest of it.

Tom, could you expand the description some, especially with some examples?
--
Jim Nasby, Chief Data Architect, Austin TX
OpenSCG http://OpenSCG.com


--
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers


[HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-03-16 Thread David Steele
On 2/14/17 4:03 PM, Tom Lane wrote:
> Jim Nasby  writes:
>> On 2/14/17 1:18 PM, Tom Lane wrote:
>>> One point that could use further review is whether the de-duplication
>>> algorithm is actually correct.  I'm only about 95% convinced by the
>>> argument I wrote in planunionor.c's header comment.
> 
>> I'll put some thought into it and see if I can find any holes. Are you 
>> only worried about the removal of "useless" rels or is there more?
> 
> Well, the key point is whether it's really OK to de-dup on the basis
> of only the CTIDs that are not eliminated in any UNION arm.  I was
> feeling fairly good about that until I thought of the full-join-to-
> left-join-to-no-join conversion issue mentioned in the comment.
> Now I'm wondering if there are other holes; or maybe I'm wrong about
> that one and it's not necessary to be afraid of full joins.

This patch applies cleanly (with offsets) and compiles at cccbdde.

Jim, have you had time to think about this?  Any insights?

-- 
-David
da...@pgmasters.net


-- 
Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org)
To make changes to your subscription:
http://www.postgresql.org/mailpref/pgsql-hackers