Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2021-05-15 Thread Andy Fan
On Mon, Mar 18, 2019 at 8:09 AM Noah Misch  wrote:

> On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> > Noah Misch  writes:
> > > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> > >> FWIW, my problem with this patch is that I remain unconvinced of the
> basic
> > >> correctness of the transform (specifically the unique-ification
> approach).
> > >> Noah's points would be important to address if we were moving the
> patch
> > >> towards commit, but I don't see much reason to put effort into it
> until
> > >> we can think of a way to prove whether that works.
> >
> > > Not even effort to fix the assertion failures I reported?
> >
> > If it seemed relevant to the proof-of-correctness problem, I would look
> > into it, but ...
>
I put some hours into theoretical study of the proof, and I didn't find any
> holes.  When the planner removes "l0 LEFT JOIN r1", it does that because
> there's one output row per l0.ctid regardless of the rows of r1.  Hence,
> deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
> r1.ctid).  In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be
> sufficient
> as a key, but we're out of luck because removing the join makes us have
> only
> l0.ctid for some tuples and only r1.ctid for other tuples.
>
> If PostgreSQL ever gets inner join removal, it would be harder to preserve
> this optimization in this form.  At that point, perhaps we'd cost the path
> that retains the join for the benefit of $SUBJECT.  Given the performance
> test
> results, $SUBJECT may already need a cost-based decision on whether to use
> it.
>
>
As for the proof-of-correctness problem, I think the below strategy should
be
easy to understand.

SELECT * FROM any_v WHERE (A OR B OR C).
=>
SELECT * FROM any_v WHERE A
UNION ALL
SELECT * FROM any_v WHERE B AND !A
UNION ALL
SELECT * FROM any_v WHERE C AND !A AND !(B AND !A);
where !(Expr) means  (NOT (expr) OR (EXPR) IS NULL)
In this way, there is no duplicated data at the first. Oracle uses a similar
way like this[1].

- A shared planner info idea.

About the planning time case, suppose we have a query below.

SELECT * FROM t1, t2, ... t20  WHERE (t1.a = 1 or t2.a = 1) and xxx;

With the current strategy, t3 .. t20 would be planned many times, including
base
relation like (t3.. t20) and joinrel like (t{3, 4}, t(3,4,5}, t(3,4,5,6})
and
so on.  I guess all these relations would get the same result at every
time. we
don't need to do that at every section of the Union ALL case.  So an idea of
PlannerInfo A can access the planning result of PlannerInfo B comes, this
idea
doesn't come in my mind for the first time,  many cost based
transformations may
need this.


Then I have the following POC in my mind.

PlannerInfo
{

  PlannerInfo *shared_planner_info;
  Relids  or_relids; // this field can be improved later.
};

With this way, if we find a 'relids' is not related to or_relids. We can
grab
the planning result from shared_planner_info (in this or-union case, we can
set
that after we plan the first section of UNION).


Then I found this one would not work after planning time partition prune
since
the relid would change. For example:

P (P1, P2) PARTITION BY A
Q (Q1, Q2) PARTITION BY A

SELECT * FROM t, p, q where (t.a = 1 or p.a = 1).  So in the un-transformed
case, Q2 would have relid = 7.  After the transform, Q2 probably has relid
= 6.

So basically, I have 4 questions now.
1. Would investing on the shared_planner_info idea be a good idea?
2. Without planning time prune, shall the above design work?
3. Currently I want to use relid to identify a resultset and pathlists which
   have the planning time prune issue, should we consider other methods?
4. Any other suggestion about to resolve the 'duplicated planning case'
besides
   the shared_planner_info method?

[1]
https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion

-- 
Best Regards
Andy Fan (https://www.aliyun.com/)


Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2019-03-17 Thread Noah Misch
On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote:
> Noah Misch  writes:
> > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> >> FWIW, my problem with this patch is that I remain unconvinced of the basic
> >> correctness of the transform (specifically the unique-ification approach).
> >> Noah's points would be important to address if we were moving the patch
> >> towards commit, but I don't see much reason to put effort into it until
> >> we can think of a way to prove whether that works.
> 
> > Not even effort to fix the assertion failures I reported?
> 
> If it seemed relevant to the proof-of-correctness problem, I would look
> into it, but ...

I put some hours into theoretical study of the proof, and I didn't find any
holes.  When the planner removes "l0 LEFT JOIN r1", it does that because
there's one output row per l0.ctid regardless of the rows of r1.  Hence,
deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid,
r1.ctid).  In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be sufficient
as a key, but we're out of luck because removing the join makes us have only
l0.ctid for some tuples and only r1.ctid for other tuples.

If PostgreSQL ever gets inner join removal, it would be harder to preserve
this optimization in this form.  At that point, perhaps we'd cost the path
that retains the join for the benefit of $SUBJECT.  Given the performance test
results, $SUBJECT may already need a cost-based decision on whether to use it.



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-10-02 Thread Tom Lane
Noah Misch  writes:
> On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
>> FWIW, my problem with this patch is that I remain unconvinced of the basic
>> correctness of the transform (specifically the unique-ification approach).
>> Noah's points would be important to address if we were moving the patch
>> towards commit, but I don't see much reason to put effort into it until
>> we can think of a way to prove whether that works.

> Not even effort to fix the assertion failures I reported?

If it seemed relevant to the proof-of-correctness problem, I would look
into it, but ...

regards, tom lane



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-10-02 Thread Noah Misch
On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote:
> Michael Paquier  writes:
> > On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
> >> If you're going to keep this highly-simplified estimate, please expand the
> >> comment to say why it doesn't matter or what makes it hard to do better.  
> >> The
> >> non-planunionor.c path for the same query computes its own estimate of the
> >> same underlying quantity.  Though it may be too difficult in today's 
> >> system,
> >> one could copy the competing path's row count estimate here.  Perhaps it
> >> doesn't matter because higher-level processing already assumes equal row 
> >> count
> >> among competitors?
> 
> > As there has been no replies to Noah's review for one month, I am
> > marking this patch as returned with feedback for now.
> 
> FWIW, my problem with this patch is that I remain unconvinced of the basic
> correctness of the transform (specifically the unique-ification approach).
> Noah's points would be important to address if we were moving the patch
> towards commit, but I don't see much reason to put effort into it until
> we can think of a way to prove whether that works.

Not even effort to fix the assertion failures I reported?



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-10-01 Thread Tom Lane
Michael Paquier  writes:
> On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
>> If you're going to keep this highly-simplified estimate, please expand the
>> comment to say why it doesn't matter or what makes it hard to do better.  The
>> non-planunionor.c path for the same query computes its own estimate of the
>> same underlying quantity.  Though it may be too difficult in today's system,
>> one could copy the competing path's row count estimate here.  Perhaps it
>> doesn't matter because higher-level processing already assumes equal row 
>> count
>> among competitors?

> As there has been no replies to Noah's review for one month, I am
> marking this patch as returned with feedback for now.

FWIW, my problem with this patch is that I remain unconvinced of the basic
correctness of the transform (specifically the unique-ification approach).
Noah's points would be important to address if we were moving the patch
towards commit, but I don't see much reason to put effort into it until
we can think of a way to prove whether that works.

regards, tom lane



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-10-01 Thread Michael Paquier
On Mon, Sep 03, 2018 at 06:59:10PM -0700, Noah Misch wrote:
> If you're going to keep this highly-simplified estimate, please expand the
> comment to say why it doesn't matter or what makes it hard to do better.  The
> non-planunionor.c path for the same query computes its own estimate of the
> same underlying quantity.  Though it may be too difficult in today's system,
> one could copy the competing path's row count estimate here.  Perhaps it
> doesn't matter because higher-level processing already assumes equal row count
> among competitors?

As there has been no replies to Noah's review for one month, I am
marking this patch as returned with feedback for now.
--
Michael


signature.asc
Description: PGP signature


Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-09-03 Thread Noah Misch
On Sun, Feb 12, 2017 at 09:32:36AM -0800, Tom Lane wrote:
> It's not so much poor choices as the cost of the optimization attempt ---
> if there's a K-relation OR clause, this will increase the cost of planning
> by a factor approaching K+1, whether or not you get a better plan out of
> it.  I ran the regression tests with some instrumentation and determined
> that this logic fires a dozen or two times, and fails to produce a plan
> that looks cheaper than the standard plan in any of those cases.  So if we
> go down this road, not only do we need a GUC but I suspect it had better
> default to off; only people using star schemas are really likely to get a
> win out of it.

I benchmarked using the attached script.  Highlights:

$ perl mk-bench-union-or.pl --dimensions=11 --tests=const | psql -X
# TRAP: FailedAssertion("!(uniq_exprs != ((List *) ((void *)0)))", File: 
"planunionor.c", Line: 335)

$ perl mk-bench-union-or.pl --dimensions=11 --tests=var --exhaustive 
--values-per-dimension=0 | psql -X
# TRAP: FailedAssertion("!(list_length(dest_tlist) == list_length(src_tlist))", 
File: "tlist.c", Line: 344)

$ perl mk-bench-union-or.pl --dimensions=7 --or-clauses=20 --exhaustive 
--tests=const,rowmark | psql -X
... (with planunionor.c plan)
 Planning Time: 1902.013 ms
 Execution Time: 291.629 ms
... (without planunionor.c plan)
 Planning Time: 11.100 ms
 Execution Time: 64.452 ms
# planning 170x slower, chosen plan 4.5x slower

I agree this warrants a GUC, but I propose a goal of making it fitting to
enable by default.  The is_suitable_join_or_clause() test is a good indicator
of promising queries, and I suspect it's cheap enough to run all the time.  As
a DBA, I'd struggle to predict when is_suitable_join_or_clause() will pass
while the optimization as a whole will lose; I would resort to testing each
important query both ways.  In other words, the GUC would boil down to a
planner hint, not to a declaration about the table schema.

On Thu, Aug 23, 2018 at 02:10:46PM -0400, Tom Lane wrote:
> Rebased up to HEAD, per cfbot nagging.  Still no substantive change from
> v2.

> +  * is retty mechanical, but we can't do it until we have a RelOptInfo for 
> the

Jim Nasby had pointed out this s/retty/pretty/ typo.

> + void
> + finish_union_or_paths(PlannerInfo *root, RelOptInfo *joinrel,
> +   List *union_or_subpaths)
> + {
...
> + pathnode = makeNode(UniquePath);
...
> + /* Estimate number of output rows */
> + pathnode->path.rows = appendpath->rows;

If you're going to keep this highly-simplified estimate, please expand the
comment to say why it doesn't matter or what makes it hard to do better.  The
non-planunionor.c path for the same query computes its own estimate of the
same underlying quantity.  Though it may be too difficult in today's system,
one could copy the competing path's row count estimate here.  Perhaps it
doesn't matter because higher-level processing already assumes equal row count
among competitors?


mk-bench-union-or.pl
Description: Perl program


Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-08-23 Thread Peter Geoghegan
On Thu, Aug 23, 2018 at 5:20 PM, Peter Geoghegan  wrote:
> This patch adds an enhancement that is an example of a broader class
> of optimizer enhancement primarily aimed at making star-schema queries
> have more efficient plans, by arranging to use several independent
> nested loop joins based on a common pattern. Each nestloop join has
> one particular dimension table on the outer side, and the fact table
> on the inner side.

Correction: I meant that for each join, the outer side is a scan of
some particular fact table index, while the inner side probes some
particular dimension table's primary key, and evaluates the
dimension's qual.

-- 
Peter Geoghegan



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-08-23 Thread Peter Geoghegan
On Thu, Aug 23, 2018 at 11:10 AM, Tom Lane  wrote:
> Rebased up to HEAD, per cfbot nagging.  Still no substantive change from
> v2.

I happened to have the opportunity to talk to Tom about this patch in
person. I expressed some very general concerns that are worth
repeating publicly.

This patch adds an enhancement that is an example of a broader class
of optimizer enhancement primarily aimed at making star-schema queries
have more efficient plans, by arranging to use several independent
nested loop joins based on a common pattern. Each nestloop join has
one particular dimension table on the outer side, and the fact table
on the inner side. The query plan is not so much a tree as it is a DAG
(directed acyclic graph), because the fact table is visited multiple
times. (There are already cases in Postgres in which the query plan is
technically a DAG, actually, but it could be taken much further.)

Aside from being inherently more efficient, DAG-like star schema plans
are also *ideal* targets for parallel query. The executor can execute
each nested loop join in a parallel worker with minimal contention --
the inner side of each nestloop join all probe a different fact table
index to the others. It's almost like executing several different
simple queries concurrently, with some serial processing at the end.
Even that serial processing can sometimes be minimized by having some
of the parallel workers use a Bloom filter in shared memory.

Tom is already concerned that the optimization added by this patch may
be too much of a special case, which is understandable. It may be that
we're failing to identify some greater opportunity to add DAG-like
plans for star schema queries.

-- 
Peter Geoghegan



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-08-23 Thread Tom Lane
I wrote:
> [ join-or-to-union-4.patch ]

Rebased up to HEAD, per cfbot nagging.  Still no substantive change from
v2.

regards, tom lane

diff --git a/src/backend/nodes/outfuncs.c b/src/backend/nodes/outfuncs.c
index 6269f47..8935503 100644
*** a/src/backend/nodes/outfuncs.c
--- b/src/backend/nodes/outfuncs.c
*** _outPlannerInfo(StringInfo str, const Pl
*** 2310,2315 
--- 2310,2316 
  	WRITE_FLOAT_FIELD(limit_tuples, "%.0f");
  	WRITE_UINT_FIELD(qual_security_level);
  	WRITE_ENUM_FIELD(inhTargetKind, InheritanceKind);
+ 	WRITE_BOOL_FIELD(needBaseTids);
  	WRITE_BOOL_FIELD(hasJoinRTEs);
  	WRITE_BOOL_FIELD(hasLateralRTEs);
  	WRITE_BOOL_FIELD(hasDeletedRTEs);
diff --git a/src/backend/optimizer/plan/Makefile b/src/backend/optimizer/plan/Makefile
index 88a9f7f..1db6bd5 100644
*** a/src/backend/optimizer/plan/Makefile
--- b/src/backend/optimizer/plan/Makefile
*** top_builddir = ../../../..
*** 13,18 
  include $(top_builddir)/src/Makefile.global
  
  OBJS = analyzejoins.o createplan.o initsplan.o planagg.o planmain.o planner.o \
! 	setrefs.o subselect.o
  
  include $(top_srcdir)/src/backend/common.mk
--- 13,18 
  include $(top_builddir)/src/Makefile.global
  
  OBJS = analyzejoins.o createplan.o initsplan.o planagg.o planmain.o planner.o \
! 	planunionor.o setrefs.o subselect.o
  
  include $(top_srcdir)/src/backend/common.mk
diff --git a/src/backend/optimizer/plan/analyzejoins.c b/src/backend/optimizer/plan/analyzejoins.c
index 0e73f9c..fdffb6b 100644
*** a/src/backend/optimizer/plan/analyzejoins.c
--- b/src/backend/optimizer/plan/analyzejoins.c
*** clause_sides_match_join(RestrictInfo *ri
*** 155,160 
--- 155,163 
   * cases, but we don't have the infrastructure to prove them.)  We also
   * have to check that the inner side doesn't generate any variables needed
   * above the join.
+  *
+  * Note: making this significantly smarter might break planunionor.c.
+  * Study that file before doing so.
   */
  static bool
  join_is_removable(PlannerInfo *root, SpecialJoinInfo *sjinfo)
diff --git a/src/backend/optimizer/plan/planmain.c b/src/backend/optimizer/plan/planmain.c
index b05adc7..4486885 100644
*** a/src/backend/optimizer/plan/planmain.c
--- b/src/backend/optimizer/plan/planmain.c
*** query_planner(PlannerInfo *root, List *t
*** 212,217 
--- 212,225 
  	add_placeholders_to_base_rels(root);
  
  	/*
+ 	 * Also, if we have a request to emit baserel CTIDs, add those to the
+ 	 * baserel targetlists now.  This likewise has to be done after join
+ 	 * removal, because we only want CTIDs for rels that survive join removal.
+ 	 */
+ 	if (root->needBaseTids)
+ 		add_base_rel_ctids(root);
+ 
+ 	/*
  	 * Construct the lateral reference sets now that we have finalized
  	 * PlaceHolderVar eval levels.
  	 */
diff --git a/src/backend/optimizer/plan/planner.c b/src/backend/optimizer/plan/planner.c
index 96bf060..36907a4 100644
*** a/src/backend/optimizer/plan/planner.c
--- b/src/backend/optimizer/plan/planner.c
*** subquery_planner(PlannerGlobal *glob, Qu
*** 623,628 
--- 623,629 
  	root->minmax_aggs = NIL;
  	root->qual_security_level = 0;
  	root->inhTargetKind = INHKIND_NONE;
+ 	root->needBaseTids = false;
  	root->hasRecursion = hasRecursion;
  	if (hasRecursion)
  		root->wt_param_id = SS_assign_special_param(root);
*** grouping_planner(PlannerInfo *root, bool
*** 1797,1802 
--- 1798,1804 
  		WindowFuncLists *wflists = NULL;
  		List	   *activeWindows = NIL;
  		grouping_sets_data *gset_data = NULL;
+ 		List	   *union_or_subpaths;
  		standard_qp_extra qp_extra;
  
  		/* A recursive query should always have setOperations */
*** grouping_planner(PlannerInfo *root, bool
*** 1874,1879 
--- 1876,1889 
  			preprocess_minmax_aggregates(root, tlist);
  
  		/*
+ 		 * Preprocess join OR clauses that might be better handled as UNIONs.
+ 		 * This likewise needs to be close to the query_planner() call.  But
+ 		 * it doesn't matter which of preprocess_minmax_aggregates() and this
+ 		 * function we call first, because they treat disjoint sets of cases.
+ 		 */
+ 		union_or_subpaths = split_join_or_clauses(root);
+ 
+ 		/*
  		 * Figure out whether there's a hard limit on the number of rows that
  		 * query_planner's result subplan needs to return.  Even if we know a
  		 * hard limit overall, it doesn't apply if the query has any
*** grouping_planner(PlannerInfo *root, bool
*** 1908,1913 
--- 1918,1931 
  	standard_qp_callback, _extra);
  
  		/*
+ 		 * If we found any way to convert a join OR clause to a union, finish
+ 		 * up generating the path(s) for that, and add them into the topmost
+ 		 * scan/join relation.
+ 		 */
+ 		if (union_or_subpaths)
+ 			finish_union_or_paths(root, current_rel, union_or_subpaths);
+ 
+ 		/*
  		 * Convert the query's result tlist into PathTarget format.
  		 *
  		 * Note: it's desirable to 

Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-04-01 Thread David Rowley
On 30 March 2018 at 15:05, Andres Freund  wrote:
>> + * To allow join removal to happen, we can't reference the CTID column
>> + * of an otherwise-removable relation.
>
> A brief hint why wouldn't hurt.

Maybe something like:

/*
 * Join removal is only ever possible when no columns of the
to-be-removed relation
 * are referenced.  If we added the CTID here then we could
inadvertently disable join
 * removal.  We'll need to delay adding the CTID until after join
removal takes place.
 */

(I've not read the code, just the thread)

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-04-01 Thread David Rowley
On 3 February 2018 at 03:26, Tom Lane  wrote:
> Tomas Vondra  writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
>
> Yeah, that's the core issue.
>
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
>
> I'm not sure if that would be enough to convince doubters.  On the other
> hand, if it found problems, that would definitely be useful.

I've read over this thread and as far as I can see there is concern
that the UNION on the ctids might not re-uniquify the rows again. Tom
mentioned a problem with FULL JOINs, but the concern appears to have
been invalidated due to wrong thinking about how join removals work.

As of now, I'm not quite sure who exactly is concerned. Tom thought
there was an issue but quickly corrected himself.

As far as I see it, there are a few cases we'd need to disable the optimization:

1. Query contains target SRFs (the same row might get duplicated, we
don't want to UNION these duplicates out again, they're meant to be
there)
2. Query has setops (ditto)
3. Any base rels are not RELKIND_RELATION (we need the ctid to
uniquely identify rows)
4. Query has volatile functions (don't want to evaluate volatile
functions more times than requested)

As far as the DISTINCT clause doing the right thing for joins, I see
no issues, even with FULL JOINs. In both branches of the UNION the
join condition will be the same so each side of the join always has
the same candidate row to join to.  I don't think the optimization is
possible if there are OR clauses in the join condition, but that's not
being proposed.

FULL JOINS appear to be fine as the row is never duplicated on a
non-match, so there will only be one version of t1.ctid, NULL::tid or
NULL::tid, t1.ctid and all ctids in the distinctClauses cannot all be
NULL at once.

I used the following SQL to help my brain think through this. There
are two versions of each query, one with DISTINCT and one without. If
the DISTINCT returns fewer rows than the one without then we need to
disable this optimization for that case. I've written queries for 3 of
the above 4 cases. I saw from reading the thread that case #4 is
already disabled:

drop table if exists t1,t2;
create table t1 (a int);
create table t2 (a int);

insert into t1 values(1),(1),(2),(4);
insert into t2 values(1),(1),(3),(3),(4),(4);

select t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

select distinct t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a;

-- case 1: must disable in face of tSRFs

select ctid from (select ctid,generate_Series(1,2) from t1) t;

select distinct ctid from (select ctid,generate_Series(1,2) from t1) t;

-- case 2: must disable optimization with setops.

select ctid from (select ctid from t1 union all select ctid from t1) t;

select distinct ctid from (select ctid from t1 union all select ctid from t1) t;

-- case 3: must disable if we join to anything other than a
RELKIND_RELATION (no ctid)

select ctid from (select t1.ctid from t1 inner join (values(1),(1))
x(x) on t1.a = x.x) t;

select distinct ctid from (select t1.ctid from t1 inner join
(values(1),(1)) x(x) on t1.a = x.x) t;

I've not read the patch yet, but I understand what it's trying to
achieve. My feelings about the patch are that it would be useful to
have. I think if someone needs this then they'll be very happythat
we've added it. I also think there should be a GUC to disable it, and
it should be enabled through the entire alpha/beta period, and we
should consider what the final value for it should be just before RC1.
It's a bit sad to exclude foreign tables, and I'm not too sure what
hurdles this leaves out for pluggable storage. No doubt we'll need to
disable the optimization for those too unless they can provide us with
some row identifier.

-- 
 David Rowley   http://www.2ndQuadrant.com/
 PostgreSQL Development, 24x7 Support, Training & Services



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-03-29 Thread Andres Freund
Hi,

I've only skimmed the thread, looking at the patch on its own.


On 2018-01-04 17:50:48 -0500, Tom Lane wrote:
> diff --git a/src/backend/optimizer/plan/plaindex ...dd11e72 .
> --- a/src/backend/optimizer/plan/planunionor.c
> +++ b/src/backend/optimizer/plan/planunionor.c
> @@ -0,0 +1,667 @@
> +/*-
> + *
> + * planunionor.c
> + * Consider whether join OR clauses can be converted to UNION queries.
> + *
> + * The current implementation of the UNION step is to de-duplicate using
> + * row CTIDs.

Could we skip using the ctid if there's a DISTINCT (or something to that
effect) above? We do not need to avoid removing rows that are identical
if that's done anyway.


> A big limitation is that this only works on plain relations,
> + * and not for instance on foreign tables.  Another problem is that we can
> + * only de-duplicate by sort/unique, not hashing; but that could be fixed
> + * if we write a hash opclass for TID.

I wonder if an alternative could be some sort of rowid that we invent.
It'd not be that hard to introduce an executor node (or do it in
projection) that simply counts row and returns that as a
column. Together with e.g. range table id that'd be unique. But for that
we would need to guarantee that foreign tables / subqueries /
... returned the same result in two scans.  We could do so by pushing
the data gathering into a CTE, but that'd make this exercise moot.

Why can't we ask at least FDWs to return something ctid like?


> + * To allow join removal to happen, we can't reference the CTID column
> + * of an otherwise-removable relation.

A brief hint why wouldn't hurt.

> +/*
> + * Is query as a whole safe to apply union OR transformation to?
> + * This checks relatively-expensive conditions that we don't want to
> + * worry about until we've found a candidate OR clause.
> + */
> +static bool
> +is_query_safe_for_union_or_transform(PlannerInfo *root)
> +{
> + Query  *parse = root->parse;
> + Relids  allbaserels;
> + ListCell   *lc;
> + int relid;
> +
> + /*
> +  * Must not have any volatile functions in FROM or WHERE (see notes at
> +  * head of file).
> +  */
> + if (contain_volatile_functions((Node *) parse->jointree))
> + return false;

Hm, are there any SRFs that could be in relevant places? I think we
reject them everywhere were they'd be problematic (as targetlist is
processed above)?


Do you have any plans for this patch at this moment?

Greetings,

Andres Freund



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-24 Thread Peter Geoghegan
On Fri, Feb 2, 2018 at 4:53 PM, Tom Lane  wrote:
> Tomas Vondra  writes:
>> BTW wouldn't it be possible to derive "traditional" proof in relational
>> algebra, similarly to other transforms?
>
> Perhaps.  The patch depends on CTID, but you could probably model that
> as a primary key in a proof.

I'll remind you that commit b648b703 made the TID sort performed by
CREATE INDEX CONCURRENTLY over 3 times faster in cases where the sort
completes in memory. The simplest way to get a significant portion of
that benefit for your approach with sort+unique on CTID would be to
add sortsupport with abbreviated keys to the TID btree opclass.

-- 
Peter Geoghegan



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tom Lane
Tomas Vondra  writes:
> BTW wouldn't it be possible to derive "traditional" proof in relational
> algebra, similarly to other transforms?

Perhaps.  The patch depends on CTID, but you could probably model that
as a primary key in a proof.

regards, tom lane



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tomas Vondra


On 02/02/2018 03:26 PM, Tom Lane wrote:
> Tomas Vondra  writes:
>> ISTM this patch got somewhat stuck as we're not quite sure the
>> transformation is correct in all cases. Is my impression correct?
> 
> Yeah, that's the core issue.
> 
>> If yes, how to we convince ourselves? Would some sort of automated testing
>> (generating data and queries) help? I'm willing to spend some cycles on
>> that, if considered helpful.
> 
> I'm not sure if that would be enough to convince doubters.  On the other
> hand, if it found problems, that would definitely be useful.
> 

OK, I'll take a stab at this, then. I wasn't really suggesting it might
prove definitely prove the transformation is correct - clearly, it can
only find counter-examples. But I think it's worth it anyway.

BTW wouldn't it be possible to derive "traditional" proof in relational
algebra, similarly to other transforms?

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tom Lane
Tomas Vondra  writes:
> ISTM this patch got somewhat stuck as we're not quite sure the
> transformation is correct in all cases. Is my impression correct?

Yeah, that's the core issue.

> If yes, how to we convince ourselves? Would some sort of automated testing
> (generating data and queries) help? I'm willing to spend some cycles on
> that, if considered helpful.

I'm not sure if that would be enough to convince doubters.  On the other
hand, if it found problems, that would definitely be useful.

regards, tom lane



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2018-02-02 Thread Tomas Vondra
On 01/04/2018 11:50 PM, Tom Lane wrote:
> I wrote:
>> Jim Nasby  writes:
>>> I've verified that the patch still applies and make check-world is clean.
> 
>> Not any more :-(.  Here's a v3 rebased over HEAD.  No substantive
>> change from v2.
> 
> Again rebased up to HEAD; still no substantive changes.
> 

ISTM this patch got somewhat stuck as we're not quite sure the
transformation is correct in all cases. Is my impression correct? If
yes, how to we convince ourselves? Would some sort of automated testing
(generating data and queries) help? I'm willing to spend some cycles on
that, if considered helpful.

regards

-- 
Tomas Vondra  http://www.2ndQuadrant.com
PostgreSQL Development, 24x7 Support, Remote DBA, Training & Services



Re: [HACKERS] Re: Improve OR conditions on joined columns (common star schema problem)

2017-11-29 Thread Michael Paquier
On Tue, Sep 12, 2017 at 9:48 PM, Tom Lane  wrote:
> Jim Nasby  writes:
>> I've verified that the patch still applies and make check-world is clean.
>
> Not any more :-(.  Here's a v3 rebased over HEAD.  No substantive
> change from v2.

Patch applies and compiles, and it got no reviews. Moved to CF 2018-01.
-- 
Michael