On Mon, Mar 18, 2019 at 8:09 AM Noah Misch <n...@leadboat.com> wrote:
> On Tue, Oct 02, 2018 at 10:53:40AM -0400, Tom Lane wrote: > > Noah Misch <n...@leadboat.com> writes: > > > On Mon, Oct 01, 2018 at 09:32:10PM -0400, Tom Lane wrote: > > >> FWIW, my problem with this patch is that I remain unconvinced of the > basic > > >> correctness of the transform (specifically the unique-ification > approach). > > >> Noah's points would be important to address if we were moving the > patch > > >> towards commit, but I don't see much reason to put effort into it > until > > >> we can think of a way to prove whether that works. > > > > > Not even effort to fix the assertion failures I reported? > > > > If it seemed relevant to the proof-of-correctness problem, I would look > > into it, but ... > I put some hours into theoretical study of the proof, and I didn't find any > holes. When the planner removes "l0 LEFT JOIN r1", it does that because > there's one output row per l0.ctid regardless of the rows of r1. Hence, > deduplicating on (l0.ctid) is equivalent to deduplicating on (l0.ctid, > r1.ctid). In the bad FULL JOIN case, (l0.ctid, r1.ctid) would be > sufficient > as a key, but we're out of luck because removing the join makes us have > only > l0.ctid for some tuples and only r1.ctid for other tuples. > > If PostgreSQL ever gets inner join removal, it would be harder to preserve > this optimization in this form. At that point, perhaps we'd cost the path > that retains the join for the benefit of $SUBJECT. Given the performance > test > results, $SUBJECT may already need a cost-based decision on whether to use > it. > > As for the proof-of-correctness problem, I think the below strategy should be easy to understand. SELECT * FROM any_v WHERE (A OR B OR C). => SELECT * FROM any_v WHERE A UNION ALL SELECT * FROM any_v WHERE B AND !A UNION ALL SELECT * FROM any_v WHERE C AND !A AND !(B AND !A); where !(Expr) means (NOT (expr) OR (EXPR) IS NULL) In this way, there is no duplicated data at the first. Oracle uses a similar way like this[1]. - A shared planner info idea. About the planning time case, suppose we have a query below. SELECT * FROM t1, t2, ... t20 WHERE (t1.a = 1 or t2.a = 1) and xxx; With the current strategy, t3 .. t20 would be planned many times, including base relation like (t3.. t20) and joinrel like (t{3, 4}, t(3,4,5}, t(3,4,5,6}) and so on. I guess all these relations would get the same result at every time. we don't need to do that at every section of the Union ALL case. So an idea of PlannerInfo A can access the planning result of PlannerInfo B comes, this idea doesn't come in my mind for the first time, many cost based transformations may need this. Then I have the following POC in my mind. PlannerInfo { PlannerInfo *shared_planner_info; Relids or_relids; // this field can be improved later. }; With this way, if we find a 'relids' is not related to or_relids. We can grab the planning result from shared_planner_info (in this or-union case, we can set that after we plan the first section of UNION). Then I found this one would not work after planning time partition prune since the relid would change. For example: P (P1, P2) PARTITION BY A Q (Q1, Q2) PARTITION BY A SELECT * FROM t, p, q where (t.a = 1 or p.a = 1). So in the un-transformed case, Q2 would have relid = 7. After the transform, Q2 probably has relid = 6. So basically, I have 4 questions now. 1. Would investing on the shared_planner_info idea be a good idea? 2. Without planning time prune, shall the above design work? 3. Currently I want to use relid to identify a resultset and pathlists which have the planning time prune issue, should we consider other methods? 4. Any other suggestion about to resolve the 'duplicated planning case' besides the shared_planner_info method? [1] https://blogs.oracle.com/optimizer/optimizer-transformations:-or-expansion -- Best Regards Andy Fan (https://www.aliyun.com/)