On 3 February 2018 at 03:26, Tom Lane <t...@sss.pgh.pa.us> wrote: > Tomas Vondra <tomas.von...@2ndquadrant.com> writes: >> ISTM this patch got somewhat stuck as we're not quite sure the >> transformation is correct in all cases. Is my impression correct? > > Yeah, that's the core issue. > >> If yes, how to we convince ourselves? Would some sort of automated testing >> (generating data and queries) help? I'm willing to spend some cycles on >> that, if considered helpful. > > I'm not sure if that would be enough to convince doubters. On the other > hand, if it found problems, that would definitely be useful.
I've read over this thread and as far as I can see there is concern that the UNION on the ctids might not re-uniquify the rows again. Tom mentioned a problem with FULL JOINs, but the concern appears to have been invalidated due to wrong thinking about how join removals work. As of now, I'm not quite sure who exactly is concerned. Tom thought there was an issue but quickly corrected himself. As far as I see it, there are a few cases we'd need to disable the optimization: 1. Query contains target SRFs (the same row might get duplicated, we don't want to UNION these duplicates out again, they're meant to be there) 2. Query has setops (ditto) 3. Any base rels are not RELKIND_RELATION (we need the ctid to uniquely identify rows) 4. Query has volatile functions (don't want to evaluate volatile functions more times than requested) As far as the DISTINCT clause doing the right thing for joins, I see no issues, even with FULL JOINs. In both branches of the UNION the join condition will be the same so each side of the join always has the same candidate row to join to. I don't think the optimization is possible if there are OR clauses in the join condition, but that's not being proposed. FULL JOINS appear to be fine as the row is never duplicated on a non-match, so there will only be one version of t1.ctid, NULL::tid or NULL::tid, t1.ctid and all ctids in the distinctClauses cannot all be NULL at once. I used the following SQL to help my brain think through this. There are two versions of each query, one with DISTINCT and one without. If the DISTINCT returns fewer rows than the one without then we need to disable this optimization for that case. I've written queries for 3 of the above 4 cases. I saw from reading the thread that case #4 is already disabled: drop table if exists t1,t2; create table t1 (a int); create table t2 (a int); insert into t1 values(1),(1),(2),(4); insert into t2 values(1),(1),(3),(3),(4),(4); select t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a; select distinct t1.ctid,t2.ctid from t1 full join t2 on t1.a = t2.a; -- case 1: must disable in face of tSRFs select ctid from (select ctid,generate_Series(1,2) from t1) t; select distinct ctid from (select ctid,generate_Series(1,2) from t1) t; -- case 2: must disable optimization with setops. select ctid from (select ctid from t1 union all select ctid from t1) t; select distinct ctid from (select ctid from t1 union all select ctid from t1) t; -- case 3: must disable if we join to anything other than a RELKIND_RELATION (no ctid) select ctid from (select t1.ctid from t1 inner join (values(1),(1)) x(x) on t1.a = x.x) t; select distinct ctid from (select t1.ctid from t1 inner join (values(1),(1)) x(x) on t1.a = x.x) t; I've not read the patch yet, but I understand what it's trying to achieve. My feelings about the patch are that it would be useful to have. I think if someone needs this then they'll be very happythat we've added it. I also think there should be a GUC to disable it, and it should be enabled through the entire alpha/beta period, and we should consider what the final value for it should be just before RC1. It's a bit sad to exclude foreign tables, and I'm not too sure what hurdles this leaves out for pluggable storage. No doubt we'll need to disable the optimization for those too unless they can provide us with some row identifier. -- David Rowley http://www.2ndQuadrant.com/ PostgreSQL Development, 24x7 Support, Training & Services