On Thu, Apr 6, 2017 at 6:37 AM, Robert Haas <robertmh...@gmail.com> wrote: > On Wed, Apr 5, 2017 at 2:42 AM, Ashutosh Bapat > <ashutosh.ba...@enterprisedb.com> wrote: >> Only inner join conditions have equivalence classes associated with >> those. Outer join conditions create single element equivalence >> classes. So, we can not associate equivalence classes as they are with >> partition scheme. If we could do that, it makes life much easier since >> checking whether equi-join between all partition keys exist, is simply >> looking up equivalence classes that cover joining relations and find >> em_member corresponding to partition keys. > > OK. > >> It looks like we should only keep strategy, partnatts, partopfamily >> and parttypcoll in PartitionScheme. A partition-wise join between two >> relations would be possible if all those match. > > Yes, I think so. Conceivably you could even exclude partnatts and > strategy, since there's nothing preventing a partitionwise join > between a list-partitioned table and a range-partitioned table, or > between a table range-partitioned on (a) and another range-partitioned > on (a, b), but there is probably not much benefit in trying to cover > such cases. I think it's reasonable to tell users that this is only > going to work when the partitioning strategy is the same and the join > conditions include all of the partitioning columns on both sides. > >> There's a relevant comment in 0006, build_joinrel_partition_info() >> (probably that name needs to change, but I will do that once we have >> settled on design) >> + /* >> + * Construct partition keys for the join. >> + * >> + * An INNER join between two partitioned relations is partition by key >> + * expressions from both the relations. For tables A and B >> partitioned by a and b >> + * respectively, (A INNER JOIN B ON A.a = B.b) is partitioned by both A.a >> + * and B.b. >> + * >> + * An OUTER join like (A LEFT JOIN B ON A.a = B.b) may produce rows with >> + * B.b NULL. These rows may not fit the partitioning conditions imposed >> on >> + * B.b. Hence, strictly speaking, the join is not partitioned by B.b. >> + * Strictly speaking, partition keys of an OUTER join should include >> + * partition key expressions from the OUTER side only. Consider a join >> like >> + * (A LEFT JOIN B on (A.a = B.b) LEFT JOIN C ON B.b = C.c. If we do not >> + * include B.b as partition key expression for (AB), it prohibits us from >> + * using partition-wise join when joining (AB) with C as there is no >> + * equi-join between partition keys of joining relations. But two NULL >> + * values are never equal and no two rows from mis-matching partitions >> can >> + * join. Hence it's safe to include B.b as partition key expression for >> + * (AB), even though rows in (AB) are not strictly partitioned by B.b. >> + */ >> >> I think that also needs to be reviewed carefully. > > The following passage from src/backend/optimizer/README seems highly relevant: > > === > The planner's treatment of outer join reordering is based on the following > identities: > > 1. (A leftjoin B on (Pab)) innerjoin C on (Pac) > = (A innerjoin C on (Pac)) leftjoin B on (Pab) > > where Pac is a predicate referencing A and C, etc (in this case, clearly > Pac cannot reference B, or the transformation is nonsensical). > > 2. (A leftjoin B on (Pab)) leftjoin C on (Pac) > = (A leftjoin C on (Pac)) leftjoin B on (Pab) > > 3. (A leftjoin B on (Pab)) leftjoin C on (Pbc) > = A leftjoin (B leftjoin C on (Pbc)) on (Pab) > > Identity 3 only holds if predicate Pbc must fail for all-null B rows > (that is, Pbc is strict for at least one column of B). If Pbc is not > strict, the first form might produce some rows with nonnull C columns > where the second form would make those entries null. > === > > In other words, I think your statement that null is never equal to > null is a bit imprecise. Somebody could certainly create an operator > that is named "=" which returns true in that case, and then they could > say, hey, two nulls are equal (when you use that operator). The > argument needs to be made in terms of the formal properties of the > operator. The relevant logic is in have_partkey_equi_join: > > + /* Skip clauses which are not equality conditions. */ > + if (rinfo->hashjoinoperator == InvalidOid && > !rinfo->mergeopfamilies) > + continue; > > Actually, I think the hashjoinoperator test is formally and > practically unnecessary here; lower down there is a test to see > whether the partitioning scheme's operator family is a member of > rinfo->mergeopfamilies, which will certainly fail if we got through > this test with rinfo->mergeopfamilies == NIL just on the strength of > rinfo->hashjoinoperator != InvalidOid. So you can just bail out if > rinfo->mergeopfamilies == NIL. But the underlying point here is that > the only thing you really know about the function is that it's got to > be a strategy-3 operator in some btree opclass; if that guarantees > strictness, then so be it -- but I wasn't able to find anything in the > code or documentation off-hand that supports that contention, so we > might need to think a bit more about why (or if) this is guaranteed to > be true.
I need more time to think about this. Will get back to this soon. > >> Partition-wise joins >> may be happy including partition keys from all sides, but >> partition-wise aggregates may not be, esp. when pushing complete >> aggregation down to partitions. In that case, rows with NULL partition >> key, which falls on nullable side of join, will be spread across >> multiple partitions. Proabably, we should separate nullable and >> non-nullable partition key expressions. > > I don't think I understand quite what you're getting at here. Can you > spell this out in more detail? To push an aggregate down to > partitions, you need the grouping key to match the applicable > partition key, and the partition key shouldn't allow nulls in more > than one place. Now I think your point may be that outer join > semantics could let them creep in there, e.g. SELECT b.x, sum(a.y) > FROM a LEFT JOIN b ON a.x = b.x GROUP BY 1 -- which would indeed be a > good test case for partitionwise aggregate. I'd be inclined to think > that we should just give up on partitionwise aggregate in such cases; > it's not worth trying to optimize such a weird query, at least IMHO. > (Does this sort of case ever happen with joins? I think not, as long > as the join operator is strict.) Yes, this is the case, I am thinking about. No, it doesn't happen with join. > > I spent some time thinking about this patch set today and I don't see > that there's much point in committing any more of this to v10. I > think that 0001 and 0002 are probably committable or very close at > this point. However, 0001 is adding more complexity than I think is > warranted until we're actually ready to commit the feature that uses > it, and 0002 is so small that committing isn't really going to smooth > future development much. 0003-0009 are essentially all one big patch > that will have to be committed together. Ok. Thanks. -- Best Wishes, Ashutosh Bapat EnterpriseDB Corporation The Postgres Database Company -- Sent via pgsql-hackers mailing list (pgsql-hackers@postgresql.org) To make changes to your subscription: http://www.postgresql.org/mailpref/pgsql-hackers