This concept is named SJR (Semi Join reduction) , this paper covers the concept in detail http://www-db.in.tum.de/research/publications/conferences/semijoin.pdf.
As Vladimir mentioned SJR analogous to bloom filters but the dimension table itself is used opposed to using just a bloom filter. TPC-H queries 17 and 20 are good candidates for semi join reduction. This feature should definitely be on our roadmap. Thanks Mostafa On Fri, Aug 8, 2014 at 1:04 PM, Julian Hyde <[email protected]> wrote: > You’re right that bloom filters are useful. I was just exploring what > could be done at the logical level; when it comes to implementing the > semi-join, bloom filters are a good option, if you can accept an > approximate answer. > > Here’s a scenario where it would make sense to transform JoinRel(X, Y) —> > JoinRel(SemiJoinRel(X, Y), Y). Let’s suppose that Y has a large number of > rows and columns (i.e. the average row length is large). We can ship the > set of distinct Y key values to X, semi-join them, then send the filtered X > rows to Y. > > So, SemiJoin(X, Y) has significantly lower I/O cost than Join(X, Y) even > though it reads the same number of rows from X and Y, because it reads > fewer columns from Y. > > We’ve replaced one shuffle join with two map joins. > > Julian -- CONFIDENTIALITY NOTICE NOTICE: This message is intended for the use of the individual or entity to which it is addressed and may contain information that is confidential, privileged and exempt from disclosure under applicable law. If the reader of this message is not the intended recipient, you are hereby notified that any printing, copying, dissemination, distribution, disclosure or forwarding of this communication is strictly prohibited. If you have received this communication in error, please contact the sender immediately and delete it from your system. Thank You.
