You’re right that bloom filters are useful. I was just exploring what could be 
done at the logical level; when it comes to implementing the semi-join, bloom 
filters are a good option, if you can accept an approximate answer.

Here’s a scenario where it would make sense to transform JoinRel(X, Y) —> 
JoinRel(SemiJoinRel(X, Y), Y). Let’s suppose that Y has a large number of rows 
and columns (i.e. the average row length is large). We can ship the set of 
distinct Y key values to X, semi-join them, then send the filtered X rows to Y.

So, SemiJoin(X, Y) has significantly lower I/O cost than Join(X, Y) even though 
it reads the same number of rows from X and Y, because it reads fewer columns 
from Y.

We’ve replaced one shuffle join with two map joins.

Julian

Reply via email to