I’m looking into using semi-join in Hive. My main focus is where semi-joins are 
specified in the query (explicitly or via IN/EXISTS) but synthetic semi-joins 
are interesting.

There is a logical rule, AddRedundantSemiJoinRule. It transforms JoinRel(X, Y) 
—> JoinRel(SemiJoinRel(X, Y), Y).

This transformation might be worth the effort if X is large, Y is small (i.e. 
candidate for map-join), and (I’m guessing) the NDV of the join key is small.

Three questions:
1. Have I characterized the sweet spot of the transformation correctly?
2. Are there any benchmark or typical queries where this would be useful?
3. Do we have good enough stats to consider applying this rule?

I’ll capture the results of the discussion in a JIRA.

Julian

Reply via email to