Sort-merge join improvement

Petar Zecevic Tue, 17 Apr 2018 18:21:25 -0700

Hello everybody

We (at University of Zagreb and University of Washington) haveimplemented an optimization of Spark's sort-merge join (SMJ) which hasimproved performance of our jobs considerably and we would like to knowif Spark community thinks it would be useful to include this in the maindistribution.

The problem we are solving is the case where you have two big tablespartitioned by X column, but also sorted by Y column (within partitions)and you need to calculate an expensive function on the joined rows.During a sort-merge join, Spark will do cross-joins of all rows thathave the same X values and calculate the function's value on all ofthem. If the two tables have a large number of rows per X, this canresult in a huge number of calculations.

Our optimization allows you to reduce the number of matching rows per Xusing a range condition on Y columns of the two tables. Something like:


... WHERE t1.X = t2.X AND t1.Y BETWEEN t2.Y - d AND t2.Y + d

The way SMJ is currently implemented, these extra conditions have noinfluence on the number of rows (per X) being checked because theseextra conditions are put in the same block with the function beingcalculated.

Our optimization changes the sort-merge join so that, when these extraconditions are specified, a queue is used instead of theExternalAppendOnlyUnsafeRowArray class. This queue is then used as amoving window across the values from the right relation as the left rowchanges. You could call this a combination of an equi-join and a thetajoin (we call it "sort-merge inner range join").

Potential use-cases for this are joins based on spatial or temporaldistance calculations.

The optimization is triggered automatically when an equi-join expressionis present AND lower and upper range conditions on a secondary columnare specified. If the tables aren't sorted by both columns, appropriatesorts will be added.



We have several questions:

1. Do you see any other way to optimize queries like these (eliminateunnecessary calculations) without changing the sort-merge join algorithm?

2. We believe there is a more general pattern here and that this couldhelp in other similar situations where secondary sorting is available.Would you agree?


3. Would you like us to open a JIRA ticket and create a pull request?

Thanks,

Petar Zecevic



---------------------------------------------------------------------
To unsubscribe e-mail: dev-unsubscr...@spark.apache.org

Sort-merge join improvement

Reply via email to