[ https://issues.apache.org/jira/browse/SPARK-8682?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15193470#comment-15193470 ]
Herman van Hovell commented on SPARK-8682: ------------------------------------------ I have recently updated the PR (which is a broadcast range join). It'll need a rebase though. The thing is that this doesn't need to be in sql to work. We can use {{ExperimentalMethods.extraStrategies}} to hook this into the planner. > Range Join for Spark SQL > ------------------------ > > Key: SPARK-8682 > URL: https://issues.apache.org/jira/browse/SPARK-8682 > Project: Spark > Issue Type: Improvement > Components: SQL > Reporter: Herman van Hovell > Attachments: perf_testing.scala > > > Currently Spark SQL uses a Broadcast Nested Loop join (or a filtered > Cartesian Join) when it has to execute the following range query: > {noformat} > SELECT A.*, > B.* > FROM tableA A > JOIN tableB B > ON A.start <= B.end > AND A.end > B.start > {noformat} > This is horribly inefficient. The performance of this query can be greatly > improved, when one of the tables can be broadcasted, by creating a range > index. A range index is basically a sorted map containing the rows of the > smaller table, indexed by both the high and low keys. using this structure > the complexity of the query would go from O(N * M) to O(N * 2 * LOG(M)), N = > number of records in the larger table, M = number of records in the smaller > (indexed) table. > I have created a pull request for this. According to the [Spark SQL: > Relational Data Processing in > Spark|http://people.csail.mit.edu/matei/papers/2015/sigmod_spark_sql.pdf] > paper similar work (page 11, section 7.2) has already been done by the ADAM > project (cannot locate the code though). > Any comments and/or feedback are greatly appreciated. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org