Tejas Patil created SPARK-17570: ----------------------------------- Summary: Avoid Hash and Exchange in Sort Merge join if bucketing factor is multiple for tables Key: SPARK-17570 URL: https://issues.apache.org/jira/browse/SPARK-17570 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 2.0.0 Reporter: Tejas Patil Priority: Minor
In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the input tables and output table has same number of buckets. However, unequal bucketing will always lead to `Sort` and `Exchange`. If the number of buckets in the output table is a factor of the buckets in the input table, we should be able to avoid `Sort` and `Exchange` and directly join those. eg. Assume Input1, Input2 and Output be bucketed + sorted tables over the same columns but with different number of buckets. Input1 has 8 buckets, Input1 has 4 buckets and Output has 4 buckets. Since hash-partitioning is done using Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in the same task, it would give the bucket 0 of output table. {noformat} Input1 (0, 4) (1, 3) (2, 5) (3, 7) Input2 (0, 4, 8) (1, 3, 9) (2, 5, 10) (3, 7, 11) Output (0) (1) (2) (3) {noformat} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org