Tejas Patil created SPARK-17570:
-----------------------------------

             Summary: Avoid Hash and Exchange in Sort Merge join if bucketing 
factor is multiple for tables
                 Key: SPARK-17570
                 URL: https://issues.apache.org/jira/browse/SPARK-17570
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 2.0.0
            Reporter: Tejas Patil
            Priority: Minor


In case of bucketed tables, Spark will avoid doing `Sort` and `Exchange` if the 
input tables and output table has same number of buckets. However, unequal 
bucketing will always lead to `Sort` and `Exchange`. If the number of buckets 
in the output table is a factor of the buckets in the input table, we should be 
able to avoid `Sort` and `Exchange` and directly join those.
eg.

Assume Input1, Input2 and Output be bucketed + sorted tables over the same 
columns but with different number of buckets. Input1 has 8 buckets, Input1 has 
4 buckets and Output has 4 buckets. Since hash-partitioning is done using 
Modulus, if we JOIN buckets (0, 4) of Input1 and buckets (0, 4, 8) of Input2 in 
the same task, it would give the bucket 0 of output table.

{noformat}
Input1   (0, 4)      (1, 3)      (2, 5)       (3, 7)
Input2   (0, 4, 8)   (1, 3, 9)   (2, 5, 10)   (3, 7, 11)
Output   (0)         (1)         (2)          (3)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to