[ 
https://issues.apache.org/jira/browse/SPARK-35703?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Chao Sun updated SPARK-35703:
-----------------------------
    Description: Currently Spark has {{HashClusteredDistribution}} and 
{{ClusteredDistribution}}. The only difference between the two is that the 
former is more strict when deciding whether bucket join is allowed to avoid 
shuffle: comparing to the latter, it requires *exact* match between the 
clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
the join keys. However, this is unnecessary, as we should be able to avoid 
shuffle when the set of clustering keys is a subset of join keys, just like 
{{ClusteredDistribution}}.   (was: Currently Spark has 
{{HashClusteredDistribution}} and {{ClusteredDistribution}}. The only 
difference between the two is that the former is more strict when deciding 
whether bucket join is allowed to avoid shuffle: comparing to the latter, it 
requires *exact* match between the clustering keys from the output partitioning 
and the join keys. However, this is unnecessary, as we should be able to avoid 
shuffle when the set of clustering keys is a subset of join keys, just like 
{{ClusteredDistribution}}. )

> Remove HashClusteredDistribution
> --------------------------------
>
>                 Key: SPARK-35703
>                 URL: https://issues.apache.org/jira/browse/SPARK-35703
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>    Affects Versions: 3.2.0
>            Reporter: Chao Sun
>            Priority: Major
>
> Currently Spark has {{HashClusteredDistribution}} and 
> {{ClusteredDistribution}}. The only difference between the two is that the 
> former is more strict when deciding whether bucket join is allowed to avoid 
> shuffle: comparing to the latter, it requires *exact* match between the 
> clustering keys from the output partitioning (i.e., {{HashPartitioning}}) and 
> the join keys. However, this is unnecessary, as we should be able to avoid 
> shuffle when the set of clustering keys is a subset of join keys, just like 
> {{ClusteredDistribution}}. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to