Nikita Eshkeev created SPARK-43021:
--------------------------------------

             Summary: Shuffle happens when Coalesce Buckets should occur
                 Key: SPARK-43021
                 URL: https://issues.apache.org/jira/browse/SPARK-43021
             Project: Spark
          Issue Type: Bug
          Components: SQL
    Affects Versions: 3.3.1
            Reporter: Nikita Eshkeev


h1. What I did

I define the following code:

{{from pyspark.sql import SparkSession}}

{{spark = (}}
{{  SparkSession}}
{{    .builder}}
{{    .appName("Bucketing")}}
{{    .master("local[4]")}}
{{    .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}}
{{    .config("spark.sql.autoBroadcastJoinThreshold", "-1")}}
{{    .getOrCreate()}}
{{)}}

{{df1 = spark.range(0, 100)}}
{{df2 = spark.range(0, 100, 2)}}

{{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}}
{{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}}

{{t1 = spark.table("t1")}}
{{t2 = spark.table("t2")}}

{{t2.join(t1, "id").explain()}}

h1. What happened

There is an Exchange node in the join plan

h1. What is expected

The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s number 
of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio is 2 which 
is less than 4 ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}}) 
and 
[CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala]
 should be applied



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to