Nikita Eshkeev created SPARK-43021: -------------------------------------- Summary: Shuffle happens when Coalesce Buckets should occur Key: SPARK-43021 URL: https://issues.apache.org/jira/browse/SPARK-43021 Project: Spark Issue Type: Bug Components: SQL Affects Versions: 3.3.1 Reporter: Nikita Eshkeev
h1. What I did I define the following code: {{from pyspark.sql import SparkSession}} {{spark = (}} {{ SparkSession}} {{ .builder}} {{ .appName("Bucketing")}} {{ .master("local[4]")}} {{ .config("spark.sql.bucketing.coalesceBucketsInJoin.enabled", True)}} {{ .config("spark.sql.autoBroadcastJoinThreshold", "-1")}} {{ .getOrCreate()}} {{)}} {{df1 = spark.range(0, 100)}} {{df2 = spark.range(0, 100, 2)}} {{df1.write.bucketBy(4, "id").mode("overwrite").saveAsTable("t1")}} {{df2.write.bucketBy(2, "id").mode("overwrite").saveAsTable("t2")}} {{t1 = spark.table("t1")}} {{t2 = spark.table("t2")}} {{t2.join(t1, "id").explain()}} h1. What happened There is an Exchange node in the join plan h1. What is expected The plan should not contain any Exchange/Shuffle nodes, because {{t1}}'s number of buckets is 4 and {{t2}}'s number of buckets is 2, and their ratio is 2 which is less than 4 ({{spark.sql.bucketing.coalesceBucketsInJoin.maxBucketRatio}}) and [CoalesceBucketsInJoin|https://github.com/apache/spark/blob/c9878a212958bc54be529ef99f5e5d1ddf513ec8/sql/core/src/main/scala/org/apache/spark/sql/execution/bucketing/CoalesceBucketsInJoin.scala] should be applied -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org