bucketBy in pyspark not retaining partition information

Nitin Siwach Mon, 31 Jan 2022 09:57:18 -0800

I am reading two datasets that I saved to the disk with ```bucketBy```
option on the same key with the same number of partitions. When I read them
back and join them, they should not result in a shuffle.


But, that isn't the case I am seeing.

*The following code demonstrates the alleged behavior:*

from pyspark.sql import SparkSession
from pyspark.sql import functions as F
spark = SparkSession.builder.config("spark.sql.autoBroadcastJoinThreshold",
"-1").getOrCreate()
import random

data1 = [(i,random.randint(1,5),random.randint(1,5)) for t in range(2) for
i in range(5)]
data2 = [(i,random.randint(1,5),random.randint(1,5)) for t in range(2) for
i in range(5)]
df1=spark.createDataFrame(data1,schema = 'a int,b int,c int')
df2=spark.createDataFrame(data1,schema = 'a int,b int,c int')

parquet_path1 = './bucket_test_parquet1'
parquet_path2 = './bucket_test_parquet2'

df1.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path1,mode='overwrite')
df2.write.bucketBy(5,"a").format("parquet").saveAsTable('df',path=parquet_path2,mode='overwrite')

read_parquet1 = spark.read.format("parquet").load(parquet_path1,header=True)
read_parquet1.createOrReplaceTempView("read_parquet1")
read_parquet1.createOrReplaceTempView('read_parquet1')
read_parquet1 = spark.sql("SELECT * from read_parquet1")

read_parquet2 = spark.read.format("parquet").load(parquet_path2,header=True)
read_parquet2.createOrReplaceTempView("read_parquet2")
read_parquet2.createOrReplaceTempView('read_parquet2')
read_parquet2 = spark.sql("SELECT * from read_parquet2")
read_parquet1.join(read_parquet2,on='a').explain()

*The output that I am getting is*

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#24, b#25, c#26, b#34, c#35]
   +- SortMergeJoin [a#24], [a#33], Inner
      :- Sort [a#24 ASC NULLS FIRST], false, 0
      :  +- Exchange hashpartitioning(a#24, 200), ENSURE_REQUIREMENTS, [id=#61]
      :     +- Filter isnotnull(a#24)
      :        +- FileScan parquet [a#24,b#25,c#26] Batched: true,
DataFilters: [isnotnull(a#24)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet1],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct<a:int,b:int,c:int>
      +- Sort [a#33 ASC NULLS FIRST], false, 0
         +- Exchange hashpartitioning(a#33, 200), ENSURE_REQUIREMENTS, [id=#62]
            +- Filter isnotnull(a#33)
               +- FileScan parquet [a#33,b#34,c#35] Batched: true,
DataFilters: [isnotnull(a#33)], Format: Parquet, Location:
InMemoryFileIndex(1
paths)[file:/home/nitin/pymonsoon/bucket_test_parquet2],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct<a:int,b:int,c:int>


*Which clearly has hashpartitioning goiong on. Kindly, help me clarify
the utility of ```bucketBy```*

bucketBy in pyspark not retaining partition information

Reply via email to