What is DataFilters and while joining why is the filter isnotnull[joinKey] applied twice

Nitin Siwach Tue, 31 Jan 2023 20:07:22 -0800

Pyspark version:3.1.3

*Question 1: *What is DataFilters in spark physical plan? How is it
different from PushedFilters?
*Question 2:* When joining two datasets, Why is the filter isnotnull
applied twice on the joining key column? In the physical plan, it is once
applied as a PushedFilter and then explicitly applied right after it. Why
is that so?



code:

import os
import pandas as pd, numpy as np
import pyspark
spark=pyspark.sql.SparkSession.builder.getOrCreate()

save_loc = "gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/
"

df1 =
spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size =
1000, p = [0.47,0.48,0.05]),
                                         'b': np.random.random(1000)}))

df2 =
spark.createDataFrame(pd.DataFrame({'a':np.random.choice([1,2,None],size =
1000, p = [0.47,0.48,0.05]),
                                         'b': np.random.random(1000)}))

df1.write.parquet(os.path.join(save_loc,"dfl_key_int"))
df2.write.parquet(os.path.join(save_loc,"dfr_key_int"))

dfl_int = spark.read.parquet(os.path.join(save_loc,"dfl_key_int"))
dfr_int = spark.read.parquet(os.path.join(save_loc,"dfr_key_int"))

dfl_int.join(dfr_int,on='a',how='inner').explain()



output:

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Project [a#23L, b#24, b#28]
   +- BroadcastHashJoin [a#23L], [a#27L], Inner, BuildRight, false
      :- Filter isnotnull(a#23L)
      :  +- FileScan parquet [a#23L,b#24] Batched: true, DataFilters:
[isnotnull(a#23L)], Format: Parquet, Location:
InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfl_key_int],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct<a:bigint,b:double>
      +- BroadcastExchange HashedRelationBroadcastMode(List(input[0,
bigint, false]),false), [id=#75]
         +- Filter isnotnull(a#27L)
            +- FileScan parquet [a#27L,b#28] Batched: true,
DataFilters: [isnotnull(a#27L)], Format: Parquet, Location:
InMemoryFileIndex[gs://monsoon-credittech.appspot.com/spark_datasets/random_tests/dfr_key_int],
PartitionFilters: [], PushedFilters: [IsNotNull(a)], ReadSchema:
struct<a:bigint,b:double>



-- 
Regards,
Nitin

What is DataFilters and while joining why is the filter isnotnull[joinKey] applied twice

Reply via email to