[ https://issues.apache.org/jira/browse/SPARK-52032?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
ASF GitHub Bot updated SPARK-52032: ----------------------------------- Labels: pull-request-available (was: ) > ORC filter pushdown causes incorrect results with eqNullSafe (<=>) in > DataFrame filter (Parquet works as expected) > ------------------------------------------------------------------------------------------------------------------ > > Key: SPARK-52032 > URL: https://issues.apache.org/jira/browse/SPARK-52032 > Project: Spark > Issue Type: Bug > Components: Spark Core > Affects Versions: 3.5.5 > Reporter: Masafumi Ito > Priority: Critical > Labels: pull-request-available > > h1. Overview of the Issue > When applying a filter using `eqNullSafe` (`<=>`, null-safe equality > operator) to a DataFrame created from an ORC file, rows containing null > values are incorrectly excluded, even though they should be retained. > With Parquet files, the same operation produces the expected results. > Additionally, if ORC filter pushdown is disabled, the ORC DataFrame also > behaves correctly. > h1. Detailed Explanation > h2. Steps to Reproduce > 1. Prepare a DataFrame (created from an ORC file) with the following data: > {code:java} > +---+----+-------------------+ > | id|name| email| > +---+----+-------------------+ > | 1|test| null| > | 2|null|exam...@example.com| > +---+----+-------------------+{code} > 2. Apply the following filter operation to this DataFrame (`df`): > {code:java} > df.filter(!(col("name") <=> "dummy")){code} > 3. *Expected result:* All rows should be retained, as shown below. > {code:java} > +---+----+-------------------+ > | id|name| email| > +---+----+-------------------+ > | 1|test| null| > | 2|null|exam...@example.com| > +---+----+-------------------+{code} > 4. *Actual result:* The row where `name` is `null` is excluded. > {code:java} > +---+----+-------------------+ > | id|name| email| > +---+----+-------------------+ > | 1|test| null| > +---+----+-------------------+{code} > h2. Additional Information > * With DataFrames created from Parquet files, the same filter operation > produces the expected results. > * Disabling ORC filter pushdown (`spark.sql.orc.filterPushdown`) results in > correct behavior for ORC as well. > * The Physical Plan is the same for both ORC and Parquet. In both formats, > the condition `NOT (name <=> dummy)` appears in DataFilters and PushedFilters > in the plan. > h3. ORC Physical Plan > {code:java} > == Physical Plan == > *(1) Filter NOT (name#36 <=> dummy) > +- *(1) ColumnarToRow > +- FileScan orc [id#35,name#36,email#37] Batched: true, DataFilters: [NOT > (name#36 <=> dummy)], Format: ORC, Location: InMemoryFileIndex(1 > paths)[file:/***/spark-null-safe/data/null_safe_..., PartitionFilters: [], > PushedFilters: [Not(EqualNullSafe(name,dummy))], ReadSchema: > struct<id:int,name:string,email:string> {code} > h3. Parquet Physical Plan > {code:java} > == Physical Plan == > *(1) Filter NOT (name#42 <=> dummy) > +- *(1) ColumnarToRow > +- FileScan parquet [id#41,name#42,email#43] Batched: true, DataFilters: > [NOT (name#42 <=> dummy)], Format: Parquet, Location: InMemoryFileIndex(1 > paths)[file:/***/spark-null-safe/data/null_safe_..., PartitionFilters: [], > PushedFilters: [Not(EqualNullSafe(name,dummy))], ReadSchema: > struct<id:int,name:string,email:string>{code} > Thus, although the physical query plans are identical, only ORC produces > incorrect filter results. > h1. Reproduction Code > {code:java} > import org.apache.spark.sql.functions.col > import org.apache.spark.sql.SparkSession > object NullSafeTest { > private val projectDir = new java.io.File(".").getCanonicalPath > private val dataPath = f"$projectDir/data/null_safe_test" > def main(args: Array[String]): Unit = { > val spark = SparkSession.builder > .appName("NullSafeTest") > .master("local[*]") > // .config("spark.sql.orc.filterPushdown", "false") // ORC works > as expected without filter pushdown > .getOrCreate() > import spark.implicits._ > val df = Seq( > (1, "test", null), > (2, null, "exam...@example.com") > ).toDF("id", "name", "email") > df.show() > df.write.mode("overwrite").orc(s"$dataPath/orc") > df.write.mode("overwrite").parquet(s"$dataPath/parquet") > val orcDf = spark.read.orc(s"$dataPath/orc") > val parquetDf = spark.read.parquet(s"$dataPath/parquet") > val filteredOrcDf = orcDf.filter(!(col("name") <=> "dummy")) > filteredOrcDf.show() > filteredOrcDf.explain("extended") > val filteredParquetDf = parquetDf.filter(!(col("name") <=> "dummy")) > filteredParquetDf.show() > filteredParquetDf.explain("extended") > spark.stop() > } > } > {code} > h1. Summary > * When using ORC files with filter pushdown enabled, filtering with > `eqNullSafe` (`<=>`) unexpectedly excludes rows containing null values. > * This issue does not occur with Parquet, or with ORC when filter pushdown > is disabled. > * There may be a bug in how ORC filter pushdown handles null values. -- This message was sent by Atlassian Jira (v8.20.10#820010) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org