[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Blue updated PARQUET-1510: ------------------------------- Labels: correctness pull-request-available (was: pull-request-available) > Dictionary filter skips null values when evaluating not-equals. > --------------------------------------------------------------- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Reporter: Ryan Blue > Priority: Major > Labels: correctness, pull-request-available > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-----+ > |value| > +-----+ > +-----+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-----+ > |value| > +-----+ > | null| > +-----+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)