[ https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16752590#comment-16752590 ]
ASF GitHub Bot commented on PARQUET-1510: ----------------------------------------- rdblue commented on pull request #603: PARQUET-1510: Fix notEq for optional columns with null values. URL: https://github.com/apache/parquet-mr/pull/603 Dictionaries cannot contain null values, so notEq filters cannot conclude that a block cannot match using only the dictionary. Instead, it must also check whether the block may have at least one null value. If there are no null values, then the existing check is correct. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Dictionary filter skips null values when evaluating not-equals. > --------------------------------------------------------------- > > Key: PARQUET-1510 > URL: https://issues.apache.org/jira/browse/PARQUET-1510 > Project: Parquet > Issue Type: Improvement > Reporter: Ryan Blue > Priority: Major > Labels: pull-request-available > > This was discovered in Spark, see SPARK-26677. From the Spark PR: > {code} > // Repeat the values to get dictionary encoding. > Seq(Some("A"), Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo") > spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show() > +-----+ > |value| > +-----+ > +-----+ > {code} > {code} > // Use plain encoding. > Seq(Some("A"), > None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar") > spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show() > +-----+ > |value| > +-----+ > | null| > +-----+ > {code} > This is a correctness issue. -- This message was sent by Atlassian JIRA (v7.6.3#76005)