[ 
https://issues.apache.org/jira/browse/PARQUET-1510?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-1510:
-------------------------------
    Labels: correctness pull-request-available  (was: pull-request-available)

> Dictionary filter skips null values when evaluating not-equals.
> ---------------------------------------------------------------
>
>                 Key: PARQUET-1510
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1510
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Ryan Blue
>            Priority: Major
>              Labels: correctness, pull-request-available
>
> This was discovered in Spark, see SPARK-26677. From the Spark PR:
> {code}
> // Repeat the values to get dictionary encoding.
> Seq(Some("A"), Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/foo")
> spark.read.parquet("/tmp/foo").where("NOT (value <=> 'A')").show()
> +-----+
> |value|
> +-----+
> +-----+
> {code}
> {code}
> // Use plain encoding.
> Seq(Some("A"), 
> None).toDF.repartition(1).write.mode("overwrite").parquet("/tmp/bar")
> spark.read.parquet("/tmp/bar").where("NOT (value <=> 'A')").show()
> +-----+
> |value|
> +-----+
> | null|
> +-----+
> {code}
> This is a correctness issue.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to