[ https://issues.apache.org/jira/browse/SPARK-17213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15851960#comment-15851960 ]
Apache Spark commented on SPARK-17213: -------------------------------------- User 'liancheng' has created a pull request for this issue: https://github.com/apache/spark/pull/16791 > Parquet String Pushdown for Non-Eq Comparisons Broken > ----------------------------------------------------- > > Key: SPARK-17213 > URL: https://issues.apache.org/jira/browse/SPARK-17213 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.1.0 > Reporter: Andrew Duffy > Assignee: Cheng Lian > Fix For: 2.1.0 > > > Spark defines ordering over strings based on comparison of UTF8 byte arrays, > which compare bytes as unsigned integers. Currently however Parquet does not > respect this ordering. This is currently in the process of being fixed in > Parquet, JIRA and PR link below, but currently all filters are broken over > strings, with there actually being a correctness issue for {{>}} and {{<}}. > *Repro:* > Querying directly from in-memory DataFrame: > {code} > > Seq("a", "é").toDF("name").where("name > 'a'").count > 1 > {code} > Querying from a parquet dataset: > {code} > > Seq("a", "é").toDF("name").write.parquet("/tmp/bad") > > spark.read.parquet("/tmp/bad").where("name > 'a'").count > 0 > {code} > This happens because Spark sorts the rows to be {{[a, é]}}, but Parquet's > implementation of comparison of strings is based on signed byte array > comparison, so it will actually create 1 row group with statistics > {{min=é,max=a}}, and so the row group will be dropped by the query. > Based on the way Parquet pushes down Eq, it will not be affecting correctness > but it will force you to read row groups you should be able to skip. > Link to PARQUET issue: https://issues.apache.org/jira/browse/PARQUET-686 > Link to PR: https://github.com/apache/parquet-mr/pull/362 -- This message was sent by Atlassian JIRA (v6.3.15#6346) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org