[ https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593024#comment-16593024 ]
Dongjoon Hyun commented on SPARK-25206: --------------------------------------- Hi, [~yucai], [~cloud_fan], [~smilegator], [~hyukjin.kwon]. In Spark 2.4, we are still trying to fix long-lasting Parquet case-sensitivity issues (Spark 2.1.x raises Exceptions and Spark 2.2.x is the same with Spark 2.3.x). Unfortunately, this effort is incomplete and unstable even in Spark 2.4 because we have unmerged one (SPARK-25207) and we may have more future unknown patches. In this case, we had better consider any backporting to `branch-2.3` after Spark 2.4 becomes stable first. We may land them together, not one by one. How do you think about this? Are the current three Spark-2.4-only Parquet patches(SPARK-25132, SPARK-24716, SPARK-25207) considered as a complete set of patches for this? > Wrong data may be returned for Parquet > -------------------------------------- > > Key: SPARK-25206 > URL: https://issues.apache.org/jira/browse/SPARK-25206 > Project: Spark > Issue Type: Bug > Components: SQL > Affects Versions: 2.2.2, 2.3.1 > Reporter: yucai > Priority: Blocker > Labels: correctness > Attachments: image-2018-08-24-18-05-23-485.png, > image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, > image-2018-08-24-22-46-05-346.png, image-2018-08-25-09-54-53-219.png, > image-2018-08-25-10-04-21-901.png, pr22183.png > > > In current Spark 2.3.1, below query returns wrong data silently. > {code:java} > spark.range(10).write.parquet("/tmp/data") > sql("DROP TABLE t") > sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'") > scala> sql("select * from t").show > +----+ > | ID| > +----+ > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > |null| > +----+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > scala> sql("set spark.sql.parquet.filterPushdown").show > +--------------------+-----+ > | key|value| > +--------------------+-----+ > |spark.sql.parquet...| true| > +--------------------+-----+ > scala> sql("set spark.sql.parquet.filterPushdown=false").show > +--------------------+-----+ > | key|value| > +--------------------+-----+ > |spark.sql.parquet...|false| > +--------------------+-----+ > scala> sql("select * from t where id > 0").show > +---+ > | ID| > +---+ > +---+ > {code} > > *Root Cause* > Spark pushdowns FilterApi.gt(intColumn("{color:#ff0000}ID{color}"), 0: > Integer) into parquet, but {color:#ff0000}ID{color} does not exist in > /tmp/data (parquet is case sensitive, it has {color:#ff0000}id{color} > actually). > So no records are returned. > In Spark 2.1, the user will get Exception: > {code:java} > Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in > schema!{code} > But in Spark 2.3, they will get the wrong results sliently. > > Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema > to do the pushdown, perfect for this issue. > [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it? -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org