[ 
https://issues.apache.org/jira/browse/SPARK-25206?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16591756#comment-16591756
 ] 

yucai commented on SPARK-25206:
-------------------------------

[~cloud_fan] , we need both [https://github.com/apache/spark/pull/21696] and 
[https://github.com/apache/spark/pull/22183] for this bug.

 

*With only* [https://github.com/apache/spark/pull/21696], no records are 
returned.
{code:java}
rm -rf /tmp/data /tmp/data_csv
./bin/spark-shell
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").write.csv("/tmp/data_csv")

scala> spark.read.csv("/tmp/data_csv")
res4: org.apache.spark.sql.DataFrame = []{code}
*Root Cause*: No filter is pushed, but "ID" is selected from parquet file, 
which has no this field, so 10 null records are returned from parquet scan, and 
then they are filtered by "ID" > 0 in FilterExec, finally, 0 records are 
returned. See:

!image-2018-08-24-22-46-05-346.png!

*With both* [https://github.com/apache/spark/pull/21696] and 
[https://github.com/apache/spark/pull/22183]
{code:java}
rm -rf /tmp/data /tmp/data_csv
./bin/spark-shell
spark.range(10).write.parquet("/tmp/data")
sql("DROP TABLE t")
sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
sql("select * from t where id > 0").write.csv("/tmp/data_csv")

scala> spark.read.csv("/tmp/data_csv").show
+---+
|_c0|
+---+
| 2|
| 3|
| 4|
| 7|
| 8|
| 9|
| 5|
| 6|
| 1|
+---+{code}

> Wrong data may be returned when enable pushdown
> -----------------------------------------------
>
>                 Key: SPARK-25206
>                 URL: https://issues.apache.org/jira/browse/SPARK-25206
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.3.1
>            Reporter: yucai
>            Priority: Blocker
>              Labels: correctness
>         Attachments: image-2018-08-24-18-05-23-485.png, 
> image-2018-08-24-22-33-03-231.png, image-2018-08-24-22-34-11-539.png, 
> image-2018-08-24-22-46-05-346.png, pr22183.png
>
>
> In current Spark 2.3.1, below query returns wrong data silently.
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> scala> sql("select * from t where id > 0").show
> +---+
> | ID|
> +---+
> +---+{code}
>  
> *Root Cause*
> Spark pushdowns FilterApi.gt(intColumn("{color:#ff0000}ID{color}"), 0: 
> Integer) into parquet, but {color:#ff0000}ID{color} does not exist in 
> /tmp/data (parquet is case sensitive, it has {color:#ff0000}id{color} 
> actually).
> So no records are returned.
> In Spark 2.1, the user will get Exception:
> {code:java}
> Caused by: java.lang.IllegalArgumentException: Column [ID] was not found in 
> schema!{code}
> But in Spark 2.3, they will get the wrong results sliently.
>  
> Since SPARK-24716, Spark uses Parquet schema instead of Hive metastore schema 
> to do the pushdown, perfect for this issue.
> [~yumwang], [~cloud_fan], [~smilegator], any thoughts? Should we backport it?



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to