[ 
https://issues.apache.org/jira/browse/SPARK-25207?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16593108#comment-16593108
 ] 

Dongjoon Hyun commented on SPARK-25207:
---------------------------------------

According to the PR, this seems to be a new regression introduced at Spark 2.4. 
It's not specific to schema mismatch case. For example, in the following schema 
matched case, the input size is less than or equal to 8.0 MB in Spark 2.3.1, 
but now master seems to show the following.

{code}
spark.sparkContext.hadoopConfiguration.setInt("parquet.block.size", 8 * 1024 * 
1024)
spark.range(1, 40 * 1024 * 1024, 1, 
1).sortWithinPartitions("id").write.mode("overwrite").parquet("/tmp/t")
sql("CREATE TABLE t (id LONG) USING parquet LOCATION '/tmp/t'")
// It should be less than and equal to 8MB.
sql("select * from t where id < 100L").show()      
// It's already less than and equal to 8BM
sql("select * from t where id < 100L").write.mode("overwrite").csv("/tmp/id")
{code}

!image.png! 

> Case-insensitve field resolution for filter pushdown when reading Parquet
> -------------------------------------------------------------------------
>
>                 Key: SPARK-25207
>                 URL: https://issues.apache.org/jira/browse/SPARK-25207
>             Project: Spark
>          Issue Type: Bug
>          Components: SQL
>    Affects Versions: 2.4.0
>            Reporter: yucai
>            Priority: Major
>              Labels: Parquet
>         Attachments: image.png
>
>
> Currently, filter pushdown will not work if Parquet schema and Hive metastore 
> schema are in different letter cases even spark.sql.caseSensitive is false.
> Like the below case:
> {code:java}
> spark.range(10).write.parquet("/tmp/data")
> sql("DROP TABLE t")
> sql("CREATE TABLE t (ID LONG) USING parquet LOCATION '/tmp/data'")
> sql("select * from t where id > 0").show{code}
> -No filter will be pushed down.-
> {code}
> scala> sql("select * from t where id > 0").explain   // Filters are pushed 
> with `ID`
> == Physical Plan ==
> *(1) Project [ID#90L]
> +- *(1) Filter (isnotnull(id#90L) && (id#90L > 0))
>    +- *(1) FileScan parquet default.t[ID#90L] Batched: true, Format: Parquet, 
> Location: InMemoryFileIndex[file:/tmp/data], PartitionFilters: [], 
> PushedFilters: [IsNotNull(ID), GreaterThan(ID,0)], ReadSchema: 
> struct<ID:bigint>
> scala> sql("select * from t").show    // Parquet returns NULL for `ID` 
> because it has `id`.
> +----+
> |  ID|
> +----+
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> |null|
> +----+
> scala> sql("select * from t where id > 0").show   // `NULL > 0` is `false`.
> +---+
> | ID|
> +---+
> +---+
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to