johanl-db commented on PR #44006:
URL: https://github.com/apache/spark/pull/44006#issuecomment-1831624859

   > It's unfortunate that the check for Spark type versus Parquet type happens 
in `ParquetVectorUpdaterFactory` which is after predicate pushdown for row 
groups. Will similar issue happen for float to double in certain cases?
   
   There's no issue with float to double because we were already strict when 
deciding whether to build a row group filter: we only accept float values for 
float and double values for double so no overflow possible.
   
   > Hi, @johanl-db . Do you happen to know what causes this? I'm curious if 
this is Apache Spark 3.5.0-only issue or not.
   When creating row group filters, we accept any value and don't check if the 
value actually fits in the target type. If the read schema is `LONG` for 
example and the parquet type is `INT32`, you could pass a value that will 
overflow before this change. We have stricter type checks in the Parquet 
readers themselves, but by that time it's too late as the row group may already 
be incorrectly skipped and that check won't trigger.
   
   Looking at 
https://github.com/apache/spark/blame/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L478,
 this goes back at least to 3.0 and it seems the check wasn't even less strict 
in earlier versions so I'd say this behavior was always there.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Reply via email to