johanl-db commented on PR #44006: URL: https://github.com/apache/spark/pull/44006#issuecomment-1831624859
> It's unfortunate that the check for Spark type versus Parquet type happens in `ParquetVectorUpdaterFactory` which is after predicate pushdown for row groups. Will similar issue happen for float to double in certain cases? There's no issue with float to double because we were already strict when deciding whether to build a row group filter: we only accept float values for float and double values for double so no overflow possible. > Hi, @johanl-db . Do you happen to know what causes this? I'm curious if this is Apache Spark 3.5.0-only issue or not. When creating row group filters, we accept any value and don't check if the value actually fits in the target type. If the read schema is `LONG` for example and the parquet type is `INT32`, you could pass a value that will overflow before this change. We have stricter type checks in the Parquet readers themselves, but by that time it's too late as the row group may already be incorrectly skipped and that check won't trigger. Looking at https://github.com/apache/spark/blame/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L478, this goes back at least to 3.0 and it seems the check wasn't even less strict in earlier versions so I'd say this behavior was always there. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org