Re: [PR] [SPARK-46092][SQL] Don't push down Parquet row group filters that overflow [spark]

via GitHub Wed, 29 Nov 2023 02:30:01 -0800


johanl-db commented on PR #44006:
URL: https://github.com/apache/spark/pull/44006#issuecomment-1831624859

> It's unfortunate that the check for Spark type versus Parquet type happens
in `ParquetVectorUpdaterFactory` which is after predicate pushdown for row
groups. Will similar issue happen for float to double in certain cases?

There's no issue with float to double because we were already strict when
deciding whether to build a row group filter: we only accept float values for
float and double values for double so no overflow possible.

> Hi, @johanl-db . Do you happen to know what causes this? I'm curious if
this is Apache Spark 3.5.0-only issue or not.
When creating row group filters, we accept any value and don't check if the
value actually fits in the target type. If the read schema is `LONG` for
example and the parquet type is `INT32`, you could pass a value that will
overflow before this change. We have stricter type checks in the Parquet
readers themselves, but by that time it's too late as the row group may already
be incorrectly skipped and that check won't trigger.

Looking at
https://github.com/apache/spark/blame/branch-3.0/sql/core/src/main/scala/org/apache/spark/sql/execution/datasources/parquet/ParquetFilters.scala#L478,
this goes back at least to 3.0 and it seems the check wasn't even less strict
in earlier versions so I'd say this behavior was always there.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

Re: [PR] [SPARK-46092][SQL] Don't push down Parquet row group filters that overflow [spark]

Reply via email to