andygrove opened a new issue, #4343:
URL: https://github.com/apache/datafusion-comet/issues/4343
## Description
`native_datafusion` silently accepts decimal-to-decimal Parquet reads where
the requested read type narrows the precision or scale below what is needed to
represent the file's values. Spark's vectorized reader rejects these
conversions with `SchemaColumnConvertNotSupportedException` because the file
values cannot be safely represented in the requested type. `native_datafusion`
instead returns wrong (truncated/overflowed) values.
This is the decimal-to-decimal counterpart to #4297 (primitive-to-primitive
numeric/date conversions), and a generalisation of the specific case that was
tracked in #4089.
## Affected tests (Spark 4.1.1, `dev/diffs/4.1.1.diff`)
Currently tagged `IgnoreCometNativeDataFusion` pointing at the umbrella
#3720:
- `ParquetQuerySuite` — `SPARK-34212 Parquet should read decimals correctly`
Asserts `SchemaColumnConvertNotSupportedException` when reading e.g.
`DECIMAL(18,2)` as `DECIMAL(3,0)`.
- `ParquetTypeWideningSuite` — `parquet decimal precision change
Decimal($fromPrecision, 2) -> Decimal($toPrecision, 2)`
Iterates precision pairs across INT32 / INT64 / FIXED_LEN_BYTE_ARRAY
backed decimals; expects an error whenever the vectorized reader is enabled and
`fromPrecision > toPrecision`.
- `ParquetTypeWideningSuite` — `parquet decimal precision and scale change
Decimal($fromPrecision, $fromScale) -> Decimal($toPrecision, $toScale)`
Same idea but varies both precision and scale.
The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under
#3720 there as well.
## Reproduction
```scala
import org.apache.comet.CometConf
import org.apache.spark.sql.internal.SQLConf
withSQLConf(
CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Seq(BigDecimal("123.45")).toDF("d")
.selectExpr("cast(d as decimal(10,2)) as d")
.write.parquet(path)
spark.read.schema("d decimal(5,0)").parquet(path).show()
// Expected: SparkException(SchemaColumnConvertNotSupportedException)
// Actual: silent wrong/truncated value
}
}
```
`native_iceberg_compat` correctly throws `SparkException` for this case.
## Suggested approach
Same direction as #4297: extend the allowlist used by
`replace_with_spark_cast` / the decimal branch of the schema adapter so that
decimal-to-decimal coercions match Spark's `ParquetVectorUpdaterFactory` rules
— only accept widening (or equal-scale precision widening) and reject
everything else with `SparkError::ParquetSchemaConvert`.
## Parent issue
Split from umbrella #3720 (and #4089, which fixed a single decimal narrowing
case but did not unblock the broader test coverage).
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]