andygrove opened a new issue, #4344:
URL: https://github.com/apache/datafusion-comet/issues/4344
## Description
`native_datafusion` silently accepts integer-to-decimal Parquet reads where
the requested decimal type cannot represent the integer values in the file.
Spark's vectorized reader rejects these conversions with
`SchemaColumnConvertNotSupportedException` (per
`ParquetVectorUpdaterFactory.getUpdater`) because reading e.g. an INT64 column
into a `DECIMAL(p,s)` whose precision is below the integer's required precision
is unsafe. `native_datafusion` instead returns wrong (truncated/overflowed)
values.
This is the integer-to-decimal counterpart to #4297 (primitive-to-primitive
numeric/date conversions) and #4343 (decimal-to-decimal narrowing).
## Affected tests (Spark 4.1.1, `dev/diffs/4.1.1.diff`)
Currently tagged `IgnoreCometNativeDataFusion` pointing at the umbrella
#3720:
- `ParquetTypeWideningSuite` — `unsupported parquet conversion $fromType ->
$toType`
(the second occurrence in the suite, the integer→decimal block at line
~264). Iterates pairs such as:
- `ByteType -> DECIMAL(1, 0)`
- `ShortType -> DECIMAL(ByteDecimal.precision, 0)` /
`DECIMAL(ByteDecimal.precision + 1, 1)` etc.
- `IntegerType -> ShortDecimal` / `DECIMAL(IntDecimal.precision - 1, 0)`
etc.
- `LongType -> IntDecimal` / `DECIMAL(LongDecimal.precision - 1, 0)` etc.
Expects `SchemaColumnConvertNotSupportedException` when the vectorized
reader is enabled and the target decimal precision is too small to hold the
integer.
The same tests exist in the 3.4 / 3.5 / 4.0 diffs and are ignored under
#3720 there as well.
## Reproduction
```scala
import org.apache.comet.CometConf
import org.apache.spark.sql.internal.SQLConf
withSQLConf(
CometConf.COMET_NATIVE_SCAN_IMPL.key -> CometConf.SCAN_NATIVE_DATAFUSION,
SQLConf.USE_V1_SOURCE_LIST.key -> "parquet") {
withTempPath { dir =>
val path = dir.getCanonicalPath
Seq(123456L).toDF("c")
.selectExpr("cast(c as bigint) as c")
.write.parquet(path)
// LongType is INT64 in Parquet; a target DECIMAL(p, 0) with p < 19
cannot
// represent every Long, so Spark rejects it. native_datafusion accepts
it.
spark.read.schema("c decimal(5, 0)").parquet(path).show()
}
}
```
## Suggested approach
Same direction as #4297 / #4343: extend the integer→decimal branch of the
schema adapter / `replace_with_spark_cast` to mirror Spark's allowlist — only
accept conversions where the target decimal precision is large enough to hold
the integer's range (and scale is 0, or handled per Spark's rules). Reject
everything else with `SparkError::ParquetSchemaConvert`.
## Parent issue
Split from umbrella #3720.
--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
To unsubscribe, e-mail: [email protected]
For queries about this service, please contact Infrastructure at:
[email protected]
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]