[I] Parquet reader: option to pass INT96 as bytes instead of a Timestamp [arrow-rs]

via GitHub Fri, 28 Feb 2025 12:48:20 -0800


mbutrovich opened a new issue, #7220:
URL: https://github.com/apache/arrow-rs/issues/7220

**Is your feature request related to a problem or challenge? Please describe
what you are trying to do.**

We are adapting [DataFusion
Comet](https://github.com/apache/datafusion-comet) (Spark accelerator) to use
DataFusion's native Parquet reader backed by arrow-rs. Spark _still_ defaults
to writing timestamps in Parquet as INT96 (a la Hive, Impala, and other
systems), which most systems infer as a timestamp despite the Parquet spec
having a separate timestamp type. In arrow-rs's case, it [converts to a
`Timestamp(TimeUnit::Nanoseconds,
None)`](https://github.com/apache/arrow-rs/blob/88eaa33ea5c959c4f129ad1b3d292d9bab1ba670/parquet/src/arrow/schema/primitive.rs#L104).
The nanosecond-precision renders the data type unable to represent the same
range of dates as what Spark wrote to the file originally.

**Describe the solution you'd like**

An option that defaults to opt-in that allows INT96 to pass unmodified bytes
for each value, perhaps as `FixedSizedBinary(12)`.

**Describe alternatives you've considered**

- An option to choose the precision for inferring INT96 as Timestamps. For
example, Spark uses microsecond precision, so going to
`Timestamp(TimeUnit::Microsecond, None)` would support a larger range. However,
I do not think it's reasonable to push Spark-specific options into arrow-rs.
- An option to pass INT96 as a struct of `Time64` and `Date32` Arrow types,
which is essentially what an INT96 timestamp represents, however I take the
same issue with the previous point.

**Additional context**

- Please see https://github.com/apache/datafusion/issues/7958 for relevant
discussion from 2023.
- Interpreting INT96 as a timestamp is fraught with peril. It depends on the
[Spark config](https://spark.apache.org/docs/latest/configuration.html), the
[Spark
version](https://kontext.tech/article/1062/spark-2x-to-3x-date-timestamp-and-int96-rebase-modes),
there still seems to be debate on whether arithmetic during conversion should
wrap on overflow or not.
- DataFusion's `SchemaAdapter` gives is a lot of control over how to adjust
data coming out of its Parquet reader. However, because this "lossy" conversion
to an Arrow type happens in arrow-rs, it's too late for us to fix it in a
custom `SchemaAdapter`. If we implement this feature, we will be able to handle
all of the Spark-specific configs and version quirks in a `SchemaAdapter`.

--
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]

[I] Parquet reader: option to pass INT96 as bytes instead of a Timestamp [arrow-rs]

Reply via email to