Re: [DISCUSS] INT96 stats

Micah Kornfield Fri, 13 Jun 2025 09:55:51 -0700

Hi Alkis,

I don't think this is just an implementation detail, the spec currently
explicitly states int96 sort order is undefined [1].


Despite this, a quick scan of the C++ seems to indicate it might be
reading/writing stats for int96 (again only looked quickly but I couldn't
find guards against it).  There is an old bug [2] on correcting comparisons
(I didn't look closely to see if these align with the Parquet-java changes
proposed), so there is a chance that files in the wild written by C++ might
have incorrect statistics.

Cheers,
Micah

[1]
https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079
[2] https://github.com/apache/parquet-cpp/pull/399/files


On Fri, Jun 13, 2025 at 6:24 AM Alkis Evlogimenos
<[email protected]> wrote:

> Hi folks,
>
> While INT96 is now deprecated, it's still the default timestamp type in
> Spark, resulting in a significant amount of existing data written in this
> format.
>
> Historically, parquet-mr/java has not emitted or read statistics for INT96.
> This was likely due to the fact that standard byte comparison on the INT96
> representation doesn't align with logical comparisons, potentially leading
> to incorrect min/max values. This is unfortunate because timestamp filters
> are extremely common and lack of stats limits optimization opportunities.
>
> Since its inception Photon <https://www.databricks.com/product/photon>
> emitted
> and utilized INT96 statistics by employing a logical comparator, ensuring
> their correctness. We have now implemented
> <https://github.com/apache/parquet-java/pull/3243> the same support within
> parquet-java.
>
> We'd like to get the community's thoughts on this addition. We anticipate
> that most users may not be directly affected due to the declining use of
> INT96. However, we are interested in identifying any potential drawbacks or
> unforeseen issues with this approach.
>
> Cheers
>

Re: [DISCUSS] INT96 stats

Reply via email to