Looking more closely the C++ code is quite old (not even in Arrow repo),
and the current code [1] looks like it matches the spec

[1]
https://github.com/apache/arrow/blob/2ba455f17e7cdbfe2b2f1aa3dfb2bf00878a17e1/cpp/src/parquet/types.cc#L302

On Fri, Jun 13, 2025 at 9:55 AM Micah Kornfield <[email protected]>
wrote:

> Hi Alkis,
>
> I don't think this is just an implementation detail, the spec currently
> explicitly states int96 sort order is undefined [1].
>
> Despite this, a quick scan of the C++ seems to indicate it might be
> reading/writing stats for int96 (again only looked quickly but I couldn't
> find guards against it).  There is an old bug [2] on correcting comparisons
> (I didn't look closely to see if these align with the Parquet-java changes
> proposed), so there is a chance that files in the wild written by C++ might
> have incorrect statistics.
>
> Cheers,
> Micah
>
> [1]
> https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079
> [2] https://github.com/apache/parquet-cpp/pull/399/files
>
>
> On Fri, Jun 13, 2025 at 6:24 AM Alkis Evlogimenos
> <[email protected]> wrote:
>
>> Hi folks,
>>
>> While INT96 is now deprecated, it's still the default timestamp type in
>> Spark, resulting in a significant amount of existing data written in this
>> format.
>>
>> Historically, parquet-mr/java has not emitted or read statistics for
>> INT96.
>> This was likely due to the fact that standard byte comparison on the INT96
>> representation doesn't align with logical comparisons, potentially leading
>> to incorrect min/max values. This is unfortunate because timestamp filters
>> are extremely common and lack of stats limits optimization opportunities.
>>
>> Since its inception Photon <https://www.databricks.com/product/photon>
>> emitted
>> and utilized INT96 statistics by employing a logical comparator, ensuring
>> their correctness. We have now implemented
>> <https://github.com/apache/parquet-java/pull/3243> the same support
>> within
>> parquet-java.
>>
>> We'd like to get the community's thoughts on this addition. We anticipate
>> that most users may not be directly affected due to the declining use of
>> INT96. However, we are interested in identifying any potential drawbacks
>> or
>> unforeseen issues with this approach.
>>
>> Cheers
>>
>

Reply via email to