Hi Alkis, I don't think this is just an implementation detail, the spec currently explicitly states int96 sort order is undefined [1].
Despite this, a quick scan of the C++ seems to indicate it might be reading/writing stats for int96 (again only looked quickly but I couldn't find guards against it). There is an old bug [2] on correcting comparisons (I didn't look closely to see if these align with the Parquet-java changes proposed), so there is a chance that files in the wild written by C++ might have incorrect statistics. Cheers, Micah [1] https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L1079 [2] https://github.com/apache/parquet-cpp/pull/399/files On Fri, Jun 13, 2025 at 6:24 AM Alkis Evlogimenos <[email protected]> wrote: > Hi folks, > > While INT96 is now deprecated, it's still the default timestamp type in > Spark, resulting in a significant amount of existing data written in this > format. > > Historically, parquet-mr/java has not emitted or read statistics for INT96. > This was likely due to the fact that standard byte comparison on the INT96 > representation doesn't align with logical comparisons, potentially leading > to incorrect min/max values. This is unfortunate because timestamp filters > are extremely common and lack of stats limits optimization opportunities. > > Since its inception Photon <https://www.databricks.com/product/photon> > emitted > and utilized INT96 statistics by employing a logical comparator, ensuring > their correctness. We have now implemented > <https://github.com/apache/parquet-java/pull/3243> the same support within > parquet-java. > > We'd like to get the community's thoughts on this addition. We anticipate > that most users may not be directly affected due to the declining use of > INT96. However, we are interested in identifying any potential drawbacks or > unforeseen issues with this approach. > > Cheers >
