Raunaq Morarka created PARQUET-2352:
---------------------------------------
Summary: Update parquet format spec to allow truncation of row
group min/max stats
Key: PARQUET-2352
URL: https://issues.apache.org/jira/browse/PARQUET-2352
Project: Parquet
Issue Type: Improvement
Reporter: Raunaq Morarka
Column index stats are explicitly allowed to be truncated
[https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
However, it seems row group min/max stats are not allowed to be truncated
[https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
although it is not explicitly clarified like in the column index case. This
forces implementations to either drop min/max row group stats for columns with
long strings and miss opportunities for filtering row groups or seemingly
deviate from spec by truncating min/max row group stats.
https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to
parquet-mr which allows users to deviate from spec and configure truncation of
min/max row group stats. Unfortunately, there is no way for readers to detect
whether truncation took place.
Since the possibility of truncation exists and is not possible to explicitly
detect, attempts to pushdown min/max aggregation to parquet have avoided
implementing it for string columns (e.g.
https://issues.apache.org/jira/browse/SPARK-36645)
Given the above situation, the spec should be updated to allow truncation of
min/max row group stats. This would align the spec with current reality that
string column min/max row group stats could be truncated.
Additionally, a flag could be added to the stats to specify whether min/max
stats are truncated. Reader implementations could then safely implement min/max
aggregation pushdown to strings for new data going forward by checking the
value of this flag. When the flag is not found on existing data then it could
be assumed that the data could be truncated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)