[
https://issues.apache.org/jira/browse/PARQUET-2352?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17789061#comment-17789061
]
ASF GitHub Bot commented on PARQUET-2352:
-----------------------------------------
mapleFU commented on PR #216:
URL: https://github.com/apache/parquet-format/pull/216#issuecomment-1824221453
So seems that for `PageHeader` (though if PageIndex enabled, we might not
write page statistics) might also have to write these two statistics?
> Update parquet format spec to allow truncation of row group min/max stats
> -------------------------------------------------------------------------
>
> Key: PARQUET-2352
> URL: https://issues.apache.org/jira/browse/PARQUET-2352
> Project: Parquet
> Issue Type: Improvement
> Reporter: Raunaq Morarka
> Assignee: Raunaq Morarka
> Priority: Major
> Fix For: format-2.10.0
>
>
> Column index stats are explicitly allowed to be truncated
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L958]
> However, it seems row group min/max stats are not allowed to be truncated
> [https://github.com/apache/parquet-format/blob/master/src/main/thrift/parquet.thrift#L219]
> although it is not explicitly clarified like in the column index case. This
> forces implementations to either drop min/max row group stats for columns
> with long strings and miss opportunities for filtering row groups or
> seemingly deviate from spec by truncating min/max row group stats.
> https://issues.apache.org/jira/browse/PARQUET-1685 introduced a feature to
> parquet-mr which allows users to deviate from spec and configure truncation
> of min/max row group stats. Unfortunately, there is no way for readers to
> detect whether truncation took place.
> Since the possibility of truncation exists and is not possible to explicitly
> detect, attempts to pushdown min/max aggregation to parquet have avoided
> implementing it for string columns (e.g.
> https://issues.apache.org/jira/browse/SPARK-36645)
> Given the above situation, the spec should be updated to allow truncation of
> min/max row group stats. This would align the spec with current reality that
> string column min/max row group stats could be truncated.
> Additionally, a flag could be added to the stats to specify whether min/max
> stats are truncated. Reader implementations could then safely implement
> min/max aggregation pushdown to strings for new data going forward by
> checking the value of this flag. When the flag is not found on existing data
> then it could be assumed that the data could be truncated.
--
This message was sent by Atlassian Jira
(v8.20.10#820010)