[ 
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961187#comment-16961187
 ] 

Ryan Blue commented on PARQUET-1685:
------------------------------------

Looks like Gabor is right. The stats fields used for each column chunk (and 
page) are called min_value and max_value, so we should not truncate them. We 
will have to use the new indexes to add truncation. That's good because we want 
more people to look at the implementation and validate that work anyway.

Maybe we could add a flag for truncating the min and max values, as long as it 
is disabled by default and stored in the file's key-value metadata.

> Truncate the stored min and max for String statistics to reduce the footer 
> size 
> --------------------------------------------------------------------------------
>
>                 Key: PARQUET-1685
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1685
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to 
> minimize the metadata size. We can borrow to truncate them in Parquet also to 
> reduce the size of the footer, or even the page header. Here is the code in 
> IceBerg 
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to