[ 
https://issues.apache.org/jira/browse/PARQUET-1685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16961165#comment-16961165
 ] 

Xinli Shang commented on PARQUET-1685:
--------------------------------------

Hi [~gszadovszky] Thanks for your reply!  

Regarding "an implementation might rely on the fact that the min/max values are 
actual values", did you already have discussions earlier when the 'column 
index' implemented the **statistics truncating?  I would like to add  [~rdblue] 
who might already have discussions and thinkings because this is implemented in 
IceBerg. 

For the 4k hard limit, I am thinking from the other way.  If empty statistics 
were written because of oversizing statistics, it would cause the query 
inefficient.  And if truncating can improve(reduce) the size and reduce the 
number of empty statistics files, then it is a big win.

In 1.11.0+, is it enforced to use the 'column index' and not to write to page 
statistics? 

 

 

> Truncate the stored min and max for String statistics to reduce the footer 
> size 
> --------------------------------------------------------------------------------
>
>                 Key: PARQUET-1685
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1685
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>    Affects Versions: 1.10.1
>            Reporter: Xinli Shang
>            Assignee: Xinli Shang
>            Priority: Major
>             Fix For: 1.12.0
>
>
> Iceberg has a cool feature that truncates the stored min, max statistics to 
> minimize the metadata size. We can borrow to truncate them in Parquet also to 
> reduce the size of the footer, or even the page header. Here is the code in 
> IceBerg 
> [https://github.com/apache/incubator-iceberg/blob/master/api/src/main/java/org/apache/iceberg/util/UnicodeUtil.java].
>  
>  
>  
>  
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to