[ 
https://issues.apache.org/jira/browse/ARROW-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502202#comment-17502202
 ] 

Antoine Pitrou edited comment on ARROW-15855 at 3/7/22, 11:10 AM:
------------------------------------------------------------------

cc [~jorisvandenbossche] [~alenkaf]



was (Author: pitrou):
cc [~jorisvandenbossche]


> [Python] Add dictionary_pagesize_limit to Parquet writer
> --------------------------------------------------------
>
>                 Key: ARROW-15855
>                 URL: https://issues.apache.org/jira/browse/ARROW-15855
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Xinyu Zeng
>            Priority: Major
>             Fix For: 8.0.0
>
>
> Although the python Parquet api is a wrapper of c+, there are some tuning 
> knobs not included in python. For example, dictionary_pagesize_limit_. The 
> dictionary page size will easily exceed the limit when any or many of the 
> following happen: 1. The row_group_size is relatively large e.g. the default 
> is 64M. 2. The size per entry is large e.g large string column 3. the 
> repeatability of data is not so high. This may result in the dictionary 
> encoding not being fully utilized if this parameter cannot be tuned. In C+, 
> however, this parameter can be tuned to the optimized setting.
>  
> There are also other parameters not exposed in python, for example, 
> max_statistics_size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to