[ 
https://issues.apache.org/jira/browse/ARROW-15855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17502410#comment-17502410
 ] 

Alenka Frim commented on ARROW-15855:
-------------------------------------

That is correct, these parameters are not exposed in Python and it would be 
good if they are. Thank you for reporting [~xzeng].

> [Python] Add dictionary_pagesize_limit to Parquet writer
> --------------------------------------------------------
>
>                 Key: ARROW-15855
>                 URL: https://issues.apache.org/jira/browse/ARROW-15855
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Parquet, Python
>            Reporter: Xinyu Zeng
>            Priority: Major
>             Fix For: 8.0.0
>
>
> Although the python Parquet api is a wrapper of C+\+, there are some tuning 
> knobs not included in python. For example, dictionary_pagesize_limit_. The 
> dictionary page size will easily exceed the limit when any or many of the 
> followings happen: 1. The row_group_size is relatively large e.g. the default 
> is 64M. 2. The size per entry is large e.g large string column 3. the 
> repeatability of data is not so high. This may result in the dictionary 
> encoding not being fully utilized if this parameter cannot be tuned. In C+\+, 
> however, this parameter can be tuned to the optimized setting.
>  
> There are also other parameters not exposed in python, for example, 
> max_statistics_size.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

Reply via email to