[ 
https://issues.apache.org/jira/browse/ARROW-2057?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16831328#comment-16831328
 ] 

Even Oldridge commented on ARROW-2057:
--------------------------------------

RAPIDS.AI has recently implemented a parquet reader to load data to GPU.  
According to the dev the optimal page size for GPUs is much smaller than the 
default of 1M and should be set closer to 256K.  My current workflow uses 
pyarrow to do the parquet write and I'd love to be able to specify this.

> [Python] Configure size of data pages in pyarrow.parquet.write_table
> --------------------------------------------------------------------
>
>                 Key: ARROW-2057
>                 URL: https://issues.apache.org/jira/browse/ARROW-2057
>             Project: Apache Arrow
>          Issue Type: Improvement
>          Components: Python
>            Reporter: Wes McKinney
>            Assignee: Uwe L. Korn
>            Priority: Major
>              Labels: beginner, parquet
>             Fix For: 0.14.0
>
>
> It would be useful to be able to set the size of data pages (within Parquet 
> column chunks) from Python. The current default is set to 1MiB at 
> https://github.com/apache/parquet-cpp/blob/0875e43010af485e1c0b506d77d7e0edc80c66cc/src/parquet/properties.h#L81.
>  It might be useful in some situations to lower this for more granular access.
> We should provide this value as a parameter to 
> {{pyarrow.parquet.write_table}}.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to