[ 
https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Ryan Blue updated PARQUET-321:
------------------------------
    Fix Version/s:     (was: 1.9.1)
                   1.10.0

> Set the HDFS padding default to 8MB
> -----------------------------------
>
>                 Key: PARQUET-321
>                 URL: https://issues.apache.org/jira/browse/PARQUET-321
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-mr
>            Reporter: Ryan Blue
>            Assignee: Ryan Blue
>            Priority: Major
>             Fix For: 1.10.0
>
>
> PARQUET-306 added the ability to pad row groups so that they align with HDFS 
> blocks to avoid remote reads. The ParquetFileWriter will now either pad the 
> remaining space in the block or target a row group for the remaining size.
> The padding maximum controls the threshold of the amount of padding that will 
> be used. If the space left is under this threshold, it is padded. If it is 
> greater than this threshold, then the next row group is fit into the 
> remaining space. The current padding maximum is 0.
> I think we should change the padding maximum to 8MB. My reasoning is this: we 
> want this number to be small enough that it won't prevent the library from 
> writing reasonable row groups, but larger than the minimum size row group we 
> would want to write. 8MB is 1/16th of the row group default, so I think it is 
> reasonable: we don't want a row group to be smaller than 8 MB.
> We also want this to be large enough that a few row groups in a  block don't 
> cause a tiny row group to be written in the excess space. 8MB accounts for 4 
> row groups that are 2MB under-size. In addition, it is reasonable to not 
> allow row groups under 8MB.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Reply via email to