[ https://issues.apache.org/jira/browse/PARQUET-321?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Ryan Blue updated PARQUET-321: ------------------------------ Fix Version/s: (was: 1.9.1) 1.10.0 > Set the HDFS padding default to 8MB > ----------------------------------- > > Key: PARQUET-321 > URL: https://issues.apache.org/jira/browse/PARQUET-321 > Project: Parquet > Issue Type: Improvement > Components: parquet-mr > Reporter: Ryan Blue > Assignee: Ryan Blue > Priority: Major > Fix For: 1.10.0 > > > PARQUET-306 added the ability to pad row groups so that they align with HDFS > blocks to avoid remote reads. The ParquetFileWriter will now either pad the > remaining space in the block or target a row group for the remaining size. > The padding maximum controls the threshold of the amount of padding that will > be used. If the space left is under this threshold, it is padded. If it is > greater than this threshold, then the next row group is fit into the > remaining space. The current padding maximum is 0. > I think we should change the padding maximum to 8MB. My reasoning is this: we > want this number to be small enough that it won't prevent the library from > writing reasonable row groups, but larger than the minimum size row group we > would want to write. 8MB is 1/16th of the row group default, so I think it is > reasonable: we don't want a row group to be smaller than 8 MB. > We also want this to be large enough that a few row groups in a block don't > cause a tiny row group to be written in the excess space. 8MB accounts for 4 > row groups that are 2MB under-size. In addition, it is reasonable to not > allow row groups under 8MB. -- This message was sent by Atlassian JIRA (v7.6.3#76005)