[ https://issues.apache.org/jira/browse/PARQUET-1381?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17092880#comment-17092880 ]
ASF GitHub Bot commented on PARQUET-1381: ----------------------------------------- shangxinli commented on a change in pull request #775: URL: https://github.com/apache/parquet-mr/pull/775#discussion_r415418003 ########## File path: parquet-hadoop/src/main/java/org/apache/parquet/hadoop/ParquetFileWriter.java ########## @@ -919,6 +895,59 @@ public void appendRowGroup(SeekableInputStream from, BlockMetaData rowGroup, endBlock(); } + /** + * Merges adjacent row groups in the supplied files while maintaining that the new groups is no more than the specified + * maxRowGroupSize + * @param inputFiles input files to merge + * @param maxRowGroupSize the maximum size in bytes the new created groups can be + * @param useV2Writer whether to use a V2 encoding based writer when rewriting dictionary encoded pages + * @param compression compression to use when writing + * @throws IOException + */ + public void mergeRowGroups(List<InputFile> inputFiles, long maxRowGroupSize, boolean useV2Writer, CompressionCodecName compression) throws IOException { Review comment: I prefer not to unless you strongly think we should. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org > Add merge blocks command to parquet-tools > ----------------------------------------- > > Key: PARQUET-1381 > URL: https://issues.apache.org/jira/browse/PARQUET-1381 > Project: Parquet > Issue Type: New Feature > Components: parquet-mr > Affects Versions: 1.10.0 > Reporter: Ekaterina Galieva > Assignee: Ekaterina Galieva > Priority: Major > Labels: pull-request-available > > Current implementation of merge command in parquet-tools doesn't merge row > groups, just places one after the other. Add API and command option to be > able to merge small blocks into larger ones up to specified size limit. > h6. Implementation details: > Blocks are not reordered not to break possible initial predicate pushdown > optimizations. > Blocks are not divided to fit upper bound perfectly. > This is an intentional performance optimization. > This gives an opportunity to form new blocks by coping full content of > smaller blocks by column, not by row. > h6. Examples: > # Input files with blocks sizes: > {code:java} > [128 | 35], [128 | 40], [120]{code} > Expected output file blocks sizes: > {{merge }} > {code:java} > [128 | 35 | 128 | 40 | 120] > {code} > {{merge -b}} > {code:java} > [128 | 35 | 128 | 40 | 120] > {code} > {{merge -b -l 256 }} > {code:java} > [163 | 168 | 120] > {code} > # Input files with blocks sizes: > {code:java} > [128 | 35], [40], [120], [6] {code} > Expected output file blocks sizes: > {{merge}} > {code:java} > [128 | 35 | 40 | 120 | 6] > {code} > {{merge -b}} > {code:java} > [128 | 75 | 126] > {code} > {{merge -b -l 256}} > {code:java} > [203 | 126]{code} -- This message was sent by Atlassian Jira (v8.3.4#803005)