[ https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248854#comment-17248854 ]
Gabor Szadovszky commented on PARQUET-1951: ------------------------------------------- [~satishkotha], I would not suggest using parquet-tools merge command. (BTW parquet-tools and parquet-cli are different. If it is really about parquet-tools please correct the component label.) Originally, the merge command was created because of the "small files problem". Small files problem is when you create huge number of small (a couple of MB) parquet files instead of creating less larger ones. (It usually happens in case of streaming.) Reading this many files are significantly slower than reading the same amount of data from less but larger files. Unfortunately, the current implementation of merge command cannot deal with the original problem because it does not merge row groups. By having less larger parquet files but the same amount of row groups does not solve the "small files problem" but hides it. We were thinking about this issue and found that this problem cannot be solved properly by a tool that is executed on a single machine. A much better solution would be something that takes the advantage of the whole cluster. But it is out of the scope of parquet-mr. Merge command also does not support column index or bloom filters. > Allow different strategies to combine key values when merging parquet files > --------------------------------------------------------------------------- > > Key: PARQUET-1951 > URL: https://issues.apache.org/jira/browse/PARQUET-1951 > Project: Parquet > Issue Type: Improvement > Components: parquet-cli > Reporter: satish > Priority: Minor > > I work on Apache Hudi project. We store some additional metadata in parquet > files (key range in the file, for example). So the metadata is different in > different parquet files that we want to merge these files. > Here is what I'm thinking: > 1) Merge command takes additional command line option: --strategy > <StrategyClassName>. > 2) We introduce new strategy class in parquet-hadoop to keep the same > behavior as today. > We can extend that class and provide our custom implementation. -- This message was sent by Atlassian Jira (v8.3.4#803005)