[ 
https://issues.apache.org/jira/browse/PARQUET-1951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17248854#comment-17248854
 ] 

Gabor Szadovszky commented on PARQUET-1951:
-------------------------------------------

[~satishkotha], I would not suggest using parquet-tools merge command. (BTW 
parquet-tools and parquet-cli are different. If it is really about 
parquet-tools please correct the component label.)
Originally, the merge command was created because of the "small files problem". 
Small files problem is when you create huge number of small (a couple of MB) 
parquet files instead of creating less larger ones. (It usually happens in case 
of streaming.) Reading this many files are significantly slower than reading 
the same amount of data from less but larger files.
Unfortunately, the current implementation of merge command cannot deal with the 
original problem because it does not merge row groups. By having less larger 
parquet files but the same amount of row groups does not solve the "small files 
problem" but hides it. We were thinking about this issue and found that this 
problem cannot be solved properly by a tool that is executed on a single 
machine. A much better solution would be something that takes the advantage of 
the whole cluster. But it is out of the scope of parquet-mr.
Merge command also does not support column index or bloom filters. 

> Allow different strategies to combine key values when merging parquet files
> ---------------------------------------------------------------------------
>
>                 Key: PARQUET-1951
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1951
>             Project: Parquet
>          Issue Type: Improvement
>          Components: parquet-cli
>            Reporter: satish
>            Priority: Minor
>
> I work on Apache Hudi project. We store some additional metadata in parquet 
> files (key range in the file, for example).  So the metadata is different in 
> different parquet files that we want to merge these files. 
> Here is what I'm thinking:
> 1) Merge command takes additional command line option: --strategy 
> <StrategyClassName>. 
> 2) We introduce new strategy class in parquet-hadoop to keep the same 
> behavior as today.  
> We can extend that class and provide our custom implementation.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to