[ 
https://issues.apache.org/jira/browse/PARQUET-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941860#comment-16941860
 ] 

Gabor Szadovszky commented on PARQUET-1670:
-------------------------------------------

Based on the code and the help message of the command {{merge}} there is no 
option {{-b}}. I don't know why it does not complain about it.

The current implementation concatenates the row-groups of the small files so 
the result file in your case will contain many small row-groups. So, it did not 
solve the issue as the real problem of the many small files is the many small 
row-groups. The merge command in the current shape is useless in my opinion. 
That's why it prints the following message.
{quote}"The command doesn't merge row groups, just places one after the other. 
When used to merge many small files, the resulting file will still contain 
small row groups, which usually leads to bad query performance."{quote}

The parquet-mr library processes the data sequentially so 100% on one core 
seems to be fine. I don't know why the memory consumption reaches 20GB but a 
jvm would never do a gc until it reaches the max available memory. So, I guess, 
the 20GB is full of unused objects which would be garbage collected if required.
I also don't know why it is that slow but it does not really matter as the 
result is not really useful.

Unfortunately, we don't have a properly working tool that could solve your 
problem.  My only idea is to read all the data back from the many files 
row-by-row and write them to one file.

> parquet-tools merge extremely slow with block-option
> ----------------------------------------------------
>
>                 Key: PARQUET-1670
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1670
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Alexander Gunkel
>            Priority: Major
>
> parquet-tools merge is extremely time- and memory-consuming when used with 
> block-option.
>  
> The merge function builds a bigger file out of several smaller parquet-files. 
> Used without the block-option it just concatenates the files into a bigger 
> one without building larger row-groups. That doesn't help with 
> query-performance-issues. With block-option, parquet-tools build bigger 
> row-groups which improves the query-performance, but the merge-process itself 
> is extremely slow and memory-consuming.
>  
> Consider a case in which you have many small parquet files, e.g. 1000 files 
> with a size of 100kb. Merging them into one file fails on my machine because 
> even 20GB of memory are not enough for the process (the total amount of data 
> as well as the resulting file should be smaller than 100MB).
>  
> Different situation: Consider having 100 files of size 1MB. Then merging them 
> is possible with 20GB of RAM, but it takes almoust half an hour to process, 
> which is to much for many use-cases.
>  
> Is there any possibility to accelerate the merge and reduce the need of 
> memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to