[ 
https://issues.apache.org/jira/browse/PARQUET-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941870#comment-16941870
 ] 

Alexander Gunkel commented on PARQUET-1670:
-------------------------------------------

[~gszadovszky], there was an option -b until commit 
[ab42fe5180366120336fb3f8b9e6540aadb5da1b|https://github.com/apache/parquet-mr/commit/ab42fe5180366120336fb3f8b9e6540aadb5da1b]
 (originally introduced in commit 863a081850e56bbbb38d7b68b478a3bd40779723) ;)

> parquet-tools merge extremely slow with block-option
> ----------------------------------------------------
>
>                 Key: PARQUET-1670
>                 URL: https://issues.apache.org/jira/browse/PARQUET-1670
>             Project: Parquet
>          Issue Type: Bug
>            Reporter: Alexander Gunkel
>            Priority: Major
>
> parquet-tools merge is extremely time- and memory-consuming when used with 
> block-option.
>  
> The merge function builds a bigger file out of several smaller parquet-files. 
> Used without the block-option it just concatenates the files into a bigger 
> one without building larger row-groups. That doesn't help with 
> query-performance-issues. With block-option, parquet-tools build bigger 
> row-groups which improves the query-performance, but the merge-process itself 
> is extremely slow and memory-consuming.
>  
> Consider a case in which you have many small parquet files, e.g. 1000 files 
> with a size of 100kb. Merging them into one file fails on my machine because 
> even 20GB of memory are not enough for the process (the total amount of data 
> as well as the resulting file should be smaller than 100MB).
>  
> Different situation: Consider having 100 files of size 1MB. Then merging them 
> is possible with 20GB of RAM, but it takes almoust half an hour to process, 
> which is to much for many use-cases.
>  
> Is there any possibility to accelerate the merge and reduce the need of 
> memory?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Reply via email to