[ https://issues.apache.org/jira/browse/PARQUET-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941870#comment-16941870 ]
Alexander Gunkel commented on PARQUET-1670: ------------------------------------------- [~gszadovszky], there was an option -b until commit [ab42fe5180366120336fb3f8b9e6540aadb5da1b|https://github.com/apache/parquet-mr/commit/ab42fe5180366120336fb3f8b9e6540aadb5da1b] (originally introduced in commit 863a081850e56bbbb38d7b68b478a3bd40779723) ;) > parquet-tools merge extremely slow with block-option > ---------------------------------------------------- > > Key: PARQUET-1670 > URL: https://issues.apache.org/jira/browse/PARQUET-1670 > Project: Parquet > Issue Type: Bug > Reporter: Alexander Gunkel > Priority: Major > > parquet-tools merge is extremely time- and memory-consuming when used with > block-option. > > The merge function builds a bigger file out of several smaller parquet-files. > Used without the block-option it just concatenates the files into a bigger > one without building larger row-groups. That doesn't help with > query-performance-issues. With block-option, parquet-tools build bigger > row-groups which improves the query-performance, but the merge-process itself > is extremely slow and memory-consuming. > > Consider a case in which you have many small parquet files, e.g. 1000 files > with a size of 100kb. Merging them into one file fails on my machine because > even 20GB of memory are not enough for the process (the total amount of data > as well as the resulting file should be smaller than 100MB). > > Different situation: Consider having 100 files of size 1MB. Then merging them > is possible with 20GB of RAM, but it takes almoust half an hour to process, > which is to much for many use-cases. > > Is there any possibility to accelerate the merge and reduce the need of > memory? -- This message was sent by Atlassian Jira (v8.3.4#803005)