[ https://issues.apache.org/jira/browse/PARQUET-1670?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16941886#comment-16941886 ]
Gabor Szadovszky commented on PARQUET-1670: ------------------------------------------- OK, I forgot about it. I was the one who reverted this feature. It was trying to do some more advanced merging but the concept was not correct. I would not suggest using it. > parquet-tools merge extremely slow with block-option > ---------------------------------------------------- > > Key: PARQUET-1670 > URL: https://issues.apache.org/jira/browse/PARQUET-1670 > Project: Parquet > Issue Type: Bug > Reporter: Alexander Gunkel > Priority: Major > > parquet-tools merge is extremely time- and memory-consuming when used with > block-option. > > The merge function builds a bigger file out of several smaller parquet-files. > Used without the block-option it just concatenates the files into a bigger > one without building larger row-groups. That doesn't help with > query-performance-issues. With block-option, parquet-tools build bigger > row-groups which improves the query-performance, but the merge-process itself > is extremely slow and memory-consuming. > > Consider a case in which you have many small parquet files, e.g. 1000 files > with a size of 100kb. Merging them into one file fails on my machine because > even 20GB of memory are not enough for the process (the total amount of data > as well as the resulting file should be smaller than 100MB). > > Different situation: Consider having 100 files of size 1MB. Then merging them > is possible with 20GB of RAM, but it takes almoust half an hour to process, > which is to much for many use-cases. > > Is there any possibility to accelerate the merge and reduce the need of > memory? -- This message was sent by Atlassian Jira (v8.3.4#803005)