Re: Merging Parquet Files

Benjamin Kim Thu, 22 Dec 2016 18:22:08 -0800

Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t 
work, I’ll try Kite.


Cheers,
Ben


> On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:
> 
> Hi Benjamin,
> 
> 
> As you might already know, I believe the Hadoop command automatically does 
> not merge the column-based format such as ORC or Parquet but just simply 
> concatenates them.
> 
> I haven't tried this by myself but I remember I saw a JIRA in Parquet - 
> https://issues.apache.org/jira/browse/PARQUET-460 
> <https://issues.apache.org/jira/browse/PARQUET-460>
> 
> It seems parquet-tools allows merge small Parquet files into one. 
> 
> 
> Also, I believe there are command-line tools in Kite - 
> https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite>
> 
> This might be useful.
> 
> 
> Thanks!
> 
> 2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com 
> <mailto:bbuil...@gmail.com>>:
> Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them 
> into 1 file after they are output from Spark. Doing a coalesce(1) on the 
> Spark cluster will not work. It just does not have the resources to do it. 
> I'm trying to do it using the commandline and not use Spark. I will use this 
> command in shell script. I tried "hdfs dfs -getmerge", but the file becomes 
> unreadable by Spark with gzip footer error.
> 
> Thanks,
> Ben
> ---------------------------------------------------------------------
> To unsubscribe e-mail: user-unsubscr...@spark.apache.org 
> <mailto:user-unsubscr...@spark.apache.org>
> 
>

Re: Merging Parquet Files

Reply via email to