Re: Merging Parquet Files

Benjamin Kim Thu, 22 Dec 2016 22:48:43 -0800

Thanks, Hyukjin.

I’ll try using the Parquet tools for 1.9


On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote:

Hi Benjamin,


As you might already know, I believe the Hadoop command automatically does
not merge the column-based format such as ORC or Parquet but just simply
concatenates them.

I haven't tried this by myself but I remember I saw a JIRA in Parquet -
https://issues.apache.org/jira/browse/PARQUET-460

It seems parquet-tools allows merge small Parquet files into one.


Also, I believe there are command-line tools in Kite -
https://github.com/kite-sdk/kite

This might be useful.


Thanks!

2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com>:

Has anyone tried to merge *.gz.parquet files before? I'm trying to merge
them into 1 file after they are output from Spark. Doing a coalesce(1) on
the Spark cluster will not work. It just does not have the resources to do
it. I'm trying to do it using the commandline and not use Spark. I will use
this command in shell script. I tried "hdfs dfs -getmerge", but the file
becomes unreadable by Spark with gzip footer error.





Thanks,


Ben


---------------------------------------------------------------------


To unsubscribe e-mail: user-unsubscr...@spark.apache.org

Re: Merging Parquet Files

Reply via email to