Thanks, Hyukjin. I’ll try using the Parquet tools for 1.9 based on the jira. If that doesn’t work, I’ll try Kite.
Cheers, Ben > On Dec 23, 2016, at 12:43 PM, Hyukjin Kwon <gurwls...@gmail.com> wrote: > > Hi Benjamin, > > > As you might already know, I believe the Hadoop command automatically does > not merge the column-based format such as ORC or Parquet but just simply > concatenates them. > > I haven't tried this by myself but I remember I saw a JIRA in Parquet - > https://issues.apache.org/jira/browse/PARQUET-460 > <https://issues.apache.org/jira/browse/PARQUET-460> > > It seems parquet-tools allows merge small Parquet files into one. > > > Also, I believe there are command-line tools in Kite - > https://github.com/kite-sdk/kite <https://github.com/kite-sdk/kite> > > This might be useful. > > > Thanks! > > 2016-12-23 7:01 GMT+09:00 Benjamin Kim <bbuil...@gmail.com > <mailto:bbuil...@gmail.com>>: > Has anyone tried to merge *.gz.parquet files before? I'm trying to merge them > into 1 file after they are output from Spark. Doing a coalesce(1) on the > Spark cluster will not work. It just does not have the resources to do it. > I'm trying to do it using the commandline and not use Spark. I will use this > command in shell script. I tried "hdfs dfs -getmerge", but the file becomes > unreadable by Spark with gzip footer error. > > Thanks, > Ben > --------------------------------------------------------------------- > To unsubscribe e-mail: user-unsubscr...@spark.apache.org > <mailto:user-unsubscr...@spark.apache.org> > >