Re: Best way to merge final output part files created by Spark job

Gaspar Muñoz Mon, 14 Sep 2015 00:11:03 -0700

Hi, check out  FileUtil.copyMerge function in the Hadoop API
<https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/fs/FileUtil.html#copyMerge(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration,
java.lang.String)>.


It's simple,


   1. Get the hadoop configuration from Spark Context  FileSystem fs =
   FileSystem.get(sparkContext.hadoopConfiguration());
   2. Create new Path
   
<https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/fs/Path.html>with
   destination and source directory.
   3. Call copyMerge   FileUtil.copyMerge(fs, inputPath, fs, destPath,
   true, sparkContext.hadoopConfiguration(), null);


2015-09-13 23:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>:

> Hi I have a spark job which creates around 500 part files inside each
> directory I process. So I have thousands of such directories. So I need to
> merge these small small 500 part files. I am using
> spark.sql.shuffle.partition as 500 and my final small files are ORC files.
> Is there a way to merge orc files in Spark if not please suggest the best
> way to merge files created by Spark job in hdfs please guide. Thanks much.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Gaspar Muñoz
@gmunozsoria


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*

Re: Best way to merge final output part files created by Spark job

Reply via email to