For text file, this merge works fine, but for binary format like "ORC", "Parquet" or "AVOR", not sure this will work. These kind of formats in fact are not append-able, as they write the detail data information either in the head or at tail part of the file. You have to use the format specified API to merge the data. Yong
Date: Mon, 14 Sep 2015 09:10:33 +0200 Subject: Re: Best way to merge final output part files created by Spark job From: gmu...@stratio.com To: umesh.ka...@gmail.com CC: user@spark.apache.org Hi, check out FileUtil.copyMerge function in the Hadoop API. It's simple, Get the hadoop configuration from Spark Context FileSystem fs = FileSystem.get(sparkContext.hadoopConfiguration()); Create new Path with destination and source directory.Call copyMerge FileUtil.copyMerge(fs, inputPath, fs, destPath, true, sparkContext.hadoopConfiguration(), null); 2015-09-13 23:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>: Hi I have a spark job which creates around 500 part files inside each directory I process. So I have thousands of such directories. So I need to merge these small small 500 part files. I am using spark.sql.shuffle.partition as 500 and my final small files are ORC files. Is there a way to merge orc files in Spark if not please suggest the best way to merge files created by Spark job in hdfs please guide. Thanks much. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org -- Gaspar Muñoz @gmunozsoria Vía de las dos Castillas, 33, Ática 4, 3ª Planta 28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd