Re: Best way to merge final output part files created by Spark job

2016-07-01 Thread kali.tumm...@gmail.com
Try using collasece function to repartition to desired number of partitions
files, to merge already output files use hive and insert overwrite table
using below options.

set hive.merge.smallfiles.avgsize=256;
set hive.merge.size.per.task=256;
set 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681p27263.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe e-mail: user-unsubscr...@spark.apache.org



Re: Best way to merge final output part files created by Spark job

2015-09-17 Thread MEETHU MATHEW
Try coalesce(1) before writing Thanks & Regards, Meethu M 


 On Tuesday, 15 September 2015 6:49 AM, java8964 <java8...@hotmail.com> 
wrote:
   

 #yiv1620377612 #yiv1620377612 --.yiv1620377612hmmessage 
P{margin:0px;padding:0px;}#yiv1620377612 
body.yiv1620377612hmmessage{font-size:12pt;font-family:Calibri;}#yiv1620377612 
For text file, this merge works fine, but for binary format like "ORC", 
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail 
data information either in the head or at tail part of the file.
You have to use the format specified API to merge the data.
Yong

Date: Mon, 14 Sep 2015 09:10:33 +0200
Subject: Re: Best way to merge final output part files created by Spark job
From: gmu...@stratio.com
To: umesh.ka...@gmail.com
CC: user@spark.apache.org

Hi, check out  FileUtil.copyMerge function in the Hadoop API.  
It's simple,  
   
   - Get the hadoop configuration from Spark Context  FileSystem fs = 
FileSystem.get(sparkContext.hadoopConfiguration());   

   - Create new Path with destination and source directory.
   - Call copyMerge   FileUtil.copyMerge(fs, inputPath, fs, destPath, true, 
sparkContext.hadoopConfiguration(), null);

2015-09-13 23:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>:

Hi I have a spark job which creates around 500 part files inside each
directory I process. So I have thousands of such directories. So I need to
merge these small small 500 part files. I am using
spark.sql.shuffle.partition as 500 and my final small files are ORC files.
Is there a way to merge orc files in Spark if not please suggest the best
way to merge files created by Spark job in hdfs please guide. Thanks much.



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org





-- 

Gaspar Muñoz 
@gmunozsoria

Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd 

  

RE: Best way to merge final output part files created by Spark job

2015-09-14 Thread java8964
For text file, this merge works fine, but for binary format like "ORC", 
"Parquet" or "AVOR", not sure this will work.
These kind of formats in fact are not append-able, as they write the detail 
data information either in the head or at tail part of the file.
You have to use the format specified API to merge the data.
Yong

Date: Mon, 14 Sep 2015 09:10:33 +0200
Subject: Re: Best way to merge final output part files created by Spark job
From: gmu...@stratio.com
To: umesh.ka...@gmail.com
CC: user@spark.apache.org

Hi, check out  FileUtil.copyMerge function in the Hadoop API.  
It's simple,  
Get the hadoop configuration from Spark Context  FileSystem fs = 
FileSystem.get(sparkContext.hadoopConfiguration());
Create new Path with destination and source directory.Call copyMerge   
FileUtil.copyMerge(fs, inputPath, fs, destPath, true, 
sparkContext.hadoopConfiguration(), null);
2015-09-13 23:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>:
Hi I have a spark job which creates around 500 part files inside each

directory I process. So I have thousands of such directories. So I need to

merge these small small 500 part files. I am using

spark.sql.shuffle.partition as 500 and my final small files are ORC files.

Is there a way to merge orc files in Spark if not please suggest the best

way to merge files created by Spark job in hdfs please guide. Thanks much.







--

View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html

Sent from the Apache Spark User List mailing list archive at Nabble.com.



-

To unsubscribe, e-mail: user-unsubscr...@spark.apache.org

For additional commands, e-mail: user-h...@spark.apache.org





-- 
Gaspar Muñoz 
@gmunozsoria
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, MadridTel: +34 91 352 59 42 // @stratiobd
  

Re: Best way to merge final output part files created by Spark job

2015-09-14 Thread Gaspar Muñoz
Hi, check out  FileUtil.copyMerge function in the Hadoop API
<https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/fs/FileUtil.html#copyMerge(org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, org.apache.hadoop.fs.FileSystem,
org.apache.hadoop.fs.Path, boolean, org.apache.hadoop.conf.Configuration,
java.lang.String)>.

It's simple,


   1. Get the hadoop configuration from Spark Context  FileSystem fs =
   FileSystem.get(sparkContext.hadoopConfiguration());
   2. Create new Path
   
<https://hadoop.apache.org/docs/r2.7.0/api/org/apache/hadoop/fs/Path.html>with
   destination and source directory.
   3. Call copyMerge   FileUtil.copyMerge(fs, inputPath, fs, destPath,
   true, sparkContext.hadoopConfiguration(), null);


2015-09-13 23:25 GMT+02:00 unk1102 <umesh.ka...@gmail.com>:

> Hi I have a spark job which creates around 500 part files inside each
> directory I process. So I have thousands of such directories. So I need to
> merge these small small 500 part files. I am using
> spark.sql.shuffle.partition as 500 and my final small files are ORC files.
> Is there a way to merge orc files in Spark if not please suggest the best
> way to merge files created by Spark job in hdfs please guide. Thanks much.
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>
> -
> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
> For additional commands, e-mail: user-h...@spark.apache.org
>
>


-- 

Gaspar Muñoz
@gmunozsoria


<http://www.stratio.com/>
Vía de las dos Castillas, 33, Ática 4, 3ª Planta
28224 Pozuelo de Alarcón, Madrid
Tel: +34 91 352 59 42 // *@stratiobd <https://twitter.com/StratioBD>*


Best way to merge final output part files created by Spark job

2015-09-13 Thread unk1102
Hi I have a spark job which creates around 500 part files inside each
directory I process. So I have thousands of such directories. So I need to
merge these small small 500 part files. I am using
spark.sql.shuffle.partition as 500 and my final small files are ORC files.
Is there a way to merge orc files in Spark if not please suggest the best
way to merge files created by Spark job in hdfs please guide. Thanks much. 



--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Best-way-to-merge-final-output-part-files-created-by-Spark-job-tp24681.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

-
To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
For additional commands, e-mail: user-h...@spark.apache.org