Re: How do we control output part files created by Spark job?

Umesh Kacha Tue, 07 Jul 2015 10:01:14 -0700

Hi I tried both approach using df. repartition(6) and df.coalesce(6) it
doesn't reduce part-xxxxx files. Even after calling above method I still
see around 200 small part files of size 20 mb each which is again orc files.


On Tue, Jul 7, 2015 at 12:52 AM, Sathish Kumaran Vairavelu <
vsathishkuma...@gmail.com> wrote:

> Try coalesce function to limit no of part files
> On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
>
>> Hi I am having couple of Spark jobs which processes thousands of files
>> every
>> day. File size may very from MBs to GBs. After finishing job I usually
>> save
>> using the following code
>>
>> finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR
>> dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file
>> as
>> of Spark 1.4
>>
>> Spark job creates plenty of small part files in final output directory. As
>> far as I understand Spark creates part file for each partition/task please
>> correct me if I am wrong. How do we control amount of part files Spark
>> creates? Finally I would like to create Hive table using these parquet/orc
>> directory and I heard Hive is slow when we have large no of small files.
>> Please guide I am new to Spark. Thanks in advance.
>>
>>
>>
>> --
>> View this message in context:
>> http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html
>> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: user-unsubscr...@spark.apache.org
>> For additional commands, e-mail: user-h...@spark.apache.org
>>
>>

Re: How do we control output part files created by Spark job?

Reply via email to