Try coalesce function to limit no of part files On Mon, Jul 6, 2015 at 1:23 PM kachau <umesh.ka...@gmail.com> wrote:
> Hi I am having couple of Spark jobs which processes thousands of files > every > day. File size may very from MBs to GBs. After finishing job I usually save > using the following code > > finalJavaRDD.saveAsParquetFile("/path/in/hdfs"); OR > dataFrame.write.format("orc").save("/path/in/hdfs") //storing as ORC file > as > of Spark 1.4 > > Spark job creates plenty of small part files in final output directory. As > far as I understand Spark creates part file for each partition/task please > correct me if I am wrong. How do we control amount of part files Spark > creates? Finally I would like to create Hive table using these parquet/orc > directory and I heard Hive is slow when we have large no of small files. > Please guide I am new to Spark. Thanks in advance. > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/How-do-we-control-output-part-files-created-by-Spark-job-tp23649.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. > > --------------------------------------------------------------------- > To unsubscribe, e-mail: user-unsubscr...@spark.apache.org > For additional commands, e-mail: user-h...@spark.apache.org > >