I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS.
Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJavaRDD != null) { JavaRDD<Row> rowRDD = mapJavaRDD.mapPartitionsWithIndex((integer, v2) -> { <logical operation> return new ArrayList<Row>(1).iterator(); }, false); DataFrame dF = sqlContext.createDataFrame(rowRDD, schema).coalesce(3); synchronized (finalLock) { dF.write().mode(SaveMode.Append).parquet("hdfs location"); } }); After looking into the logs I know the following is the reason for the job taking too long:- *dF.write().mode(SaveMode.Append).parquet("hdfs location");* I also get the following errors due to it:- 15/10/21 21:12:30 WARN scheduler.TaskSetManager: Stage 31 contains a task of very large size (378 KB). The maximum recommended task size is 100 KB.4 of these kind of warnings appeared. java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: spark.sql.execution.id is already set