I was getting the following error without it:- org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /.gz.parquet (inode ): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_, pendingcreates: 1]
I think that is due to deadlock. On Tue, Nov 3, 2015 at 7:48 AM, Ted Yu <yuzhih...@gmail.com> wrote: > I am a bit curious: why is the synchronization on finalLock is needed ? > > Thanks > > On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal <anubha...@gmail.com> wrote: > > I have a spark job that creates 6 million rows in RDDs. I convert the RDD > into Data-frame and write it to HDFS. Currently it takes 3 minutes to write > it to HDFS. > > Here is the snippet:- > RDDList.parallelStream().forEach(mapJavaRDD -> { > if (mapJavaRDD != null) { > JavaRDD<Row> rowRDD = > mapJavaRDD.mapPartitionsWithIndex((integer, v2) -> { > <logical operation> > return new ArrayList<Row>(1).iterator(); > }, false); > > DataFrame dF = sqlContext.createDataFrame(rowRDD, > schema).coalesce(3); > synchronized (finalLock) { > dF.write().mode(SaveMode.Append).parquet("hdfs > location"); > } > > }); > > After looking into the logs I know the following is the reason for the job > taking too long:- > *dF.write().mode(SaveMode.Append).parquet("hdfs > location");* > > I also get the following errors due to it:- > 15/10/21 21:12:30 WARN scheduler.TaskSetManager: Stage 31 contains a task > of very large size (378 KB). The maximum recommended task size is 100 KB.4 > of these kind of warnings appeared. > > java.lang.IllegalArgumentException: java.lang.IllegalArgumentException: > spark.sql.execution.id is already set > >