Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Ted Yu
I am a bit curious: why is the synchronization on finalLock is needed ? Thanks > On Oct 23, 2015, at 8:25 AM, Anubhav Agarwal wrote: > > I have a spark job that creates 6 million rows in RDDs. I convert the RDD > into Data-frame and write it to HDFS. Currently it takes 3

Re: Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-11-03 Thread Anubhav Agarwal
I was getting the following error without it:- org.apache.hadoop.ipc.RemoteException(org.apache.hadoop.hdfs.server.namenode.LeaseExpiredException): No lease on /.gz.parquet (inode ): File does not exist. [Lease. Holder: DFSClient_NONMAPREDUCE_, pendingcreates: 1] I think that is due to deadlock.

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread morfious902002
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. I am using spark 1.5.1 with YARN. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if

Improve parquet write speed to HDFS and spark.sql.execution.id is already set ERROR

2015-10-23 Thread Anubhav Agarwal
I have a spark job that creates 6 million rows in RDDs. I convert the RDD into Data-frame and write it to HDFS. Currently it takes 3 minutes to write it to HDFS. Here is the snippet:- RDDList.parallelStream().forEach(mapJavaRDD -> { if (mapJavaRDD != null) {