In the code, instead of a map try to use the mapPartitions. Can you look at the event timeline and see where its taking time?
[image: Inline image 1] You can see it from the driver ui under Stages tab. Thanks Best Regards On Sat, Dec 5, 2015 at 11:14 PM, Ram VISWANADHA < ram.viswana...@dailymotion.com> wrote: > I tried partitionBy with a Hashpartitioner still the same issue > groupBy Operation: > https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L51 > Join Operation: > https://gist.github.com/ramv-dailymotion/4e19b96b625c52d7ed3b#file-saveasparquet-java-L80 > > Best Regards, > Ram > -- > > Date: Saturday, December 5, 2015 at 7:18 AM > To: Akhil Das <ak...@sigmoidanalytics.com> > > Cc: user <user@spark.apache.org> > Subject: Re: Improve saveAsTextFile performance > > >If you are doing a join/groupBy kind of operations then you need to make > sure the keys are evenly distributed throughout the partitions. > > Yes I am doing join/groupBy operations.Can you point me to docs on how to > do this? > > Spark 1.5.2 > > > First attempt > Aggregated Metrics by Executor Executor ID Address Task Time ▾ Total Tasks > Failed > Tasks Succeeded Tasks Shuffle Read Size / Records Shuffle Write Size / > Records Shuffle Spill (Memory) Shuffle Spill (Disk) 32 > rc-spark-poc-w-3.c.dailymotion-data.internal:51748 1.2 h 18 0 18 4.4 MB / > 167812 51.5 GB / 128713 153.1 GB 51.1 GB > > Second Attempt > > Aggregated Metrics by Executor Executor ID Address Task Time ▾ Total Tasks > Failed > Tasks Succeeded Tasks Shuffle Read Size / Records 5 > rc-spark-poc-w-1.c.dailymotion-data.internal:41061 47 min 8 0 8 3.9 MB / > 95334 > > > Best Regards, > Ram > > From: Akhil Das <ak...@sigmoidanalytics.com> > Date: Saturday, December 5, 2015 at 1:32 AM > To: Ram VISWANADHA <ram.viswana...@dailymotion.com> > Cc: user <user@spark.apache.org> > Subject: Re: Improve saveAsTextFile performance > > Which version of spark are you using? Can you look at the event timeline > and the DAG of the job and see where its spending more time? .save simply > triggers your entire pipeline, If you are doing a join/groupBy kind of > operations then you need to make sure the keys are evenly distributed > throughout the partitions. > > Thanks > Best Regards > > On Sat, Dec 5, 2015 at 8:24 AM, Ram VISWANADHA < > ram.viswana...@dailymotion.com> wrote: > >> That didn’t work :( >> Any help I have documented some steps here. >> >> http://stackoverflow.com/questions/34048340/spark-saveastextfile-last-stage-almost-never-finishes >> >> Best Regards, >> Ram >> >> From: Sahil Sareen <sareen...@gmail.com> >> Date: Wednesday, December 2, 2015 at 10:18 PM >> To: Ram VISWANADHA <ram.viswana...@dailymotion.com> >> Cc: Ted Yu <yuzhih...@gmail.com>, user <user@spark.apache.org> >> Subject: Re: Improve saveAsTextFile performance >> >> >> http://stackoverflow.com/questions/29213404/how-to-split-an-rdd-into-multiple-smaller-rdds-given-a-max-number-of-rows-per >> > >