Hi, I was able to solve the issue. Putting down the settings that worked for me.
1) It was happening due to the large number of partitions.I *coalesce*'d the RDD as early as possible in my code into lot less partitions ( used . coalesce(10000) to bring down from 500K to 10k) 2) Increased the settings for the parameters *spark.akka.frameSize (= 500), **spark.akka.timeout,**spark.akka.askTimeout and **spark.core.connection.ack.wait.timeout *to get rid of any insufficient frame size and timeout errors Thanks Himanish On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary <himan...@gmail.com> wrote: > Hi, > > I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is loaded > into memory , the size is around 18G. > > Whenever I run a distinct() on the RDD, the driver ( spark-shell in > yarn-client mode) host CPU usage rockets up (400+ %) and the distinct() > process seems to stall.The spark driver UI also hangs. > > In ganglia the only node with high load is the driver host. I have tried > repartitioning the data into less number of partitions ( using coalesce or > repartition) with no luck. > > I have attached the jstack output which shows few threads in BLOCKED > status. Not sure what exactly is going on here. > > The driver program was started with 15G memory on AWS EMR. Appreciate any > thoughts regarding the issue. > > -- > Thanks & Regards > Himanish > -- Thanks & Regards Himanish