Thanks for coming back to the list with response! pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik <himan...@gmail.com> napisał:
> Hi, > > I was able to solve the issue. Putting down the settings that worked for > me. > > 1) It was happening due to the large number of partitions.I *coalesce*'d > the RDD as early as possible in my code into lot less partitions ( used . > coalesce(10000) to bring down from 500K to 10k) > > 2) Increased the settings for the parameters *spark.akka.frameSize (= > 500), **spark.akka.timeout,**spark.akka.askTimeout and > **spark.core.connection.ack.wait.timeout > *to get rid of any insufficient frame size and timeout errors > > Thanks > Himanish > > On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary <himan...@gmail.com> > wrote: > >> Hi, >> >> I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is >> loaded into memory , the size is around 18G. >> >> Whenever I run a distinct() on the RDD, the driver ( spark-shell in >> yarn-client mode) host CPU usage rockets up (400+ %) and the distinct() >> process seems to stall.The spark driver UI also hangs. >> >> In ganglia the only node with high load is the driver host. I have tried >> repartitioning the data into less number of partitions ( using coalesce or >> repartition) with no luck. >> >> I have attached the jstack output which shows few threads in BLOCKED >> status. Not sure what exactly is going on here. >> >> The driver program was started with 15G memory on AWS EMR. Appreciate any >> thoughts regarding the issue. >> >> -- >> Thanks & Regards >> Himanish >> > > > > -- > Thanks & Regards > Himanish >