Hi,

I was able to solve the issue. Putting down the settings that worked for me.

1) It was happening due to the large number of partitions.I *coalesce*'d
the RDD as early as possible in my code into lot less partitions ( used .
coalesce(10000) to bring down from 500K to 10k)

2) Increased the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors

Thanks
Himanish

On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary <himan...@gmail.com>
wrote:

> Hi,
>
> I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is loaded
> into memory , the size is around 18G.
>
> Whenever I run a distinct() on the RDD, the driver ( spark-shell in
> yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
> process seems to stall.The spark driver UI also hangs.
>
> In ganglia the only node with high load is the driver host. I have tried
> repartitioning the data into less number of partitions ( using coalesce or
> repartition) with no luck.
>
> I have attached the jstack output which shows few threads in BLOCKED
> status. Not sure what exactly is going on here.
>
> The driver program was started with 15G memory on AWS EMR. Appreciate any
> thoughts regarding the issue.
>
> --
> Thanks & Regards
> Himanish
>



-- 
Thanks & Regards
Himanish

Reply via email to