Re: High CPU usage in Driver

2015-02-27 Thread Paweł Szulc
Thanks for coming back to the list  with response!

pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik himan...@gmail.com
napisał:

 Hi,

 I was able to solve the issue. Putting down the settings that worked for
 me.

 1) It was happening due to the large number of partitions.I *coalesce*'d
 the RDD as early as possible in my code into lot less partitions ( used .
 coalesce(1) to bring down from 500K to 10k)

 2) Increased the settings for the parameters *spark.akka.frameSize (=
 500), **spark.akka.timeout,**spark.akka.askTimeout and 
 **spark.core.connection.ack.wait.timeout
 *to get rid of any insufficient frame size and timeout errors

 Thanks
 Himanish

 On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary himan...@gmail.com
 wrote:

 Hi,

 I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is
 loaded into memory , the size is around 18G.

 Whenever I run a distinct() on the RDD, the driver ( spark-shell in
 yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
 process seems to stall.The spark driver UI also hangs.

 In ganglia the only node with high load is the driver host. I have tried
 repartitioning the data into less number of partitions ( using coalesce or
 repartition) with no luck.

 I have attached the jstack output which shows few threads in BLOCKED
 status. Not sure what exactly is going on here.

 The driver program was started with 15G memory on AWS EMR. Appreciate any
 thoughts regarding the issue.

 --
 Thanks  Regards
 Himanish




 --
 Thanks  Regards
 Himanish



Re: High CPU usage in Driver

2015-02-27 Thread Himanish Kushary
Hi,

I was able to solve the issue. Putting down the settings that worked for me.

1) It was happening due to the large number of partitions.I *coalesce*'d
the RDD as early as possible in my code into lot less partitions ( used .
coalesce(1) to bring down from 500K to 10k)

2) Increased the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors

Thanks
Himanish

On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary himan...@gmail.com
wrote:

 Hi,

 I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is loaded
 into memory , the size is around 18G.

 Whenever I run a distinct() on the RDD, the driver ( spark-shell in
 yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
 process seems to stall.The spark driver UI also hangs.

 In ganglia the only node with high load is the driver host. I have tried
 repartitioning the data into less number of partitions ( using coalesce or
 repartition) with no luck.

 I have attached the jstack output which shows few threads in BLOCKED
 status. Not sure what exactly is going on here.

 The driver program was started with 15G memory on AWS EMR. Appreciate any
 thoughts regarding the issue.

 --
 Thanks  Regards
 Himanish




-- 
Thanks  Regards
Himanish