Thanks for coming back to the list with response!
pt., 27 lut 2015, 3:16 PM Himanish Kushary użytkownik himan...@gmail.com
napisał:
Hi,
I was able to solve the issue. Putting down the settings that worked for
me.
1) It was happening due to the large number of partitions.I *coalesce*'d
the RDD as early as possible in my code into lot less partitions ( used .
coalesce(1) to bring down from 500K to 10k)
2) Increased the settings for the parameters *spark.akka.frameSize (=
500), **spark.akka.timeout,**spark.akka.askTimeout and
**spark.core.connection.ack.wait.timeout
*to get rid of any insufficient frame size and timeout errors
Thanks
Himanish
On Thu, Feb 26, 2015 at 5:00 PM, Himanish Kushary himan...@gmail.com
wrote:
Hi,
I am working with a RDD (PairRDD) with 500K+ partitions. The RDD is
loaded into memory , the size is around 18G.
Whenever I run a distinct() on the RDD, the driver ( spark-shell in
yarn-client mode) host CPU usage rockets up (400+ %) and the distinct()
process seems to stall.The spark driver UI also hangs.
In ganglia the only node with high load is the driver host. I have tried
repartitioning the data into less number of partitions ( using coalesce or
repartition) with no luck.
I have attached the jstack output which shows few threads in BLOCKED
status. Not sure what exactly is going on here.
The driver program was started with 15G memory on AWS EMR. Appreciate any
thoughts regarding the issue.
--
Thanks Regards
Himanish
--
Thanks Regards
Himanish