You can resort to Serialized storage (still in memory) of your RDDs - this
will obviate the need for GC since the RDD elements are stored as serialized
objects off the JVM heap (most likely in Tachion which is distributed in
memory files system used by Spark internally)

 

Also review the Object Oriented Model of your RDD to see whether it consists
of too many redundant objects and multiple levels of hierarchy - in high
performance computing and distributed cluster object oriented frameworks
like Spark some of the "OO Patterns"  represent unnecessary burden .. 

 

From: Shuai Zheng [mailto:szheng.c...@gmail.com] 
Sent: Thursday, April 23, 2015 6:14 PM
To: user@spark.apache.org
Subject: Slower performance when bigger memory?

 

Hi All,

 

I am running some benchmark on r3*8xlarge instance. I have a cluster with
one master (no executor on it) and one slave (r3*8xlarge).

 

My job has 1000 tasks in stage 0.

 

R3*8xlarge has 244G memory and 32 cores.

 

If I create 4 executors, each has 8 core+50G memory, each task will take
around 320s-380s. And if I only use one big executor with 32 cores and 200G
memory, each task will take 760s-900s.

 

And I check the log, looks like the minor GC takes much longer when using
200G memory:

 

285.242: [GC [PSYoungGen: 29027310K->8646087K(31119872K)]
38810417K->19703013K(135977472K), 11.2509770 secs] [Times: user=38.95
sys=120.65, real=11.25 secs] 

 

And when it uses 50G memory, the minor GC takes only less than 1s.

 

I try to see what is the best way to configure the Spark. For some special
reason, I tempt to use a bigger memory on single executor if no significant
penalty on performance. But now looks like it is?

 

Anyone has any idea?

 

Regards,

 

Shuai

Reply via email to