It's highly dependent on what the issue is with your particular job, but
the ones I modify most commonly are:

spark.storage.memoryFraction
spark.shuffle.memoryFraction
parallelism (a parameter on many RDD calls) -- increase from the default
level to get more, smaller tasks that are more likely to finish
Use Kryo

A while back I also modified:
spark.storage.blockManagerTimeoutIntervalMs -- when a stop-the-world GC on
a slave caused heartbeat timeouts but the slave would eventually recover, I
would bump up this parameter from default (this was 0.7.x)

It's also been a while since I messed with GC tuning for Spark, but I'd
generally recommend capping your JVM size at about 31.5 GB so you can keep
compressed pointers.  Better to run multiple JVMs at that size than a
single that's 128GB for example.  I think the general practice for GC
tuning is to have your interactive-time JVMs (like the driver or the
master) run with the concurrent mark and sweep collector, and the bulk
computation JVMs (like a worker or executor) run with parallel GC.  Not
sure how the newer G1 collector fits in to those.  And most of the time if
you're messing with GC parameters, your issues are actually at the Spark
level and you should be spending your time figuring out why that's causing
problems instead.

Another thing I did was if a job wouldn't finish but consisted of several
steps, I could manually save each step along the way to disk (HDFS) and
load from there.  A lot of my jobs only need to finish once, so as long as
I get it done (even if it's a more manual process than it should be) is ok.

Hope that helps!
Andrew



On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo <jim.bl...@gmail.com> wrote:

> On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash <and...@andrewash.com> wrote:
> > The biggest issue I've come across is that the cluster is somewhat
> unstable
> > when under memory pressure.  Meaning that if you attempt to persist an
> RDD
> > that's too big for memory, even with MEMORY_AND_DISK, you'll often still
> get
> > OOMs.  I had to carefully modify some of the space tuning parameters and
> GC
> > settings to get some jobs to even finish.
>
> Would you mind sharing some of these settings?  Even just a GitHub
> gist would be helpful.  These are the main issues I've run into as
> well, and memory pressure also seems to be correlated with akka
> timeouts, possibly because of GC pauses.
>

Reply via email to