It's highly dependent on what the issue is with your particular job, but the ones I modify most commonly are:
spark.storage.memoryFraction spark.shuffle.memoryFraction parallelism (a parameter on many RDD calls) -- increase from the default level to get more, smaller tasks that are more likely to finish Use Kryo A while back I also modified: spark.storage.blockManagerTimeoutIntervalMs -- when a stop-the-world GC on a slave caused heartbeat timeouts but the slave would eventually recover, I would bump up this parameter from default (this was 0.7.x) It's also been a while since I messed with GC tuning for Spark, but I'd generally recommend capping your JVM size at about 31.5 GB so you can keep compressed pointers. Better to run multiple JVMs at that size than a single that's 128GB for example. I think the general practice for GC tuning is to have your interactive-time JVMs (like the driver or the master) run with the concurrent mark and sweep collector, and the bulk computation JVMs (like a worker or executor) run with parallel GC. Not sure how the newer G1 collector fits in to those. And most of the time if you're messing with GC parameters, your issues are actually at the Spark level and you should be spending your time figuring out why that's causing problems instead. Another thing I did was if a job wouldn't finish but consisted of several steps, I could manually save each step along the way to disk (HDFS) and load from there. A lot of my jobs only need to finish once, so as long as I get it done (even if it's a more manual process than it should be) is ok. Hope that helps! Andrew On Sun, Apr 13, 2014 at 4:33 PM, Jim Blomo <jim.bl...@gmail.com> wrote: > On Thu, Apr 10, 2014 at 12:24 PM, Andrew Ash <and...@andrewash.com> wrote: > > The biggest issue I've come across is that the cluster is somewhat > unstable > > when under memory pressure. Meaning that if you attempt to persist an > RDD > > that's too big for memory, even with MEMORY_AND_DISK, you'll often still > get > > OOMs. I had to carefully modify some of the space tuning parameters and > GC > > settings to get some jobs to even finish. > > Would you mind sharing some of these settings? Even just a GitHub > gist would be helpful. These are the main issues I've run into as > well, and memory pressure also seems to be correlated with akka > timeouts, possibly because of GC pauses. >