Hi, I have 4 very powerful boxes (256GB RAM, 32 cores each). I am running spark 1.2.0 in yarn-client mode with following layout: spark.executor.cores=4 spark.executor.memory=28G spark.yarn.executor.memoryOverhead=4096
I am submitting bigger ALS trainImplicit task (rank=100, iters=15) on a dataset with ~3 billion of ratings using 25 executors. At some point some executor crashes with: 15/02/19 05:41:06 WARN util.AkkaUtils: Error sending message in 1 attempts java.util.concurrent.TimeoutException: Futures timed out after [30 seconds] at scala.concurrent.impl.Promise$DefaultPromise.ready(Promise.scala:219) at scala.concurrent.impl.Promise$DefaultPromise.result(Promise.scala:223) at scala.concurrent.Await$$anonfun$result$1.apply(package.scala:107) at scala.concurrent.BlockContext$DefaultBlockContext$.blockOn(BlockContext.scala:53) at scala.concurrent.Await$.result(package.scala:107) at org.apache.spark.util.AkkaUtils$.askWithReply(AkkaUtils.scala:187) at org.apache.spark.executor.Executor$$anon$1.run(Executor.scala:398)15/02/19 05:41:06 ERROR executor.Executor: Exception in task 131.0 in stage 51.0 (TID 7259)java.lang.OutOfMemoryError: GC overhead limit exceeded at java.lang.reflect.Array.newInstance(Array.java:75) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1671) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345) at java.io.ObjectInputStream.readArray(ObjectInputStream.java:1707) at java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1345) So the GC overhead limit exceeded is pretty clear and would suggest running out of memory. Since I have 1TB of RAM available this must be rather due to some config inoptimality. Can anyone please point me to some directions how to tackle this? Thanks,Antony.