My workers are going OOM over time. I am running a streaming job in spark 1.4.0. Here is the heap dump of workers.
/16,802 instances of "org.apache.spark.deploy.worker.ExecutorRunner", loaded by "sun.misc.Launcher$AppClassLoader @ 0xdff94088" occupy 488,249,688 (95.80%) bytes. These instances are referenced from one instance of "java.lang.Object[]", loaded by "<system class loader>" Keywords org.apache.spark.deploy.worker.ExecutorRunner java.lang.Object[] sun.misc.Launcher$AppClassLoader @ 0xdff94088 / I am getting below error continuously if one of the worker/executor dies on any node in my spark cluster. If I start the worker also, error doesn't go. I have to force_kill my streaming job and restart to fix the issue. Is it some bug? I am using Spark 1.4.0. MY_IP in logs is IP of worker node which failed. /15/09/03 11:29:11 WARN BlockManagerMaster: Failed to remove RDD 194218 - Ask timed out on [Actor[akka.tcp://sparkExecutor@MY_IP:38223/user/BlockManagerEndpoint1#656884654]] after [120000 ms]} akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://sparkExecutor@MY_IP:38223/user/BlockManagerEndpoint1#656884654]] after [120000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) at java.lang.Thread.run(Thread.java:745) 15/09/03 11:29:11 WARN BlockManagerMaster: Failed to remove RDD 194217 - Ask timed out on [Actor[akka.tcp://sparkExecutor@MY_IP:38223/user/BlockManagerEndpoint1#656884654]] after [120000 ms]} akka.pattern.AskTimeoutException: Ask timed out on [Actor[akka.tcp://sparkExecutor@MY_IP:38223/user/BlockManagerEndpoint1#656884654]] after [120000 ms] at akka.pattern.PromiseActorRef$$anonfun$1.apply$mcV$sp(AskSupport.scala:333) at akka.actor.Scheduler$$anon$7.run(Scheduler.scala:117) at scala.concurrent.Future$InternalCallbackExecutor$.scala$concurrent$Future$InternalCallbackExecutor$$unbatchedExecute(Future.scala:694) at scala.concurrent.Future$InternalCallbackExecutor$.execute(Future.scala:691) at akka.actor.LightArrayRevolverScheduler$TaskHolder.executeTask(Scheduler.scala:467) at akka.actor.LightArrayRevolverScheduler$$anon$8.executeBucket$1(Scheduler.scala:419) at akka.actor.LightArrayRevolverScheduler$$anon$8.nextTick(Scheduler.scala:423) at akka.actor.LightArrayRevolverScheduler$$anon$8.run(Scheduler.scala:375) at java.lang.Thread.run(Thread.java:745) 15/09/03 11:29:11 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 16723 15/09/03 11:29:11 WARN BlockManagerMaster: Failed to remove RDD 194216 - Ask timed out on [Actor[akka.tcp://sparkExecutor@MY_IP:38223/user/BlockManagerEndpoint1#656884654]] after [120000 ms]} / It is easily reproducible if I manually stop a worker on one of my node. /15/09/03 23:52:18 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 329 15/09/03 23:52:18 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 333 15/09/03 23:52:18 ERROR SparkDeploySchedulerBackend: Asked to remove non-existent executor 334 / It doesn't go even if I start the worker again. Follow up question: If my streaming job has consumed some events from Kafka topic and are pending to be scheduled because of delay in processing... Will my force killing the streaming job lose that data which is not yet scheduled? Please help ASAP. -- View this message in context: http://apache-spark-user-list.1001560.n3.nabble.com/OOM-error-in-Spark-worker-tp24856.html Sent from the Apache Spark User List mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org