We are using Spark job server to submit spark jobs (our spark version is 0.91).
After running the spark job server for a while, we often see the following
errors (executor lost) in the spark job server log. As a consequence, the spark
driver (allocated inside spark job server) gradually loses executors. And
finally the spark job server no longer be able to submit jobs. We tried to
google the solutions but so far no luck. Please help if you have any ideas.
Thanks!
[2014-11-25 01:37:36,250] INFO parkDeploySchedulerBackend []
[akka://JobServer/user/context-supervisor/next-staging] - Executor 6
disconnected, so removing it[2014-11-25 01:37:36,252] ERROR
cheduler.TaskSchedulerImpl []
[akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6 on
XXXX: remote Akka client disassociated[2014-11-25 01:37:36,252] INFO
ark.scheduler.DAGScheduler [] [] - Executor lost: 6 (epoch 8)[2014-11-25
01:37:36,252] INFO ge.BlockManagerMasterActor [] [] - Trying to remove
executor 6 from BlockManagerMaster.[2014-11-25 01:37:36,252] INFO
storage.BlockManagerMaster [] [] - Removed 6 successfully in
removeExecutor[2014-11-25 01:37:36,286] INFO ient.AppClient$ClientActor []
[akka://JobServer/user/context-supervisor/next-staging] - Executor updated:
app-20141125002023-0037/6 is now FAILED (Command exited with code 143)