Here is a sample exception I collected from a spark worker node: (there are many such errors across over work nodes). It looks to me that spark worker failed to communicate to executor locally. 14/12/04 04:26:37 ERROR EndpointWriter: AssociationError [akka.tcp://sparkwor...@spark-prod1.xxx:7079] -> [akka.tcp://sparkexecu...@spark-prod1.xxx:47710]: Error [Association failed with [akka.tcp://sparkexecu...@spark-prod1.xxx:47710]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://sparkexecu...@spark-prod1.xxx:47710] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: spark-prod1.XXX/10.51.XX.XX:47710
On Wednesday, December 3, 2014 5:05 PM, Ted Yu <yuzhih...@gmail.com> wrote: bq. to get the logs from the data nodes Minor correction: the logs are collected from machines where node managers run. Cheers On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya <ilya.gane...@capitalone.com> wrote: You want to look further up the stack (there are almost certainly other errors before this happens) and those other errors may give your better idea of what is going on. Also if you are running on yarn you can run "yarn logs -applicationId <yourAppId>" to get the logs from the data nodes. Sent with Good (www.good.com) -----Original Message----- From: S. Zhou [myx...@yahoo.com.INVALID] Sent: Wednesday, December 03, 2014 06:30 PM Eastern Standard Time To: user@spark.apache.org Subject: Spark executor lost We are using Spark job server to submit spark jobs (our spark version is 0.91). After running the spark job server for a while, we often see the following errors (executor lost) in the spark job server log. As a consequence, the spark driver (allocated inside spark job server) gradually loses executors. And finally the spark job server no longer be able to submit jobs. We tried to google the solutions but so far no luck. Please help if you have any ideas. Thanks! [2014-11-25 01:37:36,250] INFO parkDeploySchedulerBackend [] [akka://JobServer/user/context-supervisor/next-staging] - Executor 6 disconnected, so removing it[2014-11-25 01:37:36,252] ERROR cheduler.TaskSchedulerImpl [] [akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6 on XXXX: remote Akka client disassociated[2014-11-25 01:37:36,252] INFO ark.scheduler.DAGScheduler [] [] - Executor lost: 6 (epoch 8)[2014-11-25 01:37:36,252] INFO ge.BlockManagerMasterActor [] [] - Trying to remove executor 6 from BlockManagerMaster.[2014-11-25 01:37:36,252] INFO storage.BlockManagerMaster [] [] - Removed 6 successfully in removeExecutor[2014-11-25 01:37:36,286] INFO ient.AppClient$ClientActor [] [akka://JobServer/user/context-supervisor/next-staging] - Executor updated: app-20141125002023-0037/6 is now FAILED (Command exited with code 143) The information contained in this e-mail is confidential and/or proprietary to Capital One and/or its affiliates. The information transmitted herewith is intended only for use by the individual or entity to which it is addressed. If the reader of this message is not the intended recipient, you are hereby notified that any review, retransmission, dissemination, distribution, copying or other use of, or taking of any action in reliance upon this information is strictly prohibited. If you have received this communication in error, please contact the sender and delete the material from your computer.