Here is a sample exception I collected from a spark worker node: (there are 
many such errors across over work nodes). It looks to me that spark worker 
failed to communicate to executor locally.
14/12/04 04:26:37 ERROR EndpointWriter: AssociationError 
[akka.tcp://sparkwor...@spark-prod1.xxx:7079] -> 
[akka.tcp://sparkexecu...@spark-prod1.xxx:47710]: Error [Association failed 
with [akka.tcp://sparkexecu...@spark-prod1.xxx:47710]] [
akka.remote.EndpointAssociationException: Association failed with 
[akka.tcp://sparkexecu...@spark-prod1.xxx:47710]
Caused by: 
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: 
Connection refused: spark-prod1.XXX/10.51.XX.XX:47710

 

     On Wednesday, December 3, 2014 5:05 PM, Ted Yu <yuzhih...@gmail.com> wrote:
   

 bq.  to get the logs from the data nodes
Minor correction: the logs are collected from machines where node managers run.
Cheers
On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya <ilya.gane...@capitalone.com> 
wrote:

You want to look further up the stack (there are almost certainly other errors 
before this happens) and those other errors may give your better idea of what 
is going on. Also if you are running on yarn you can run "yarn logs 
-applicationId <yourAppId>" to get the logs from the data nodes.



Sent with Good (www.good.com)


-----Original Message-----
From: S. Zhou [myx...@yahoo.com.INVALID]
Sent: Wednesday, December 03, 2014 06:30 PM Eastern Standard Time
To: user@spark.apache.org
Subject: Spark executor lost

We are using Spark job server to submit spark jobs (our spark version is 0.91). 
After running the spark job server for a while, we often see the following 
errors (executor lost) in the spark job server log. As a consequence, the spark 
driver (allocated inside spark job server) gradually loses executors. And 
finally the spark job server no longer be able to submit jobs. We tried to 
google the solutions but so far no luck. Please help if you have any ideas. 
Thanks!
[2014-11-25 01:37:36,250] INFO  parkDeploySchedulerBackend [] 
[akka://JobServer/user/context-supervisor/next-staging] - Executor 6 
disconnected, so removing it[2014-11-25 01:37:36,252] ERROR 
cheduler.TaskSchedulerImpl [] 
[akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6 on 
XXXX: remote Akka client disassociated[2014-11-25 01:37:36,252] INFO  
ark.scheduler.DAGScheduler [] [] - Executor lost: 6 (epoch 8)[2014-11-25 
01:37:36,252] INFO  ge.BlockManagerMasterActor [] [] - Trying to remove 
executor 6 from BlockManagerMaster.[2014-11-25 01:37:36,252] INFO  
storage.BlockManagerMaster [] [] - Removed 6 successfully in 
removeExecutor[2014-11-25 01:37:36,286] INFO  ient.AppClient$ClientActor [] 
[akka://JobServer/user/context-supervisor/next-staging] - Executor updated: 
app-20141125002023-0037/6 is now FAILED (Command exited with code 143)


 The information contained in this e-mail is confidential and/or proprietary to 
Capital One and/or its affiliates. The information transmitted herewith is 
intended only for use by the individual or entity to which it is addressed.  If 
the reader of this message is not the intended recipient, you are hereby 
notified that any review, retransmission, dissemination, distribution, copying 
or other use of, or taking of any action in reliance upon this information is 
strictly prohibited. If you have received this communication in error, please 
contact the sender and delete the material from your computer.



   

Reply via email to