It says connection refused, just make sure the network is configured properly (open the ports between master and the worker nodes). If the ports are configured correctly, then i assume the process is getting killed for some reason and hence connection refused.
Thanks Best Regards On Fri, Dec 5, 2014 at 12:30 AM, S. Zhou <myx...@yahoo.com.invalid> wrote: > Here is a sample exception I collected from a spark worker node: (there > are many such errors across over work nodes). It looks to me that spark > worker failed to communicate to executor locally. > > 14/12/04 04:26:37 ERROR EndpointWriter: AssociationError > [akka.tcp://sparkwor...@spark-prod1.xxx:7079] -> > [akka.tcp://sparkexecu...@spark-prod1.xxx:47710]: Error [Association > failed with [akka.tcp://sparkexecu...@spark-prod1.xxx:47710]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://sparkexecu...@spark-prod1.xxx:47710] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: spark-prod1.XXX/10.51.XX.XX:47710 > > > > On Wednesday, December 3, 2014 5:05 PM, Ted Yu <yuzhih...@gmail.com> > wrote: > > > bq. to get the logs from the data nodes > > Minor correction: the logs are collected from machines where node managers > run. > > Cheers > > On Wed, Dec 3, 2014 at 3:39 PM, Ganelin, Ilya <ilya.gane...@capitalone.com > > wrote: > > You want to look further up the stack (there are almost certainly other > errors before this happens) and those other errors may give your better > idea of what is going on. Also if you are running on yarn you can run "yarn > logs -applicationId <yourAppId>" to get the logs from the data nodes. > > > > Sent with Good (www.good.com) > > > -----Original Message----- > *From: *S. Zhou [myx...@yahoo.com.INVALID] > *Sent: *Wednesday, December 03, 2014 06:30 PM Eastern Standard Time > *To: *user@spark.apache.org > *Subject: *Spark executor lost > > We are using Spark job server to submit spark jobs (our spark version is > 0.91). After running the spark job server for a while, we often see the > following errors (executor lost) in the spark job server log. As a > consequence, the spark driver (allocated inside spark job server) gradually > loses executors. And finally the spark job server no longer be able to > submit jobs. We tried to google the solutions but so far no luck. Please > help if you have any ideas. Thanks! > > [2014-11-25 01:37:36,250] INFO parkDeploySchedulerBackend [] > [akka://JobServer/user/context-supervisor/next-staging] - Executor 6 > disconnected, so removing it > [2014-11-25 01:37:36,252] ERROR cheduler.TaskSchedulerImpl [] > [akka://JobServer/user/context-supervisor/next-staging] - Lost executor 6 > on XXXX: remote Akka client disassociated > [2014-11-25 01:37:36,252] INFO ark.scheduler.DAGScheduler [] [] - *Executor > lost*: 6 (epoch 8) > [2014-11-25 01:37:36,252] INFO ge.BlockManagerMasterActor [] [] - Trying > to remove executor 6 from BlockManagerMaster. > [2014-11-25 01:37:36,252] INFO storage.BlockManagerMaster [] [] - Removed > 6 successfully in removeExecutor > [2014-11-25 01:37:36,286] INFO ient.AppClient$ClientActor [] > [akka://JobServer/user/context-supervisor/next-staging] - Executor updated: > app-20141125002023-0037/6 is now FAILED (Command exited with code 143) > > > > ------------------------------ > The information contained in this e-mail is confidential and/or > proprietary to Capital One and/or its affiliates. The information > transmitted herewith is intended only for use by the individual or entity > to which it is addressed. If the reader of this message is not the > intended recipient, you are hereby notified that any review, > retransmission, dissemination, distribution, copying or other use of, or > taking of any action in reliance upon this information is strictly > prohibited. If you have received this communication in error, please > contact the sender and delete the material from your computer. > > > > >