Rob,

I have seen this too. I have 16 nodes in my spark cluster and for some reason (after app failures) one of the workers will go offline. I will ssh to the machine in question and find that the java process is running but for some reason the master is not noticing this. I have not had the time to investigate (my setup is manual, 0.9 in standalone mode).

Ognen

On 3/5/14, 12:27 PM, Rob Povey wrote:
I installed Spark 0.9.0 from the CDH parcel yesterday in standalone mode on
top of a 6 node cluster running CDH4.6 on Centos.

What I'm seeing is that when jobs fail, often the worker process will crash,
it seems that the worker restarts on the node but the Master then never
utilizes the restarted worker, and it doesn't show up in the web interface.

Has anyone seen anything like this, is there an obvious workaround/fix other
than manually restarting the workers?

In the Master log I see the following repeated many times, filer being the
"lost" node. What it looks like to me is that when the worker actor is
restarted by AKKA, it gets a new ID and for whatever reason does not
register with the master.

Any ideas?

14/03/04 20:04:44 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078
14/03/04 20:04:54 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078
14/03/04 20:04:59 WARN master.Master: Got heartbeat from unregistered worker
worker-20140304183709-filer.maana.io-7078


On Filer itself I can see it's shutdown with the following exception, and I
can see that it's been restarted and is running.

14/03/04 18:37:09 INFO worker.Worker: Executor app-20140304183705-0036/0
finished with state KILLED
14/03/04 18:37:09 INFO worker.CommandUtils: Redirection to
/var/run/spark/work/app-20140304183705-0036/0/stderr closed: Bad file
descriptor
14/03/04 18:37:09 ERROR actor.OneForOneStrategy: key not found:
app-20140304183705-0036/0
java.util.NoSuchElementException: key not found: app-20140304183705-0036/0
         at scala.collection.MapLike$class.default(MapLike.scala:228)
         at scala.collection.AbstractMap.default(Map.scala:58)
         at scala.collection.mutable.HashMap.apply(HashMap.scala:64)
         at
org.apache.spark.deploy.worker.Worker$$anonfun$receive$1.applyOrElse(Worker.scala:232)
         at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498)
         at akka.actor.ActorCell.invoke(ActorCell.scala:456)
         at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237)
         at akka.dispatch.Mailbox.run(Mailbox.scala:219)
         at
akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386)
         at
scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
         at
scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
         at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
         at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
         at
scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
         at
scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
14/03/04 18:37:09 ERROR remote.EndpointWriter: AssociationError
[akka.tcp://sparkwor...@filer.maana.io:7078] ->
[akka.tcp://sparkexecu...@filer.maana.io:58331]: Error [Association failed
with [akka.tcp://sparkexecu...@filer.maana.io:58331]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://sparkexecu...@filer.maana.io:58331]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: filer.maana.io/192.168.1.33:58331
]
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{*,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/logPage,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: stopped
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/03/04 18:37:09 INFO worker.Worker: Starting Spark worker
filer.maana.io:7078 with 4 cores, 30.3 GB RAM
14/03/04 18:37:09 INFO worker.Worker: Spark home:
/opt/cloudera/parcels/SPARK/lib/spark
14/03/04 18:37:09 INFO server.Server: jetty-7.6.8.v20121106
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/metrics/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/static,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/log,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/logPage,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{/json,null}
14/03/04 18:37:09 INFO handler.ContextHandler: started
o.e.j.s.h.ContextHandler{*,null}
14/03/04 18:37:09 INFO server.AbstractConnector: Started
SelectChannelConnector@0.0.0.0:18081
14/03/04 18:37:09 INFO ui.WorkerWebUI: Started Worker web UI at
http://filer.maana.io:18081
14/03/04 18:37:09 INFO worker.Worker: Connecting to master
spark://Master.maana.io:7077...
14/03/04 18:37:09 INFO worker.Worker: Successfully registered with master
spark://Master.maana.io:7077
14/03/05 08:53:35 INFO actor.LocalActorRef: Message
[akka.remote.transport.AssociationHandle$Disassociated] from
Actor[akka://sparkWorker/deadLetters] to
Actor[akka://sparkWorker/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkWorker%40192.168.1.33%3A37859-60#-1831633323]
was not delivered. [24] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
14/03/05 09:06:29 INFO worker.ExecutorRunner: Shutdown hook killing child
process.
~
~
~






--
View this message in context: 
http://apache-spark-user-list.1001560.n3.nabble.com/Spark-Worker-crashing-and-Master-not-seeing-recovered-worker-tp2312.html
Sent from the Apache Spark User List mailing list archive at Nabble.com.

--
Some people, when confronted with a problem, think "I know, I'll use regular 
expressions." Now they have two problems.
-- Jamie Zawinski

Reply via email to