I don't think it's a red herring... (btw. spark.driver.host needs to be set to the IP or FQDN of the machine where you're running the program).
I am running 0.9.2 on CDH4 and the beginning of my executor log looks like below (I've obfuscated the IP -- this is the log from executor a100-2-200-245). My driver is running on a100-2-200-238. I am not specifically setting spark.driver.host or the port but depending on how your machine is setup you might need to: SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory] 14/10/03 18:14:48 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/10/03 18:14:48 INFO Remoting: Starting remoting 14/10/03 18:14:48 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkExecutor@a100-2-200-245:56760] 14/10/03 18:14:48 INFO Remoting: Remoting now listens on addresses: [akka.tcp://sparkExecutor@a100-2-200-245:56760] **14/10/03 18:14:48 INFO executor.CoarseGrainedExecutorBackend: Connecting to driver: akka.tcp://spark@a100-2-200-238:61505/user/CoarseGrainedScheduler** 14/10/03 18:14:48 INFO worker.WorkerWatcher: Connecting to worker akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker 14/10/03 18:14:48 INFO worker.WorkerWatcher: Successfully connected to akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker **14/10/03 18:14:49 INFO executor.CoarseGrainedExecutorBackend: Successfully registered with driver** 14/10/03 18:14:49 INFO slf4j.Slf4jLogger: Slf4jLogger started 14/10/03 18:14:49 INFO Remoting: Starting remoting If you look at the lines with ** this is where the driver successfully connects and at this point you should see your app show up in the UI under "Running applications"...The worker log you're posting -- is that the log that stored under work/app-<id>/<executor-id>/stderr? The first line you show in that log is INFO worker.Worker: Executor app-20141002131901-0002/9 finished with state FAILED but I imagine something prior to that would say why the executor failed? On Fri, Oct 3, 2014 at 2:56 PM, Irina Fedulova <fedul...@gmail.com> wrote: > Yana, many thanks for looking into this! > > I am not running spark-shell in local mode, I am really starting > spark-shell with --master spark://master:7077 and run in cluster mode. > > Second thing is I tried to set "spark.driver.host" to "master" both in > scala app when creating context, and in conf/spark-defaults.conf file, but > this did not make any difference. Worker logs still have same messages: > 14/10/03 13:37:30 ERROR remote.EndpointWriter: AssociationError > [akka.tcp://sparkWorker@host2:51414] -> > [akka.tcp://sparkExecutor@host2:53851]: > Error [Association failed with [akka.tcp://sparkExecutor@host2:53851]] [ > akka.remote.EndpointAssociationException: Association failed with > [akka.tcp://sparkExecutor@host2:53851] > Caused by: > akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: > Connection refused: host2/xxx.xx.xx.xx:53851 > ] > > note that host1, host2 etc are slave hostnames, and each slave has error > message about itself: host1:<some random port> cannot connect to > host1:<some random port>. > > However I noticed that after running successfully SparkPi app log also is > populated with similar "connection refused" messages, but this does not > lead to application death... So these worker logs are probably a false clue. > > > > On 03.10.14 19:37, Yana Kadiyska wrote: > >> when you're running spark-shell and the example, are you actually >> specifying --master spark://master:7077 as shown here: >> http://spark.apache.org/docs/latest/programming-guide.html# >> initializing-spark >> >> because if you're not, your spark-shell is running in local mode and not >> actually connecting to the cluster. Also, if you run spark-shell against >> the cluster, you'll see it listed under the Running applications in the >> master UI. It would be pretty odd for spark shell to connect >> successfully to the cluster but for your app to not connect...(which is >> why I suspect that you're running spark-shell local) >> >> Another thing to check, the executors need to connect back to your >> driver, so it could be that you have to set the driver host or driver >> port...in fact looking at your executor log, this seems fairly likely: >> is host1/xxx.xx.xx.xx:45542 the machine where your driver is running? is >> that host/port reachable from the worker machines? >> >> On Fri, Oct 3, 2014 at 5:32 AM, Irina Fedulova <fedul...@gmail.com >> <mailto:fedul...@gmail.com>> wrote: >> >> Hi, >> >> I have set up Spark 0.9.2 standalone cluster using CDH5 and >> pre-built spark distribution archive for Hadoop 2. I was not using >> spark-ec2 scripts because I am not on EC2 cloud. >> >> Spark-shell seems to be working properly -- I am able to perform >> simple RDD operations, as well as e.g. SparkPi standalone example >> works well when run via `run-example`. Web UI shows all workers >> connected. >> >> However, standalone Scala application gets "connection refused" >> messages. I think this has something to do with configuration, >> because spark-shell and SparkPi works well. I verified that >> .setMaster and .setSparkHome are properly assigned within scala app. >> >> Is there anything else in configuration of standalone scala app on >> spark that I am missing? >> I would very much appreciate any clues. >> >> Namely, I am trying to run MovieLensALS.scala example from AMPCamp >> big data mini course >> (http://ampcamp.berkeley.edu/__big-data-mini-course/movie-__ >> recommendation-with-mllib.html >> <http://ampcamp.berkeley.edu/big-data-mini-course/movie- >> recommendation-with-mllib.html>__). >> >> Here is error which I get when try to run compiled jar: >> --------------- >> root@master:~/machine-__learning/scala# sbt/sbt package "run >> /movielens/medium" >> Launching sbt from sbt/sbt-launch-0.12.4.jar >> [info] Loading project definition from >> /root/training/machine-__learning/scala/project >> [info] Set current project to movielens-als (in build >> file:/root/training/machine-__learning/scala/) >> [info] Compiling 1 Scala source to >> /root/training/machine-__learning/scala/target/scala-2. >> __10/classes... >> [warn] there were 2 deprecation warning(s); re-run with -deprecation >> for details >> [warn] one warning found >> [info] Packaging >> /root/training/machine-__learning/scala/target/scala-2. >> __10/movielens-als_2.10-0.0.jar >> ... >> [info] Done packaging. >> [success] Total time: 6 s, completed Oct 2, 2014 1:19:00 PM >> [info] Running MovieLensALS /movielens/medium >> master = spark://master:7077 >> log4j:WARN No appenders could be found for logger >> (akka.event.slf4j.Slf4jLogger)__. >> log4j:WARN Please initialize the log4j system properly. >> log4j:WARN See >> http://logging.apache.org/__log4j/1.2/faq.html#noconfig >> >> <http://logging.apache.org/log4j/1.2/faq.html#noconfig> for more >> info. >> 14/10/02 13:19:01 WARN NativeCodeLoader: Unable to load >> native-hadoop library for your platform... using builtin-java >> classes where applicable >> HERE >> THERE >> 14/10/02 13:19:02 INFO FileInputFormat: Total input paths to process >> : 1 >> 14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 0 on host2: >> remote Akka client disassociated >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 1 (task 0.0:1) >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 0 (task 0.0:0) >> 14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 4 on host5: >> remote Akka client disassociated >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 3 (task 0.0:1) >> 14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 1 on host4: >> remote Akka client disassociated >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 2 (task 0.0:0) >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 4 (task 0.0:1) >> 14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 3 on host3: >> remote Akka client disassociated >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 6 (task 0.0:0) >> 14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 2 on host1: >> remote Akka client disassociated >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 5 (task 0.0:1) >> 14/10/02 13:19:03 WARN TaskSetManager: Lost TID 7 (task 0.0:0) >> 14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 6 on host4: >> remote Akka client disassociated >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 8 (task 0.0:0) >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 9 (task 0.0:1) >> 14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 5 on host2: >> remote Akka client disassociated >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 10 (task 0.0:1) >> 14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 7 on host5: >> remote Akka client disassociated >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 11 (task 0.0:0) >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 12 (task 0.0:1) >> 14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 8 on host3: >> remote Akka client disassociated >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 13 (task 0.0:1) >> 14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 9 on host1: >> remote Akka client disassociated >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 14 (task 0.0:0) >> 14/10/02 13:19:04 WARN TaskSetManager: Lost TID 15 (task 0.0:1) >> 14/10/02 13:19:05 ERROR AppClient$ClientActor: Master removed our >> application: FAILED; stopping client >> 14/10/02 13:19:05 WARN SparkDeploySchedulerBackend: Disconnected >> from Spark cluster! Waiting for reconnection... >> 14/10/02 13:19:06 ERROR TaskSchedulerImpl: Lost executor 11 on >> host5: remote Akka client disassociated >> 14/10/02 13:19:06 WARN TaskSetManager: Lost TID 17 (task 0.0:0) >> 14/10/02 13:19:06 WARN TaskSetManager: Lost TID 16 (task 0.0:1) >> --------------- >> >> And this is error log on one of the workers: >> --------------- >> 14/10/02 13:19:05 INFO worker.Worker: Executor >> app-20141002131901-0002/9 finished with state FAILED message Command >> exited with code 1 exitStatus 1 >> 14/10/02 13:19:05 INFO actor.LocalActorRef: Message >> [akka.remote.transport.__ActorTransportAdapter$__ >> DisassociateUnderlying] >> from Actor[akka://sparkWorker/__deadLetters] to >> Actor[akka://sparkWorker/__system/transports/__ >> akkaprotocolmanager.tcp0/__akkaProtocol-tcp%3A%2F%__ >> 2FsparkWorker%40xxx.xx.xx.xx%__3A57719-15#1504298502] >> was not delivered. [6] dead letters encountered. This logging can be >> turned off or adjusted with configuration settings >> 'akka.log-dead-letters' and 'akka.log-dead-letters-during- >> __shutdown'. >> 14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@host1:__47421] -> >> [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed >> with [akka.tcp://sparkExecutor@__host1:45542]] [ >> akka.remote.__EndpointAssociationException: Association failed with >> [akka.tcp://sparkExecutor@__host1:45542] >> Caused by: >> akka.remote.transport.netty.__NettyTransport$$anonfun$__ >> associate$1$$anon$2: >> Connection refused: host1/xxx.xx.xx.xx:45542 >> ] >> 14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@host1:__47421] -> >> [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed >> with [akka.tcp://sparkExecutor@__host1:45542]] [ >> akka.remote.__EndpointAssociationException: Association failed with >> [akka.tcp://sparkExecutor@__host1:45542] >> Caused by: >> akka.remote.transport.netty.__NettyTransport$$anonfun$__ >> associate$1$$anon$2: >> Connection refused: host1/xxx.xx.xx.xx:45542 >> ] >> 14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError >> [akka.tcp://sparkWorker@host1:__47421] -> >> [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed >> with [akka.tcp://sparkExecutor@__host1:45542]] [ >> akka.remote.__EndpointAssociationException: Association failed with >> [akka.tcp://sparkExecutor@__host1:45542] >> Caused by: >> akka.remote.transport.netty.__NettyTransport$$anonfun$__ >> associate$1$$anon$2: >> Connection refused: host1/xxx.xx.xx.xx:45542 >> --------------- >> >> Thanks! >> Irina >> >> ------------------------------__---------------------------- >> --__--------- >> To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org >> <mailto:user-unsubscr...@spark.apache.org> >> For additional commands, e-mail: user-h...@spark.apache.org >> <mailto:user-h...@spark.apache.org> >> >> >>