I don't think it's a red herring... (btw. spark.driver.host needs to be set
to the IP or  FQDN of the machine where you're running the program).

I am running 0.9.2 on CDH4 and the beginning of my executor log looks like
below (I've obfuscated the IP -- this is the log from executor
a100-2-200-245). My driver is running on a100-2-200-238. I am not
specifically setting spark.driver.host or the port but depending on how
your machine is setup you might need to:

SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
14/10/03 18:14:48 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/10/03 18:14:48 INFO Remoting: Starting remoting
14/10/03 18:14:48 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkExecutor@a100-2-200-245:56760]
14/10/03 18:14:48 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://sparkExecutor@a100-2-200-245:56760]
**14/10/03 18:14:48 INFO executor.CoarseGrainedExecutorBackend:
Connecting to driver:
akka.tcp://spark@a100-2-200-238:61505/user/CoarseGrainedScheduler**
14/10/03 18:14:48 INFO worker.WorkerWatcher: Connecting to worker
akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker
14/10/03 18:14:48 INFO worker.WorkerWatcher: Successfully connected to
akka.tcp://sparkWorker@a100-2-200-245:48067/user/Worker
**14/10/03 18:14:49 INFO executor.CoarseGrainedExecutorBackend:
Successfully registered with driver**
14/10/03 18:14:49 INFO slf4j.Slf4jLogger: Slf4jLogger started
14/10/03 18:14:49 INFO Remoting: Starting remoting

​
If you look at the lines with ** this is where the driver successfully
connects and at this point you should see your app show up in the UI under
"Running applications"...The worker log you're posting -- is that the log
that stored under work/app-<id>/<executor-id>/stderr? The first line you
show in that log is

 INFO worker.Worker: Executor
    app-20141002131901-0002/9 finished with state FAILED

but I imagine something prior to that would say why the executor failed?

On Fri, Oct 3, 2014 at 2:56 PM, Irina Fedulova <fedul...@gmail.com> wrote:

> Yana, many thanks for looking into this!
>
> I am not running spark-shell in local mode, I am really starting
> spark-shell with --master spark://master:7077 and run in cluster mode.
>
> Second thing is I tried to set "spark.driver.host" to "master" both in
> scala app when creating context, and in conf/spark-defaults.conf file, but
> this did not make any difference. Worker logs still have same messages:
> 14/10/03 13:37:30 ERROR remote.EndpointWriter: AssociationError
> [akka.tcp://sparkWorker@host2:51414] -> 
> [akka.tcp://sparkExecutor@host2:53851]:
> Error [Association failed with [akka.tcp://sparkExecutor@host2:53851]] [
> akka.remote.EndpointAssociationException: Association failed with
> [akka.tcp://sparkExecutor@host2:53851]
> Caused by: 
> akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
> Connection refused: host2/xxx.xx.xx.xx:53851
> ]
>
> note that host1, host2 etc are slave hostnames, and each slave has error
> message about itself: host1:<some random port> cannot connect to
> host1:<some random port>.
>
> However I noticed that after running successfully SparkPi app log also is
> populated with similar "connection refused" messages, but this does not
> lead to application death... So these worker logs are probably a false clue.
>
>
>
> On 03.10.14 19:37, Yana Kadiyska wrote:
>
>> when you're running spark-shell and the example, are you actually
>> specifying --master spark://master:7077 as shown here:
>> http://spark.apache.org/docs/latest/programming-guide.html#
>> initializing-spark
>>
>> because if you're not, your spark-shell is running in local mode and not
>> actually connecting to the cluster. Also, if you run spark-shell against
>> the cluster, you'll see it listed under the Running applications in the
>> master UI. It would be pretty odd for spark shell to connect
>> successfully to the cluster but for your app to not connect...(which is
>> why I suspect that you're running spark-shell local)
>>
>> Another thing to check, the executors need to connect back to your
>> driver, so it could be that you have to set the driver host or driver
>> port...in fact looking at your executor log, this seems fairly likely:
>> is host1/xxx.xx.xx.xx:45542 the machine where your driver is running? is
>> that host/port reachable from the worker machines?
>>
>> On Fri, Oct 3, 2014 at 5:32 AM, Irina Fedulova <fedul...@gmail.com
>> <mailto:fedul...@gmail.com>> wrote:
>>
>>     Hi,
>>
>>     I have set up Spark 0.9.2 standalone cluster using CDH5 and
>>     pre-built spark distribution archive for Hadoop 2. I was not using
>>     spark-ec2 scripts because I am not on EC2 cloud.
>>
>>     Spark-shell seems to be working properly -- I am able to perform
>>     simple RDD operations, as well as e.g. SparkPi standalone example
>>     works well when run via `run-example`. Web UI shows all workers
>>     connected.
>>
>>     However, standalone Scala application gets "connection refused"
>>     messages. I think this has something to do with configuration,
>>     because spark-shell and SparkPi works well. I verified that
>>     .setMaster and .setSparkHome are properly assigned within scala app.
>>
>>     Is there anything else in configuration of standalone scala app on
>>     spark that I am missing?
>>     I would very much appreciate any clues.
>>
>>     Namely, I am trying to run MovieLensALS.scala example from AMPCamp
>>     big data mini course
>>     (http://ampcamp.berkeley.edu/__big-data-mini-course/movie-__
>> recommendation-with-mllib.html
>>     <http://ampcamp.berkeley.edu/big-data-mini-course/movie-
>> recommendation-with-mllib.html>__).
>>
>>     Here is error which I get when try to run compiled jar:
>>     ---------------
>>     root@master:~/machine-__learning/scala# sbt/sbt package "run
>>     /movielens/medium"
>>     Launching sbt from sbt/sbt-launch-0.12.4.jar
>>     [info] Loading project definition from
>>     /root/training/machine-__learning/scala/project
>>     [info] Set current project to movielens-als (in build
>>     file:/root/training/machine-__learning/scala/)
>>     [info] Compiling 1 Scala source to
>>     /root/training/machine-__learning/scala/target/scala-2.
>> __10/classes...
>>     [warn] there were 2 deprecation warning(s); re-run with -deprecation
>>     for details
>>     [warn] one warning found
>>     [info] Packaging
>>     /root/training/machine-__learning/scala/target/scala-2.
>> __10/movielens-als_2.10-0.0.jar
>>     ...
>>     [info] Done packaging.
>>     [success] Total time: 6 s, completed Oct 2, 2014 1:19:00 PM
>>     [info] Running MovieLensALS /movielens/medium
>>     master = spark://master:7077
>>     log4j:WARN No appenders could be found for logger
>>     (akka.event.slf4j.Slf4jLogger)__.
>>     log4j:WARN Please initialize the log4j system properly.
>>     log4j:WARN See
>>     http://logging.apache.org/__log4j/1.2/faq.html#noconfig
>>
>>     <http://logging.apache.org/log4j/1.2/faq.html#noconfig> for more
>> info.
>>     14/10/02 13:19:01 WARN NativeCodeLoader: Unable to load
>>     native-hadoop library for your platform... using builtin-java
>>     classes where applicable
>>     HERE
>>     THERE
>>     14/10/02 13:19:02 INFO FileInputFormat: Total input paths to process
>> : 1
>>     14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 0 on host2:
>>     remote Akka client disassociated
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 1 (task 0.0:1)
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 0 (task 0.0:0)
>>     14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 4 on host5:
>>     remote Akka client disassociated
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 3 (task 0.0:1)
>>     14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 1 on host4:
>>     remote Akka client disassociated
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 2 (task 0.0:0)
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 4 (task 0.0:1)
>>     14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 3 on host3:
>>     remote Akka client disassociated
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 6 (task 0.0:0)
>>     14/10/02 13:19:03 ERROR TaskSchedulerImpl: Lost executor 2 on host1:
>>     remote Akka client disassociated
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 5 (task 0.0:1)
>>     14/10/02 13:19:03 WARN TaskSetManager: Lost TID 7 (task 0.0:0)
>>     14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 6 on host4:
>>     remote Akka client disassociated
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 8 (task 0.0:0)
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 9 (task 0.0:1)
>>     14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 5 on host2:
>>     remote Akka client disassociated
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 10 (task 0.0:1)
>>     14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 7 on host5:
>>     remote Akka client disassociated
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 11 (task 0.0:0)
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 12 (task 0.0:1)
>>     14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 8 on host3:
>>     remote Akka client disassociated
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 13 (task 0.0:1)
>>     14/10/02 13:19:04 ERROR TaskSchedulerImpl: Lost executor 9 on host1:
>>     remote Akka client disassociated
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 14 (task 0.0:0)
>>     14/10/02 13:19:04 WARN TaskSetManager: Lost TID 15 (task 0.0:1)
>>     14/10/02 13:19:05 ERROR AppClient$ClientActor: Master removed our
>>     application: FAILED; stopping client
>>     14/10/02 13:19:05 WARN SparkDeploySchedulerBackend: Disconnected
>>     from Spark cluster! Waiting for reconnection...
>>     14/10/02 13:19:06 ERROR TaskSchedulerImpl: Lost executor 11 on
>>     host5: remote Akka client disassociated
>>     14/10/02 13:19:06 WARN TaskSetManager: Lost TID 17 (task 0.0:0)
>>     14/10/02 13:19:06 WARN TaskSetManager: Lost TID 16 (task 0.0:1)
>>     ---------------
>>
>>     And this is error log on one of the workers:
>>     ---------------
>>     14/10/02 13:19:05 INFO worker.Worker: Executor
>>     app-20141002131901-0002/9 finished with state FAILED message Command
>>     exited with code 1 exitStatus 1
>>     14/10/02 13:19:05 INFO actor.LocalActorRef: Message
>>     [akka.remote.transport.__ActorTransportAdapter$__
>> DisassociateUnderlying]
>>     from Actor[akka://sparkWorker/__deadLetters] to
>>     Actor[akka://sparkWorker/__system/transports/__
>> akkaprotocolmanager.tcp0/__akkaProtocol-tcp%3A%2F%__
>> 2FsparkWorker%40xxx.xx.xx.xx%__3A57719-15#1504298502]
>>     was not delivered. [6] dead letters encountered. This logging can be
>>     turned off or adjusted with configuration settings
>>     'akka.log-dead-letters' and 'akka.log-dead-letters-during-
>> __shutdown'.
>>     14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
>>     [akka.tcp://sparkWorker@host1:__47421] ->
>>     [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed
>>     with [akka.tcp://sparkExecutor@__host1:45542]] [
>>     akka.remote.__EndpointAssociationException: Association failed with
>>     [akka.tcp://sparkExecutor@__host1:45542]
>>     Caused by:
>>     akka.remote.transport.netty.__NettyTransport$$anonfun$__
>> associate$1$$anon$2:
>>     Connection refused: host1/xxx.xx.xx.xx:45542
>>     ]
>>     14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
>>     [akka.tcp://sparkWorker@host1:__47421] ->
>>     [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed
>>     with [akka.tcp://sparkExecutor@__host1:45542]] [
>>     akka.remote.__EndpointAssociationException: Association failed with
>>     [akka.tcp://sparkExecutor@__host1:45542]
>>     Caused by:
>>     akka.remote.transport.netty.__NettyTransport$$anonfun$__
>> associate$1$$anon$2:
>>     Connection refused: host1/xxx.xx.xx.xx:45542
>>     ]
>>     14/10/02 13:19:05 ERROR remote.EndpointWriter: AssociationError
>>     [akka.tcp://sparkWorker@host1:__47421] ->
>>     [akka.tcp://sparkExecutor@__host1:45542]: Error [Association failed
>>     with [akka.tcp://sparkExecutor@__host1:45542]] [
>>     akka.remote.__EndpointAssociationException: Association failed with
>>     [akka.tcp://sparkExecutor@__host1:45542]
>>     Caused by:
>>     akka.remote.transport.netty.__NettyTransport$$anonfun$__
>> associate$1$$anon$2:
>>     Connection refused: host1/xxx.xx.xx.xx:45542
>>     ---------------
>>
>>     Thanks!
>>     Irina
>>
>>     ------------------------------__----------------------------
>> --__---------
>>     To unsubscribe, e-mail: user-unsubscribe@spark.apache.__org
>>     <mailto:user-unsubscr...@spark.apache.org>
>>     For additional commands, e-mail: user-h...@spark.apache.org
>>     <mailto:user-h...@spark.apache.org>
>>
>>
>>

Reply via email to