ge ko Sun, 13 Apr 2014 12:52:17 -0700

Hi,

I'm still going to start working with Spark and installed the parcels in
our CDH5 GA cluster.




Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster

Like some advices told me to use FQDN, the settings above sound reasonable
for me .



Both daemons are running, Master-Web-UI shows the connected worker, and the
log entries show:

master:

2014-04-13 21:26:40,641 INFO Remoting: Starting remoting
2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077]
2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master:
Starting Spark master at spark://hadoop-pg-5.cluster:7077
...

2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector:
Started SelectChannelConnector@0.0.0.0:18080
2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI:
Started Master web UI at http://hadoop-pg-5.cluster:18080
2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have
been elected leader! New state: ALIVE

2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master:
Registering worker hadoop-pg-5.cluster:7078 with 2 cores, 64.0 MB RAM



worker:

2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger
started
2014-04-13 21:27:39,136 INFO Remoting: Starting remoting
2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on
addresses :[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078]
2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker:
Starting Spark worker hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM
2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark
home: /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark
...

2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector:
Started SelectChannelConnector@0.0.0.0:18081
2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI:
Started Worker web UI at http://hadoop-pg-7.cluster:18081
2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker:
Connecting to master spark://hadoop-pg-5.cluster:7077...
2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker:
Successfully registered with master spark://hadoop-pg-5.cluster:7077



Looks good, so far.



Now I want to execute the python pi example by executing (on the worker):

cd /opt/cloudera/parcels/CDH/lib/spark && ./bin/pyspark
./python/examples/pi.py spark://hadoop-pg-5.cluster:7077



Here the strange thing happens, the script doesn't get executed, it hangs
(repeating this output forever) at :



14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory



The whole log is:





14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started
14/04/13 21:30:45 INFO Remoting: Starting remoting
14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses
:[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses:
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster
14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at
/tmp/spark-local-20140413213045-acec
14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9
MB.
14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with
id = ConnectionManagerId(hadoop-pg-7.cluster,57506)
14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager
14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo:
Registering block manager hadoop-pg-7.cluster:57506 with 294.9 MB RAM
14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at
http://10.147.210.7:51224
14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker
14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is
/tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f
14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server
14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at
http://hadoop-pg-7.cluster:4040
14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master
spark://hadoop-pg-5.cluster:7077...
14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark
cluster with app ID app-20140413213046-0000
14/04/13 21:30:48 INFO SparkContext: Starting job: reduce at
./python/examples/pi.py:36
14/04/13 21:30:48 INFO DAGScheduler: Got job 0 (reduce at
./python/examples/pi.py:36) with 2 output partitions (allowLocal=false)
14/04/13 21:30:48 INFO DAGScheduler: Final stage: Stage 0 (reduce at
./python/examples/pi.py:36)
14/04/13 21:30:48 INFO DAGScheduler: Parents of final stage: List()
14/04/13 21:30:48 INFO DAGScheduler: Missing parents: List()
14/04/13 21:30:48 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at
reduce at ./python/examples/pi.py:36), which has no missing parents
14/04/13 21:30:48 INFO DAGScheduler: Submitting 2 missing tasks from Stage
0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36)
14/04/13 21:30:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks
14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory
14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any
resources; check your cluster UI to ensure that workers are registered and
have sufficient memory





Thereby I have to cancel the execution of the script. If I am doing this, I
receive the following log entries on the master (! at cancellation of the
python pi script !):



2014-04-13 21:30:46,965 INFO org.apache.spark.deploy.master.Master:
Registering app PythonPi
2014-04-13 21:30:46,974 INFO org.apache.spark.deploy.master.Master:
Registered app PythonPi with ID app-20140413213046-0000
2014-04-13 21:31:27,123 INFO org.apache.spark.deploy.master.Master:
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,125 INFO org.apache.spark.deploy.master.Master:
Removing app app-20140413213046-0000
2014-04-13 21:31:27,143 INFO org.apache.spark.deploy.master.Master:
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,144 INFO akka.actor.LocalActorRef: Message
[akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from
Actor[akka://sparkMaster/deadLetters] to
Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.147.210.7%3A44207-2#-389971336]
was not delivered. [1] dead letters encountered. This logging can be turned
off or adjusted with configuration settings 'akka.log-dead-letters' and
'akka.log-dead-letters-during-shutdown'.
2014-04-13 21:31:27,194 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] ->
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed
with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,199 INFO org.apache.spark.deploy.master.Master:
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,215 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] ->
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed
with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,222 INFO org.apache.spark.deploy.master.Master:
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.
2014-04-13 21:31:27,234 ERROR akka.remote.EndpointWriter: AssociationError
[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] ->
[akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed
with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [
akka.remote.EndpointAssociationException: Association failed with
[akka.tcp://spark@hadoop-pg-7.cluster:50601]
Caused by:
akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2:
Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601
]
2014-04-13 21:31:27,238 INFO org.apache.spark.deploy.master.Master:
akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it.





What is going wrong here ?!?!?!?



I get the same behaviour if I start the spark-shell on the worker and try
to execute e.g. sc.parallelize(1 to 100,10).count



any help highly appreciated, Gerd

Reply via email to