Hi, I'm still going to start working with Spark and installed the parcels in our CDH5 GA cluster.
Master: hadoop-pg-5.cluster, Worker: hadoop-pg-7.cluster Like some advices told me to use FQDN, the settings above sound reasonable for me . Both daemons are running, Master-Web-UI shows the connected worker, and the log entries show: master: 2014-04-13 21:26:40,641 INFO Remoting: Starting remoting 2014-04-13 21:26:40,930 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] 2014-04-13 21:26:41,356 INFO org.apache.spark.deploy.master.Master: Starting Spark master at spark://hadoop-pg-5.cluster:7077 ... 2014-04-13 21:26:41,439 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18080 2014-04-13 21:26:41,441 INFO org.apache.spark.deploy.master.ui.MasterWebUI: Started Master web UI at http://hadoop-pg-5.cluster:18080 2014-04-13 21:26:41,476 INFO org.apache.spark.deploy.master.Master: I have been elected leader! New state: ALIVE 2014-04-13 21:27:40,319 INFO org.apache.spark.deploy.master.Master: Registering worker hadoop-pg-5.cluster:7078 with 2 cores, 64.0 MB RAM worker: 2014-04-13 21:27:39,037 INFO akka.event.slf4j.Slf4jLogger: Slf4jLogger started 2014-04-13 21:27:39,136 INFO Remoting: Starting remoting 2014-04-13 21:27:39,413 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://sparkWorker@hadoop-pg-7.cluster:7078] 2014-04-13 21:27:39,706 INFO org.apache.spark.deploy.worker.Worker: Starting Spark worker hadoop-pg-7.cluster:7078 with 2 cores, 64.0 MB RAM 2014-04-13 21:27:39,708 INFO org.apache.spark.deploy.worker.Worker: Spark home: /opt/cloudera/parcels/CDH-5.0.0-1.cdh5.0.0.p0.47/lib/spark ... 2014-04-13 21:27:39,888 INFO org.eclipse.jetty.server.AbstractConnector: Started SelectChannelConnector@0.0.0.0:18081 2014-04-13 21:27:39,889 INFO org.apache.spark.deploy.worker.ui.WorkerWebUI: Started Worker web UI at http://hadoop-pg-7.cluster:18081 2014-04-13 21:27:39,890 INFO org.apache.spark.deploy.worker.Worker: Connecting to master spark://hadoop-pg-5.cluster:7077... 2014-04-13 21:27:40,360 INFO org.apache.spark.deploy.worker.Worker: Successfully registered with master spark://hadoop-pg-5.cluster:7077 Looks good, so far. Now I want to execute the python pi example by executing (on the worker): cd /opt/cloudera/parcels/CDH/lib/spark && ./bin/pyspark ./python/examples/pi.py spark://hadoop-pg-5.cluster:7077 Here the strange thing happens, the script doesn't get executed, it hangs (repeating this output forever) at : 14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory The whole log is: 14/04/13 21:30:44 INFO Slf4jLogger: Slf4jLogger started 14/04/13 21:30:45 INFO Remoting: Starting remoting 14/04/13 21:30:45 INFO Remoting: Remoting started; listening on addresses :[akka.tcp://spark@hadoop-pg-7.cluster:50601] 14/04/13 21:30:45 INFO Remoting: Remoting now listens on addresses: [akka.tcp://spark@hadoop-pg-7.cluster:50601] 14/04/13 21:30:45 INFO SparkEnv: Registering BlockManagerMaster 14/04/13 21:30:45 INFO DiskBlockManager: Created local directory at /tmp/spark-local-20140413213045-acec 14/04/13 21:30:45 INFO MemoryStore: MemoryStore started with capacity 294.9 MB. 14/04/13 21:30:45 INFO ConnectionManager: Bound socket to port 57506 with id = ConnectionManagerId(hadoop-pg-7.cluster,57506) 14/04/13 21:30:45 INFO BlockManagerMaster: Trying to register BlockManager 14/04/13 21:30:45 INFO BlockManagerMasterActor$BlockManagerInfo: Registering block manager hadoop-pg-7.cluster:57506 with 294.9 MB RAM 14/04/13 21:30:45 INFO BlockManagerMaster: Registered BlockManager 14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server 14/04/13 21:30:45 INFO HttpBroadcast: Broadcast server started at http://10.147.210.7:51224 14/04/13 21:30:45 INFO SparkEnv: Registering MapOutputTracker 14/04/13 21:30:45 INFO HttpFileServer: HTTP File server directory is /tmp/spark-f9ab98c8-2adf-460a-9099-6dc07c7dc89f 14/04/13 21:30:45 INFO HttpServer: Starting HTTP Server 14/04/13 21:30:46 INFO SparkUI: Started Spark Web UI at http://hadoop-pg-7.cluster:4040 14/04/13 21:30:46 INFO AppClient$ClientActor: Connecting to master spark://hadoop-pg-5.cluster:7077... 14/04/13 21:30:47 INFO SparkDeploySchedulerBackend: Connected to Spark cluster with app ID app-20140413213046-0000 14/04/13 21:30:48 INFO SparkContext: Starting job: reduce at ./python/examples/pi.py:36 14/04/13 21:30:48 INFO DAGScheduler: Got job 0 (reduce at ./python/examples/pi.py:36) with 2 output partitions (allowLocal=false) 14/04/13 21:30:48 INFO DAGScheduler: Final stage: Stage 0 (reduce at ./python/examples/pi.py:36) 14/04/13 21:30:48 INFO DAGScheduler: Parents of final stage: List() 14/04/13 21:30:48 INFO DAGScheduler: Missing parents: List() 14/04/13 21:30:48 INFO DAGScheduler: Submitting Stage 0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36), which has no missing parents 14/04/13 21:30:48 INFO DAGScheduler: Submitting 2 missing tasks from Stage 0 (PythonRDD[1] at reduce at ./python/examples/pi.py:36) 14/04/13 21:30:48 INFO TaskSchedulerImpl: Adding task set 0.0 with 2 tasks 14/04/13 21:31:03 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory 14/04/13 21:31:18 WARN TaskSchedulerImpl: Initial job has not accepted any resources; check your cluster UI to ensure that workers are registered and have sufficient memory Thereby I have to cancel the execution of the script. If I am doing this, I receive the following log entries on the master (! at cancellation of the python pi script !): 2014-04-13 21:30:46,965 INFO org.apache.spark.deploy.master.Master: Registering app PythonPi 2014-04-13 21:30:46,974 INFO org.apache.spark.deploy.master.Master: Registered app PythonPi with ID app-20140413213046-0000 2014-04-13 21:31:27,123 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it. 2014-04-13 21:31:27,125 INFO org.apache.spark.deploy.master.Master: Removing app app-20140413213046-0000 2014-04-13 21:31:27,143 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it. 2014-04-13 21:31:27,144 INFO akka.actor.LocalActorRef: Message [akka.remote.transport.ActorTransportAdapter$DisassociateUnderlying] from Actor[akka://sparkMaster/deadLetters] to Actor[akka://sparkMaster/system/transports/akkaprotocolmanager.tcp0/akkaProtocol-tcp%3A%2F%2FsparkMaster%4010.147.210.7%3A44207-2#-389971336] was not delivered. [1] dead letters encountered. This logging can be turned off or adjusted with configuration settings 'akka.log-dead-letters' and 'akka.log-dead-letters-during-shutdown'. 2014-04-13 21:31:27,194 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601 ] 2014-04-13 21:31:27,199 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it. 2014-04-13 21:31:27,215 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601 ] 2014-04-13 21:31:27,222 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it. 2014-04-13 21:31:27,234 ERROR akka.remote.EndpointWriter: AssociationError [akka.tcp://sparkMaster@hadoop-pg-5.cluster:7077] -> [akka.tcp://spark@hadoop-pg-7.cluster:50601]: Error [Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601]] [ akka.remote.EndpointAssociationException: Association failed with [akka.tcp://spark@hadoop-pg-7.cluster:50601] Caused by: akka.remote.transport.netty.NettyTransport$$anonfun$associate$1$$anon$2: Connection refused: hadoop-pg-7.cluster/10.147.210.7:50601 ] 2014-04-13 21:31:27,238 INFO org.apache.spark.deploy.master.Master: akka.tcp://spark@hadoop-pg-7.cluster:50601 got disassociated, removing it. What is going wrong here ?!?!?!? I get the same behaviour if I start the spark-shell on the worker and try to execute e.g. sc.parallelize(1 to 100,10).count any help highly appreciated, Gerd