What happens if you enable DEBUG-level logging on the workers? You might be affected by the same root problem in https://spark-project.atlassian.net/browse/SPARK-1025, where PythonRDD is logging and ignoring an IOException that's actually an unrecoverable failure, but this is only apparent in the DEBUG logs.
My pull request at https://github.com/apache/incubator-spark/pull/504addresses this problem for FileNotFoundException, but I may need to add special cases for other types of IOException (or remove the code that catches generic IOExceptions; I'd do this but I'm not sure which exceptions were originally being ignored as spurious, so I'm not sure which more specific Exceptions should be ignored). On Fri, Jan 24, 2014 at 1:43 PM, lobdellb <lobde...@gmail.com> wrote: > I set up a cluster with four nodes using the spark cluster setup. I built > the Spark 0.8.1 binaries with "sbt/sbt assembly" and > "./make_distribution.sh", the result of which I copied to each client > machine (which has Java + Python installed). DNS works in all directions, > all machines have no firewall of any sort, and passwordless SSH from the > master node. > > I am using the pyspark interface, launched with: > *MASTER=spark://namenode1:7077 ./pyspark* > > I am able to load the web interface, with *http://namenode1:8080/*, and > browse my four workers. > > On the pyspark command line, I run: > *> data = sc.textFile("my_text_file.txt") > > data.count() > * > This launches a job, which can be seen on the web interface > (http://namenode1:8080/). Also, I can see the Java processes running on > the > slave nodes: > > *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ jps > 22366 CoarseGrainedExecutorBackend > 22301 Worker > 22876 Jps* > > Additionally, I can see the python daemon process running on the slaves: > > *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ ps aux | grep python > lobdellb 22433 0.0 0.2 46028 11856 ? S 14:43 0:00 python > /home/lobdellb/spark-0.8.1-incubating/python/pyspark/daemon.py* > > However, the slaves remain idle, ie., no computation is done, the > "*data.count()*" command never finishes, the load average of all cluster > machines stays near 0.0. > > I adjusted log4j.properties to show ALL error messages, but none of the > messages. The Master log ( > spark-username-org.apache.spark.deploy.master.Master-1-namenode1.out) is > quite verbose but has no messages which seem indicative of a malfunction. > The slave node messages from the web interface are easier to mentally parse > and perhaps offer some clues, but I'm not sure what I'm looking at: > > 14/01/24 14:42:04 INFO Slf4jEventHandler: Slf4jEventHandler started > 14/01/24 14:42:04 INFO SparkEnv: Connecting to BlockManagerMaster: > akka://sp...@scalaplay.hsd1.il.comcast.net:52181/user/BlockManagerMaster > 14/01/24 14:42:04 INFO DiskBlockManager: Created local directory at > /tmp/spark-local-20140124144204-5097 > 14/01/24 14:42:04 INFO MemoryStore: MemoryStore started with capacity 324.4 > MB. > 14/01/24 14:42:04 INFO ConnectionManager: Bound socket to port 55782 with > id > = ConnectionManagerId(slave1.hsd1.il.comcast.net,55782) > 14/01/24 14:42:04 INFO BlockManagerMaster: Trying to register BlockManager > 14/01/24 14:42:04 INFO BlockManagerMaster: Registered BlockManager > 14/01/24 14:42:04 INFO SparkEnv: Connecting to MapOutputTracker: > akka://sp...@scalaplay.hsd1.il.comcast.net:52181/user/MapOutputTracker > 14/01/24 14:42:04 INFO HttpFileServer: HTTP File server directory is > /tmp/spark-157b67f1-1be6-4289-bff2-b29cf23a7e67 > 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 2 > 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 6 > 14/01/24 14:43:22 INFO Executor: Running task ID 6 > 14/01/24 14:43:22 INFO Executor: Running task ID 2 > 14/01/24 14:43:22 INFO HttpBroadcast: Started reading broadcast variable 0 > 14/01/24 14:43:23 INFO MemoryStore: ensureFreeSpace(39200) called with > curMem=0, maxMem=340147568 > 14/01/24 14:43:23 INFO MemoryStore: Block broadcast_0 stored as values to > memory (estimated size 38.3 KB, free 324.4 MB) > 14/01/24 14:43:23 INFO HttpBroadcast: Reading broadcast variable 0 took > 0.333493883 s > 14/01/24 14:43:23 INFO BlockManager: Found block broadcast_0 locally > 14/01/24 14:43:23 INFO HadoopRDD: Input split: > file:/home/lobdellb/data/Building_Permits.csv:67108864+33554432 > 14/01/24 14:43:23 INFO HadoopRDD: Input split: > file:/home/lobdellb/data/Building_Permits.csv:201326592+3555642 > 14/01/24 14:43:23 WARN NativeCodeLoader: Unable to load native-hadoop > library for your platform... using builtin-java classes where applicable > 14/01/24 14:43:23 WARN LoadSnappy: Snappy native library not loaded > *14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early > 14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early* > > I would start sifting through the python source code (which is only a few > thousand lines) and attempt to understand what it does -- probably by > adding > logging, but before doing this, I am hoping someone might know: > > (1) Is there additional logging I can access which will help identify the > problem? > (2) Is there an obvious flaw in my setup? > > Much appreciated > -Bryce > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-job-not-starting-tp903.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >