What happens if you enable DEBUG-level logging on the workers?  You might
be affected by the same root problem in
https://spark-project.atlassian.net/browse/SPARK-1025, where PythonRDD is
logging and ignoring an IOException that's actually an unrecoverable
failure, but this is only apparent in the DEBUG logs.

My pull request at
https://github.com/apache/incubator-spark/pull/504addresses this
problem for FileNotFoundException, but I may need to add
special cases for other types of IOException (or remove the code that
catches generic IOExceptions; I'd do this but I'm not sure which exceptions
were originally being ignored as spurious, so I'm not sure which more
specific Exceptions should be ignored).


On Fri, Jan 24, 2014 at 1:43 PM, lobdellb <lobde...@gmail.com> wrote:

> I set up a cluster with four nodes using the spark cluster setup.  I built
> the Spark 0.8.1 binaries with "sbt/sbt assembly" and
> "./make_distribution.sh", the result of which I copied to each client
> machine (which has Java + Python installed).  DNS works in all directions,
> all machines have no firewall of any sort, and passwordless SSH from the
> master node.
>
> I am using the pyspark interface, launched with:
> *MASTER=spark://namenode1:7077 ./pyspark*
>
> I am able to load the web interface, with *http://namenode1:8080/*, and
> browse my four workers.
>
> On the pyspark command line, I run:
> *> data = sc.textFile("my_text_file.txt")
> > data.count()
> *
> This launches a job, which can be seen on the web interface
> (http://namenode1:8080/).  Also, I can see the Java processes running on
> the
> slave nodes:
>
> *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ jps
> 22366 CoarseGrainedExecutorBackend
> 22301 Worker
> 22876 Jps*
>
> Additionally, I can see the python daemon process running on the slaves:
>
> *lobdellb@slave2:~/spark-0.8.1-incubating/conf$ ps aux | grep python
> lobdellb 22433  0.0  0.2  46028 11856 ?        S    14:43   0:00 python
> /home/lobdellb/spark-0.8.1-incubating/python/pyspark/daemon.py*
>
> However, the slaves remain idle, ie., no computation is done, the
> "*data.count()*" command never finishes, the load average of all cluster
> machines stays near 0.0.
>
> I adjusted log4j.properties to show ALL error messages, but none of the
> messages.  The Master log (
> spark-username-org.apache.spark.deploy.master.Master-1-namenode1.out) is
> quite verbose but has no messages which seem indicative of a malfunction.
> The slave node messages from the web interface are easier to mentally parse
> and perhaps offer some clues, but I'm not sure what I'm looking at:
>
> 14/01/24 14:42:04 INFO Slf4jEventHandler: Slf4jEventHandler started
> 14/01/24 14:42:04 INFO SparkEnv: Connecting to BlockManagerMaster:
> akka://sp...@scalaplay.hsd1.il.comcast.net:52181/user/BlockManagerMaster
> 14/01/24 14:42:04 INFO DiskBlockManager: Created local directory at
> /tmp/spark-local-20140124144204-5097
> 14/01/24 14:42:04 INFO MemoryStore: MemoryStore started with capacity 324.4
> MB.
> 14/01/24 14:42:04 INFO ConnectionManager: Bound socket to port 55782 with
> id
> = ConnectionManagerId(slave1.hsd1.il.comcast.net,55782)
> 14/01/24 14:42:04 INFO BlockManagerMaster: Trying to register BlockManager
> 14/01/24 14:42:04 INFO BlockManagerMaster: Registered BlockManager
> 14/01/24 14:42:04 INFO SparkEnv: Connecting to MapOutputTracker:
> akka://sp...@scalaplay.hsd1.il.comcast.net:52181/user/MapOutputTracker
> 14/01/24 14:42:04 INFO HttpFileServer: HTTP File server directory is
> /tmp/spark-157b67f1-1be6-4289-bff2-b29cf23a7e67
> 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 2
> 14/01/24 14:43:22 INFO CoarseGrainedExecutorBackend: Got assigned task 6
> 14/01/24 14:43:22 INFO Executor: Running task ID 6
> 14/01/24 14:43:22 INFO Executor: Running task ID 2
> 14/01/24 14:43:22 INFO HttpBroadcast: Started reading broadcast variable 0
> 14/01/24 14:43:23 INFO MemoryStore: ensureFreeSpace(39200) called with
> curMem=0, maxMem=340147568
> 14/01/24 14:43:23 INFO MemoryStore: Block broadcast_0 stored as values to
> memory (estimated size 38.3 KB, free 324.4 MB)
> 14/01/24 14:43:23 INFO HttpBroadcast: Reading broadcast variable 0 took
> 0.333493883 s
> 14/01/24 14:43:23 INFO BlockManager: Found block broadcast_0 locally
> 14/01/24 14:43:23 INFO HadoopRDD: Input split:
> file:/home/lobdellb/data/Building_Permits.csv:67108864+33554432
> 14/01/24 14:43:23 INFO HadoopRDD: Input split:
> file:/home/lobdellb/data/Building_Permits.csv:201326592+3555642
> 14/01/24 14:43:23 WARN NativeCodeLoader: Unable to load native-hadoop
> library for your platform... using builtin-java classes where applicable
> 14/01/24 14:43:23 WARN LoadSnappy: Snappy native library not loaded
> *14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early
> 14/01/24 14:43:23 INFO PythonRDD: stdin writer to Python finished early*
>
> I would start sifting through the python source code (which is only a few
> thousand lines) and attempt to understand what it does -- probably by
> adding
> logging, but before doing this, I am hoping someone might know:
>
> (1)  Is there additional logging I can access which will help identify the
> problem?
> (2)  Is there an obvious flaw in my setup?
>
> Much appreciated
> -Bryce
>
>
>
> --
> View this message in context:
> http://apache-spark-user-list.1001560.n3.nabble.com/Pyspark-job-not-starting-tp903.html
> Sent from the Apache Spark User List mailing list archive at Nabble.com.
>

Reply via email to