These tips were very helpful! By setting SPARK_MASTER_IP as you suggest, I was able to make progress. Unfortunately, it is unclear to me how to specify the hadoop-client dependency for a pyspark stand-alone application. So, I still get the EOFException, since I used a non-default Hadoop distribution (I was using 2.3.0-cdh5.0.0 distributed with CDH 5). The documentation describes how to add a hadoop-client dependency to the pom.xml for a Java application, but not for PySpark. To work around the EOFException, I created a multi-node Hadoop cluster with version 1.04 (the default Hadoop for Spark 0.9.1). This worked and I was able to successfully do a multi-node Spark job.
The question remains though: how do you specify a hadoop-client dependency for a Python stand-alone Spark application (i.e. do the equivalent of adding to the pom.xml for a Java Spark application)? Thanks! -T.J. On Thu, May 29, 2014 at 4:04 AM, jaranda <jordi.ara...@bsc.es> wrote: > I finally got it working. Main points: > > - I had to add hadoop-client dependency to avoid a strange EOFException. > - I had to set SPARK_MASTER_IP in conf/start-master.sh to hostname -f > instead of hostname, since akka seems not to work properly with host names > / > ip, it requires fully qualified domain names. > - I also set SPARK_MASTER_IP in conf/spark-env.sh to hostname -f so that > other workers can reach the master. > - Be sure that conf/slaves also contains fully qualified domain names. > - It seems that both master and workers need to have access to the driver > client and since I was within a VPN I had lot of troubles with this. It > took > some time but I finally realized it. > > Making these changes, everything just worked like a charm! > > > > -- > View this message in context: > http://apache-spark-user-list.1001560.n3.nabble.com/A-Standalone-App-in-Scala-Standalone-mode-issues-tp6493p6514.html > Sent from the Apache Spark User List mailing list archive at Nabble.com. >