[ https://issues.apache.org/jira/browse/SPARK-11509?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14994616#comment-14994616 ]
Andrew Davidson commented on SPARK-11509: ----------------------------------------- okay after a couple of days hacking it looks like my test program might be working. Here is my recipe. (I hope this helps others) My test program is now In [1]: from pyspark import SparkContext textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") In [2]: print("hello world”) hello world In [3]: textFile.take(3) Out[3]: [' hello world', ''] Installation instructions Ssh to cluster master Sudo su install python3.4 on all machines ``` yum install -y python34 bash-4.2# which python3 /usr/bin/python3 pssh -h /root/spark-ec2/slaves yum install -y python34 ``` 4. Install pip on all machines ``` yum list available |grep pip yum install -y python34-pip find /usr/bin -name "*pip*" -print /usr/bin/pip-3.4 pssh -h /root/spark-ec2/slaves yum install -y python34-pip ``` 5. install python on master ``` /usr/bin/pip-3.4 install ipython pssh -h /root/spark-ec2/slaves /usr/bin/pip-3.4 install python ``` 6. Install python develop stuff and jupiter on master ``` yum install -y python34-devel /usr/bin/pip-3.4 install jupyter ``` 7. Set up update spark-env.sh on all machine so by default we use python3.4 ``` cd /root/spark/conf printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=python3.4\n" >> /root/spark/conf/spark-env.sh for i in `cat slaves` ; do scp spark-env.sh root@$i:/root/spark/conf/spark-env.sh; done ``` 8. Restart cluster ``` /root/spark/sbin/stop-all.sh /root/spark/sbin/start-all.sh ``` Running ipython notebook set up an ssh tunnel on your local machine ssh -i $KEY_FILE -N -f -L localhost:8888:localhost:7000 ec2-user@$SPARK_MASTER 2. Log on to cluster master and start ipython notebook server ``` export PYSPARK_PYTHON=python3.4 export PYSPARK_DRIVER_PYTHON=python3.4 export IPYTHON_OPTS="notebook --no-browser --port=7000" $SPARK_ROOT/bin/pyspark --master local[2] ``` 3. On your local machine open http://localhost:8888 > ipython notebooks do not work on clusters created using > spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 script > ------------------------------------------------------------------------------------------------------ > > Key: SPARK-11509 > URL: https://issues.apache.org/jira/browse/SPARK-11509 > Project: Spark > Issue Type: Bug > Components: Documentation, EC2, PySpark > Affects Versions: 1.5.1 > Environment: AWS cluster > [ec2-user@ip-172-31-29-60 ~]$ uname -a > Linux ip-172-31-29-60.us-west-1.compute.internal 3.4.37-40.44.amzn1.x86_64 #1 > SMP Thu Mar 21 01:17:08 UTC 2013 x86_64 x86_64 x86_64 GNU/Linux > Reporter: Andrew Davidson > > I recently downloaded spark-1.5.1-bin-hadoop2.6 to my local mac. > I used spark-1.5.1-bin-hadoop2.6/ec2/spark-ec2 to create an aws cluster. I am > able to run the java SparkPi example on the cluster how ever I am not able to > run ipython notebooks on the cluster. (I connect using ssh tunnel) > According to the 1.5.1 getting started doc > http://spark.apache.org/docs/latest/programming-guide.html#using-the-shell > The following should work > PYSPARK_DRIVER_PYTHON=ipython PYSPARK_DRIVER_PYTHON_OPTS="notebook > --no-browser --port=7000" /root/spark/bin/pyspark > I am able to connect to the notebook server and start a notebook how ever > bug 1) the default sparkContext does not exist > from pyspark import SparkContext > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3 > --------------------------------------------------------------------------- > NameError Traceback (most recent call last) > <ipython-input-1-127b6a58d5cc> in <module>() > 1 from pyspark import SparkContext > ----> 2 textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > 3 textFile.take(3) > NameError: name 'sc' is not defined > bug 2) > If I create a SparkContext I get the following python versions miss match > error > sc = SparkContext("local", "Simple App") > textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md") > textFile.take(3) > File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in main > ("%d.%d" % sys.version_info[:2], version)) > Exception: Python in worker has different version 2.7 than that in driver > 2.6, PySpark cannot run with different minor versions > I am able to run ipython notebooks on my local Mac as follows. (by default > you would get an error that the driver and works are using different version > of python) > $ cat ~/bin/pySparkNotebook.sh > #!/bin/sh > set -x # turn debugging on > #set +x # turn debugging off > export PYSPARK_PYTHON=python3 > export PYSPARK_DRIVER_PYTHON=python3 > IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark $*$ > I have spent a lot of time trying to debug the pyspark script however I can > not figure out what the problem is > Please let me know if there is something I can do to help > Andy -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org