What a BEAR! The following recipe worked for me. (took a couple of days
hacking).

I hope this improves the out of the box experience for others

Andy

My test program is now

In [1]:
from pyspark import SparkContext
textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
In [2]:
print("hello world²)

hello world
In [3]:
textFile.take(3)

Out[3]:
[' hello world', '']


Installation instructions

1. Ssh to cluster master
2. Sudo su
3. install python3.4 on all machines
```
yum install -y python34
bash-4.2# which python3
/usr/bin/python3

pssh -h /root/spark-ec2/slaves yum install -y python34

```

4. Install pip on all machines


```
yum list available |grep pip
yum install -y python34-pip

find /usr/bin -name "*pip*" -print
/usr/bin/pip-3.4

pssh -h /root/spark-ec2/slaves yum install -y python34-pip
```

5. install python on master

```
/usr/bin/pip-3.4 install ipython

pssh -h /root/spark-ec2/slaves /usr/bin/pip-3.4 install python
```

6. Install python develop stuff and jupiter on master
```
yum install -y python34-devel
/usr/bin/pip-3.4 install jupyter
```

7. Set up update spark-env.sh on all machine so by default we use python3.4

```
cd /root/spark/conf
printf "\n# Set Spark Python version\nexport PYSPARK_PYTHON=python3.4\n" >>
/root/spark/conf/spark-env.sh
for i in `cat slaves` ; do scp spark-env.sh
root@$i:/root/spark/conf/spark-env.sh; done
```

8. Restart cluster

```
/root/spark/sbin/stop-all.sh
/root/spark/sbin/start-all.sh
```

Running ipython notebook

1. set up an ssh tunnel on your local machine
ssh -i $KEY_FILE -N -f -L localhost:8888:localhost:7000
ec2-user@$SPARK_MASTER

2. Log on to cluster master and start ipython notebook server

```
export PYSPARK_PYTHON=python3.4

export PYSPARK_DRIVER_PYTHON=python3.4

export IPYTHON_OPTS="notebook --no-browser --port=7000"

$SPARK_ROOT/bin/pyspark --master local[2]

```

3. On your local machine open http://localhost:8888



From:  Andrew Davidson <a...@santacruzintegration.com>
Date:  Friday, November 6, 2015 at 2:18 PM
To:  "user @spark" <user@spark.apache.org>
Subject:  bug: can not run Ipython notebook on cluster

> Does anyone use iPython notebooks?
> 
> I am able to use it on my local machine with spark how ever I can not get it
> work on my cluster.
> 
> 
> For unknown reason on my cluster I have to manually create the spark context.
> My test code generated this exception
> 
> Exception: Python in worker has different version 2.7 than that in driver 2.6,
> PySpark cannot run with different minor versions
> 
> On my mac I can solve the exception problem by setting
> 
> export PYSPARK_PYTHON=python3
> 
> export PYSPARK_DRIVER_PYTHON=python3
> 
> IPYTHON_OPTS=notebook $SPARK_ROOT/bin/pyspark
> 
> 
> 
> On my cluster I set the values to python2.7. And PYTHON_OPTS=³notebook
> ‹no-browser ‹port=7000² . I connect using a ssh tunnel from my local machine.
> 
> 
> 
> I also tried installing python 3 , pip, ipython, and jupyter in/on my cluster
> 
> 
> 
> I tried adding export PYSPARK_PYTHON=python2.7 to the
> /root/spark/conf/spark-env.sh on all my machines
> 
> 
> from pyspark import SparkContext
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3
> 
> 
> In [1]:
> from pyspark import SparkContext
> sc = SparkContext("local", "Simple App")
> textFile = sc.textFile("file:///home/ec2-user/dataScience/readme.md")
> textFile.take(3)
> ---------------------------------------------------------------------------Py4
> JJavaError                             Traceback (most recent call last)
> <ipython-input-1-e0006b323300> in <module>()      2 sc = SparkContext("local",
> "Simple App")      3 textFile =
> sc.textFile("file:///home/ec2-user/dataScience/readme.md")----> 4
> textFile.take(3)/root/spark/python/pyspark/rdd.py in take(self, num)   1297
> 1298             p = range(partsScanned, min(partsScanned + numPartsToTry,
> totalParts))-> 1299             res = self.context.runJob(self,
> takeUpToNumLeft, p)   1300    1301             items +=
> res/root/spark/python/pyspark/context.py in runJob(self, rdd, partitionFunc,
> partitions, allowLocal)    914         # SparkContext#runJob.    915
> mappedRDD = rdd.mapPartitions(partitionFunc)--> 916         port =
> self._jvm.PythonRDD.runJob(self._jsc.sc(), mappedRDD._jrdd, partitions)    917
> return list(_load_from_socket(port, mappedRDD._jrdd_deserializer))    918
> /root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/java_gateway.py in
> __call__(self, *args)    536         answer =
> self.gateway_client.send_command(command)    537         return_value =
> get_return_value(answer, self.gateway_client,
> --> 538                 self.target_id, self.name)
>     539     540         for temp_arg in
> temp_args:/root/spark/python/lib/py4j-0.8.2.1-src.zip/py4j/protocol.py in
> get_return_value(answer, gateway_client, target_id, name)    298
> raise Py4JJavaError(
>     299                     'An error occurred while calling {0}{1}{2}.\n'.-->
> 300                     format(target_id, '.', name), value)
>     301             else:    302                 raise Py4JError(
> 
> Py4JJavaError: An error occurred while calling
> z:org.apache.spark.api.python.PythonRDD.runJob.
> : org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 in
> stage 0.0 failed 1 times, most recent failure: Lost task 0.0 in stage 0.0 (TID
> 0, localhost): org.apache.spark.api.python.PythonException: Traceback (most
> recent call last):
>   File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in
> main
>     ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 2.6,
> PySpark cannot run with different minor versions
> 
>       at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>       at 
> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
>       at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>       at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142>
)
>       at 
> 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617>
)
>       at java.lang.Thread.run(Thread.java:745)
> 
> Driver stacktrace:
>       at 
> org.apache.spark.scheduler.DAGScheduler.org$apache$spark$scheduler$DAGSchedule
> r$$failJobAndIndependentStages(DAGScheduler.scala:1283)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSchedul
> er.scala:1271)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGSchedul
> er.scala:1270)
>       at 
> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)
>       at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)
>       at 
> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1270)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(D
> AGScheduler.scala:697)
>       at 
> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(D
> AGScheduler.scala:697)
>       at scala.Option.foreach(Option.scala:236)
>       at 
> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala
> :697)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.doOnReceive(DAGSchedul
> er.scala:1496)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler
> .scala:1458)
>       at 
> org.apache.spark.scheduler.DAGSchedulerEventProcessLoop.onReceive(DAGScheduler
> .scala:1447)
>       at org.apache.spark.util.EventLoop$$anon$1.run(EventLoop.scala:48)
>       at 
> org.apache.spark.scheduler.DAGScheduler.runJob(DAGScheduler.scala:567)
>       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1822)
>       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1835)
>       at org.apache.spark.SparkContext.runJob(SparkContext.scala:1848)
>       at org.apache.spark.api.python.PythonRDD$.runJob(PythonRDD.scala:393)
>       at org.apache.spark.api.python.PythonRDD.runJob(PythonRDD.scala)
>       at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
>       at 
> sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
>       at 
> sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.j
> ava:43)
>       at java.lang.reflect.Method.invoke(Method.java:497)
>       at py4j.reflection.MethodInvoker.invoke(MethodInvoker.java:231)
>       at py4j.reflection.ReflectionEngine.invoke(ReflectionEngine.java:379)
>       at py4j.Gateway.invoke(Gateway.java:259)
>       at py4j.commands.AbstractCommand.invokeMethod(AbstractCommand.java:133)
>       at py4j.commands.CallCommand.execute(CallCommand.java:79)
>       at py4j.GatewayConnection.run(GatewayConnection.java:207)
>       at java.lang.Thread.run(Thread.java:745)
> Caused by: org.apache.spark.api.python.PythonException: Traceback (most recent
> call last):
>   File "/root/spark/python/lib/pyspark.zip/pyspark/worker.py", line 64, in
> main
>     ("%d.%d" % sys.version_info[:2], version))
> Exception: Python in worker has different version 2.7 than that in driver 2.6,
> PySpark cannot run with different minor versions
> 
>       at 
> org.apache.spark.api.python.PythonRunner$$anon$1.read(PythonRDD.scala:166)
>       at 
> org.apache.spark.api.python.PythonRunner$$anon$1.<init>(PythonRDD.scala:207)
>       at org.apache.spark.api.python.PythonRunner.compute(PythonRDD.scala:125)
>       at org.apache.spark.api.python.PythonRDD.compute(PythonRDD.scala:70)
>       at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:297)
>       at org.apache.spark.rdd.RDD.iterator(RDD.scala:264)
>       at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:66)
>       at org.apache.spark.scheduler.Task.run(Task.scala:88)
>       at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:214)
>       at 
> 
java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142>
)
>       at 
> 
java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617>
)
>       ... 1 more
> 


Reply via email to