I'm trying to setup toree as follows: CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET https://${BI_HOST}:9443/api/v1/clusters | python -c 'import sys, json; print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);') echo Cluster Name: $CLUSTER_NAME
CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts = [ item["Hosts"]["host_name"] for item in items ]; print(" ".join(hosts));') echo Cluster Hosts: $CLUSTER_HOSTS wget -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh # Install anaconda if it isn't already installed [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b # check toree is available, if not install it ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install toree # Install toree ./anaconda2/bin/jupyter toree install \ --spark_home=/usr/iop/current/spark-client/ \ --user --interpreters Scala,PySpark,SparkR \ --spark_opts="--master yarn" \ --python_exec=${HOME}/anaconda2/bin/python2.7 # Install anaconda on all of the cluster nodes for CLUSTER_HOST in ${CLUSTER_HOSTS}; do if [[ "$CLUSTER_HOST" != "$BI_HOST" ]]; then echo "*** Processing $CLUSTER_HOST ***" ssh $BI_USER@$CLUSTER_HOST "wget -q -c https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh" ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b" # You can install your pip modules on each node using something like this: # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary" fi done echo 'Finished installing' However, when I try to run a pyspark job I get the following error: Name: org.apache.toree.interpreter.broker.BrokerException Message: Py4JJavaError: An error occurred while calling z:org.apache.spark.api.python.PythonRDD.collectAndServe. : org.apache.spark.SparkException: Job aborted due to stage failure: Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net): org.apache.spark.SparkException: Error from python worker: /home/biadmin/anaconda2/bin/python2.7: No module named pyspark PYTHONPATH was: /disk3/local/filecache/103/spark-assembly.jar java.io.EOFException Any ideas what is going wrong?