Hi Chip, Thanks for the response.
Is this a defect with toree, or have I misconfigured? Many thanks, Chris On 15 December 2016 at 19:14, Chip Senkbeil <chip.senkb...@gmail.com> wrote: > It's showing your PYTHONPATH as > /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for > pyspark > on your PYTHONPATH. > > https://github.com/apache/incubator-toree/blob/master/ > pyspark-interpreter/src/main/scala/org/apache/toree/kernel/ > interpreter/pyspark/PySparkProcess.scala#L78 > > That code is showing us augmenting the existing PYTHONPATH to include > $SPARK_HOME/python/, where we are searching for your pyspark distribution. > > Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which > is also troubling. > > On Wed, Dec 14, 2016 at 9:41 AM chris snow <chsnow...@gmail.com> wrote: > > > I'm trying to setup toree as follows: > > > > CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS -X GET > > https://${BI_HOST}:9443/api/v1/clusters > > | python -c 'import sys, json; > > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);') > > echo Cluster Name: $CLUSTER_NAME > > > > CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS -X GET > > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts > > | python -c 'import sys, json; items = json.load(sys.stdin)["items"]; > hosts > > = [ item["Hosts"]["host_name"] for item in items ]; print(" > > ".join(hosts));') > > echo Cluster Hosts: $CLUSTER_HOSTS > > > > wget -c > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh > > > > # Install anaconda if it isn't already installed > > [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b > > > > # check toree is available, if not install it > > ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip > install > > toree > > > > # Install toree > > ./anaconda2/bin/jupyter toree install \ > > --spark_home=/usr/iop/current/spark-client/ \ > > --user --interpreters Scala,PySpark,SparkR \ > > --spark_opts="--master yarn" \ > > --python_exec=${HOME}/anaconda2/bin/python2.7 > > > > # Install anaconda on all of the cluster nodes > > for CLUSTER_HOST in ${CLUSTER_HOSTS}; > > do > > if [[ "$CLUSTER_HOST" != "$BI_HOST" ]]; > > then > > echo "*** Processing $CLUSTER_HOST ***" > > ssh $BI_USER@$CLUSTER_HOST "wget -q -c > > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh" > > ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash > > Anaconda2-4.1.1-Linux-x86_64.sh -b" > > > > # You can install your pip modules on each node using something > > like this: > > # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c > > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary" > > fi > > done > > > > echo 'Finished installing' > > > > However, when I try to run a pyspark job I get the following error: > > > > Name: org.apache.toree.interpreter.broker.BrokerException > > Message: Py4JJavaError: An error occurred while calling > > z:org.apache.spark.api.python.PythonRDD.collectAndServe. > > : org.apache.spark.SparkException: Job aborted due to stage failure: > > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in > > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net): > > org.apache.spark.SparkException: > > Error from python worker: > > /home/biadmin/anaconda2/bin/python2.7: No module named pyspark > > PYTHONPATH was: > > /disk3/local/filecache/103/spark-assembly.jar > > java.io.EOFException > > > > Any ideas what is going wrong? > > >