I'm trying to setup toree as follows:

    CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
https://${BI_HOST}:9443/api/v1/clusters
| python -c 'import sys, json;
print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
    echo Cluster Name: $CLUSTER_NAME

    CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
| python -c 'import sys, json; items = json.load(sys.stdin)["items"]; hosts
= [ item["Hosts"]["host_name"] for item in items ]; print("
".join(hosts));')
    echo Cluster Hosts: $CLUSTER_HOSTS

    wget -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh

    # Install anaconda if it isn't already installed
    [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b

    # check toree is available, if not install it
    ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip install
toree

    # Install toree
    ./anaconda2/bin/jupyter toree install \
            --spark_home=/usr/iop/current/spark-client/ \
            --user --interpreters Scala,PySpark,SparkR  \
            --spark_opts="--master yarn" \
            --python_exec=${HOME}/anaconda2/bin/python2.7

    # Install anaconda on all of the cluster nodes
    for CLUSTER_HOST in ${CLUSTER_HOSTS};
    do
       if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
       then
          echo "*** Processing $CLUSTER_HOST ***"
          ssh $BI_USER@$CLUSTER_HOST "wget -q -c
https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh";
          ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
Anaconda2-4.1.1-Linux-x86_64.sh -b"

          # You can install your pip modules on each node using something
like this:
          # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
       fi
    done

    echo 'Finished installing'

However, when I try to run a pyspark job I get the following error:

    Name: org.apache.toree.interpreter.broker.BrokerException
    Message: Py4JJavaError: An error occurred while calling
z:org.apache.spark.api.python.PythonRDD.collectAndServe.
    : org.apache.spark.SparkException: Job aborted due to stage failure:
Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
org.apache.spark.SparkException:
    Error from python worker:
      /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
    PYTHONPATH was:
      /disk3/local/filecache/103/spark-assembly.jar
    java.io.EOFException

Any ideas what is going wrong?

Reply via email to