Hi Chip,

Thanks for the response.

Is this a defect with toree, or have I misconfigured?

Many thanks,

Chris

On 15 December 2016 at 19:14, Chip Senkbeil <chip.senkb...@gmail.com> wrote:

> It's showing your PYTHONPATH as
> /disk3/local/filecache/103/spark-assembly.jar. Toree is looking for
> pyspark
> on your PYTHONPATH.
>
> https://github.com/apache/incubator-toree/blob/master/
> pyspark-interpreter/src/main/scala/org/apache/toree/kernel/
> interpreter/pyspark/PySparkProcess.scala#L78
>
> That code is showing us augmenting the existing PYTHONPATH to include
> $SPARK_HOME/python/, where we are searching for your pyspark distribution.
>
> Your PYTHONPATH isn't even showing us adding the $SPARK_HOME/python/, which
> is also troubling.
>
> On Wed, Dec 14, 2016 at 9:41 AM chris snow <chsnow...@gmail.com> wrote:
>
> > I'm trying to setup toree as follows:
> >
> >     CLUSTER_NAME=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > https://${BI_HOST}:9443/api/v1/clusters
> > | python -c 'import sys, json;
> > print(json.load(sys.stdin)["items"][0]["Clusters"]["cluster_name"]);')
> >     echo Cluster Name: $CLUSTER_NAME
> >
> >     CLUSTER_HOSTS=$(curl -s -k -u $BI_USER:$BI_PASS  -X GET
> > https://${BI_HOST}:9443/api/v1/clusters/${CLUSTER_NAME}/hosts
> > | python -c 'import sys, json; items = json.load(sys.stdin)["items"];
> hosts
> > = [ item["Hosts"]["host_name"] for item in items ]; print("
> > ".join(hosts));')
> >     echo Cluster Hosts: $CLUSTER_HOSTS
> >
> >     wget -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh
> >
> >     # Install anaconda if it isn't already installed
> >     [[ -d anaconda2 ]] || bash Anaconda2-4.1.1-Linux-x86_64.sh -b
> >
> >     # check toree is available, if not install it
> >     ./anaconda2/bin/python -c 'import toree' || ./anaconda2/bin/pip
> install
> > toree
> >
> >     # Install toree
> >     ./anaconda2/bin/jupyter toree install \
> >             --spark_home=/usr/iop/current/spark-client/ \
> >             --user --interpreters Scala,PySpark,SparkR  \
> >             --spark_opts="--master yarn" \
> >             --python_exec=${HOME}/anaconda2/bin/python2.7
> >
> >     # Install anaconda on all of the cluster nodes
> >     for CLUSTER_HOST in ${CLUSTER_HOSTS};
> >     do
> >        if [[ "$CLUSTER_HOST" != "$BI_HOST" ]];
> >        then
> >           echo "*** Processing $CLUSTER_HOST ***"
> >           ssh $BI_USER@$CLUSTER_HOST "wget -q -c
> > https://repo.continuum.io/archive/Anaconda2-4.1.1-Linux-x86_64.sh";
> >           ssh $BI_USER@$CLUSTER_HOST "[[ -d anaconda2 ]] || bash
> > Anaconda2-4.1.1-Linux-x86_64.sh -b"
> >
> >           # You can install your pip modules on each node using something
> > like this:
> >           # ssh $BI_USER@$CLUSTER_HOST "${HOME}/anaconda2/bin/python -c
> > 'import yourlibrary' || ${HOME}/anaconda2/pip install yourlibrary"
> >        fi
> >     done
> >
> >     echo 'Finished installing'
> >
> > However, when I try to run a pyspark job I get the following error:
> >
> >     Name: org.apache.toree.interpreter.broker.BrokerException
> >     Message: Py4JJavaError: An error occurred while calling
> > z:org.apache.spark.api.python.PythonRDD.collectAndServe.
> >     : org.apache.spark.SparkException: Job aborted due to stage failure:
> > Task 1 in stage 0.0 failed 4 times, most recent failure: Lost task 1.3 in
> > stage 0.0 (TID 6, bi4c-xxxx-data-3.bi.services.bluemix.net):
> > org.apache.spark.SparkException:
> >     Error from python worker:
> >       /home/biadmin/anaconda2/bin/python2.7: No module named pyspark
> >     PYTHONPATH was:
> >       /disk3/local/filecache/103/spark-assembly.jar
> >     java.io.EOFException
> >
> > Any ideas what is going wrong?
> >
>

Reply via email to