I still can not reproduce the problem. Let me try little more and update here.
Thanks, moon On Mon, Jul 13, 2015 at 2:50 PM Chad Timmins <ctimm...@trulia.com> wrote: > I already export SPARK_HOME in my .bashrc and I confirmed it is > /home/hadoop/spark in the zeppelin notebook. > > I configure zeppelin using the following script (almost identical to a > gist another user posted): > > # Install Zeppelin > git clone https://github.com/apache/incubator-zeppelin.git > /home/hadoop/zeppelin > cd /home/hadoop/zeppelin > mvn clean package -Pspark-1.3 -Dhadoop.version=2.4.0 -Phadoop-2.4 -Pyarn > -DskipTests > > # Configure Zeppelin > SPARK_DEFAULTS=/home/hadoop/spark/conf/spark-defaults.conf > > declare -a ZEPPELIN_JAVA_OPTS > if [ -f $SPARK_DEFAULTS ]; then > ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ > $(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print "-D" $1 > "=" $2}')) > ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ > $(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print "-D" $1 "=" > $2}')) > ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ > $(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print "-D" $1 > "=" $2}')) > ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \ > $(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print "-D" > $1 "=" $2}')) > fi > echo "${ZEPPELIN_JAVA_OPTS[@]}" > > cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh > cat <<EOF >> conf/zeppelin-env.sh > export MASTER=yarn-client > export HADOOP_CONF_DIR=$HADOOP_CONF_DIR > export ZEPPELIN_SPARK_USEHIVECONTEXT=false > export ZEPPELIN_JAVA_OPTS="${ZEPPELIN_JAVA_OPTS[@]}" > EOF > > > Thank you so much for helping > > -Chad > > From: moon soo Lee <m...@apache.org> > Reply-To: "users@zeppelin.incubator.apache.org" < > users@zeppelin.incubator.apache.org> > Date: Monday, July 13, 2015 at 12:25 PM > > To: "users@zeppelin.incubator.apache.org" < > users@zeppelin.incubator.apache.org> > Subject: Re: PySpark RDD method errors > > Could you try export SPARK_HOME variable? like > > export SPARK_HOME=/home/hadoop/spark > > > > On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com> > wrote: > >> Hi, >> >> Thanks for the quick reply. I have set up my configuration for >> zeppelin exactly as you did except for the port number. I had to add to >> zeppelin/conf/zeppelin-env.sh >> *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python* >> >> Before the interpreter patch my PYTHONPATH env variable looked like >> >> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* >> >> AFTER the patch PYTHONPATH looked like >> >> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip* >> >> I am still getting the same errors even after I removed the extra >> python path from conf/zeppelin-env.sh >> Currently my zeppelin environment looks like: >> >> export MASTER=yarn-client >> export HADOOP_CONF_DIR=/home/hadoop/conf >> export ZEPPELIN_SPARK_USEHIVECONTEXT=false >> export ZEPPELIN_JAVA_OPTS="" >> >> >> >> From: moon soo Lee <m...@apache.org> >> Reply-To: "users@zeppelin.incubator.apache.org" < >> users@zeppelin.incubator.apache.org> >> Date: Sunday, July 12, 2015 at 8:59 AM >> To: "users@zeppelin.incubator.apache.org" < >> users@zeppelin.incubator.apache.org> >> Subject: Re: PySpark RDD method errors >> >> Hi, >> >> Thanks for sharing the problem. >> I have tried with AWS EMR and i could make all the code works without >> error. >> >> I've set >> >> export HADOOP_CONF_DIR=/home/hadoop/conf >> export SPARK_HOME=/home/hadoop/spark >> export ZEPPELIN_PORT=9090 >> >> with 'yarn-client' for master property. >> export SPARK_HOME is not correctly work without this patch. >> https://github.com/apache/incubator-zeppelin/pull/151 >> >> Could you share your configuration of Zeppelin with EMR cluster? >> >> Thanks, >> moon >> >> On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote: >> >>> Hi, >>> >>> When I run the filter() method on an RDD object and then try to print >>> its results using collect(), I get a Py4JJavaError. It is not only filter >>> but other methods that cause similar errors and I cannot figure out what is >>> causing this. PySpark from the command line works fine, but it does not >>> work in the Zeppelin Notebook. My setup is on an AWS EMR instance running >>> spark 1.3.1 on Amazon’s Hadoop 2.4.0. I have included a snippet of code >>> (in blue) and the error (in red). Thank you and please let me know if you >>> need any more additional information. >>> >>> >>> %pyspark >>> >>> nums = [1,2,3,4,5,6] >>> >>> rdd_nums = sc.parallelize(nums) >>> rdd_sq = rdd_nums.map(lambda x: pow(x,2)) >>> rdd_cube = rdd_nums.map(lambda x: pow(x,3)) >>> rdd_odd = rdd_nums.filter(lambda x: x%2 == 1) >>> >>> print "nums: %s" % rdd_nums.collect() >>> print "squares: %s" % rdd_sq.collect() >>> print "cubes: %s" % rdd_cube.collect() >>> print "odds: %s" % rdd_odd.collect() >>> >>> >>> Py4JJavaError: An error occurred while calling >>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. : >>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0 >>> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage >>> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal): >>> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed) >>> … >>> … >>> Caused by: java.io.EOFException at >>> java.io.DataInputStream.readInt(DataInputStream.java:392) at >>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ... >>> 10 more >>> >>> If I instead set rdd_odd = rdd_nums.filter(lambda x: x%2) I don’t >>> get an error >>> >>> >>> Thanks, >>> >>> Chad Timmins >>> >>> Software Engineer Intern at Trulia >>> B.S. Electrical Engineering, UC Davis 2015 >>> >>