Re: PySpark RDD method errors

moon soo Lee Fri, 17 Jul 2015 05:58:56 -0700

I still can not reproduce the problem.
Let me try little more and update here.


Thanks,
moon


On Mon, Jul 13, 2015 at 2:50 PM Chad Timmins <ctimm...@trulia.com> wrote:

>  I already export SPARK_HOME in my .bashrc and I confirmed it is
> /home/hadoop/spark in the zeppelin notebook.
>
>  I configure zeppelin using the following script (almost identical to a
> gist another user posted):
>
> # Install Zeppelin
> git clone https://github.com/apache/incubator-zeppelin.git 
> /home/hadoop/zeppelin
> cd /home/hadoop/zeppelin
> mvn clean package -Pspark-1.3 -Dhadoop.version=2.4.0 -Phadoop-2.4 -Pyarn 
> -DskipTests
>
> # Configure Zeppelin
> SPARK_DEFAULTS=/home/hadoop/spark/conf/spark-defaults.conf
>
> declare -a ZEPPELIN_JAVA_OPTS
> if [ -f $SPARK_DEFAULTS ]; then
>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>         $(grep spark.executor.instances $SPARK_DEFAULTS | awk '{print "-D" $1 
> "=" $2}'))
>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>         $(grep spark.executor.cores $SPARK_DEFAULTS | awk '{print "-D" $1 "=" 
> $2}'))
>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>         $(grep spark.executor.memory $SPARK_DEFAULTS | awk '{print "-D" $1 
> "=" $2}'))
>     ZEPPELIN_JAVA_OPTS=("${ZEPPELIN_JAVA_OPTS[@]}" \
>         $(grep spark.default.parallelism $SPARK_DEFAULTS | awk '{print "-D" 
> $1 "=" $2}'))
> fi
> echo "${ZEPPELIN_JAVA_OPTS[@]}"
>
> cp conf/zeppelin-env.sh.template conf/zeppelin-env.sh
> cat <<EOF >> conf/zeppelin-env.sh
> export MASTER=yarn-client
> export HADOOP_CONF_DIR=$HADOOP_CONF_DIR
> export ZEPPELIN_SPARK_USEHIVECONTEXT=false
> export ZEPPELIN_JAVA_OPTS="${ZEPPELIN_JAVA_OPTS[@]}"
> EOF
>
>
> Thank you so much for helping
>
>  -Chad
>
>   From: moon soo Lee <m...@apache.org>
> Reply-To: "users@zeppelin.incubator.apache.org" <
> users@zeppelin.incubator.apache.org>
> Date: Monday, July 13, 2015 at 12:25 PM
>
> To: "users@zeppelin.incubator.apache.org" <
> users@zeppelin.incubator.apache.org>
> Subject: Re: PySpark RDD method errors
>
>   Could you try export SPARK_HOME variable? like
>
>  export SPARK_HOME=/home/hadoop/spark
>
>
>
>  On Mon, Jul 13, 2015 at 10:55 AM Chad Timmins <ctimm...@trulia.com>
> wrote:
>
>>  Hi,
>>
>>  Thanks for the quick reply.  I have set up my configuration for
>> zeppelin exactly as you did except for the port number.  I had to add to
>>  zeppelin/conf/zeppelin-env.sh
>> *export PYTHONPATH=$PYTHONPATH:/home/hadoop/spark/python*
>>
>>  Before the interpreter patch my PYTHONPATH env variable looked like
>>
>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/pyspark.zip:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>>
>>  AFTER the patch PYTHONPATH looked like
>>
>> *:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python:/home/hadoop/spark/python/lib/py4j-0.8.2.1-src.zip*
>>
>>  I am still getting the same errors  even after I removed the extra
>> python path from conf/zeppelin-env.sh
>> Currently my zeppelin environment looks like:
>>
>>  export MASTER=yarn-client
>> export HADOOP_CONF_DIR=/home/hadoop/conf
>> export ZEPPELIN_SPARK_USEHIVECONTEXT=false
>> export ZEPPELIN_JAVA_OPTS=""
>>
>>
>>
>>   From: moon soo Lee <m...@apache.org>
>> Reply-To: "users@zeppelin.incubator.apache.org" <
>> users@zeppelin.incubator.apache.org>
>> Date: Sunday, July 12, 2015 at 8:59 AM
>> To: "users@zeppelin.incubator.apache.org" <
>> users@zeppelin.incubator.apache.org>
>> Subject: Re: PySpark RDD method errors
>>
>>   Hi,
>>
>>  Thanks for sharing the problem.
>> I have tried with AWS EMR and i could make all the code works without
>> error.
>>
>>  I've set
>>
>>  export HADOOP_CONF_DIR=/home/hadoop/conf
>> export SPARK_HOME=/home/hadoop/spark
>> export ZEPPELIN_PORT=9090
>>
>>  with 'yarn-client' for master property.
>> export SPARK_HOME is not correctly work without this patch.
>>  https://github.com/apache/incubator-zeppelin/pull/151
>>
>>  Could you share your configuration of Zeppelin with EMR cluster?
>>
>>  Thanks,
>> moon
>>
>>  On Thu, Jul 9, 2015 at 3:35 PM Chad Timmins <ctimm...@trulia.com> wrote:
>>
>>>  Hi,
>>>
>>>  When I run the filter() method on an RDD object and then try to print
>>> its results using collect(), I get a Py4JJavaError.  It is not only filter
>>> but other methods that cause similar errors and I cannot figure out what is
>>> causing this.  PySpark from the command line works fine, but it does not
>>> work in the Zeppelin Notebook.  My setup is on an AWS EMR instance running
>>> spark 1.3.1 on Amazon’s Hadoop 2.4.0.  I have included a snippet of code
>>> (in blue) and the error (in red).  Thank you and please let me know if you
>>> need any more additional information.
>>>
>>>
>>>  %pyspark
>>>
>>>  nums = [1,2,3,4,5,6]
>>>
>>>  rdd_nums = sc.parallelize(nums)
>>> rdd_sq   = rdd_nums.map(lambda x: pow(x,2))
>>> rdd_cube = rdd_nums.map(lambda x: pow(x,3))
>>> rdd_odd  = rdd_nums.filter(lambda x: x%2 == 1)
>>>
>>>  print "nums: %s" % rdd_nums.collect()
>>> print "squares: %s" % rdd_sq.collect()
>>> print "cubes: %s" % rdd_cube.collect()
>>> print "odds: %s" % rdd_odd.collect()
>>>
>>>
>>>  Py4JJavaError: An error occurred while calling
>>> z:org.apache.spark.api.python.PythonRDD.collectAndServe. :
>>> org.apache.spark.SparkException: Job aborted due to stage failure: Task 0
>>> in stage 107.0 failed 4 times, most recent failure: Lost task 0.3 in stage
>>> 107.0 (TID 263, ip-10-204-134-29.us-west-2.compute.internal):
>>> org.apache.spark.SparkException: Python worker exited unexpectedly (crashed)
>>> …
>>> …
>>> Caused by: java.io.EOFException at
>>> java.io.DataInputStream.readInt(DataInputStream.java:392) at
>>> org.apache.spark.api.python.PythonRDD$$anon$1.read(PythonRDD.scala:108) ...
>>> 10 more
>>>
>>>  If I instead set rdd_odd  = rdd_nums.filter(lambda x: x%2)    I don’t
>>> get an error
>>>
>>>
>>>  Thanks,
>>>
>>>  Chad Timmins
>>>
>>>  Software Engineer Intern at Trulia
>>> B.S. Electrical Engineering, UC Davis 2015
>>>
>>

Re: PySpark RDD method errors

Reply via email to