Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Xu (Simon) Chen Mon, 02 Jun 2014 13:16:44 -0700

Nope... didn't try java 6. The standard installation guide didn't say
anything about java 7 and suggested to do "-DskipTests" for the build..
http://spark.apache.org/docs/latest/building-with-maven.html


So, I didn't see the warning message...


On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pwend...@gmail.com> wrote:

> Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
> Zip format and Java 7 uses Zip64. I think we've tried to add some
> build warnings if Java 7 is used, for this reason:
>
> https://github.com/apache/spark/blob/master/make-distribution.sh#L102
>
> Any luck if you use JDK 6 to compile?
>
>
> On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xche...@gmail.com>
> wrote:
> > OK, my colleague found this:
> > https://mail.python.org/pipermail/python-list/2014-May/671353.html
> >
> > And my jar file has 70011 files. Fantastic..
> >
> >
> >
> >
> > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com>
> wrote:
> >>
> >> I asked several people, no one seems to believe that we can do this:
> >> $ PYTHONPATH=/path/to/assembly/jar python
> >> >>> import pyspark
> >>
> >> This following pull request did mention something about generating a zip
> >> file for all python related modules:
> >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
> >>
> >> I've tested that zipped modules can as least be imported via zipimport.
> >>
> >> Any ideas?
> >>
> >> -Simon
> >>
> >>
> >>
> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com>
> wrote:
> >>>
> >>> Hi Simon,
> >>>
> >>> You shouldn't have to install pyspark on every worker node. In YARN
> mode,
> >>> pyspark is packaged into your assembly jar and shipped to your
> executors
> >>> automatically. This seems like a more general problem. There are a few
> >>> things to try:
> >>>
> >>> 1) Run a simple pyspark shell with yarn-client, and do
> >>> "sc.parallelize(range(10)).count()" to see if you get the same error
> >>> 2) If so, check if your assembly jar is compiled correctly. Run
> >>>
> >>> $ jar -tf <path/to/assembly/jar> pyspark
> >>> $ jar -tf <path/to/assembly/jar> py4j
> >>>
> >>> to see if the files are there. For Py4j, you need both the python files
> >>> and the Java class files.
> >>>
> >>> 3) If the files are there, try running a simple python shell (not
> pyspark
> >>> shell) with the assembly jar on the PYTHONPATH:
> >>>
> >>> $ PYTHONPATH=/path/to/assembly/jar python
> >>> >>> import pyspark
> >>>
> >>> 4) If that works, try it on every worker node. If it doesn't work,
> there
> >>> is probably something wrong with your jar.
> >>>
> >>> There is a known issue for PySpark on YARN - jars built with Java 7
> >>> cannot be properly opened by Java 6. I would either verify that the
> >>> JAVA_HOME set on all of your workers points to Java 7 (by setting
> >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
> >>>
> >>> $ cd /path/to/spark/home
> >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
> >>> 2.3.0-cdh5.0.0
> >>>
> >>> 5) You can check out
> >>>
> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application
> ,
> >>> which has more detailed information about how to debug running an
> >>> application on YARN in general. In my experience, the steps outlined
> there
> >>> are quite useful.
> >>>
> >>> Let me know if you get it working (or not).
> >>>
> >>> Cheers,
> >>> Andrew
> >>>
> >>>
> >>>
> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
> >>>
> >>>> Hi folks,
> >>>>
> >>>> I have a weird problem when using pyspark with yarn. I started ipython
> >>>> as follows:
> >>>>
> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
> >>>> --num-executors 4 --executor-memory 4G
> >>>>
> >>>> When I create a notebook, I can see workers being created and indeed I
> >>>> see spark UI running on my client machine on port 4040.
> >>>>
> >>>> I have the following simple script:
> >>>> """
> >>>> import pyspark
> >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
> >>>> oneday = data.map(lambda line: line.split(",")).\
> >>>>               map(lambda f: (f[0], float(f[1]))).\
> >>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
> >>>> "2013-01-02").\
> >>>>               map(lambda t: (parser.parse(t[0]), t[1]))
> >>>> oneday.take(1)
> >>>> """
> >>>>
> >>>> By executing this, I see that it is my client machine (where ipython
> is
> >>>> launched) is reading all the data from HDFS, and produce the result of
> >>>> take(1), rather than my worker nodes...
> >>>>
> >>>> When I do "data.count()", things would blow up altogether. But I do
> see
> >>>> in the error message something like this:
> >>>> """
> >>>>
> >>>> Error from python worker:
> >>>>   /usr/bin/python: No module named pyspark
> >>>>
> >>>> """
> >>>>
> >>>>
> >>>> Am I supposed to install pyspark on every worker node?
> >>>>
> >>>>
> >>>> Thanks.
> >>>>
> >>>> -Simon
> >>>
> >>>
> >>
> >
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to