Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Patrick Wendell Mon, 02 Jun 2014 12:49:10 -0700

Are you building Spark with Java 6 or Java 7. Java 6 uses the extended
Zip format and Java 7 uses Zip64. I think we've tried to add some
build warnings if Java 7 is used, for this reason:


https://github.com/apache/spark/blob/master/make-distribution.sh#L102

Any luck if you use JDK 6 to compile?


On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:
> OK, my colleague found this:
> https://mail.python.org/pipermail/python-list/2014-May/671353.html
>
> And my jar file has 70011 files. Fantastic..
>
>
>
>
> On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com> wrote:
>>
>> I asked several people, no one seems to believe that we can do this:
>> $ PYTHONPATH=/path/to/assembly/jar python
>> >>> import pyspark
>>
>> This following pull request did mention something about generating a zip
>> file for all python related modules:
>> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html
>>
>> I've tested that zipped modules can as least be imported via zipimport.
>>
>> Any ideas?
>>
>> -Simon
>>
>>
>>
>> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote:
>>>
>>> Hi Simon,
>>>
>>> You shouldn't have to install pyspark on every worker node. In YARN mode,
>>> pyspark is packaged into your assembly jar and shipped to your executors
>>> automatically. This seems like a more general problem. There are a few
>>> things to try:
>>>
>>> 1) Run a simple pyspark shell with yarn-client, and do
>>> "sc.parallelize(range(10)).count()" to see if you get the same error
>>> 2) If so, check if your assembly jar is compiled correctly. Run
>>>
>>> $ jar -tf <path/to/assembly/jar> pyspark
>>> $ jar -tf <path/to/assembly/jar> py4j
>>>
>>> to see if the files are there. For Py4j, you need both the python files
>>> and the Java class files.
>>>
>>> 3) If the files are there, try running a simple python shell (not pyspark
>>> shell) with the assembly jar on the PYTHONPATH:
>>>
>>> $ PYTHONPATH=/path/to/assembly/jar python
>>> >>> import pyspark
>>>
>>> 4) If that works, try it on every worker node. If it doesn't work, there
>>> is probably something wrong with your jar.
>>>
>>> There is a known issue for PySpark on YARN - jars built with Java 7
>>> cannot be properly opened by Java 6. I would either verify that the
>>> JAVA_HOME set on all of your workers points to Java 7 (by setting
>>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6:
>>>
>>> $ cd /path/to/spark/home
>>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop
>>> 2.3.0-cdh5.0.0
>>>
>>> 5) You can check out
>>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application,
>>> which has more detailed information about how to debug running an
>>> application on YARN in general. In my experience, the steps outlined there
>>> are quite useful.
>>>
>>> Let me know if you get it working (or not).
>>>
>>> Cheers,
>>> Andrew
>>>
>>>
>>>
>>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>:
>>>
>>>> Hi folks,
>>>>
>>>> I have a weird problem when using pyspark with yarn. I started ipython
>>>> as follows:
>>>>
>>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4
>>>> --num-executors 4 --executor-memory 4G
>>>>
>>>> When I create a notebook, I can see workers being created and indeed I
>>>> see spark UI running on my client machine on port 4040.
>>>>
>>>> I have the following simple script:
>>>> """
>>>> import pyspark
>>>> data = sc.textFile("hdfs://test/tmp/data/*").cache()
>>>> oneday = data.map(lambda line: line.split(",")).\
>>>>               map(lambda f: (f[0], float(f[1]))).\
>>>>               filter(lambda t: t[0] >= "2013-01-01" and t[0] <
>>>> "2013-01-02").\
>>>>               map(lambda t: (parser.parse(t[0]), t[1]))
>>>> oneday.take(1)
>>>> """
>>>>
>>>> By executing this, I see that it is my client machine (where ipython is
>>>> launched) is reading all the data from HDFS, and produce the result of
>>>> take(1), rather than my worker nodes...
>>>>
>>>> When I do "data.count()", things would blow up altogether. But I do see
>>>> in the error message something like this:
>>>> """
>>>>
>>>> Error from python worker:
>>>>   /usr/bin/python: No module named pyspark
>>>>
>>>> """
>>>>
>>>>
>>>> Am I supposed to install pyspark on every worker node?
>>>>
>>>>
>>>> Thanks.
>>>>
>>>> -Simon
>>>
>>>
>>
>

Re: pyspark problems on yarn (job not parallelized, and Py4JJavaError)

Reply via email to