Are you building Spark with Java 6 or Java 7. Java 6 uses the extended Zip format and Java 7 uses Zip64. I think we've tried to add some build warnings if Java 7 is used, for this reason:
https://github.com/apache/spark/blob/master/make-distribution.sh#L102 Any luck if you use JDK 6 to compile? On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: > OK, my colleague found this: > https://mail.python.org/pipermail/python-list/2014-May/671353.html > > And my jar file has 70011 files. Fantastic.. > > > > > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com> wrote: >> >> I asked several people, no one seems to believe that we can do this: >> $ PYTHONPATH=/path/to/assembly/jar python >> >>> import pyspark >> >> This following pull request did mention something about generating a zip >> file for all python related modules: >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html >> >> I've tested that zipped modules can as least be imported via zipimport. >> >> Any ideas? >> >> -Simon >> >> >> >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> wrote: >>> >>> Hi Simon, >>> >>> You shouldn't have to install pyspark on every worker node. In YARN mode, >>> pyspark is packaged into your assembly jar and shipped to your executors >>> automatically. This seems like a more general problem. There are a few >>> things to try: >>> >>> 1) Run a simple pyspark shell with yarn-client, and do >>> "sc.parallelize(range(10)).count()" to see if you get the same error >>> 2) If so, check if your assembly jar is compiled correctly. Run >>> >>> $ jar -tf <path/to/assembly/jar> pyspark >>> $ jar -tf <path/to/assembly/jar> py4j >>> >>> to see if the files are there. For Py4j, you need both the python files >>> and the Java class files. >>> >>> 3) If the files are there, try running a simple python shell (not pyspark >>> shell) with the assembly jar on the PYTHONPATH: >>> >>> $ PYTHONPATH=/path/to/assembly/jar python >>> >>> import pyspark >>> >>> 4) If that works, try it on every worker node. If it doesn't work, there >>> is probably something wrong with your jar. >>> >>> There is a known issue for PySpark on YARN - jars built with Java 7 >>> cannot be properly opened by Java 6. I would either verify that the >>> JAVA_HOME set on all of your workers points to Java 7 (by setting >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6: >>> >>> $ cd /path/to/spark/home >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop >>> 2.3.0-cdh5.0.0 >>> >>> 5) You can check out >>> http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application, >>> which has more detailed information about how to debug running an >>> application on YARN in general. In my experience, the steps outlined there >>> are quite useful. >>> >>> Let me know if you get it working (or not). >>> >>> Cheers, >>> Andrew >>> >>> >>> >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: >>> >>>> Hi folks, >>>> >>>> I have a weird problem when using pyspark with yarn. I started ipython >>>> as follows: >>>> >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 >>>> --num-executors 4 --executor-memory 4G >>>> >>>> When I create a notebook, I can see workers being created and indeed I >>>> see spark UI running on my client machine on port 4040. >>>> >>>> I have the following simple script: >>>> """ >>>> import pyspark >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache() >>>> oneday = data.map(lambda line: line.split(",")).\ >>>> map(lambda f: (f[0], float(f[1]))).\ >>>> filter(lambda t: t[0] >= "2013-01-01" and t[0] < >>>> "2013-01-02").\ >>>> map(lambda t: (parser.parse(t[0]), t[1])) >>>> oneday.take(1) >>>> """ >>>> >>>> By executing this, I see that it is my client machine (where ipython is >>>> launched) is reading all the data from HDFS, and produce the result of >>>> take(1), rather than my worker nodes... >>>> >>>> When I do "data.count()", things would blow up altogether. But I do see >>>> in the error message something like this: >>>> """ >>>> >>>> Error from python worker: >>>> /usr/bin/python: No module named pyspark >>>> >>>> """ >>>> >>>> >>>> Am I supposed to install pyspark on every worker node? >>>> >>>> >>>> Thanks. >>>> >>>> -Simon >>> >>> >> >