Nope... didn't try java 6. The standard installation guide didn't say anything about java 7 and suggested to do "-DskipTests" for the build.. http://spark.apache.org/docs/latest/building-with-maven.html
So, I didn't see the warning message... On Mon, Jun 2, 2014 at 3:48 PM, Patrick Wendell <pwend...@gmail.com> wrote: > Are you building Spark with Java 6 or Java 7. Java 6 uses the extended > Zip format and Java 7 uses Zip64. I think we've tried to add some > build warnings if Java 7 is used, for this reason: > > https://github.com/apache/spark/blob/master/make-distribution.sh#L102 > > Any luck if you use JDK 6 to compile? > > > On Mon, Jun 2, 2014 at 12:03 PM, Xu (Simon) Chen <xche...@gmail.com> > wrote: > > OK, my colleague found this: > > https://mail.python.org/pipermail/python-list/2014-May/671353.html > > > > And my jar file has 70011 files. Fantastic.. > > > > > > > > > > On Mon, Jun 2, 2014 at 2:34 PM, Xu (Simon) Chen <xche...@gmail.com> > wrote: > >> > >> I asked several people, no one seems to believe that we can do this: > >> $ PYTHONPATH=/path/to/assembly/jar python > >> >>> import pyspark > >> > >> This following pull request did mention something about generating a zip > >> file for all python related modules: > >> https://www.mail-archive.com/reviews@spark.apache.org/msg08223.html > >> > >> I've tested that zipped modules can as least be imported via zipimport. > >> > >> Any ideas? > >> > >> -Simon > >> > >> > >> > >> On Mon, Jun 2, 2014 at 11:50 AM, Andrew Or <and...@databricks.com> > wrote: > >>> > >>> Hi Simon, > >>> > >>> You shouldn't have to install pyspark on every worker node. In YARN > mode, > >>> pyspark is packaged into your assembly jar and shipped to your > executors > >>> automatically. This seems like a more general problem. There are a few > >>> things to try: > >>> > >>> 1) Run a simple pyspark shell with yarn-client, and do > >>> "sc.parallelize(range(10)).count()" to see if you get the same error > >>> 2) If so, check if your assembly jar is compiled correctly. Run > >>> > >>> $ jar -tf <path/to/assembly/jar> pyspark > >>> $ jar -tf <path/to/assembly/jar> py4j > >>> > >>> to see if the files are there. For Py4j, you need both the python files > >>> and the Java class files. > >>> > >>> 3) If the files are there, try running a simple python shell (not > pyspark > >>> shell) with the assembly jar on the PYTHONPATH: > >>> > >>> $ PYTHONPATH=/path/to/assembly/jar python > >>> >>> import pyspark > >>> > >>> 4) If that works, try it on every worker node. If it doesn't work, > there > >>> is probably something wrong with your jar. > >>> > >>> There is a known issue for PySpark on YARN - jars built with Java 7 > >>> cannot be properly opened by Java 6. I would either verify that the > >>> JAVA_HOME set on all of your workers points to Java 7 (by setting > >>> SPARK_YARN_USER_ENV), or simply build your jar with Java 6: > >>> > >>> $ cd /path/to/spark/home > >>> $ JAVA_HOME=/path/to/java6 ./make-distribution --with-yarn --hadoop > >>> 2.3.0-cdh5.0.0 > >>> > >>> 5) You can check out > >>> > http://spark.apache.org/docs/latest/running-on-yarn.html#debugging-your-application > , > >>> which has more detailed information about how to debug running an > >>> application on YARN in general. In my experience, the steps outlined > there > >>> are quite useful. > >>> > >>> Let me know if you get it working (or not). > >>> > >>> Cheers, > >>> Andrew > >>> > >>> > >>> > >>> 2014-06-02 17:24 GMT+02:00 Xu (Simon) Chen <xche...@gmail.com>: > >>> > >>>> Hi folks, > >>>> > >>>> I have a weird problem when using pyspark with yarn. I started ipython > >>>> as follows: > >>>> > >>>> IPYTHON=1 ./pyspark --master yarn-client --executor-cores 4 > >>>> --num-executors 4 --executor-memory 4G > >>>> > >>>> When I create a notebook, I can see workers being created and indeed I > >>>> see spark UI running on my client machine on port 4040. > >>>> > >>>> I have the following simple script: > >>>> """ > >>>> import pyspark > >>>> data = sc.textFile("hdfs://test/tmp/data/*").cache() > >>>> oneday = data.map(lambda line: line.split(",")).\ > >>>> map(lambda f: (f[0], float(f[1]))).\ > >>>> filter(lambda t: t[0] >= "2013-01-01" and t[0] < > >>>> "2013-01-02").\ > >>>> map(lambda t: (parser.parse(t[0]), t[1])) > >>>> oneday.take(1) > >>>> """ > >>>> > >>>> By executing this, I see that it is my client machine (where ipython > is > >>>> launched) is reading all the data from HDFS, and produce the result of > >>>> take(1), rather than my worker nodes... > >>>> > >>>> When I do "data.count()", things would blow up altogether. But I do > see > >>>> in the error message something like this: > >>>> """ > >>>> > >>>> Error from python worker: > >>>> /usr/bin/python: No module named pyspark > >>>> > >>>> """ > >>>> > >>>> > >>>> Am I supposed to install pyspark on every worker node? > >>>> > >>>> > >>>> Thanks. > >>>> > >>>> -Simon > >>> > >>> > >> > > >