[ 
https://issues.apache.org/jira/browse/SPARK-1520?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13972819#comment-13972819
 ] 

Sean Owen commented on SPARK-1520:
----------------------------------

Java 6 had a limit of 65536 files per jar in total, but the limit is much 
higher in Java 7:
http://stackoverflow.com/questions/9616250/what-is-the-maximum-number-of-files-per-jar
https://blogs.oracle.com/xuemingshen/entry/zip64_support_for_4g_zipfile

When I build the assembly I find that it has 70948 files. I think you are 
certainly onto something.

I see the same behavior as you, and am using the latest Java 6/7. I also note 
that "unzip -l" succeeds for the Java 6 version, but fails with the following 
on the Java 7 version:

{code}
error:  expected central file header signature not found (file #70949).
  (please check that you have transferred or created the zipfile in the
  appropriate BINARY mode and that you have compiled UnZip properly)
{code}

This might not be Java's fault. It could be something to do with how SBT 
handles merging the zip files, and not handling Java 7's output (which is 
zip64) correctly.

As a short-term solution, I note that we can probably slim down the assembly 
jar. For example, fastutil is still in there for some reason, and accounts for 
10,666 files. It shouldn't be there.

You can get a quick view into where the files are with:

{code}
jar tf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar | grep -oE "(.+/)+" | uniq 
-c | sort -rn | head -100
{code}

{code}
2883 breeze/linalg/operators/
2034 it/unimi/dsi/fastutil/objects/
1396 spire/std/
1379 scala/tools/nsc/typechecker/
1351 breeze/linalg/
1215 it/unimi/dsi/fastutil/longs/
1214 it/unimi/dsi/fastutil/ints/
1213 it/unimi/dsi/fastutil/doubles/
1211 it/unimi/dsi/fastutil/floats/
1210 it/unimi/dsi/fastutil/shorts/
1209 it/unimi/dsi/fastutil/chars/
1209 it/unimi/dsi/fastutil/bytes/
1187 scala/reflect/internal/
 896 com/google/common/collect/
 894 tachyon/thrift/
 886 spire/algebra/
 797 scala/tools/nsc/transform/
 749 scala/tools/nsc/interpreter/
 723 org/netlib/lapack/
 677 spire/math/
...
{code}

> Inclusion of breeze corrupts assembly when compiled with JDK7 and run on JDK6
> -----------------------------------------------------------------------------
>
>                 Key: SPARK-1520
>                 URL: https://issues.apache.org/jira/browse/SPARK-1520
>             Project: Spark
>          Issue Type: Bug
>          Components: MLlib, Spark Core
>            Reporter: Patrick Wendell
>            Priority: Blocker
>             Fix For: 1.0.0
>
>
> This is a real doozie - when compiling a Spark assembly with JDK7, the 
> produced jar does not work well with JRE6. I confirmed the byte code being 
> produced is JDK 6 compatible (major version 50). What happens is that, 
> silently, the JRE will not load any class files from the assembled jar.
> {code}
> $> sbt/sbt assembly/assembly
> $> /usr/lib/jvm/java-1.7.0-openjdk-amd64/bin/java -cp 
> /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
>  org.apache.spark.ui.UIWorkloadGenerator
> usage: ./bin/spark-class org.apache.spark.ui.UIWorkloadGenerator [master] 
> [FIFO|FAIR]
> $> /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
> /home/patrick/Documents/spark/assembly/target/scala-2.10/spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
>  org.apache.spark.ui.UIWorkloadGenerator
> Exception in thread "main" java.lang.NoClassDefFoundError: 
> org/apache/spark/ui/UIWorkloadGenerator
> Caused by: java.lang.ClassNotFoundException: 
> org.apache.spark.ui.UIWorkloadGenerator
>       at java.net.URLClassLoader$1.run(URLClassLoader.java:217)
>       at java.security.AccessController.doPrivileged(Native Method)
>       at java.net.URLClassLoader.findClass(URLClassLoader.java:205)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:323)
>       at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:294)
>       at java.lang.ClassLoader.loadClass(ClassLoader.java:268)
> Could not find the main class: org.apache.spark.ui.UIWorkloadGenerator. 
> Program will exit.
> {code}
> I also noticed that if the jar is unzipped, and the classpath set to the 
> currently directory, it "just works". Finally, if the assembly jar is 
> compiled with JDK6, it also works. The error is seen with any class, not just 
> the UIWorkloadGenerator. Also, this error doesn't exist in branch 0.9, only 
> in master.
> *Isolation*
> -I ran a git bisection and this appeared after the MLLib sparse vector patch 
> was merged:-
> https://github.com/apache/spark/commit/80c29689ae3b589254a571da3ddb5f9c866ae534
> SPARK-1212
> -I narrowed this down specifically to the inclusion of the breeze library. 
> Just adding breeze to an older (unaffected) build triggered the issue.-
> I've found that if I just unpack and re-pack the jar (using `jar` from java 6 
> or 7) it always works:
> {code}
> $ cd assembly/target/scala-2.10/
> $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
> ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
> org.apache.spark.ui.UIWorkloadGenerator # fails
> $ jar xvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar
> $ jar cvf spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar *
> $ /usr/lib/jvm/java-1.6.0-openjdk-amd64/bin/java -cp 
> ./spark-assembly-1.0.0-SNAPSHOT-hadoop1.0.4.jar 
> org.apache.spark.ui.UIWorkloadGenerator # succeeds
> {code}
> I also noticed something of note. The Breeze package contains single 
> directories that have huge numbers of files in them (e.g. 2000+ class files 
> in one directory). It's possible we are hitting some weird bugs/corner cases 
> with compatibility of the internal storage format of the jar itself.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Reply via email to