[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77428412 Yes indeed. Maven is the build of reference. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77427477 By the way, I wanted to mention that because Spark production uses maven but some developers use sbt during development, shading HyperLogLog could produce a different assembly with sbt (unshaded) from what maven would produce. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77426386 Closing this PR and hoping that the user class path first gets fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner closed the pull request at: https://github.com/apache/spark/pull/4780 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77406012 Do you mind closing this PR? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77265506 I think shading CA's HyperLogLog usage does fix this, but it is a band-aid. I am still not clear why the user-classpath-first mechanism doesn't resolve this. This should be the right way to resolve this, but I don't see why it doesn't work. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-77195569 Well, [sbt-assembly rename](http://stackoverflow.com/questions/24596914/sbt-assembly-rename-class-with-merge-conflicts-shade) does not perform a shade operation. So I guess the options are: * We manually shade/rename the functions. * We switch to maven for packaging instead of sbt in order to do the shading * Fix Spark via one of: * Since Spark does use maven, it could shade (relocate) the use of HyperLogLog * Include the fastutil package (this patch) rejected strongly for understandable reasons * Fix the user class path first mechanism to properly find clearspring in our jar Did I miss any options? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76831730 > I wonder if we're hitting some bugs with the serializer's classloader not seeing the right classloaders. Some issues like that are fixed in 1.3.0. It may still be that this is OK in 1.3, but that's just a guess right now. > > Remind me where the error occurs? is it within a stack trace that includes the serializer? Perhaps. Here is a bit of test code that I run before really setting up the RDDs to provoke the failure. The dump suggests it is in the constructor though. ```scala val qDigest = new QDigest(256.0) val s2 = " qDigest: " + qDigest.toString() println(s2) Exception in thread "main" java.lang.NoClassDefFoundError: it/unimi/dsi/fastutil/longs/Long2LongOpenHashMap at com.clearspring.analytics.stream.quantile.QDigest.(QDigest.java:79) at com.* at com.* at com.* at com.* at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method) at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57) at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) at java.lang.reflect.Method.invoke(Method.java:606) at org.apache.spark.deploy.SparkSubmit$.launch(SparkSubmit.scala:358) at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:75) at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala) Caused by: java.lang.ClassNotFoundException: it.unimi.dsi.fastutil.longs.Long2LongOpenHashMap at java.net.URLClassLoader$1.run(URLClassLoader.java:366) at java.net.URLClassLoader$1.run(URLClassLoader.java:355) at java.security.AccessController.doPrivileged(Native Method) at java.net.URLClassLoader.findClass(URLClassLoader.java:354) at java.lang.ClassLoader.loadClass(ClassLoader.java:425) at sun.misc.Launcher$AppClassLoader.loadClass(Launcher.java:308) at java.lang.ClassLoader.loadClass(ClassLoader.java:358) ... 12 more ``` > Spark doesn't build with ```minimizeJar```. I was referring to the parquet-column package. Spark should not, philosophically, be depended upon to provide anything but Spark (well, and anything third party that's necessary to invoke it). Indeed a lot of issues here descend from the fact that things aren't shaded and conflict. Ah yes, parquet-column was the one with minimizeJar, not Spark. > classpath-first is supposed to be a mechanism to work around this no matter if the conflict came from elsewhere. And if it isn't, that needs to be fixed ideally, as a first priority. Makes sense. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76778163 Darn, I had hoped that was the answer, since it really seems like prioritizing your classes should mean your copy of CA, which definitely sits next to all of fastutil, is used. Yes, that's what is supposed to happen. I wonder if we're hitting some bugs with the _serializer's_ classloader not seeing the right classloaders. Some issues like that are fixed in 1.3.0. It may still be that this is OK in 1.3, but that's just a guess right now. Remind me where the error occurs? is it within a stack trace that includes the serializer? Spark doesn't build with `minimizeJar`. I was referring to the `parquet-column` package. Spark should not, philosophically, be depended upon to provide anything but Spark (well, and anything third party that's necessary to invoke it). Indeed a lot of issues here descend from the fact that things aren't shaded and conflict. There's a good argument that just about everything Spark (and Hadoop) uses should be shaded. It wasn't and at this point might be disruptive to change it, but in theory you can use the classpath-first properties to overrule whatever is in Spark anyway. I don't think you need `minimizeJar` to make things work, but it would help to make your app jar much smaller. I don't think the HyperLogLog usage is relevant per se. If you mean, should that just be shaded in Spark? It sure should resolve the conflict that appears to be at the root of this, but I guess we're assuming that Spark is the only thing in the whole runtime classpath that includes CA. I don't even know that. classpath-first is supposed to be a mechanism to work around this no matter if the conflict came from elsewhere. And if it isn't, that needs to be fixed ideally, as a first priority. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76767297 > You're definitely sure CA is in your app JAR? I ask just because you mention it was marked provided above, though also in your assembly. Worth double-checking. Oh yes, it shows as completely there (sbt doesn't minimize) via ```jar tf``` on the assembly. Also all of fastutil... > Are you using Spark 1.3? because it now occurs to me that spark.driver.userClassPathFirst and spark.executor.userClassPathFirst don't exist before 1.3. The equivalents are the less-well-named spark.files.userClassPathFirst and spark.yarn.user.classpath.first (for YARN mode). Can you try those instead if you're using <= 1.2? Ouch. Got bitten by the inconsistency and bugs among 1.2.0 local vs YARN and 1.3. Am using 1.2.0a on Amazon EMR. My inner loop testing was with local and this bug https://issues.apache.org/jira/browse/SPARK-4739 indicates that it won't work with local. On the presumption that unused options can't hurt and to cover the bases on old and new option flag settings, I changed my script to use: ```bash --conf spark.files.userClassPathFirst=true \ --conf spark.driver.userClassPathFirst=true\ --conf spark.executor.userClassPathFirst=true\ --conf spark.yarn.user.classpath.first=true\ --conf spark.files.user.classpath.first=true \ ``` So I retried the case where CA and fastutil are complete (and not renamed or shaded) in our user assembly. I can confirm that both local and YARN get the class not found exception ```java.lang.NoClassDefFoundError: it/unimi/dsi/fastutil/longs/Long2LongOpenHashMap``` even with the user class path first settings. Can you confirm that the correct behavior would be for the CA and fastutil classes to be found in our assembly when using user jar first even if some of of CA is contained in the spark jar? Or does this require shading? > So, I would close this PR in the sense that this isn't the fix. I would leave SPARK-6029 open though until there's a resolution to the issue one way or the other. First, please comment on the following. Philosophically, it is not clear that the ```minimizeJar``` option used to create the Spark assembly is consistent with expecting people to mark it as ```provided```. If, as in this case, a broader set of classes is required by a downstream jar, it creates the sort of conflict we are seeing. In some sense, the dependent package is, in fact, not ```provided```, at least not fully. I'm surprised this hasn't happened in more cases. > You're effectively shading CA (and not fastutil); you should also be able to achieve that through your build rather than bother with source, though, I don't know how that works in SBT. (minimizeJar is a function of Maven's shading plugin.) > > fastutil-in-Spark isn't the issue per se, since indeed Spark doesn't have it! what it does have is CA. Yes, there is a rename feature in sbt which should accomplish the shading without manually renaming. There does not appear to be a ```minimizeJar``` equivalent. I will give that renaming a try. Do you think that Spark should be the one to rename CA if it is cherry picking the HyperLogLog class and nothing else? It creates headaches for other dependent analytics packages which could declare a dependence on CA without conflicts if it did. Your pointers to the other JIRA bugs do seem relevant and to have some mechanisms in common. http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5879.html > * for a class that exists in both my jar and spark kernel it tries to use userClassLoader and ends up with a NoClassDefFoundError. the class is org.apache.avro.mapred.AvroInputFormat and the NoClassDefFoundError is for org.apache.hadoop.mapred.FileInputFormat (which the parentClassLoader is responsible for since it is not in my jar). i currently catch this NoClassDefFoundError and call parentClassLoader.loadClass but thats clearly not a solution since it loads the wrong version. http://apache-spark-user-list.1001560.n3.nabble.com/java-serialization-errors-with-spark-files-userClassPathFirst-true-tp5832p5875.html > ok i think the issue is visibility: a classloader can see all classes loaded by its parent classloader. but userClassLoader does not have a parent classloader, so its not able to "see" any classes that parentLoader is responsible for. in my case userClassLoader is trying to get AvroInputFormat which probably somewhere statically references FileInputFormat, which is invisible to userClassLoader. These two may explain why putting our jar first still doesn't work without shading. I confess that I am not an expert on class loaders... --- If your project is set up for it
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76681725 So, I would close this PR in the sense that this isn't the fix. I would leave SPARK-6029 open though until there's a resolution to the issue one way or the other. Are you using Spark 1.3? because it now occurs to me that `spark.driver.userClassPathFirst` and `spark.executor.userClassPathFirst` don't exist before 1.3. The equivalents are the less-well-named `spark.files.userClassPathFirst` and `spark.yarn.user.classpath.first` (for YARN mode). Can you try those instead if you're using <= 1.2? Also, just double-check that CA is really in your assembly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76507231 So, marking CA as provided but putting it your assembly is contradictory, but in the end, the right thing is to include CA (and its dependencies) in your assembly, yes. I was asking whether you mark Spark as provided; it should be. You're effectively shading CA (and not fastutil); you should also be able to achieve that through your build rather than bother with source, though, I don't know how that works in SBT. (`minimizeJar` is a function of Maven's shading plugin.) fastutil-in-Spark isn't the issue per se, since indeed Spark doesn't have it! what it does have is CA. Your result seems to confirm that the problem is really CA, in the sense that your app finds Spark's loaded copy of CA classes, but that classloader can't see your classloader, which also has CA, but also the fastutil it needs. Shading CA disambiguates this. (So, put that workaround in your pocket for now; you should be able to do this with SBT.) This is what the userClassPathFirst stuff is supposed to resolve though. You're definitely sure CA is in your app JAR? I ask just because you mention it was marked provided above, though also in your assembly. Worth double-checking. Otherwise, I'm not sure. It almost sounds like the inverse of unresolved (?) https://issues.apache.org/jira/browse/SPARK-1863 but may surprise me by being the same issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76428126 >Are you packaging CA yourself, are you marking all the Spark deps as provided? I see you're already using the ```userClassPathFirst``` settings. Cases of sparkSubmit (fails => fastutil class not found exception) that fail: - Mark CA as provided - Include CA in our assembly - Include CA in our assembly and use ```spark.driver.userClassPathFirst=true``` and ```spark.executor.userClassPathFirst=true``` - Include CA and ```fastutil``` in our assembly - Include CA and ```fastutil``` in our assembly and use ```userClassPath=true``` Cases that succeed: - Include source for CA directly and rename package from ```clearspring``` to ```clearsprin``` - Include source for CA directly and rename package from ```clearspring``` to ```clearsprin``` and use ```userClassPath=true``` Note that these last two cases do cause ```fastutil``` to be included in the assembly. >(PS good news, I see that parquet-column's shading is set to only include classes that are used from fastutil. That's great; there are only tens of classes added. That's why they had to shade.) I don't see a ```minimizeJar``` option with ```sbt-assembly``` which is what we are using. I have to say, that would be nice since it is over 16MB... So, I guess we will have to just copy/rename the CA package? Or should the ```userClassPathFirst``` work in this situation where spark is using one of the classes? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76214905 (PS good news, I see that `parquet-column`'s shading is set to only include classes that are used from `fastutil`. That's great; there are only tens of classes added. That's why they had to shade.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76214312 Since the PR is the implementation of an issue resolution, if the discussion is about the implementation it can happen here on the PR. Shading isn't the issue in the sense that it wouldn't affect whether you can access un-shaded fastutil classes. So, that's bad if parquet-column brings back in all of fastutil, since it has caused a problem with assembly size in the past. It looks like Spark SQL uses it. This is a separate issue at the moment. It shouldn't matter what Spark does with Spark's copy of clearspring-analytics, in the sense that you shouldn't care or rely on what Spark does or doesn't use. In practice it can matter, since, if your code uses CA, and it finds CA in the Spark classloader, then that copy can't see fastutil that your usage and your copy of CA (which you should be bundling with your app) can see. Are you packaging CA yourself, are you marking all the Spark deps as provided? I see you're already using the `userClassPathFirst` settings. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user jkleckner commented on a diff in the pull request: https://github.com/apache/spark/pull/4780#discussion_r25440596 --- Diff: pom.xml --- @@ -471,13 +471,6 @@ com.clearspring.analytics stream 2.7.0 - - - -it.unimi.dsi -fastutil --- End diff -- I'm very open to suggestions. Note that parquet is already bringing in part of it via it's own assembly jar renamed under parquet/. I haven't worked with sbt-assembly rename, but is that a solution? Perhaps we should all use a different implementation other than QDigest like tdigest? BTW, I'm unclear where discussion is best placed. Should it be here or there: https://issues.apache.org/jira/browse/SPARK-6029 The more I look into this sort of ```dependency **ll```, the more I believe it is intractable in the current way we (generally, not Spark specifically) package things. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/4780#discussion_r25414302 --- Diff: pom.xml --- @@ -471,13 +471,6 @@ com.clearspring.analytics stream 2.7.0 - - - -it.unimi.dsi -fastutil --- End diff -- We need to not bring fastutil back into the build. It is massive, and caused packaging problems. Let's find another answer. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/4780#issuecomment-76125266 Can one of the admins verify this patch? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6029] Stop excluding fastutil package
GitHub user jkleckner opened a pull request: https://github.com/apache/spark/pull/4780 [SPARK-6029] Stop excluding fastutil package Spark excludes "fastutil" dependencies of "clearspring" quantiles. This patch removes the exclusion of the dependency. You can merge this pull request into a Git repository by running: $ git pull https://github.com/jkleckner/spark SPARK-6029-stop-excluding-fastutil Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/4780.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #4780 commit 13242068789496ecfc83ebf13bb881e8a25abe8f Author: Jim Kleckner Date: 2015-02-26T05:23:57Z [SPARK-6029] Stop excluding fastutil package Spark excludes "fastutil" dependencies of "clearspring" quantiles. This patch removes the exclusion of the dependency. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org