fwiw i never built spark using maven. Always use sbt assembly. On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <p...@occamsmachete.com> wrote:
> Ok, the mystery is solved. > > The safe sequence from my limited testing is: > 1) delete ~/.m2/repository/org/spark and mahout > 2) build Spark for your version of Hadoop *but do not use "mvn package > ...”* use “mvn install …” This will put a copy of the exact bits you need > into the maven cache for building mahout against. In my case using hadoop > 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If you > run tests on Spark some failures can safely be ignored according to the > Spark guys so check before giving up. > 3) build mahout with “mvn clean install" > > This will create mahout from exactly the same bits you will run on your > cluster. It got rid of a missing anon function for me. The problem occurs > when you use a different version of Spark on your cluster than you used to > build Mahout and this is rather hidden by Maven. Maven downloads from repos > any dependency that is not in the local .m2 cache and so you have to make > sure your version of Spark is there so Maven wont download one that is > incompatible. Unless you really know what you are doing I’d build both > Spark and Mahout for now > > BTW I will check in the Spark 1.1.0 version of Mahout once I do some more > testing. > > On Oct 21, 2014, at 10:26 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > Sorry to hear. I bet you’ll find a way. > > The Spark Jira trail leads to two suggestions: > 1) use spark-submit to execute code with your own entry point (other than > spark-shell) One theory points to not loading all needed Spark classes from > calling code (Mahout in our case). I can hand check the jars for the anon > function I am missing. > 2) there may be different class names in the running code (created by > building Spark locally) and the version referenced in the Mahout POM. If > this turns out to be true it means we can’t rely on building Spark locally. > Is there a maven target that puts the artifacts of the Spark build in the > .m2/repository local cache? That would be an easy way to test this theory. > > either of these could cause missing classes. > > > On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > no i havent used it with anything but 1.0.1 and 0.9.x . > > on a side note, I just have changed my employer. It is one of these big > guys that make it very difficult to do any contributions. So I am not sure > how much of anything i will be able to share/contribute. > > On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <p...@occamsmachete.com> wrote: > > > But unless you have the time to devote to errors avoid it. I’ve built > > everything from scratch using 1.0.2 and 1.1.0 and am getting these and > > missing class errors. The 1.x branch seems to have some kind of peculiar > > build order dependencies. The errors sometimes don’t show up until > runtime, > > passing all build tests. > > > > Dmitriy, have you successfully used any Spark version other than 1.0.1 on > > a cluster? If so do you recall the exact order and from what sources you > > built? > > > > > > On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > > > > You can't use spark client of one version and have the backend of > another. > > You can try to change spark dependency in mahout poms to match your > backend > > (or vice versa, you can change your backend to match what's on the > client). > > > > On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija < > balijamahesh....@gmail.com > >> > > wrote: > > > >> Hi All, > >> > >> Here are the errors I get which I run in a pseudo distributed mode, > >> > >> Spark 1.0.2 and Mahout latest code (Clone) > >> > >> When I run the command in page, > >> https://mahout.apache.org/users/sparkbindings/play-with-shell.html > >> > >> val drmX = drmData(::, 0 until 4) > >> > >> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class > >> incompatible: stream classdesc serialVersionUID = 385418487991259089, > >> local class serialVersionUID = -6766554341038829528 > >> at > >> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) > >> at > >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) > >> at > >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) > >> at > >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) > >> at > >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) > >> at > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) > >> at > >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) > >> at > > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) > >> at > >> > > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> at > >> > > > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) > >> at > >> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) > >> at > >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) > >> at > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) > >> at > >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) > >> at > > java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) > >> at > >> > > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> at > >> > > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) > >> at > >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) > >> at > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > >> at > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >> at java.lang.Thread.run(Thread.java:701) > >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1) > >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0) > >> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1) > >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0) > >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1) > >> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0) > >> org.apache.spark.SparkException: Job aborted due to stage failure: > >> Task 0.0:0 failed 4 times, most recent failure: Exception failure in > >> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException: > >> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc > >> serialVersionUID = 385418487991259089, local class serialVersionUID = > >> -6766554341038829528 > >> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) > >> > >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) > >> > >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) > >> > >> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) > >> > >> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) > >> > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) > >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) > >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) > >> > >> > > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> > >> > > > org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) > >> > >> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) > >> > >> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) > >> > >> > java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) > >> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) > >> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) > >> > >> > > > org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) > >> > >> > > > org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) > >> > >> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) > >> > >> > > > java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) > >> > >> > > > java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) > >> java.lang.Thread.run(Thread.java:701) > >> Driver stacktrace: > >> at org.apache.spark.scheduler.DAGScheduler.org > >> > > > $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) > >> at > >> > > > scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) > >> at > >> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) > >> at scala.Option.foreach(Option.scala:236) > >> at > >> > > > org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) > >> at > >> > > > org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) > >> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) > >> at akka.actor.ActorCell.invoke(ActorCell.scala:456) > >> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) > >> at akka.dispatch.Mailbox.run(Mailbox.scala:219) > >> at > >> > > > akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) > >> at > >> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) > >> at > >> > > > scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) > >> at > >> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) > >> at > >> > > > scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) > >> > >> Best, > >> Mahesh Balija. > >> > >> > >> > >> > >> > >> > >> > >> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlie...@gmail.com> > >> wrote: > >> > >>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <p...@occamsmachete.com> > >> wrote: > >>> > >>>> Is anyone else nervous about ignoring this issue or relying on > >> non-build > >>>> (hand run) test driven transitive dependency checking. I hope someone > >>> else > >>>> will chime in. > >>>> > >>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we > set > >>> up > >>>> the build machine to do this? I’d feel better about eyeballing deps if > >> we > >>>> could have a TEST_MASTER automatically run during builds at Apache. > >> Maybe > >>>> the regular unit tests are OK for building locally ourselves. > >>>> > >>>>> > >>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlie...@gmail.com> > >>>> wrote: > >>>>> > >>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel <p...@occamsmachete.com> > >>>> wrote: > >>>>> > >>>>>> Maybe a more fundamental issue is that we don’t know for sure > >> whether > >>> we > >>>>>> have missing classes or not. The job.jar at least used the pom > >>>> dependencies > >>>>>> to guarantee every needed class was present. So the job.jar seems to > >>>> solve > >>>>>> the problem but may ship some unnecessary duplicate code, right? > >>>>>> > >>>>> > >>>>> No, as i wrote spark doesn't work with job jar format. Neither as it > >>>> turns > >>>>> out more recent hadoop MR btw. > >>>> > >>>> Not speaking literally of the format. Spark understands jars and maven > >>> can > >>>> build one from transitive dependencies. > >>>> > >>>>> > >>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES to > >>>> startup > >>>>> tasks with all of it just on copy time). This is absolutely not the > >> way > >>>> to > >>>>> go with this. > >>>>> > >>>> > >>>> Lack of guarantee to load seems like a bigger problem than startup > >> time. > >>>> Clearly we can’t just ignore this. > >>>> > >>> > >>> Nope. given highly iterative nature and dynamic task allocation in this > >>> environment, one is looking to effects similar to Map Reduce. This is > > not > >>> the only reason why I never go to MR anymore, but that's one of main > >> ones. > >>> > >>> How about experiment: why don't you create assembly that copies ALL > >>> transitive dependencies in one folder, and then try to broadcast it > from > >>> single point (front end) to well... let's start with 20 machines. (of > >>> course we ideally want to into 10^3 ..10^4 range -- but why bother if > we > >>> can't do it for 20). > >>> > >>> Or, heck, let's try to simply parallel-copy it between too machines 20 > >>> times that are not collocated on the same subnet. > >>> > >>> > >>>>> > >>>>>> There may be any number of bugs waiting for the time we try running > >>> on a > >>>>>> node machine that doesn’t have some class in it’s classpath. > >>>>> > >>>>> > >>>>> No. Assuming any given method is tested on all its execution paths, > >>> there > >>>>> will be no bugs. The bugs of that sort will only appear if the user > >> is > >>>>> using algebra directly and calls something that is not on the path, > >>> from > >>>>> the closure. In which case our answer to this is the same as for the > >>>> solver > >>>>> methodology developers -- use customized SparkConf while creating > >>> context > >>>>> to include stuff you really want. > >>>>> > >>>>> Also another right answer to this is that we probably should > >> reasonably > >>>>> provide the toolset here. For example, all the stats stuff found in R > >>>> base > >>>>> and R stat packages so the user is not compelled to go non-native. > >>>>> > >>>>> > >>>> > >>>> Huh? this is not true. The one I ran into was found by calling > >> something > >>>> in math from something in math-scala. It led outside and you can > >>> encounter > >>>> such things even in algebra. In fact you have no idea if these > >> problems > >>>> exists except for the fact you have used it a lot personally. > >>>> > >>> > >>> > >>> You ran it with your own code that never existed before. > >>> > >>> But there's difference between released Mahout code (which is what you > >> are > >>> working on) and the user code. Released code must run thru remote tests > >> as > >>> you suggested and thus guarantee there are no such problems with post > >>> release code. > >>> > >>> For users, we only can provide a way for them to load stuff that they > >>> decide to use. We don't have apriori knowledge what they will use. It > is > >>> the same thing that spark does, and the same thing that MR does, > doesn't > >>> it? > >>> > >>> Of course mahout should drop rigorously the stuff it doesn't load, from > >> the > >>> scala scope. No argue about that. In fact that's what i suggested as #1 > >>> solution. But there's nothing much to do here but to go dependency > >>> cleansing for math and spark code. Part of the reason there's so much > is > >>> because newer modules still bring in everything from mrLegacy. > >>> > >>> You are right in saying it is hard to guess what else dependencies are > > in > >>> the util/legacy code that are actually used. but that's not a > >> justification > >>> for brute force "copy them all" approach that virtually guarantees > >> ruining > >>> one of the foremost legacy issues this work intended to address. > >>> > >> > > > > > > >