Right. Something else that’s come up so I haven’t tried the shell tutorial yet. If anyone else wants to try it you can build Mahout from this PR: https://github.com/apache/mahout/pull/61
On Oct 21, 2014, at 3:28 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: hm no they don't push different binary releases to maven. I assume they only push the default one. On Tue, Oct 21, 2014 at 3:26 PM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > ps i remember discussion for packaging binary spark distributions. So > there's in fact a number of different spark artifact releases. However, i > am not sure if they are pushing them to mvn repositories. (if they did, > they might use different maven classifiers for those). If that's the case, > then one plausible strategy here is to recommend rebuilding mahout with > dependency to a classifier corresponding to the actual spark binary release > used. > > On Tue, Oct 21, 2014 at 2:21 PM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > >> if you are using mahout shell or command line drivers (which i dont) it >> would seem the correct thing to do is for mahout script simply to take >> spark dependencies from installed $SPARK_HOME rather than from Mahout's >> assembly. In fact that would be consistent with what other projects are >> doing in similar situation. it should also probably make things compatible >> between minor releases of spark. >> >> But i think you are right in a sense that the problem is that spark jars >> are not uniquely encompassed by maven artifact id and version, unlike with >> most other products. (e.g. if we see mahout-math-0.9.jar we expect there to >> be one and only one released artifact in existence -- but one's local build >> may create incompatible variations). >> >> On Tue, Oct 21, 2014 at 1:51 PM, Pat Ferrel <p...@occamsmachete.com> >> wrote: >> >>> The problem is not in building Spark it is in building Mahout using the >>> correct Spark jars. If you are using CDH and hadoop 2 the correct jars are >>> in the repos. >>> >>> For the rest of us, though the process below seems like an error prone >>> hack to me it does work on Linux and BSD/mac. It should really be addressed >>> by Spark imo. >>> >>> BTW The cache is laid out differently on linux but I don’t think you >>> need to delete is anyway. >>> >>> On Oct 21, 2014, at 12:27 PM, Dmitriy Lyubimov <dlie...@gmail.com> >>> wrote: >>> >>> fwiw i never built spark using maven. Always use sbt assembly. >>> >>> On Tue, Oct 21, 2014 at 11:55 AM, Pat Ferrel <p...@occamsmachete.com> >>> wrote: >>> >>>> Ok, the mystery is solved. >>>> >>>> The safe sequence from my limited testing is: >>>> 1) delete ~/.m2/repository/org/spark and mahout >>>> 2) build Spark for your version of Hadoop *but do not use "mvn package >>>> ...”* use “mvn install …” This will put a copy of the exact bits you >>> need >>>> into the maven cache for building mahout against. In my case using >>> hadoop >>>> 1.2.1 it was "mvn -Dhadoop.version=1.2.1 -DskipTests clean install” If >>> you >>>> run tests on Spark some failures can safely be ignored according to the >>>> Spark guys so check before giving up. >>>> 3) build mahout with “mvn clean install" >>>> >>>> This will create mahout from exactly the same bits you will run on your >>>> cluster. It got rid of a missing anon function for me. The problem >>> occurs >>>> when you use a different version of Spark on your cluster than you >>> used to >>>> build Mahout and this is rather hidden by Maven. Maven downloads from >>> repos >>>> any dependency that is not in the local .m2 cache and so you have to >>> make >>>> sure your version of Spark is there so Maven wont download one that is >>>> incompatible. Unless you really know what you are doing I’d build both >>>> Spark and Mahout for now >>>> >>>> BTW I will check in the Spark 1.1.0 version of Mahout once I do some >>> more >>>> testing. >>>> >>>> On Oct 21, 2014, at 10:26 AM, Pat Ferrel <p...@occamsmachete.com> >>> wrote: >>>> >>>> Sorry to hear. I bet you’ll find a way. >>>> >>>> The Spark Jira trail leads to two suggestions: >>>> 1) use spark-submit to execute code with your own entry point (other >>> than >>>> spark-shell) One theory points to not loading all needed Spark classes >>> from >>>> calling code (Mahout in our case). I can hand check the jars for the >>> anon >>>> function I am missing. >>>> 2) there may be different class names in the running code (created by >>>> building Spark locally) and the version referenced in the Mahout POM. >>> If >>>> this turns out to be true it means we can’t rely on building Spark >>> locally. >>>> Is there a maven target that puts the artifacts of the Spark build in >>> the >>>> .m2/repository local cache? That would be an easy way to test this >>> theory. >>>> >>>> either of these could cause missing classes. >>>> >>>> >>>> On Oct 21, 2014, at 9:52 AM, Dmitriy Lyubimov <dlie...@gmail.com> >>> wrote: >>>> >>>> no i havent used it with anything but 1.0.1 and 0.9.x . >>>> >>>> on a side note, I just have changed my employer. It is one of these big >>>> guys that make it very difficult to do any contributions. So I am not >>> sure >>>> how much of anything i will be able to share/contribute. >>>> >>>> On Tue, Oct 21, 2014 at 9:43 AM, Pat Ferrel <p...@occamsmachete.com> >>> wrote: >>>> >>>>> But unless you have the time to devote to errors avoid it. I’ve built >>>>> everything from scratch using 1.0.2 and 1.1.0 and am getting these and >>>>> missing class errors. The 1.x branch seems to have some kind of >>> peculiar >>>>> build order dependencies. The errors sometimes don’t show up until >>>> runtime, >>>>> passing all build tests. >>>>> >>>>> Dmitriy, have you successfully used any Spark version other than >>> 1.0.1 on >>>>> a cluster? If so do you recall the exact order and from what sources >>> you >>>>> built? >>>>> >>>>> >>>>> On Oct 21, 2014, at 9:35 AM, Dmitriy Lyubimov <dlie...@gmail.com> >>> wrote: >>>>> >>>>> You can't use spark client of one version and have the backend of >>>> another. >>>>> You can try to change spark dependency in mahout poms to match your >>>> backend >>>>> (or vice versa, you can change your backend to match what's on the >>>> client). >>>>> >>>>> On Tue, Oct 21, 2014 at 7:12 AM, Mahesh Balija < >>>> balijamahesh....@gmail.com >>>>>> >>>>> wrote: >>>>> >>>>>> Hi All, >>>>>> >>>>>> Here are the errors I get which I run in a pseudo distributed mode, >>>>>> >>>>>> Spark 1.0.2 and Mahout latest code (Clone) >>>>>> >>>>>> When I run the command in page, >>>>>> https://mahout.apache.org/users/sparkbindings/play-with-shell.html >>>>>> >>>>>> val drmX = drmData(::, 0 until 4) >>>>>> >>>>>> java.io.InvalidClassException: org.apache.spark.rdd.RDD; local class >>>>>> incompatible: stream classdesc serialVersionUID = 385418487991259089, >>>>>> local class serialVersionUID = -6766554341038829528 >>>>>> at >>>>>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>>>>> at >>>>>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>>>> at >>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>>>> at >>>>>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>>>> at >>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>>>> at >>>>>> >>>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>>>>> at >>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>>>> at >>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>>>>> at >>>>>> >>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>>>>> at >>>>>> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>>>>> at >>>>>> >>>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>>>>> at >>>>>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>>>> at >>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>>>>> at >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>>>>> at >>>>>> >>>>> >>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>>>>> at >>>>>> >>>>> >>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> at java.lang.Thread.run(Thread.java:701) >>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 1 (task 0.0:1) >>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 2 (task 0.0:0) >>>>>> 14/10/21 19:35:37 WARN TaskSetManager: Lost TID 3 (task 0.0:1) >>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 4 (task 0.0:0) >>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 5 (task 0.0:1) >>>>>> 14/10/21 19:35:38 WARN TaskSetManager: Lost TID 6 (task 0.0:0) >>>>>> org.apache.spark.SparkException: Job aborted due to stage failure: >>>>>> Task 0.0:0 failed 4 times, most recent failure: Exception failure in >>>>>> TID 6 on host mahesh-VirtualBox.local: java.io.InvalidClassException: >>>>>> org.apache.spark.rdd.RDD; local class incompatible: stream classdesc >>>>>> serialVersionUID = 385418487991259089, local class serialVersionUID = >>>>>> -6766554341038829528 >>>>>> >>> java.io.ObjectStreamClass.initNonProxy(ObjectStreamClass.java:592) >>>>>> >>>>>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>>>> >>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>>>> >>>>>> >>> java.io.ObjectInputStream.readNonProxyDesc(ObjectInputStream.java:1621) >>>>>> >>>>>> java.io.ObjectInputStream.readClassDesc(ObjectInputStream.java:1516) >>>>>> >>>>>> >>>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1770) >>>>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>>>> >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>>>> >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.ResultTask$.deserializeInfo(ResultTask.scala:61) >>>>>> >>>>>> >>> org.apache.spark.scheduler.ResultTask.readExternal(ResultTask.scala:141) >>>>>> >>>>>> >>> java.io.ObjectInputStream.readExternalData(ObjectInputStream.java:1836) >>>>>> >>>>>> >>>> >>> java.io.ObjectInputStream.readOrdinaryObject(ObjectInputStream.java:1795) >>>>>> >>> java.io.ObjectInputStream.readObject0(ObjectInputStream.java:1349) >>>>>> java.io.ObjectInputStream.readObject(ObjectInputStream.java:369) >>>>>> >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaDeserializationStream.readObject(JavaSerializer.scala:63) >>>>>> >>>>>> >>>>> >>>> >>> org.apache.spark.serializer.JavaSerializerInstance.deserialize(JavaSerializer.scala:85) >>>>>> >>>>>> org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:165) >>>>>> >>>>>> >>>>> >>>> >>> java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1146) >>>>>> >>>>>> >>>>> >>>> >>> java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) >>>>>> java.lang.Thread.run(Thread.java:701) >>>>>> Driver stacktrace: >>>>>> at org.apache.spark.scheduler.DAGScheduler.org >>>>>> >>>>> >>>> >>> $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1044) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1028) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1026) >>>>>> at >>>>>> >>>>> >>>> >>> scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) >>>>>> at >>>>>> scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1026) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler$$anonfun$handleTaskSetFailed$1.apply(DAGScheduler.scala:634) >>>>>> at scala.Option.foreach(Option.scala:236) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGScheduler.handleTaskSetFailed(DAGScheduler.scala:634) >>>>>> at >>>>>> >>>>> >>>> >>> org.apache.spark.scheduler.DAGSchedulerEventProcessActor$$anonfun$receive$2.applyOrElse(DAGScheduler.scala:1229) >>>>>> at akka.actor.ActorCell.receiveMessage(ActorCell.scala:498) >>>>>> at akka.actor.ActorCell.invoke(ActorCell.scala:456) >>>>>> at akka.dispatch.Mailbox.processMailbox(Mailbox.scala:237) >>>>>> at akka.dispatch.Mailbox.run(Mailbox.scala:219) >>>>>> at >>>>>> >>>>> >>>> >>> akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:386) >>>>>> at >>>>>> scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260) >>>>>> at >>>>>> >>>>> >>>> >>> scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339) >>>>>> at >>>>>> >>> scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979) >>>>>> at >>>>>> >>>>> >>>> >>> scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107) >>>>>> >>>>>> Best, >>>>>> Mahesh Balija. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> >>>>>> On Tue, Oct 21, 2014 at 2:38 AM, Dmitriy Lyubimov <dlie...@gmail.com >>>> >>>>>> wrote: >>>>>> >>>>>>> On Mon, Oct 20, 2014 at 1:51 PM, Pat Ferrel <p...@occamsmachete.com> >>>>>> wrote: >>>>>>> >>>>>>>> Is anyone else nervous about ignoring this issue or relying on >>>>>> non-build >>>>>>>> (hand run) test driven transitive dependency checking. I hope >>> someone >>>>>>> else >>>>>>>> will chime in. >>>>>>>> >>>>>>>> As to running unit tests on a TEST_MASTER I’ll look into it. Can we >>>> set >>>>>>> up >>>>>>>> the build machine to do this? I’d feel better about eyeballing >>> deps if >>>>>> we >>>>>>>> could have a TEST_MASTER automatically run during builds at Apache. >>>>>> Maybe >>>>>>>> the regular unit tests are OK for building locally ourselves. >>>>>>>> >>>>>>>>> >>>>>>>>> On Oct 20, 2014, at 12:23 PM, Dmitriy Lyubimov <dlie...@gmail.com >>>> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>> On Mon, Oct 20, 2014 at 11:44 AM, Pat Ferrel < >>> p...@occamsmachete.com> >>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> Maybe a more fundamental issue is that we don’t know for sure >>>>>> whether >>>>>>> we >>>>>>>>>> have missing classes or not. The job.jar at least used the pom >>>>>>>> dependencies >>>>>>>>>> to guarantee every needed class was present. So the job.jar >>> seems to >>>>>>>> solve >>>>>>>>>> the problem but may ship some unnecessary duplicate code, right? >>>>>>>>>> >>>>>>>>> >>>>>>>>> No, as i wrote spark doesn't work with job jar format. Neither >>> as it >>>>>>>> turns >>>>>>>>> out more recent hadoop MR btw. >>>>>>>> >>>>>>>> Not speaking literally of the format. Spark understands jars and >>> maven >>>>>>> can >>>>>>>> build one from transitive dependencies. >>>>>>>> >>>>>>>>> >>>>>>>>> Yes, this is A LOT of duplicate code (will take normally MINUTES >>> to >>>>>>>> startup >>>>>>>>> tasks with all of it just on copy time). This is absolutely not >>> the >>>>>> way >>>>>>>> to >>>>>>>>> go with this. >>>>>>>>> >>>>>>>> >>>>>>>> Lack of guarantee to load seems like a bigger problem than startup >>>>>> time. >>>>>>>> Clearly we can’t just ignore this. >>>>>>>> >>>>>>> >>>>>>> Nope. given highly iterative nature and dynamic task allocation in >>> this >>>>>>> environment, one is looking to effects similar to Map Reduce. This >>> is >>>>> not >>>>>>> the only reason why I never go to MR anymore, but that's one of main >>>>>> ones. >>>>>>> >>>>>>> How about experiment: why don't you create assembly that copies ALL >>>>>>> transitive dependencies in one folder, and then try to broadcast it >>>> from >>>>>>> single point (front end) to well... let's start with 20 machines. >>> (of >>>>>>> course we ideally want to into 10^3 ..10^4 range -- but why bother >>> if >>>> we >>>>>>> can't do it for 20). >>>>>>> >>>>>>> Or, heck, let's try to simply parallel-copy it between too machines >>> 20 >>>>>>> times that are not collocated on the same subnet. >>>>>>> >>>>>>> >>>>>>>>> >>>>>>>>>> There may be any number of bugs waiting for the time we try >>> running >>>>>>> on a >>>>>>>>>> node machine that doesn’t have some class in it’s classpath. >>>>>>>>> >>>>>>>>> >>>>>>>>> No. Assuming any given method is tested on all its execution >>> paths, >>>>>>> there >>>>>>>>> will be no bugs. The bugs of that sort will only appear if the >>> user >>>>>> is >>>>>>>>> using algebra directly and calls something that is not on the >>> path, >>>>>>> from >>>>>>>>> the closure. In which case our answer to this is the same as for >>> the >>>>>>>> solver >>>>>>>>> methodology developers -- use customized SparkConf while creating >>>>>>> context >>>>>>>>> to include stuff you really want. >>>>>>>>> >>>>>>>>> Also another right answer to this is that we probably should >>>>>> reasonably >>>>>>>>> provide the toolset here. For example, all the stats stuff found >>> in R >>>>>>>> base >>>>>>>>> and R stat packages so the user is not compelled to go non-native. >>>>>>>>> >>>>>>>>> >>>>>>>> >>>>>>>> Huh? this is not true. The one I ran into was found by calling >>>>>> something >>>>>>>> in math from something in math-scala. It led outside and you can >>>>>>> encounter >>>>>>>> such things even in algebra. In fact you have no idea if these >>>>>> problems >>>>>>>> exists except for the fact you have used it a lot personally. >>>>>>>> >>>>>>> >>>>>>> >>>>>>> You ran it with your own code that never existed before. >>>>>>> >>>>>>> But there's difference between released Mahout code (which is what >>> you >>>>>> are >>>>>>> working on) and the user code. Released code must run thru remote >>> tests >>>>>> as >>>>>>> you suggested and thus guarantee there are no such problems with >>> post >>>>>>> release code. >>>>>>> >>>>>>> For users, we only can provide a way for them to load stuff that >>> they >>>>>>> decide to use. We don't have apriori knowledge what they will use. >>> It >>>> is >>>>>>> the same thing that spark does, and the same thing that MR does, >>>> doesn't >>>>>>> it? >>>>>>> >>>>>>> Of course mahout should drop rigorously the stuff it doesn't load, >>> from >>>>>> the >>>>>>> scala scope. No argue about that. In fact that's what i suggested >>> as #1 >>>>>>> solution. But there's nothing much to do here but to go dependency >>>>>>> cleansing for math and spark code. Part of the reason there's so >>> much >>>> is >>>>>>> because newer modules still bring in everything from mrLegacy. >>>>>>> >>>>>>> You are right in saying it is hard to guess what else dependencies >>> are >>>>> in >>>>>>> the util/legacy code that are actually used. but that's not a >>>>>> justification >>>>>>> for brute force "copy them all" approach that virtually guarantees >>>>>> ruining >>>>>>> one of the foremost legacy issues this work intended to address. >>>>>>> >>>>>> >>>>> >>>>> >>>> >>>> >>>> >>> >>> >> >