Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1
I was having this same problem early this week and had to include my changes in the assembly. On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das debasish.da...@gmail.com wrote: I validated that I can reproduce this problem with master as well (without adding any of my mllib changes)... I separated mllib jar from assembly, deploy the assembly and then I supply the mllib jar as --jars option to spark-submit... I get this error: 14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299 Exception in thread main org.apache.spark.SparkException: Job aborted due to stage failure: Task 238 in stage 40.0 failed 4 times, most recent failure: Lost task 238.3 in stage 40.0 (TID 10002, tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException: scala.Tuple1 cannot be cast to scala.Product2 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159) scala.collection.Iterator$$anon$11.next(Iterator.scala:328) org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129) org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126) scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772) scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33) scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108) scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771) org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.rdd.RDD.iterator(RDD.scala:229) org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33) org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262) org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61) org.apache.spark.rdd.RDD.iterator(RDD.scala:227) org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62) org.apache.spark.scheduler.Task.run(Task.scala:54) org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199) java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) java.lang.Thread.run(Thread.java:744) Driver stacktrace: at org.apache.spark.scheduler.DAGScheduler.org $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142) at org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141) at scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59) at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47) at org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141) at
Documentation confusing or incorrect for decision trees?
I found the section on ordering categorical features really interesting, but the A, B, C example seemed inconsistent. Am I interpreting this passage wrong, or are there typos? Aren't the split candidates A | C, B and A, C | B ? For example, for a binary classification problem with one categorical feature with three categories A, B and C with corresponding proportion of label 1 as 0.2, 0.6 and 0.4, the categorical features are ordered as A followed by C followed B or A, B, C. The two split candidates are A | C, B and A , B | C where | denotes the split.
Problems running modified spark version on ec2 cluster
I'm trying to run a forked version of mllib where I am experimenting with a boosted trees implementation. Here is what I've tried, but can't seem to get working properly: *Directory layout:* src/spark-dev (spark github fork) pom.xml - I've tried changing the version to 1.2 arbitrarily in core and mllib src/forestry (test driver) pom.xml - depends on spark-core and spark-mllib with version 1.2 *spark-defaults.conf:* spark.masterspark:// ec2-54-224-112-117.compute-1.amazonaws.com:7077 spark.verbose true spark.files.userClassPathFirst false # I've tried both true and false here spark.executor-memory 6G spark.jars spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar *Build and run script:* MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar FORESTRY_DIR=~/src/forestry-main SPARK_DIR=~/src/spark-dev cd $SPARK_DIR mvn -T8 -DskipTests -pl core,mllib,streaming install cd $FORESTRY_DIR mvn -T8 -DskipTests package rsync --progress ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest --verbose In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm referencing from TreeTest in my test driver. The driver pulls some data from s3, converts to LabeledPoint, and then calls GradientBoostingTree.train(...) identically to how DecisionTree works. This is all fine until it we call examples.map { x = tree.predict(x.features) } where tree is a DecisionTree that I've also modified in my fork. At this point, the workers blow up because they can't find a new method I've added to the tree.model.Node class. My suspicion is that maybe the workers have deserialized the DecisionTreeModel into a different version of mllib that doesn't have my changes? Is my setup all wrong? I'm using an EC2 cluster because it is so easy to startup and manage, maybe I need to fully distribute my new version of spark to all the workers before starting the job? Is there an easy way to do that?
Re: Problems running modified spark version on ec2 cluster
After rummaging through the worker instances I noticed they were using the assembly jar (which I hadn't noticed before). Now instead of using the core and mllib jars individually, I'm just overwriting the assembly jar in the master and using spark-ec2/copy-dir. For posterity, my run script is: MASTER=r...@ec2-54-224-110-72.compute-1.amazonaws.com PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar ASSEMBLY_SRC=spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar ASSEMBLY_DEST=spark-assembly-1.0.1-hadoop1.0.4.jar FORESTRY_DIR=~/src/forestry-main SPARK_DIR=~/src/spark-dev cd $SPARK_DIR mvn -T8 -DskipTests -pl core,mllib,assembly install cd $FORESTRY_DIR mvn -T8 -DskipTests package rsync --progress ~/src/spark-dev/assembly/target/scala-2.10/$ASSEMBLY_SRC $MASTER:spark/lib/$ASSEMBLY_DEST rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf ssh $MASTER spark-ec2/copy-dir --delete /root/spark/lib ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class com.ttforbes.TreeTest --verbose On Mon, Aug 4, 2014 at 10:23 AM, Matt Forbes m...@tellapart.com wrote: I'm trying to run a forked version of mllib where I am experimenting with a boosted trees implementation. Here is what I've tried, but can't seem to get working properly: *Directory layout:* src/spark-dev (spark github fork) pom.xml - I've tried changing the version to 1.2 arbitrarily in core and mllib src/forestry (test driver) pom.xml - depends on spark-core and spark-mllib with version 1.2 *spark-defaults.conf:* spark.masterspark:// ec2-54-224-112-117.compute-1.amazonaws.com:7077 spark.verbose true spark.files.userClassPathFirst false # I've tried both true and false here spark.executor-memory 6G spark.jars spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar *Build and run script:* MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar FORESTRY_DIR=~/src/forestry-main SPARK_DIR=~/src/spark-dev cd $SPARK_DIR mvn -T8 -DskipTests -pl core,mllib,streaming install cd $FORESTRY_DIR mvn -T8 -DskipTests package rsync --progress ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar $MASTER: rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER: rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest --verbose In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm referencing from TreeTest in my test driver. The driver pulls some data from s3, converts to LabeledPoint, and then calls GradientBoostingTree.train(...) identically to how DecisionTree works. This is all fine until it we call examples.map { x = tree.predict(x.features) } where tree is a DecisionTree that I've also modified in my fork. At this point, the workers blow up because they can't find a new method I've added to the tree.model.Node class. My suspicion is that maybe the workers have deserialized the DecisionTreeModel into a different version of mllib that doesn't have my changes? Is my setup all wrong? I'm using an EC2 cluster because it is so easy to startup and manage, maybe I need to fully distribute my new version of spark to all the workers before starting the job? Is there an easy way to do that?