Re: Using mllib-1.1.0-SNAPSHOT on Spark 1.0.1

2014-08-09 Thread Matt Forbes
I was having this same problem early this week and had to include my
changes in the assembly.


On Sat, Aug 9, 2014 at 9:59 AM, Debasish Das debasish.da...@gmail.com
wrote:

 I validated that I can reproduce this problem with master as well (without
 adding any of my mllib changes)...

 I separated mllib jar from assembly, deploy the assembly and then I supply
 the mllib jar as --jars option to spark-submit...

 I get this error:

 14/08/09 12:49:32 INFO DAGScheduler: Failed to run count at ALS.scala:299

 Exception in thread main org.apache.spark.SparkException: Job aborted due
 to stage failure: Task 238 in stage 40.0 failed 4 times, most recent
 failure: Lost task 238.3 in stage 40.0 (TID 10002,
 tblpmidn05adv-hdp.tdc.vzwcorp.com): java.lang.ClassCastException:
 scala.Tuple1 cannot be cast to scala.Product2



 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5$$anonfun$apply$4.apply(CoGroupedRDD.scala:159)

 scala.collection.Iterator$$anon$11.next(Iterator.scala:328)



 org.apache.spark.util.collection.ExternalAppendOnlyMap.insertAll(ExternalAppendOnlyMap.scala:138)



 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:159)



 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$5.apply(CoGroupedRDD.scala:158)



 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)



 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)



 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:158)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)



 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)



 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:129)



 org.apache.spark.rdd.CoGroupedRDD$$anonfun$compute$2.apply(CoGroupedRDD.scala:126)



 scala.collection.TraversableLike$WithFilter$$anonfun$foreach$1.apply(TraversableLike.scala:772)



 scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)

 scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)



 scala.collection.TraversableLike$WithFilter.foreach(TraversableLike.scala:771)

 org.apache.spark.rdd.CoGroupedRDD.compute(CoGroupedRDD.scala:126)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)


 org.apache.spark.rdd.MappedValuesRDD.compute(MappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)



 org.apache.spark.rdd.FlatMappedValuesRDD.compute(FlatMappedValuesRDD.scala:31)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:229)

 org.apache.spark.rdd.FlatMappedRDD.compute(FlatMappedRDD.scala:33)

 org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:262)

 org.apache.spark.CacheManager.getOrCompute(CacheManager.scala:61)

 org.apache.spark.rdd.RDD.iterator(RDD.scala:227)

 org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:62)

 org.apache.spark.scheduler.Task.run(Task.scala:54)


 org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:199)



 java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)



 java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)

 java.lang.Thread.run(Thread.java:744)

 Driver stacktrace:

 at org.apache.spark.scheduler.DAGScheduler.org

 $apache$spark$scheduler$DAGScheduler$$failJobAndIndependentStages(DAGScheduler.scala:1153)

 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1142)

 at

 org.apache.spark.scheduler.DAGScheduler$$anonfun$abortStage$1.apply(DAGScheduler.scala:1141)

 at

 scala.collection.mutable.ResizableArray$class.foreach(ResizableArray.scala:59)

 at scala.collection.mutable.ArrayBuffer.foreach(ArrayBuffer.scala:47)

 at
 org.apache.spark.scheduler.DAGScheduler.abortStage(DAGScheduler.scala:1141)

 at

 

Documentation confusing or incorrect for decision trees?

2014-08-06 Thread Matt Forbes
I found the section on ordering categorical features really interesting,
but the A, B, C example seemed inconsistent. Am I interpreting this passage
wrong, or are there typos? Aren't the split candidates A | C, B and A, C |
B ?

For example, for a binary classification problem with one categorical
feature with three categories A, B and C with corresponding proportion of
label 1 as 0.2, 0.6 and 0.4, the categorical features are ordered as A
followed by C followed B or A, B, C. The two split candidates are A | C, B
and A , B | C where | denotes the split.


Problems running modified spark version on ec2 cluster

2014-08-04 Thread Matt Forbes
I'm trying to run a forked version of mllib where I am experimenting with a
boosted trees implementation. Here is what I've tried, but can't seem to
get working properly:

*Directory layout:*

src/spark-dev  (spark github fork)
  pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
mllib
src/forestry  (test driver)
  pom.xml - depends on spark-core and spark-mllib with version 1.2

*spark-defaults.conf:*

spark.masterspark://
ec2-54-224-112-117.compute-1.amazonaws.com:7077
spark.verbose   true
spark.files.userClassPathFirst  false  # I've tried both true and false here
spark.executor-memory   6G
spark.jars
 
spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar

*Build and run script:*

MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,streaming install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress
~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
rsync --progress
~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
$MASTER:
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
--verbose

In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
referencing from TreeTest in my test driver. The driver pulls some data
from s3, converts to LabeledPoint, and then calls
GradientBoostingTree.train(...) identically to how DecisionTree works. This
is all fine until it we call examples.map { x = tree.predict(x.features) }
where tree is a DecisionTree that I've also modified in my fork. At this
point, the workers blow up because they can't find a new method I've added
to the tree.model.Node class. My suspicion is that maybe the workers have
deserialized the DecisionTreeModel into a different version of mllib that
doesn't have my changes?

Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
startup and manage, maybe I need to fully distribute my new version of
spark to all the workers before starting the job? Is there an easy way to
do that?


Re: Problems running modified spark version on ec2 cluster

2014-08-04 Thread Matt Forbes
After rummaging through the worker instances I noticed they were using the
assembly jar (which I hadn't noticed before). Now instead of using the core
and mllib jars individually, I'm just overwriting the assembly jar in the
master and using spark-ec2/copy-dir. For posterity, my run script is:

MASTER=r...@ec2-54-224-110-72.compute-1.amazonaws.com
PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
ASSEMBLY_SRC=spark-assembly-1.1.0-SNAPSHOT-hadoop1.0.4.jar
ASSEMBLY_DEST=spark-assembly-1.0.1-hadoop1.0.4.jar
FORESTRY_DIR=~/src/forestry-main
SPARK_DIR=~/src/spark-dev
cd $SPARK_DIR
mvn -T8 -DskipTests -pl core,mllib,assembly install
cd $FORESTRY_DIR
mvn -T8 -DskipTests package
rsync --progress ~/src/spark-dev/assembly/target/scala-2.10/$ASSEMBLY_SRC
$MASTER:spark/lib/$ASSEMBLY_DEST
rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
ssh $MASTER spark-ec2/copy-dir --delete /root/spark/lib
ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class
com.ttforbes.TreeTest --verbose



On Mon, Aug 4, 2014 at 10:23 AM, Matt Forbes m...@tellapart.com wrote:

 I'm trying to run a forked version of mllib where I am experimenting with
 a boosted trees implementation. Here is what I've tried, but can't seem to
 get working properly:

 *Directory layout:*

 src/spark-dev  (spark github fork)
   pom.xml - I've tried changing the version to 1.2 arbitrarily in core and
 mllib
 src/forestry  (test driver)
   pom.xml - depends on spark-core and spark-mllib with version 1.2

 *spark-defaults.conf:*

 spark.masterspark://
 ec2-54-224-112-117.compute-1.amazonaws.com:7077
 spark.verbose   true
 spark.files.userClassPathFirst  false  # I've tried both true and false
 here
 spark.executor-memory   6G
 spark.jars
  
 spark-mllib_2.10-1.2.0-SNAPSHOT.jar,spark-core_2.10-1.2.0-SNAPSHOT.jar,spark-streaming_2.10-1.2.0-SNAPSHOT.jar

 *Build and run script:*

 MASTER=r...@ec2-54-224-112-117.compute-1.amazonaws.com
 PRIMARY_JAR=forestry-main-1.0-SNAPSHOT-jar-with-dependencies.jar
 FORESTRY_DIR=~/src/forestry-main
 SPARK_DIR=~/src/spark-dev
 cd $SPARK_DIR
 mvn -T8 -DskipTests -pl core,mllib,streaming install
 cd $FORESTRY_DIR
 mvn -T8 -DskipTests package
 rsync --progress
 ~/src/spark-dev/mllib/target/spark-mllib_2.10-1.2.0-SNAPSHOT.jar $MASTER:
 rsync --progress
 ~/src/spark-dev/core/target/spark-core_2.10-1.2.0-SNAPSHOT.jar $MASTER:
 rsync --progress
 ~/src/spark-dev/streaming/target/spark-streaming_2.10-1.2.0-SNAPSHOT.jar
 $MASTER:
 rsync --progress ~/src/forestry-main/target/$PRIMARY_JAR $MASTER:
 rsync --progress ~/src/forestry-main/spark-defaults.conf $MASTER:spark/conf
 ssh $MASTER spark/bin/spark-submit $PRIMARY_JAR --class forestry.TreeTest
 --verbose

 In spark-dev/mllib I've added a new class, GradientBoostingTree, which I'm
 referencing from TreeTest in my test driver. The driver pulls some data
 from s3, converts to LabeledPoint, and then calls
 GradientBoostingTree.train(...) identically to how DecisionTree works. This
 is all fine until it we call examples.map { x = tree.predict(x.features) }
 where tree is a DecisionTree that I've also modified in my fork. At this
 point, the workers blow up because they can't find a new method I've added
 to the tree.model.Node class. My suspicion is that maybe the workers have
 deserialized the DecisionTreeModel into a different version of mllib that
 doesn't have my changes?

 Is my setup all wrong? I'm using an EC2 cluster because it is so easy to
 startup and manage, maybe I need to fully distribute my new version of
 spark to all the workers before starting the job? Is there an easy way to
 do that?