Xiangrui, Do you have any idea how to make this work?
Thanks - Terry Terry Hole <hujie.ea...@gmail.com>于2015年9月6日星期日 17:41写道: > Sean > > Do you know how to tell decision tree that the "label" is a binary or set > some attributes to dataframe to carry number of classes? > > Thanks! > - Terry > > On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen <so...@cloudera.com> wrote: > >> (Sean) >> The error suggests that the type is not a binary or nominal attribute >> though. I think that's the missing step. A double-valued column need >> not be one of these attribute types. >> >> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole <hujie.ea...@gmail.com> >> wrote: >> > Hi, Owen, >> > >> > The dataframe "training" is from a RDD of case class: >> RDD[LabeledDocument], >> > while the case class is defined as this: >> > case class LabeledDocument(id: Long, text: String, label: Double) >> > >> > So there is already has the default "label" column with "double" type. >> > >> > I already tried to set the label column for decision tree as this: >> > val lr = new >> > >> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") >> > It raised the same error. >> > >> > I also tried to change the "label" to "int" type, it also reported error >> > like following stack, I have no idea how to make this work. >> > >> > java.lang.IllegalArgumentException: requirement failed: Column label >> must be >> > of type DoubleType but was actually IntegerType. >> > at scala.Predef$.require(Predef.scala:233) >> > at >> > >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) >> > at >> > >> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) >> > at >> > >> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) >> > at >> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) >> > at >> > >> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) >> > at >> > >> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) >> > at >> > >> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) >> > at >> > >> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) >> > at >> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) >> > at >> org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) >> > at >> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) >> > at >> > >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) >> > at >> > >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) >> > at >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) >> > at >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:62) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66) >> > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68) >> > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:70) >> > at $iwC$$iwC$$iwC.<init>(<console>:72) >> > at $iwC$$iwC.<init>(<console>:74) >> > at $iwC.<init>(<console>:76) >> > at <init>(<console>:78) >> > at .<init>(<console>:82) >> > at .<clinit>(<console>) >> > at .<init>(<console>:7) >> > at .<clinit>(<console>) >> > at $print(<console>) >> > >> > Thanks! >> > - Terry >> > >> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote: >> >> >> >> I think somewhere alone the line you've not specified your label >> >> column -- it's defaulting to "label" and it does not recognize it, or >> >> at least not as a binary or nominal attribute. >> >> >> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com> >> wrote: >> >> > Hi, Experts, >> >> > >> >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier >> on >> >> > spark shell with spark 1.4.1, but always meets error like following, >> do >> >> > you >> >> > have any idea how to fix this? >> >> > >> >> > The error stack: >> >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given >> >> > input >> >> > with invalid label column label, without the number of classes >> >> > specified. >> >> > See StringIndexer. >> >> > at >> >> > >> >> > >> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) >> >> > at >> >> > >> >> > >> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) >> >> > at >> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) >> >> > at >> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) >> >> > at >> scala.collection.Iterator$class.foreach(Iterator.scala:727) >> >> > at >> >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> >> > at >> >> > >> >> > >> scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) >> >> > at >> >> > >> >> > >> scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) >> >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129) >> >> > at >> >> > >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42) >> >> > at >> >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47) >> >> > at >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49) >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53) >> >> > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55) >> >> > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:57) >> >> > at $iwC$$iwC$$iwC.<init>(<console>:59) >> >> > at $iwC$$iwC.<init>(<console>:61) >> >> > at $iwC.<init>(<console>:63) >> >> > at <init>(<console>:65) >> >> > at .<init>(<console>:69) >> >> > at .<clinit>(<console>) >> >> > at .<init>(<console>:7) >> >> > at .<clinit>(<console>) >> >> > at $print(<console>) >> >> > >> >> > The execute code is: >> >> > // Labeled and unlabeled instance types. >> >> > // Spark SQL can infer schema from case classes. >> >> > case class LabeledDocument(id: Long, text: String, label: Double) >> >> > case class Document(id: Long, text: String) >> >> > // Prepare training documents, which are labeled. >> >> > val training = sc.parallelize(Seq( >> >> > LabeledDocument(0L, "a b c d e spark", 1.0), >> >> > LabeledDocument(1L, "b d", 0.0), >> >> > LabeledDocument(2L, "spark f g h", 1.0), >> >> > LabeledDocument(3L, "hadoop mapreduce", 0.0))) >> >> > >> >> > // Configure an ML pipeline, which consists of three stages: >> tokenizer, >> >> > hashingTF, and lr. >> >> > val tokenizer = new >> >> > Tokenizer().setInputCol("text").setOutputCol("words") >> >> > val hashingTF = new >> >> > >> >> > >> HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features") >> >> > val lr = new >> >> > >> >> > >> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini") >> >> > val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, >> lr)) >> >> > >> >> > // Error raises from the following line >> >> > val model = pipeline.fit(training.toDF) >> >> > >> >> > >> > >> > >> > >