Hi, Owen, The dataframe "training" is from a RDD of case class: RDD[LabeledDocument], while the case class is defined as this: case class LabeledDocument(id: Long, text: String, *label: Double*)
So there is already has the default "label" column with "double" type. I already tried to set the label column for decision tree as this: val lr = new DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") It raised the same error. I also tried to change the "label" to "int" type, it also reported error like following stack, I have no idea how to make this work. java.lang.IllegalArgumentException: requirement failed: *Column label must be of type DoubleType but was actually IntegerType*. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:56) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:58) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:60) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:62) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:64) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:66) at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:68) at $iwC$$iwC$$iwC$$iwC.<init>(<console>:70) at $iwC$$iwC$$iwC.<init>(<console>:72) at $iwC$$iwC.<init>(<console>:74) at $iwC.<init>(<console>:76) at <init>(<console>:78) at .<init>(<console>:82) at .<clinit>(<console>) at .<init>(<console>:7) at .<clinit>(<console>) at $print(<console>) Thanks! - Terry On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen <so...@cloudera.com> wrote: > I think somewhere alone the line you've not specified your label > column -- it's defaulting to "label" and it does not recognize it, or > at least not as a binary or nominal attribute. > > On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole <hujie.ea...@gmail.com> wrote: > > Hi, Experts, > > > > I followed the guide of spark ml pipe to test DecisionTreeClassifier on > > spark shell with spark 1.4.1, but always meets error like following, do > you > > have any idea how to fix this? > > > > The error stack: > > java.lang.IllegalArgumentException: DecisionTreeClassifier was given > input > > with invalid label column label, without the number of classes specified. > > See StringIndexer. > > at > > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) > > at > > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) > > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > > at > > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) > > at > > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > > > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) > > at > > > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) > > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129) > > at > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42) > > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53) > > at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55) > > at $iwC$$iwC$$iwC$$iwC.<init>(<console>:57) > > at $iwC$$iwC$$iwC.<init>(<console>:59) > > at $iwC$$iwC.<init>(<console>:61) > > at $iwC.<init>(<console>:63) > > at <init>(<console>:65) > > at .<init>(<console>:69) > > at .<clinit>(<console>) > > at .<init>(<console>:7) > > at .<clinit>(<console>) > > at $print(<console>) > > > > The execute code is: > > // Labeled and unlabeled instance types. > > // Spark SQL can infer schema from case classes. > > case class LabeledDocument(id: Long, text: String, label: Double) > > case class Document(id: Long, text: String) > > // Prepare training documents, which are labeled. > > val training = sc.parallelize(Seq( > > LabeledDocument(0L, "a b c d e spark", 1.0), > > LabeledDocument(1L, "b d", 0.0), > > LabeledDocument(2L, "spark f g h", 1.0), > > LabeledDocument(3L, "hadoop mapreduce", 0.0))) > > > > // Configure an ML pipeline, which consists of three stages: tokenizer, > > hashingTF, and lr. > > val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") > > val hashingTF = new > > > HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features") > > val lr = new > > > DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini") > > val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr)) > > > > // Error raises from the following line > > val model = pipeline.fit(training.toDF) > > > > >