Hi, Experts,

I followed the guide of spark ml pipe
<http://spark.apache.org/docs/latest/ml-guide.html> to test
DecisionTreeClassifier on spark shell with spark 1.4.1, but always meets
error like following, do you have any idea how to fix this?

The error stack:
*java.lang.IllegalArgumentException: DecisionTreeClassifier was given input
with invalid label column label, without the number of classes specified.
See StringIndexer.*
        at
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71)
        at
org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:90)
        at org.apache.spark.ml.Predictor.fit(Predictor.scala:71)
        at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133)
        at
org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129)
        at scala.collection.Iterator$class.foreach(Iterator.scala:727)
        at scala.collection.AbstractIterator.foreach(Iterator.scala:1157)
        at
scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42)
        at
scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129)
        at
$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:42)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:47)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:49)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:51)
        at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:53)
        at $iwC$$iwC$$iwC$$iwC$$iwC.<init>(<console>:55)
        at $iwC$$iwC$$iwC$$iwC.<init>(<console>:57)
        at $iwC$$iwC$$iwC.<init>(<console>:59)
        at $iwC$$iwC.<init>(<console>:61)
        at $iwC.<init>(<console>:63)
        at <init>(<console>:65)
        at .<init>(<console>:69)
        at .<clinit>(<console>)
        at .<init>(<console>:7)
        at .<clinit>(<console>)
        at $print(<console>)

The execute code is:
*// Labeled and unlabeled instance types.*
*// Spark SQL can infer schema from case classes.*
*case class LabeledDocument(id: Long, text: String, label: Double)*
*case class Document(id: Long, text: String)*
*// Prepare training documents, which are labeled.*
*val training = sc.parallelize(Seq(*
*  LabeledDocument(0L, "a b c d e spark", 1.0),*
*  LabeledDocument(1L, "b d", 0.0),*
*  LabeledDocument(2L, "spark f g h", 1.0),*
*  LabeledDocument(3L, "hadoop mapreduce", 0.0)))*

*// Configure an ML pipeline, which consists of three stages: tokenizer,
hashingTF, and lr.*
*val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words")*
*val hashingTF = new
HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features")*
*val lr =  new
DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini")*
*val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr))*

*// Error raises from the following line*
*val model = pipeline.fit(training.toDF)*

Reply via email to