Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
Sean, Thank you! Finally, I get this to work, although it is a bit ugly: manually to set the meta data of dataframe. import org.apache.spark.ml.attribute._ import org.apache.spark.sql.types._ val df = training.toDF() val schema = df.schema val rowRDD = df.rdd def enrich(m : Metadata) : Metadata = { val na = NominalAttribute.defaultAttr.withValues("0", "1") na.toMetadata(m) } val newSchema = StructType(schema.map(f => if (f.name == "label") f.copy(metadata=enrich(f.metadata)) else f)) val model = pipeline.fit(sqlContext.createDataFrame(rowRDD, newSchema)) Thanks! - Terry On Mon, Sep 7, 2015 at 4:24 PM, Sean Owenwrote: > Hm, off the top of my head I don't know. I haven't looked at this > aspect in a while, strangely. It's an attribute in the metadata of the > field. I assume there's a method for setting this metadata when you > construct the input data. > > On Sun, Sep 6, 2015 at 10:41 AM, Terry Hole wrote: > > Sean > > > > Do you know how to tell decision tree that the "label" is a binary or set > > some attributes to dataframe to carry number of classes? > > > > Thanks! > > - Terry > > > > On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen wrote: > >> > >> (Sean) > >> The error suggests that the type is not a binary or nominal attribute > >> though. I think that's the missing step. A double-valued column need > >> not be one of these attribute types. > >> > >> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole > wrote: > >> > Hi, Owen, > >> > > >> > The dataframe "training" is from a RDD of case class: > >> > RDD[LabeledDocument], > >> > while the case class is defined as this: > >> > case class LabeledDocument(id: Long, text: String, label: Double) > >> > > >> > So there is already has the default "label" column with "double" type. > >> > > >> > I already tried to set the label column for decision tree as this: > >> > val lr = new > >> > > >> > > DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") > >> > It raised the same error. > >> > > >> > I also tried to change the "label" to "int" type, it also reported > error > >> > like following stack, I have no idea how to make this work. > >> > > >> > java.lang.IllegalArgumentException: requirement failed: Column label > >> > must be > >> > of type DoubleType but was actually IntegerType. > >> > at scala.Predef$.require(Predef.scala:233) > >> > at > >> > > >> > > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) > >> > at > >> > > >> > > org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) > >> > at > >> > > >> > > org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) > >> > at > >> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) > >> > at > >> > > >> > > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > >> > at > >> > > >> > > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > >> > at > >> > > >> > > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > >> > at > >> > > >> > > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > >> > at > >> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) > >> > at > >> > org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) > >> > at > >> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) > >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) > >> > at > >> > > >> > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > >> > at > >> > > >> > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56) > >> > at > >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) > >> > at > >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) > >> > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62) > >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) > >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) > >> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) > >> > at $iwC$$iwC$$iwC$$iwC.(:70) > >> > at $iwC$$iwC$$iwC.(:72) > >> > at $iwC$$iwC.(:74) > >> > at $iwC.(:76) > >> > at (:78) > >> > at .(:82) > >> > at .() > >> > at .(:7) > >> > at .() > >> > at $print() > >> > > >> > Thanks! > >> > - Terry > >> > > >> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen wrote: > >> >> > >> >> I think somewhere alone the line you've not specified your label > >> >> column -- it's defaulting to "label" and it does not recognize it, or > >> >> at least not as a binary or nominal attribute. > >> >> > >> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole >
Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
Xiangrui, Do you have any idea how to make this work? Thanks - Terry Terry Hole于2015年9月6日星期日 17:41写道: > Sean > > Do you know how to tell decision tree that the "label" is a binary or set > some attributes to dataframe to carry number of classes? > > Thanks! > - Terry > > On Sun, Sep 6, 2015 at 5:23 PM, Sean Owen wrote: > >> (Sean) >> The error suggests that the type is not a binary or nominal attribute >> though. I think that's the missing step. A double-valued column need >> not be one of these attribute types. >> >> On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole >> wrote: >> > Hi, Owen, >> > >> > The dataframe "training" is from a RDD of case class: >> RDD[LabeledDocument], >> > while the case class is defined as this: >> > case class LabeledDocument(id: Long, text: String, label: Double) >> > >> > So there is already has the default "label" column with "double" type. >> > >> > I already tried to set the label column for decision tree as this: >> > val lr = new >> > >> DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") >> > It raised the same error. >> > >> > I also tried to change the "label" to "int" type, it also reported error >> > like following stack, I have no idea how to make this work. >> > >> > java.lang.IllegalArgumentException: requirement failed: Column label >> must be >> > of type DoubleType but was actually IntegerType. >> > at scala.Predef$.require(Predef.scala:233) >> > at >> > >> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) >> > at >> > >> org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) >> > at >> > >> org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) >> > at >> > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) >> > at >> > >> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) >> > at >> > >> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) >> > at >> > >> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) >> > at >> > >> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) >> > at >> > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) >> > at >> org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) >> > at >> > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) >> > at >> > >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) >> > at >> > >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56) >> > at >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) >> > at >> $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) >> > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) >> > at $iwC$$iwC$$iwC$$iwC.(:70) >> > at $iwC$$iwC$$iwC.(:72) >> > at $iwC$$iwC.(:74) >> > at $iwC.(:76) >> > at (:78) >> > at .(:82) >> > at .() >> > at .(:7) >> > at .() >> > at $print() >> > >> > Thanks! >> > - Terry >> > >> > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen wrote: >> >> >> >> I think somewhere alone the line you've not specified your label >> >> column -- it's defaulting to "label" and it does not recognize it, or >> >> at least not as a binary or nominal attribute. >> >> >> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole >> wrote: >> >> > Hi, Experts, >> >> > >> >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier >> on >> >> > spark shell with spark 1.4.1, but always meets error like following, >> do >> >> > you >> >> > have any idea how to fix this? >> >> > >> >> > The error stack: >> >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given >> >> > input >> >> > with invalid label column label, without the number of classes >> >> > specified. >> >> > See StringIndexer. >> >> > at >> >> > >> >> > >> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) >> >> > at >> >> > >> >> > >> org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) >> >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) >> >> > at >> >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) >> >> > at >> >> >
Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
Hi, Owen, The dataframe "training" is from a RDD of case class: RDD[LabeledDocument], while the case class is defined as this: case class LabeledDocument(id: Long, text: String, *label: Double*) So there is already has the default "label" column with "double" type. I already tried to set the label column for decision tree as this: val lr = new DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") It raised the same error. I also tried to change the "label" to "int" type, it also reported error like following stack, I have no idea how to make this work. java.lang.IllegalArgumentException: requirement failed: *Column label must be of type DoubleType but was actually IntegerType*. at scala.Predef$.require(Predef.scala:233) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) at org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) at org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) at org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) at $iwC$$iwC$$iwC$$iwC.(:70) at $iwC$$iwC$$iwC.(:72) at $iwC$$iwC.(:74) at $iwC.(:76) at (:78) at .(:82) at .() at .(:7) at .() at $print() Thanks! - Terry On Sun, Sep 6, 2015 at 4:53 PM, Sean Owenwrote: > I think somewhere alone the line you've not specified your label > column -- it's defaulting to "label" and it does not recognize it, or > at least not as a binary or nominal attribute. > > On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole wrote: > > Hi, Experts, > > > > I followed the guide of spark ml pipe to test DecisionTreeClassifier on > > spark shell with spark 1.4.1, but always meets error like following, do > you > > have any idea how to fix this? > > > > The error stack: > > java.lang.IllegalArgumentException: DecisionTreeClassifier was given > input > > with invalid label column label, without the number of classes specified. > > See StringIndexer. > > at > > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) > > at > > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) > > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > > at > > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) > > at > > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) > > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > > at > > > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) > > at > > > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) > > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129) > > at > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > > at $iwC$$iwC$$iwC$$iwC.(:57) > > at $iwC$$iwC$$iwC.(:59) > > at $iwC$$iwC.(:61) > > at $iwC.(:63) > > at (:65) > > at .(:69) > > at .() > > at .(:7) > > at .() > > at $print() > > > > The execute code is: > > // Labeled and unlabeled instance types. > > // Spark SQL can infer schema from case
Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
I think somewhere alone the line you've not specified your label column -- it's defaulting to "label" and it does not recognize it, or at least not as a binary or nominal attribute. On Sun, Sep 6, 2015 at 5:47 AM, Terry Holewrote: > Hi, Experts, > > I followed the guide of spark ml pipe to test DecisionTreeClassifier on > spark shell with spark 1.4.1, but always meets error like following, do you > have any idea how to fix this? > > The error stack: > java.lang.IllegalArgumentException: DecisionTreeClassifier was given input > with invalid label column label, without the number of classes specified. > See StringIndexer. > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) > at > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) > at > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > at scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > at > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) > at > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:53) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:55) > at $iwC$$iwC$$iwC$$iwC.(:57) > at $iwC$$iwC$$iwC.(:59) > at $iwC$$iwC.(:61) > at $iwC.(:63) > at (:65) > at .(:69) > at .() > at .(:7) > at .() > at $print() > > The execute code is: > // Labeled and unlabeled instance types. > // Spark SQL can infer schema from case classes. > case class LabeledDocument(id: Long, text: String, label: Double) > case class Document(id: Long, text: String) > // Prepare training documents, which are labeled. > val training = sc.parallelize(Seq( > LabeledDocument(0L, "a b c d e spark", 1.0), > LabeledDocument(1L, "b d", 0.0), > LabeledDocument(2L, "spark f g h", 1.0), > LabeledDocument(3L, "hadoop mapreduce", 0.0))) > > // Configure an ML pipeline, which consists of three stages: tokenizer, > hashingTF, and lr. > val tokenizer = new Tokenizer().setInputCol("text").setOutputCol("words") > val hashingTF = new > HashingTF().setNumFeatures(1000).setInputCol(tokenizer.getOutputCol).setOutputCol("features") > val lr = new > DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini") > val pipeline = new Pipeline().setStages(Array(tokenizer, hashingTF, lr)) > > // Error raises from the following line > val model = pipeline.fit(training.toDF) > > - To unsubscribe, e-mail: user-unsubscr...@spark.apache.org For additional commands, e-mail: user-h...@spark.apache.org
Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
(Sean) The error suggests that the type is not a binary or nominal attribute though. I think that's the missing step. A double-valued column need not be one of these attribute types. On Sun, Sep 6, 2015 at 10:14 AM, Terry Holewrote: > Hi, Owen, > > The dataframe "training" is from a RDD of case class: RDD[LabeledDocument], > while the case class is defined as this: > case class LabeledDocument(id: Long, text: String, label: Double) > > So there is already has the default "label" column with "double" type. > > I already tried to set the label column for decision tree as this: > val lr = new > DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") > It raised the same error. > > I also tried to change the "label" to "int" type, it also reported error > like following stack, I have no idea how to make this work. > > java.lang.IllegalArgumentException: requirement failed: Column label must be > of type DoubleType but was actually IntegerType. > at scala.Predef$.require(Predef.scala:233) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) > at > org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) > at > org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) > at > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > at > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) > at > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56) > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) > at $iwC$$iwC$$iwC$$iwC.(:70) > at $iwC$$iwC$$iwC.(:72) > at $iwC$$iwC.(:74) > at $iwC.(:76) > at (:78) > at .(:82) > at .() > at .(:7) > at .() > at $print() > > Thanks! > - Terry > > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen wrote: >> >> I think somewhere alone the line you've not specified your label >> column -- it's defaulting to "label" and it does not recognize it, or >> at least not as a binary or nominal attribute. >> >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole wrote: >> > Hi, Experts, >> > >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier on >> > spark shell with spark 1.4.1, but always meets error like following, do >> > you >> > have any idea how to fix this? >> > >> > The error stack: >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given >> > input >> > with invalid label column label, without the number of classes >> > specified. >> > See StringIndexer. >> > at >> > >> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) >> > at >> > >> > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) >> > at >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) >> > at >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) >> > at >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) >> > at >> > >> > scala.collection.IterableViewLike$Transformed$class.foreach(IterableViewLike.scala:42) >> > at >> > >> > scala.collection.SeqViewLike$AbstractTransformed.foreach(SeqViewLike.scala:43) >> > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:129) >> > at >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:42) >> > at >> > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:47) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:49) >> > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) >> > at
Re: Meets "java.lang.IllegalArgumentException" when test spark ml pipe with DecisionTreeClassifier
Sean Do you know how to tell decision tree that the "label" is a binary or set some attributes to dataframe to carry number of classes? Thanks! - Terry On Sun, Sep 6, 2015 at 5:23 PM, Sean Owenwrote: > (Sean) > The error suggests that the type is not a binary or nominal attribute > though. I think that's the missing step. A double-valued column need > not be one of these attribute types. > > On Sun, Sep 6, 2015 at 10:14 AM, Terry Hole wrote: > > Hi, Owen, > > > > The dataframe "training" is from a RDD of case class: > RDD[LabeledDocument], > > while the case class is defined as this: > > case class LabeledDocument(id: Long, text: String, label: Double) > > > > So there is already has the default "label" column with "double" type. > > > > I already tried to set the label column for decision tree as this: > > val lr = new > > > DecisionTreeClassifier().setMaxDepth(5).setMaxBins(32).setImpurity("gini").setLabelCol("label") > > It raised the same error. > > > > I also tried to change the "label" to "int" type, it also reported error > > like following stack, I have no idea how to make this work. > > > > java.lang.IllegalArgumentException: requirement failed: Column label > must be > > of type DoubleType but was actually IntegerType. > > at scala.Predef$.require(Predef.scala:233) > > at > > > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:37) > > at > > > org.apache.spark.ml.PredictorParams$class.validateAndTransformSchema(Predictor.scala:53) > > at > > > org.apache.spark.ml.Predictor.validateAndTransformSchema(Predictor.scala:71) > > at > > org.apache.spark.ml.Predictor.transformSchema(Predictor.scala:116) > > at > > > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > > at > > > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:162) > > at > > > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:51) > > at > > > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:60) > > at > > scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:108) > > at > org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:162) > > at > > org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:59) > > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:116) > > at > > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:51) > > at > > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:56) > > at > > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:58) > > at > $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:60) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:62) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:64) > > at $iwC$$iwC$$iwC$$iwC$$iwC$$iwC.(:66) > > at $iwC$$iwC$$iwC$$iwC$$iwC.(:68) > > at $iwC$$iwC$$iwC$$iwC.(:70) > > at $iwC$$iwC$$iwC.(:72) > > at $iwC$$iwC.(:74) > > at $iwC.(:76) > > at (:78) > > at .(:82) > > at .() > > at .(:7) > > at .() > > at $print() > > > > Thanks! > > - Terry > > > > On Sun, Sep 6, 2015 at 4:53 PM, Sean Owen wrote: > >> > >> I think somewhere alone the line you've not specified your label > >> column -- it's defaulting to "label" and it does not recognize it, or > >> at least not as a binary or nominal attribute. > >> > >> On Sun, Sep 6, 2015 at 5:47 AM, Terry Hole > wrote: > >> > Hi, Experts, > >> > > >> > I followed the guide of spark ml pipe to test DecisionTreeClassifier > on > >> > spark shell with spark 1.4.1, but always meets error like following, > do > >> > you > >> > have any idea how to fix this? > >> > > >> > The error stack: > >> > java.lang.IllegalArgumentException: DecisionTreeClassifier was given > >> > input > >> > with invalid label column label, without the number of classes > >> > specified. > >> > See StringIndexer. > >> > at > >> > > >> > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:71) > >> > at > >> > > >> > > org.apache.spark.ml.classification.DecisionTreeClassifier.train(DecisionTreeClassifier.scala:41) > >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:90) > >> > at org.apache.spark.ml.Predictor.fit(Predictor.scala:71) > >> > at > >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:133) > >> > at > >> > org.apache.spark.ml.Pipeline$$anonfun$fit$2.apply(Pipeline.scala:129) > >> > at scala.collection.Iterator$class.foreach(Iterator.scala:727) > >> > at > >> > scala.collection.AbstractIterator.foreach(Iterator.scala:1157) > >> > at > >> > > >> > >