Yanbo Liang created SPARK-16750: ----------------------------------- Summary: ML GaussianMixture training failed due to feature column type mistake Key: SPARK-16750 URL: https://issues.apache.org/jira/browse/SPARK-16750 Project: Spark Issue Type: Bug Components: ML Reporter: Yanbo Liang Assignee: Yanbo Liang
ML GaussianMixture training failed due to feature column type mistake. The feature column type should be {{ml.linalg.VectorUDT}} but got {{mllib.linalg.VectorUDT}} by mistake. This bug is easy to reproduce by the following code: {code} val df = spark.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0)), (2, Vectors.dense(1.0, 0.0, 4.0)), (3, Vectors.dense(1.0, 0.0, 5.0)), (4, Vectors.dense(0.0, 0.0, 5.0))) ).toDF("id", "features") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(0.0) .setMax(5.0) val gmm = new GaussianMixture() .setFeaturesCol("features_scaled") .setK(2) val pipeline = new Pipeline().setStages(Array(scaler, gmm)) pipeline.fit(df) requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) {code} The reason for not this bug was not found during unit tests is that some estimators/transformers missed firstly calling {{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added them for all estimators/transformers who missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org