[ https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Yanbo Liang updated SPARK-16750: -------------------------------- Description: ML GaussianMixture training failed due to feature column type mistake. The feature column type should be {{ml.linalg.VectorUDT}} but got {{mllib.linalg.VectorUDT}} by mistake. This bug is easy to reproduce by the following code: {code} val df = spark.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0)), (2, Vectors.dense(1.0, 0.0, 4.0)), (3, Vectors.dense(1.0, 0.0, 5.0)), (4, Vectors.dense(0.0, 0.0, 5.0))) ).toDF("id", "features") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(0.0) .setMax(5.0) val gmm = new GaussianMixture() .setFeaturesCol("features_scaled") .setK(2) val pipeline = new Pipeline().setStages(Array(scaler, gmm)) pipeline.fit(df) requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) {code} The reason of this bug was not found during unit tests is that some estimators/transformers missed firstly calling {{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added them for all estimators/transformers who missed. was: ML GaussianMixture training failed due to feature column type mistake. The feature column type should be {{ml.linalg.VectorUDT}} but got {{mllib.linalg.VectorUDT}} by mistake. This bug is easy to reproduce by the following code: {code} val df = spark.createDataFrame( Seq( (1, Vectors.dense(0.0, 1.0, 4.0)), (2, Vectors.dense(1.0, 0.0, 4.0)), (3, Vectors.dense(1.0, 0.0, 5.0)), (4, Vectors.dense(0.0, 0.0, 5.0))) ).toDF("id", "features") val scaler = new MinMaxScaler() .setInputCol("features") .setOutputCol("features_scaled") .setMin(0.0) .setMax(5.0) val gmm = new GaussianMixture() .setFeaturesCol("features_scaled") .setK(2) val pipeline = new Pipeline().setStages(Array(scaler, gmm)) pipeline.fit(df) requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. java.lang.IllegalArgumentException: requirement failed: Column features_scaled must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. at scala.Predef$.require(Predef.scala:224) at org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) at org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) at org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) at org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) at scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) at scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) {code} The reason for not this bug was not found during unit tests is that some estimators/transformers missed firstly calling {{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added them for all estimators/transformers who missed. > ML GaussianMixture training failed due to feature column type mistake > --------------------------------------------------------------------- > > Key: SPARK-16750 > URL: https://issues.apache.org/jira/browse/SPARK-16750 > Project: Spark > Issue Type: Bug > Components: ML > Reporter: Yanbo Liang > Assignee: Yanbo Liang > > ML GaussianMixture training failed due to feature column type mistake. The > feature column type should be {{ml.linalg.VectorUDT}} but got > {{mllib.linalg.VectorUDT}} by mistake. > This bug is easy to reproduce by the following code: > {code} > val df = spark.createDataFrame( > Seq( > (1, Vectors.dense(0.0, 1.0, 4.0)), > (2, Vectors.dense(1.0, 0.0, 4.0)), > (3, Vectors.dense(1.0, 0.0, 5.0)), > (4, Vectors.dense(0.0, 0.0, 5.0))) > ).toDF("id", "features") > val scaler = new MinMaxScaler() > .setInputCol("features") > .setOutputCol("features_scaled") > .setMin(0.0) > .setMax(5.0) > val gmm = new GaussianMixture() > .setFeaturesCol("features_scaled") > .setK(2) > val pipeline = new Pipeline().setStages(Array(scaler, gmm)) > pipeline.fit(df) > requirement failed: Column features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > java.lang.IllegalArgumentException: requirement failed: Column > features_scaled must be of type > org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually > org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7. > at scala.Predef$.require(Predef.scala:224) > at > org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42) > at > org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64) > at > org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275) > at > org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180) > at > scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57) > at > scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66) > at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186) > at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180) > at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70) > at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132) > {code} > The reason of this bug was not found during unit tests is that some > estimators/transformers missed firstly calling > {{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added > them for all estimators/transformers who missed. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org