Yanbo Liang created SPARK-16750:
-----------------------------------

             Summary: ML GaussianMixture training failed due to feature column 
type mistake
                 Key: SPARK-16750
                 URL: https://issues.apache.org/jira/browse/SPARK-16750
             Project: Spark
          Issue Type: Bug
          Components: ML
            Reporter: Yanbo Liang
            Assignee: Yanbo Liang


ML GaussianMixture training failed due to feature column type mistake. The 
feature column type should be {{ml.linalg.VectorUDT}} but got 
{{mllib.linalg.VectorUDT}} by mistake.
This bug is easy to reproduce by the following code:
{code}
val df = spark.createDataFrame(
  Seq(
    (1, Vectors.dense(0.0, 1.0, 4.0)),
    (2, Vectors.dense(1.0, 0.0, 4.0)),
    (3, Vectors.dense(1.0, 0.0, 5.0)),
    (4, Vectors.dense(0.0, 0.0, 5.0)))
).toDF("id", "features")

val scaler = new MinMaxScaler()
  .setInputCol("features")
  .setOutputCol("features_scaled")
  .setMin(0.0)
  .setMax(5.0)

val gmm = new GaussianMixture()
  .setFeaturesCol("features_scaled")
  .setK(2)

val pipeline = new Pipeline().setStages(Array(scaler, gmm))
pipeline.fit(df)

requirement failed: Column features_scaled must be of type 
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
java.lang.IllegalArgumentException: requirement failed: Column features_scaled 
must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was 
actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
        at scala.Predef$.require(Predef.scala:224)
        at 
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
        at 
org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
        at 
org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
        at 
org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
{code}
The reason for not this bug was not found during unit tests is that some 
estimators/transformers missed firstly calling 
{{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added 
them for all estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to