[ 
https://issues.apache.org/jira/browse/SPARK-16750?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Yanbo Liang updated SPARK-16750:
--------------------------------
    Description: 
ML GaussianMixture training failed due to feature column type mistake. The 
feature column type should be {{ml.linalg.VectorUDT}} but got 
{{mllib.linalg.VectorUDT}} by mistake.
This bug is easy to reproduce by the following code:
{code}
val df = spark.createDataFrame(
  Seq(
    (1, Vectors.dense(0.0, 1.0, 4.0)),
    (2, Vectors.dense(1.0, 0.0, 4.0)),
    (3, Vectors.dense(1.0, 0.0, 5.0)),
    (4, Vectors.dense(0.0, 0.0, 5.0)))
).toDF("id", "features")

val scaler = new MinMaxScaler()
  .setInputCol("features")
  .setOutputCol("features_scaled")
  .setMin(0.0)
  .setMax(5.0)

val gmm = new GaussianMixture()
  .setFeaturesCol("features_scaled")
  .setK(2)

val pipeline = new Pipeline().setStages(Array(scaler, gmm))
pipeline.fit(df)

requirement failed: Column features_scaled must be of type 
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
java.lang.IllegalArgumentException: requirement failed: Column features_scaled 
must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was 
actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
        at scala.Predef$.require(Predef.scala:224)
        at 
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
        at 
org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
        at 
org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
        at 
org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
{code}
The reason of this bug was not found during unit tests is that some 
estimators/transformers missed firstly calling 
{{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added 
them for all estimators/transformers who missed.

  was:
ML GaussianMixture training failed due to feature column type mistake. The 
feature column type should be {{ml.linalg.VectorUDT}} but got 
{{mllib.linalg.VectorUDT}} by mistake.
This bug is easy to reproduce by the following code:
{code}
val df = spark.createDataFrame(
  Seq(
    (1, Vectors.dense(0.0, 1.0, 4.0)),
    (2, Vectors.dense(1.0, 0.0, 4.0)),
    (3, Vectors.dense(1.0, 0.0, 5.0)),
    (4, Vectors.dense(0.0, 0.0, 5.0)))
).toDF("id", "features")

val scaler = new MinMaxScaler()
  .setInputCol("features")
  .setOutputCol("features_scaled")
  .setMin(0.0)
  .setMax(5.0)

val gmm = new GaussianMixture()
  .setFeaturesCol("features_scaled")
  .setK(2)

val pipeline = new Pipeline().setStages(Array(scaler, gmm))
pipeline.fit(df)

requirement failed: Column features_scaled must be of type 
org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
java.lang.IllegalArgumentException: requirement failed: Column features_scaled 
must be of type org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was 
actually org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
        at scala.Predef$.require(Predef.scala:224)
        at 
org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
        at 
org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
        at 
org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
        at 
org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
        at 
scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
        at 
scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
        at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
        at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
        at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
        at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
{code}
The reason for not this bug was not found during unit tests is that some 
estimators/transformers missed firstly calling 
{{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added 
them for all estimators/transformers who missed.


> ML GaussianMixture training failed due to feature column type mistake
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16750
>                 URL: https://issues.apache.org/jira/browse/SPARK-16750
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>            Reporter: Yanbo Liang
>            Assignee: Yanbo Liang
>
> ML GaussianMixture training failed due to feature column type mistake. The 
> feature column type should be {{ml.linalg.VectorUDT}} but got 
> {{mllib.linalg.VectorUDT}} by mistake.
> This bug is easy to reproduce by the following code:
> {code}
> val df = spark.createDataFrame(
>   Seq(
>     (1, Vectors.dense(0.0, 1.0, 4.0)),
>     (2, Vectors.dense(1.0, 0.0, 4.0)),
>     (3, Vectors.dense(1.0, 0.0, 5.0)),
>     (4, Vectors.dense(0.0, 0.0, 5.0)))
> ).toDF("id", "features")
> val scaler = new MinMaxScaler()
>   .setInputCol("features")
>   .setOutputCol("features_scaled")
>   .setMin(0.0)
>   .setMax(5.0)
> val gmm = new GaussianMixture()
>   .setFeaturesCol("features_scaled")
>   .setK(2)
> val pipeline = new Pipeline().setStages(Array(scaler, gmm))
> pipeline.fit(df)
> requirement failed: Column features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
> java.lang.IllegalArgumentException: requirement failed: Column 
> features_scaled must be of type 
> org.apache.spark.mllib.linalg.VectorUDT@f71b0bce but was actually 
> org.apache.spark.ml.linalg.VectorUDT@3bfc3ba7.
>       at scala.Predef$.require(Predef.scala:224)
>       at 
> org.apache.spark.ml.util.SchemaUtils$.checkColumnType(SchemaUtils.scala:42)
>       at 
> org.apache.spark.ml.clustering.GaussianMixtureParams$class.validateAndTransformSchema(GaussianMixture.scala:64)
>       at 
> org.apache.spark.ml.clustering.GaussianMixture.validateAndTransformSchema(GaussianMixture.scala:275)
>       at 
> org.apache.spark.ml.clustering.GaussianMixture.transformSchema(GaussianMixture.scala:342)
>       at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>       at 
> org.apache.spark.ml.Pipeline$$anonfun$transformSchema$4.apply(Pipeline.scala:180)
>       at 
> scala.collection.IndexedSeqOptimized$class.foldl(IndexedSeqOptimized.scala:57)
>       at 
> scala.collection.IndexedSeqOptimized$class.foldLeft(IndexedSeqOptimized.scala:66)
>       at scala.collection.mutable.ArrayOps$ofRef.foldLeft(ArrayOps.scala:186)
>       at org.apache.spark.ml.Pipeline.transformSchema(Pipeline.scala:180)
>       at org.apache.spark.ml.PipelineStage.transformSchema(Pipeline.scala:70)
>       at org.apache.spark.ml.Pipeline.fit(Pipeline.scala:132)
> {code}
> The reason of this bug was not found during unit tests is that some 
> estimators/transformers missed firstly calling 
> {{transformSchema(dataset.schema)}} during {{fit}} or {{transform}}. I added 
> them for all estimators/transformers who missed.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to