[ 
https://issues.apache.org/jira/browse/SPARK-20604?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Wayne Zhang updated SPARK-20604:
--------------------------------
    Description: 
Imputer currently requires input column to be Double or Float, but the logic 
should work on any numeric data types. Many practical problems have integer  
data types, and it could get very tedious to manually cast them into Double 
before calling imputer. This transformer could be extended to handle all 
numeric types.  

The example below shows failure of Imputer on integer data. 
{code}
    val df = spark.createDataFrame( Seq(
      (0, 1.0, 1.0, 1.0),
      (1, 11.0, 11.0, 11.0),
      (2, 1.5, 1.5, 1.5),
      (3, Double.NaN, 4.5, 1.5)
    )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
    val imputer = new Imputer()
      .setInputCols(Array("value1"))
      .setOutputCols(Array("out1"))
    imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

{code}



  was:
Imputer currently requires input column to be Double or Float, but the logic 
should work on any numeric data types. Many practical problems have integer  
data types, and it could get very tedious to manually cast them into Double 
before calling imputer. This transformer could be extended to handle all 
numeric types.  

The example below shows failure of Bucketizer on integer data. 
{code}
    val df = spark.createDataFrame( Seq(
      (0, 1.0, 1.0, 1.0),
      (1, 11.0, 11.0, 11.0),
      (2, 1.5, 1.5, 1.5),
      (3, Double.NaN, 4.5, 1.5)
    )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
    val imputer = new Imputer()
      .setInputCols(Array("value1"))
      .setOutputCols(Array("out1"))
    imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))

java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
of type equal to one of the following types: [DoubleType, FloatType] but was 
actually of type IntegerType.

{code}




> Allow Imputer to handle all numeric types
> -----------------------------------------
>
>                 Key: SPARK-20604
>                 URL: https://issues.apache.org/jira/browse/SPARK-20604
>             Project: Spark
>          Issue Type: Bug
>          Components: ML
>    Affects Versions: 2.1.0
>            Reporter: Wayne Zhang
>            Assignee: Apache Spark
>
> Imputer currently requires input column to be Double or Float, but the logic 
> should work on any numeric data types. Many practical problems have integer  
> data types, and it could get very tedious to manually cast them into Double 
> before calling imputer. This transformer could be extended to handle all 
> numeric types.  
> The example below shows failure of Imputer on integer data. 
> {code}
>     val df = spark.createDataFrame( Seq(
>       (0, 1.0, 1.0, 1.0),
>       (1, 11.0, 11.0, 11.0),
>       (2, 1.5, 1.5, 1.5),
>       (3, Double.NaN, 4.5, 1.5)
>     )).toDF("id", "value1", "expected_mean_value1", "expected_median_value1")
>     val imputer = new Imputer()
>       .setInputCols(Array("value1"))
>       .setOutputCols(Array("out1"))
>     imputer.fit(df.withColumn("value1", col("value1").cast(IntegerType)))
> java.lang.IllegalArgumentException: requirement failed: Column value1 must be 
> of type equal to one of the following types: [DoubleType, FloatType] but was 
> actually of type IntegerType.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to