[GitHub] spark pull request #17864: [SPARK-20604][ML] Allow imputer to handle numeric...

2017-05-25 Thread actuaryzhang
Github user actuaryzhang commented on a diff in the pull request:

https://github.com/apache/spark/pull/17864#discussion_r118600408
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -94,12 +94,13 @@ private[feature] trait ImputerParams extends Params 
with HasInputCols {
  * :: Experimental ::
  * Imputation estimator for completing missing values, either using the 
mean or the median
  * of the columns in which the missing values are located. The input 
columns should be of
- * DoubleType or FloatType. Currently Imputer does not support categorical 
features
+ * numeric type. Currently Imputer does not support categorical features
  * (SPARK-15041) and possibly creates incorrect values for a categorical 
feature.
  *
  * Note that the mean/median value is computed after filtering out missing 
values.
  * All Null values in the input columns are treated as missing, and so are 
also imputed. For
  * computing median, DataFrameStatFunctions.approxQuantile is used with a 
relative error of 0.001.
+ * The output column is always of Double type regardless of the input 
column type.
--- End diff --

@MLnick Here is the note on always returning Double type. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #17864: [SPARK-20604][ML] Allow imputer to handle numeric...

2017-05-04 Thread actuaryzhang
GitHub user actuaryzhang opened a pull request:

https://github.com/apache/spark/pull/17864

[SPARK-20604][ML] Allow imputer to handle numeric types

## What changes were proposed in this pull request?

Imputer currently requires input column to be Double or Float, but the 
logic should work on any numeric data types. Many practical problems have 
integer  data types, and it could get very tedious to manually cast them into 
Double before calling imputer. This transformer could be extended to handle all 
numeric types.  

## How was this patch tested?
new test

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/actuaryzhang/spark imputer

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/17864.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #17864


commit e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8
Author: Wayne Zhang 
Date:   2017-05-04T22:18:46Z

allow imputer to handle numeric types




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org