[GitHub] spark pull request #17864: [SPARK-20604][ML] Allow imputer to handle numeric...
Github user actuaryzhang commented on a diff in the pull request: https://github.com/apache/spark/pull/17864#discussion_r118600408 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -94,12 +94,13 @@ private[feature] trait ImputerParams extends Params with HasInputCols { * :: Experimental :: * Imputation estimator for completing missing values, either using the mean or the median * of the columns in which the missing values are located. The input columns should be of - * DoubleType or FloatType. Currently Imputer does not support categorical features + * numeric type. Currently Imputer does not support categorical features * (SPARK-15041) and possibly creates incorrect values for a categorical feature. * * Note that the mean/median value is computed after filtering out missing values. * All Null values in the input columns are treated as missing, and so are also imputed. For * computing median, DataFrameStatFunctions.approxQuantile is used with a relative error of 0.001. + * The output column is always of Double type regardless of the input column type. --- End diff -- @MLnick Here is the note on always returning Double type. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #17864: [SPARK-20604][ML] Allow imputer to handle numeric...
GitHub user actuaryzhang opened a pull request: https://github.com/apache/spark/pull/17864 [SPARK-20604][ML] Allow imputer to handle numeric types ## What changes were proposed in this pull request? Imputer currently requires input column to be Double or Float, but the logic should work on any numeric data types. Many practical problems have integer data types, and it could get very tedious to manually cast them into Double before calling imputer. This transformer could be extended to handle all numeric types. ## How was this patch tested? new test You can merge this pull request into a Git repository by running: $ git pull https://github.com/actuaryzhang/spark imputer Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/17864.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #17864 commit e9ab39c2bdca76dae2b5cc40f90e4f5b2f9416c8 Author: Wayne Zhang Date: 2017-05-04T22:18:46Z allow imputer to handle numeric types --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org