[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-220254380 @hhbyyh really sorry this slipped in the rush around 2.0 QA. I'm afraid it will have to be revisited after 2.0 release! But I think it should be in good shape to merge soon after that happens. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216485692 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216485606 **[Test build #57620 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57620/consoleFull)** for PR 11601 at commit [`4e07431`](https://github.com/apache/spark/commit/4e0743139796ac53df2554cfa53736b8035bae15). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216485694 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57620/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216477659 **[Test build #57620 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57620/consoleFull)** for PR 11601 at commit [`4e07431`](https://github.com/apache/spark/commit/4e0743139796ac53df2554cfa53736b8035bae15). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216472078 @MLnick Yes, I've been preparing the doc and examples and they are almost ready. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216462472 @hhbyyh I think this is far enough along that I'm comfortable putting it in 2.0, unless @jkbradley or anyone else has an objection. However this will need an entry in the HTML doc, as well as a small example. Will you be able to ensure those are completed during the initial QA period? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216462512 @hhbyyh made a small doc comment too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61848348 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature. " + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean or the median + * of the column in which the missing values are located. The input column should be of + * DoubleType or FloatType. + * + * Note that the mean/median value is computed after filtering out missing values. + * All Null values in the input column are treated as missing, and so are also imputed. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mean", missingValue -> Double.NaN) + + override def fit(dataset: Dataset[_]): ImputerModel = { +transformSchema(dataset.schema, logging = true) +val ic = col($(inputCol))
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216435795 @MLnick @jkbradley Is it possible to make this into 2.0? Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216011080 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57459/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216011066 **[Test build #57459 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57459/consoleFull)** for PR 11601 at commit [`cca8dd4`](https://github.com/apache/spark/commit/cca8dd41714d79476c2bf23f706012a282c53bcb). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216011079 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-216008178 **[Test build #57459 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57459/consoleFull)** for PR 11601 at commit [`cca8dd4`](https://github.com/apache/spark/commit/cca8dd41714d79476c2bf23f706012a282c53bcb). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215975928 Thanks for the review @MLnick. I'll create the jira now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215719811 @hhbyyh if you have time, could you create 2 follow up JIRAs for: * PySpark impl * adding mode at a later date for categorical features * investigate efficiency of approaches using DataFrame/Dataset and/or approx approaches such as `frequentItems` or Count-Min Sketch (will require an update to CMS to return "heavy-hitters"). * investigate if we can use metadata to only allow mode for categorical features (or perhaps as an easier alternative, allow mode for only Int/Long columns) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215715233 A few small final comments, otherwise LGTM. @jkbradley will leave open for a day or two in case you want to take a final pass. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61577109 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature. " + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean or the + * median of the column in which the missing values are located. InputCol should be --- End diff -- I prefer "The input column" rather than "InputCol". --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61576752 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "expected_mean", "expected_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("expected_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0 and contains NaN") { --- End diff -- Can we call this `Imputer should handle NaNs in data when 'missingValue' is not NaN` for clarity --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61575765 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "expected_mean", "expected_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("expected_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0 and contains NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 3.0, 3.0, 3.0), + (2, Double.NaN, Double.NaN, Double.NaN), + (3, -1.0, 2.0, 3.0) +)).toDF("id", "value", "expected_mean", "expected_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select("expected_" + strategy, "out").collect().foreach { +case Row(exp: Double, out: Double) => + assert((exp.isNaN && out.isNaN) || (exp ~== out absTol 1e-5), +s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Float with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0F, 1.0F, 1.0F), + (1, 3.0F, 3.0F, 3.0F), + (2, 10.0F, 10.0F, 10.0F), + (3, 10.0F, 10.0F, 10.0F), + (4, -1.0F, 6.0F, 3.0F) +)).toDF("id", "value", "expected_mean", "expected_median") + +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + val result = model.transform(df) + model.transform(df).select("expected_" + strategy, "out").collect().foreach { +case Row(exp: Float, out: Float) => + assert(exp == out, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer should impute null") { --- End diff -- Could we call this `Imputer should impute null as well as 'missingValue'` to clarify --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215656460 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215656462 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57321/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215656374 **[Test build #57321 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57321/consoleFull)** for PR 11601 at commit [`335ded7`](https://github.com/apache/spark/commit/335ded7d92d7ab6f50e2d16348cfde051eb26579). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215650485 **[Test build #57321 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57321/consoleFull)** for PR 11601 at commit [`335ded7`](https://github.com/apache/spark/commit/335ded7d92d7ab6f50e2d16348cfde051eb26579). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215615452 **[Test build #57298 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57298/consoleFull)** for PR 11601 at commit [`aef094b`](https://github.com/apache/spark/commit/aef094bc7b7a00c0ded1b2998b7f98d2bc42c666). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215615467 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57298/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215615465 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215613222 **[Test build #57298 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57298/consoleFull)** for PR 11601 at commit [`aef094b`](https://github.com/apache/spark/commit/aef094bc7b7a00c0ded1b2998b7f98d2bc42c666). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215503334 Other than a couple minor comments, this LGTM pending tests. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61467752 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,122 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "expected_mean", "expected_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0 and contains NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 3.0, 3.0, 3.0), + (2, Double.NaN, Double.NaN, Double.NaN), + (3, -1.0, 2.0, 3.0) +)).toDF("id", "value", "expected_mean", "expected_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { +case Row(exp: Double, out: Double) => + assert((exp.isNaN && out.isNaN) || (exp ~== out absTol 1e-5), +s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Float with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0F, 1.0F, 1.0F), + (1, 3.0F, 3.0F, 3.0F), + (2, 10.0F, 10.0F, 10.0F), + (3, 10.0F, 10.0F, 10.0F), + (4, -1.0F, 6.0F, 3.0F) +)).toDF("id", "value", "expected_mean", "expected_median") + +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + val result = model.transform(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { +case Row(exp: Float, out: Float) => + assert(exp == out, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer should impute null") { +val df = sqlContext.createDataFrame( Seq( + (0, 4.0, 4.0, 4.0), + (1, 10.0, 10.0, 10.0), + (2, 10.0, 10.0, 10.0), + (3, Double.NaN, 8.0, 10.0), + (4, -1.0, 8.0, 10.0) +)).toDF("id", "value", "expected_mean", "expected_median") +val df2 = df.selectExpr("*", "IF(value=-1.0, null, value) as nullable_value") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("nullable_value").setOutputCol("out") +.setStrategy(strategy) + val model = imputer.fit(df2) + model.transform(df2).select("exp_" + strategy, "out").collect().foreach { --- End diff -- As Jenkins has already pointed out, you'll need to change this to `.select("expected_" +
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61455519 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + --- End diff -- There needs to be a space after `.. feature.` here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61455460 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,220 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean or the + * median of the column in which the missing values are located. InputCol should be + * of DoubleType or FloatType. + * + * Note that the mean/median value is computed after filtering out missing values. + * All Null values in the input column are treated as missing, and so are also imputed. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mean", missingValue -> Double.NaN) + + override def fit(dataset: Dataset[_]): ImputerModel = { +transformSchema(dataset.schema, logging = true) +val ic = col($(inputCol)) +
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215335519 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57229/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215335518 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215335486 **[Test build #57229 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57229/consoleFull)** for PR 11601 at commit [`4e1c34a`](https://github.com/apache/spark/commit/4e1c34a77b8e4382c00a0438e00e34f544b591a3). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215330151 **[Test build #57229 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57229/consoleFull)** for PR 11601 at commit [`4e1c34a`](https://github.com/apache/spark/commit/4e1c34a77b8e4382c00a0438e00e34f544b591a3). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215330198 Documents updated and remove an unit test. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373605 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * + * Note that all the null values will be imputed as well. --- End diff -- Yes. That's better. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373554 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. --- End diff -- @sethah, I tried yet I'm afraid it draws more confusion than the help. After all, the current behavior is not out of expectation and works for most ( if not all) users. I'd prefer to skip the detailed explanation in API document. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61373571 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * --- End diff -- Cool. Thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61365096 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "exp_mean", "exp_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0) +)).toDF("id", "value", "exp_mean", "exp_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { +case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Impute($strategy) error. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0 and contains NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 3.0, 3.0, 3.0), + (2, Double.NaN, Double.NaN, Double.NaN), + (3, -1.0, 2.0, 3.0) --- End diff -- Actually, it always choose the smaller one of the two middle values as I saw in some tests. In this test case, the median is computed from [1, 3, Double.NaN]. And Double.NaN is treated as it's greater than Double.MaValue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61321225 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. --- End diff -- Another subtlety I noticed is that, regardless of the `missingValue` param, we filter out `NaN` values before computing the mean. I think this is the correct behavior, since otherwise you'd just replace missing values with `NaN`. I'm not sure if we even need to document it, it might make things more confusing. We could note that the "mean" strategy is nan-safe maybe? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61318222 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the --- End diff -- I don't think it's necessary to say mean ("mean") and median ("median") here. Just "mean or the median" will suffice. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61318084 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. --- End diff -- can we add something saying how we impute, i.e. "either using the mean or median of the input column, computed after filtering out missing values" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61317809 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * + * Note that all the null values will be imputed as well. --- End diff -- Can we amend the doc slightly to say something like "Note that null values in the input column are treated as missing, and so are also imputed" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61315937 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61308835 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException --- End diff -- unused import --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61308682 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,219 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the approximate median value of the feature. + * Default: mean + * + * @group param + */ + final val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + final val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, FloatType)) +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +SchemaUtils.appendColumn(schema, $(outputCol), inputType) + } +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean") or the + * median("median") of the column in which the missing values are located. + * --- End diff -- Can we document that we only support "Float" and "Double" types for now? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61306820 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { --- End diff -- Sorry to nitpick, but "expected_mean" seems significantly more clear than "exp_mean" at the expense of only a few extra characters. Only change it if you push more commits since its minor :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61305806 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "exp_mean", "exp_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0) +)).toDF("id", "value", "exp_mean", "exp_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { +case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Impute($strategy) error. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0 and contains NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 3.0, 3.0, 3.0), + (2, Double.NaN, Double.NaN, Double.NaN), + (3, -1.0, 2.0, 3.0) --- End diff -- Does approx quantile always choose the greater of the two middle values as the median? If so, can we add a comment noting that? NumPy computes the median of `[1.0, 3.0]` exactly as `2.0`. Future developers might think it's a mistake. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61305479 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,141 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0) +)).toDF("id", "value", "exp_mean", "exp_median") +Seq("mean", "median").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select("exp_" + strategy, "out").collect().foreach { + case Row(exp: Double, out: Double) => + assert(exp ~== out absTol 1e-5, s"Imputed values differ. Expected: $exp, actual: $out") + } +} + } + + test("Imputer for Double with missing Value -1.0") { --- End diff -- I'd argue this test is not necessary because of the test right below, where we test that it works for integer missing values and leaves `NaN`s alone. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215155942 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215155656 **[Test build #57140 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57140/consoleFull)** for PR 11601 at commit [`053d489`](https://github.com/apache/spark/commit/053d489a70a28674029ee51a69f529e851261c96). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215155949 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57140/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215143852 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215143820 **[Test build #57139 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57139/consoleFull)** for PR 11601 at commit [`b3633e8`](https://github.com/apache/spark/commit/b3633e8dd0edf47a684aa344ba6a3c43ac0d91fe). * This patch **fails MiMa tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215143854 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57139/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215143141 **[Test build #57140 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57140/consoleFull)** for PR 11601 at commit [`053d489`](https://github.com/apache/spark/commit/053d489a70a28674029ee51a69f529e851261c96). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215141731 **[Test build #57139 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57139/consoleFull)** for PR 11601 at commit [`b3633e8`](https://github.com/apache/spark/commit/b3633e8dd0edf47a684aa344ba6a3c43ac0d91fe). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61286231 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Int with missing Value -1") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) +)).toDF("id", "value", "mean", "median", "mode") + +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Int, d2: Int) => + assert(d1 === d2, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer should impute null") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61286255 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Int with missing Value -1") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) +)).toDF("id", "value", "mean", "median", "mode") + +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Int, d2: Int) => + assert(d1 === d2, s"Imputer ut error: $d2 should be $d1") +} +} + } + --- End diff -- added --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61286207 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") --- End diff -- sure, changed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61286179 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { --- End diff -- I get the point, yet data in the format of ``` val df = sqlContext.createDataFrame( Seq( (0, 1.0, 1.0, 1.0), (1, 1.0, 1.0, 1.0), (2, 3.0, 3.0, 3.0), (3, 4.0, 4.0, 4.0), (4, Double.NaN, 2.25, 1.0) )).toDF("id", "value", "exp_mean", "exp_median") ``` provides direct correspondence. I've updated the columns names. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61282637 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + --- End diff -- Sure, thanks. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-215057260 Thanks for helping review and the helpful discussion. I've read through the comments. Two major changes are: 1. This initial PR does not include support for "mode" strategy. The preference is to implement "mode" for categorical features (metadata) and use DataFrame API. 2. We'll support only FloatType and DoubleType for type safety. Preparing update now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61163112 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61162298 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61158015 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61157610 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + --- End diff -- Make all Param vals final --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61157367 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61156296 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61152055 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { --- End diff -- The way these tests are written was pretty confusing to me. It seems like the "mean" column should be the mean of some _values_ but really it is the expected output when "mean" strategy is used. This is minor since it just affects readability and might be more of a personal preference. I think the following is clearer: ```scala val df = sqlContext.createDataFrame( Seq( (0, 1.0), (1, 1.0), (2, 3.0), (3, 4.0), (4, Double.NaN) )).toDF("id", "value") val expectedOutput = Map( "mean"->Array(1.0, 1.0, 3.0, 4.0, 2.25), "median" -> Array(1.0, 1.0, 3.0, 4.0, 1.0), "mode" -> Array(1.0, 1.0, 3.0, 4.0, 1.0)) Seq("mean", "median", "mode").foreach { strategy => val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) val model = imputer.fit(df) val result = model.transform(df).select("out").collect().map(_.getDouble(0)) result.zip(expectedOutput(strategy)).foreach { case (actual, expected) => assert(actual ~== expected absTol 1e-5) } ``` Really, just any way of indicating that the extra columns are _expected outputs_ would be clearer to me. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61136907 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + --- End diff -- Using median here is fine I think, but can we document that it's actually an approximation? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user sethah commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61136387 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user BenFradet commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-214792913 Lgtm, really looking forward to this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r6326 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", --- End diff -- I think the word "values" is missing in this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user BenFradet commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61110533 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], --- End diff -- `checkNumericType` even --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61104774 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-214780361 @hhbyyh sorry for the delay. I've made some comments and suggestions. In particular, as per one of my inline comments, I'd propose we don't support the "mode" option in this first version. Let's see whether any users actually need/request it. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61105305 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Int with missing Value -1") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) +)).toDF("id", "value", "mean", "median", "mode") + +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Int, d2: Int) => + assert(d1 === d2, s"Imputer ut error: $d2 should be $d1") +} +} + } + --- End diff -- we should also have a test for a `non-NaN` missing value, but with `NaN` in the dataset, to check that "mean" and "median" behave as we expect. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61105066 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Double with missing Value -1.0") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, -1.0, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1.0) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer for Int with missing Value -1") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) +)).toDF("id", "value", "mean", "median", "mode") + +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) +.setMissingValue(-1) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Int, d2: Int) => + assert(d1 === d2, s"Imputer ut error: $d2 should be $d1") +} +} + } + + test("Imputer should impute null") { +val df = sqlContext.createDataFrame( Seq( + (0, 1, 1, 1, 1), + (1, 3, 3, 3, 3), + (2, 10, 10, 10, 10), + (3, 10, 10, 10, 10), + (4, -1, 6, 3, 10) --- End diff -- can we test here a "missing value" as well as null, to see it does the right thing. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61103526 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61101270 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61099521 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61100245 --- Diff: mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala --- @@ -0,0 +1,121 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.spark.ml.feature + +import org.apache.spark.SparkFunSuite +import org.apache.spark.ml.util.{DefaultReadWriteTest} +import org.apache.spark.mllib.util.MLlibTestSparkContext +import org.apache.spark.mllib.util.TestingUtils._ +import org.apache.spark.sql.Row + +class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with DefaultReadWriteTest { + + test("Imputer for Double with default missing Value NaN") { +val df = sqlContext.createDataFrame( Seq( + (0, 1.0, 1.0, 1.0, 1.0), + (1, 1.0, 1.0, 1.0, 1.0), + (2, 3.0, 3.0, 3.0, 3.0), + (3, 4.0, 4.0, 4.0, 4.0), + (4, Double.NaN, 2.25, 1.0, 1.0 ) +)).toDF("id", "value", "mean", "median", "mode") +Seq("mean", "median", "mode").foreach { strategy => + val imputer = new Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy) + val model = imputer.fit(df) + model.transform(df).select(strategy, "out").collect() +.foreach { case Row(d1: Double, d2: Double) => + assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be $d1") --- End diff -- Could we make this error message more informative, like "Imputed values differ. Expected: $d1, actual: $d2" --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61099076 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ --- End diff -- likewise, we can use `SchemaUtils.appendColumn` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r61098866 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,231 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], --- End diff -- I think we can use `SchemaUtils.checkColumnType` here --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212723400 @jkbradley Appreciate if you can take another look, especially for the question from @MLnick in last comment. @MLnick I'm not sure if we need to wait for Joseph to make the final decision. Can I know your preference on output column type? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212503267 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56372/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212503265 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212503081 **[Test build #56372 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56372/consoleFull)** for PR 11601 at commit [`594c501`](https://github.com/apache/spark/commit/594c501f85cad2a278caee5f08b85deb61272e5d). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212489019 **[Test build #56372 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56372/consoleFull)** for PR 11601 at commit [`594c501`](https://github.com/apache/spark/commit/594c501f85cad2a278caee5f08b85deb61272e5d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-212188252 Updated according to the comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-210947567 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-210947568 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56035/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-210947421 **[Test build #56035 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56035/consoleFull)** for PR 11601 at commit [`1718422`](https://github.com/apache/spark/commit/171842210d3ea2e3c97fe803f0a8bb3831063f3f). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/11601#issuecomment-210944565 **[Test build #56035 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56035/consoleFull)** for PR 11601 at commit [`1718422`](https://github.com/apache/spark/commit/171842210d3ea2e3c97fe803f0a8bb3831063f3f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r59977329 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user hhbyyh commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r59977297 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea
[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...
Github user MLnick commented on a diff in the pull request: https://github.com/apache/spark/pull/11601#discussion_r59918234 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala --- @@ -0,0 +1,230 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.ml.feature + +import org.apache.hadoop.fs.Path + +import org.apache.spark.SparkException +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.ml.{Estimator, Model} +import org.apache.spark.ml.param._ +import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol} +import org.apache.spark.ml.util._ +import org.apache.spark.sql.{DataFrame, Dataset, Row} +import org.apache.spark.sql.functions._ +import org.apache.spark.sql.types._ +import org.apache.spark.sql.types.NumericType + +/** + * Params for [[Imputer]] and [[ImputerModel]]. + */ +private[feature] trait ImputerParams extends Params with HasInputCol with HasOutputCol { + + /** + * The imputation strategy. + * If "mean", then replace missing values using the mean value of the feature. + * If "median", then replace missing values using the median value of the feature. + * If "mode", then replace missing using the most frequent value of the feature. + * Default: mean + * + * @group param + */ + val strategy: Param[String] = new Param(this, "strategy", "strategy for imputation. " + +"If mean, then replace missing values using the mean value of the feature." + +"If median, then replace missing values using the median value of the feature." + +"If mode, then replace missing using the most frequent value of the feature.", + ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray)) + + /** @group getParam */ + def getStrategy: String = $(strategy) + + /** + * The placeholder for the missing values. All occurrences of missingValue will be imputed. + * Default: Double.NaN + * + * @group param + */ + val missingValue: DoubleParam = new DoubleParam(this, "missingValue", +"The placeholder for the missing values. All occurrences of missingValue will be imputed") + + /** @group getParam */ + def getMissingValue: Double = $(missingValue) + + /** Validates and transforms the input schema. */ + protected def validateAndTransformSchema(schema: StructType): StructType = { +val inputType = schema($(inputCol)).dataType +require(inputType.isInstanceOf[NumericType], + s"Input column ${$(inputCol)} must be of NumericType") +require(!schema.fieldNames.contains($(outputCol)), + s"Output column ${$(outputCol)} already exists.") +val outputFields = schema.fields :+ + StructField($(outputCol), inputType, schema($(inputCol)).nullable) +StructType(outputFields) + } + +} + +/** + * :: Experimental :: + * Imputation estimator for completing missing values, either using the mean("mean"), the + * median("median") or the most frequent value("mode") of the column in which the missing + * values are located. + */ +@Experimental +class Imputer @Since("2.0.0")(override val uid: String) + extends Estimator[ImputerModel] with ImputerParams with DefaultParamsWritable { + + @Since("2.0.0") + def this() = this(Identifiable.randomUID("imputer")) + + /** @group setParam */ + def setInputCol(value: String): this.type = set(inputCol, value) + + /** @group setParam */ + def setOutputCol(value: String): this.type = set(outputCol, value) + + /** + * Imputation strategy. Available options are ["mean", "median" and "mode"]. + * @group setParam + */ + def setStrategy(value: String): this.type = set(strategy, value) + + /** @group setParam */ + def setMissingValue(value: Double): this.type = set(missingValue, value) + + setDefault(strategy -> "mea