[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-19 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-220254380
  
@hhbyyh really sorry this slipped in the rush around 2.0 QA. I'm afraid it 
will have to be revisited after 2.0 release! But I think it should be in good 
shape to merge soon after that happens.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216485692
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216485606
  
**[Test build #57620 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57620/consoleFull)**
 for PR 11601 at commit 
[`4e07431`](https://github.com/apache/spark/commit/4e0743139796ac53df2554cfa53736b8035bae15).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216485694
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57620/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216477659
  
**[Test build #57620 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57620/consoleFull)**
 for PR 11601 at commit 
[`4e07431`](https://github.com/apache/spark/commit/4e0743139796ac53df2554cfa53736b8035bae15).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216472078
  
@MLnick Yes, I've been preparing the doc and examples and they are almost 
ready. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216462472
  
@hhbyyh I think this is far enough along that I'm comfortable putting it in 
2.0, unless @jkbradley or anyone else has an objection. However this will need 
an entry in the HTML doc, as well as a small example. Will you be able to 
ensure those are completed during the initial QA period?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216462512
  
@hhbyyh made a small doc comment too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-03 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61848348
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature. " +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean or the median
+ * of the column in which the missing values are located. The input column 
should be of
+ * DoubleType or FloatType.
+ *
+ * Note that the mean/median value is computed after filtering out missing 
values.
+ * All Null values in the input column are treated as missing, and so are 
also imputed.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mean", missingValue -> Double.NaN)
+
+  override def fit(dataset: Dataset[_]): ImputerModel = {
+transformSchema(dataset.schema, logging = true)
+val ic = col($(inputCol))

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-05-02 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216435795
  
@MLnick @jkbradley Is it possible to make this into 2.0? Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216011080
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57459/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216011066
  
**[Test build #57459 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57459/consoleFull)**
 for PR 11601 at commit 
[`cca8dd4`](https://github.com/apache/spark/commit/cca8dd41714d79476c2bf23f706012a282c53bcb).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216011079
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-216008178
  
**[Test build #57459 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57459/consoleFull)**
 for PR 11601 at commit 
[`cca8dd4`](https://github.com/apache/spark/commit/cca8dd41714d79476c2bf23f706012a282c53bcb).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-30 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215975928
  
Thanks for the review @MLnick. I'll create the jira now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215719811
  
@hhbyyh if you have time, could you create 2 follow up JIRAs for:
* PySpark impl
* adding mode at a later date for categorical features
* investigate efficiency of approaches using DataFrame/Dataset and/or 
approx approaches such as `frequentItems` or Count-Min Sketch (will require an 
update to CMS to return "heavy-hitters").
* investigate if we can use metadata to only allow mode for categorical 
features (or perhaps as an easier alternative, allow mode for only Int/Long 
columns)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215715233
  
A few small final comments, otherwise LGTM. @jkbradley will leave open for 
a day or two in case you want to take a final pass.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61577109
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature. " +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean or the
+ * median of the column in which the missing values are located. InputCol 
should be
--- End diff --

I prefer "The input column" rather than "InputCol".


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61576752
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("expected_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0 and contains NaN") {
--- End diff --

Can we call this `Imputer should handle NaNs in data when 'missingValue' is 
not NaN` for clarity


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61575765
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("expected_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0 and contains NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 3.0, 3.0, 3.0),
+  (2, Double.NaN, Double.NaN, Double.NaN),
+  (3, -1.0, 2.0, 3.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select("expected_" + strategy, 
"out").collect().foreach {
+case Row(exp: Double, out: Double) =>
+  assert((exp.isNaN && out.isNaN) || (exp ~== out absTol 1e-5),
+s"Imputed values differ. Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Float with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0F, 1.0F, 1.0F),
+  (1, 3.0F, 3.0F, 3.0F),
+  (2, 10.0F, 10.0F, 10.0F),
+  (3, 10.0F, 10.0F, 10.0F),
+  (4, -1.0F, 6.0F, 3.0F)
+)).toDF("id", "value", "expected_mean", "expected_median")
+
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  val result = model.transform(df)
+  model.transform(df).select("expected_" + strategy, 
"out").collect().foreach {
+case Row(exp: Float, out: Float) =>
+  assert(exp == out, s"Imputed values differ. Expected: $exp, 
actual: $out")
+  }
+}
+  }
+
+  test("Imputer should impute null") {
--- End diff --

Could we call this `Imputer should impute null as well as 'missingValue'` 
to clarify


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215656460
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215656462
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57321/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215656374
  
**[Test build #57321 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57321/consoleFull)**
 for PR 11601 at commit 
[`335ded7`](https://github.com/apache/spark/commit/335ded7d92d7ab6f50e2d16348cfde051eb26579).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215650485
  
**[Test build #57321 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57321/consoleFull)**
 for PR 11601 at commit 
[`335ded7`](https://github.com/apache/spark/commit/335ded7d92d7ab6f50e2d16348cfde051eb26579).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215615452
  
**[Test build #57298 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57298/consoleFull)**
 for PR 11601 at commit 
[`aef094b`](https://github.com/apache/spark/commit/aef094bc7b7a00c0ded1b2998b7f98d2bc42c666).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215615467
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57298/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215615465
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215613222
  
**[Test build #57298 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57298/consoleFull)**
 for PR 11601 at commit 
[`aef094b`](https://github.com/apache/spark/commit/aef094bc7b7a00c0ded1b2998b7f98d2bc42c666).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread sethah
Github user sethah commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215503334
  
Other than a couple minor comments, this LGTM pending tests.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61467752
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,122 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0 and contains NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 3.0, 3.0, 3.0),
+  (2, Double.NaN, Double.NaN, Double.NaN),
+  (3, -1.0, 2.0, 3.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+case Row(exp: Double, out: Double) =>
+  assert((exp.isNaN && out.isNaN) || (exp ~== out absTol 1e-5),
+s"Imputed values differ. Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Float with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0F, 1.0F, 1.0F),
+  (1, 3.0F, 3.0F, 3.0F),
+  (2, 10.0F, 10.0F, 10.0F),
+  (3, 10.0F, 10.0F, 10.0F),
+  (4, -1.0F, 6.0F, 3.0F)
+)).toDF("id", "value", "expected_mean", "expected_median")
+
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  val result = model.transform(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+case Row(exp: Float, out: Float) =>
+  assert(exp == out, s"Imputed values differ. Expected: $exp, 
actual: $out")
+  }
+}
+  }
+
+  test("Imputer should impute null") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 4.0, 4.0, 4.0),
+  (1, 10.0, 10.0, 10.0),
+  (2, 10.0, 10.0, 10.0),
+  (3, Double.NaN, 8.0, 10.0),
+  (4, -1.0, 8.0, 10.0)
+)).toDF("id", "value", "expected_mean", "expected_median")
+val df2 = df.selectExpr("*", "IF(value=-1.0, null, value) as 
nullable_value")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("nullable_value").setOutputCol("out")
+.setStrategy(strategy)
+  val model = imputer.fit(df2)
+  model.transform(df2).select("exp_" + strategy, 
"out").collect().foreach {
--- End diff --

As Jenkins has already pointed out, you'll need to change this to 
`.select("expected_" +

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61455519
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
--- End diff --

There needs to be a space after `.. feature.` here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61455460
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,220 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean or the
+ * median of the column in which the missing values are located. InputCol 
should be
+ * of DoubleType or FloatType.
+ *
+ * Note that the mean/median value is computed after filtering out missing 
values.
+ * All Null values in the input column are treated as missing, and so are 
also imputed.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mean", missingValue -> Double.NaN)
+
+  override def fit(dataset: Dataset[_]): ImputerModel = {
+transformSchema(dataset.schema, logging = true)
+val ic = col($(inputCol))
+

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215335519
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57229/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215335518
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215335486
  
**[Test build #57229 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57229/consoleFull)**
 for PR 11601 at commit 
[`4e1c34a`](https://github.com/apache/spark/commit/4e1c34a77b8e4382c00a0438e00e34f544b591a3).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215330151
  
**[Test build #57229 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57229/consoleFull)**
 for PR 11601 at commit 
[`4e1c34a`](https://github.com/apache/spark/commit/4e1c34a77b8e4382c00a0438e00e34f544b591a3).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-28 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215330198
  
Documents updated and remove an unit test.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373605
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
+ * Note that all the null values will be imputed as well.
--- End diff --

Yes. That's better.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373554
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
--- End diff --

@sethah, I tried yet I'm afraid it draws more confusion than the help. 
After all, the current behavior is not out of expectation and works for most ( 
if not all) users. I'd prefer to skip the detailed explanation in API document. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61373571
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
--- End diff --

Cool. Thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61365096
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "exp_mean", "exp_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0)
+)).toDF("id", "value", "exp_mean", "exp_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Impute($strategy) error. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0 and contains NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 3.0, 3.0, 3.0),
+  (2, Double.NaN, Double.NaN, Double.NaN),
+  (3, -1.0, 2.0, 3.0)
--- End diff --

Actually, it always choose the smaller one of the two middle values as I 
saw in some tests.
In this test case, the median is computed from [1, 3, Double.NaN]. And 
Double.NaN is treated as it's greater than Double.MaValue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61321225
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
--- End diff --

Another subtlety I noticed is that, regardless of the `missingValue` param, 
we filter out `NaN` values before computing the mean. I think this is the 
correct behavior, since otherwise you'd just replace missing values with `NaN`. 
I'm not sure if we even need to document it, it might make things more 
confusing. We could note that the "mean" strategy is nan-safe maybe?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61318222
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
--- End diff --

I don't think it's necessary to say mean ("mean") and median ("median") 
here. Just "mean or the median" will suffice.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61318084
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
--- End diff --

can we add something saying how we impute, i.e. "either using the mean or 
median of the input column, computed after filtering out missing values"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61317809
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
+ * Note that all the null values will be imputed as well.
--- End diff --

Can we amend the doc slightly to say something like "Note that null values 
in the input column are treated as missing, and so are also imputed"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61315937
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61308835
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
--- End diff --

unused import


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61308682
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,219 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the approximate median 
value of the feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  final val strategy: Param[String] = new Param(this, "strategy", 
"strategy for imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  final val missingValue: DoubleParam = new DoubleParam(this, 
"missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+SchemaUtils.checkColumnTypes(schema, $(inputCol), Seq(DoubleType, 
FloatType))
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+SchemaUtils.appendColumn(schema, $(outputCol), inputType)
+  }
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean") or the
+ * median("median") of the column in which the missing values are located.
+ *
--- End diff --

Can we document that we only support "Float" and "Double" types for now?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61306820
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
--- End diff --

Sorry to nitpick, but "expected_mean" seems significantly more clear than 
"exp_mean" at the expense of only a few extra characters. Only change it if you 
push more commits since its minor :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61305806
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "exp_mean", "exp_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0)
+)).toDF("id", "value", "exp_mean", "exp_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Impute($strategy) error. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0 and contains NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 3.0, 3.0, 3.0),
+  (2, Double.NaN, Double.NaN, Double.NaN),
+  (3, -1.0, 2.0, 3.0)
--- End diff --

Does approx quantile always choose the greater of the two middle values as 
the median? If so, can we add a comment noting that? NumPy computes the median 
of `[1.0, 3.0]` exactly as `2.0`. Future developers might think it's a mistake.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61305479
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,141 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0)
+)).toDF("id", "value", "exp_mean", "exp_median")
+Seq("mean", "median").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select("exp_" + strategy, 
"out").collect().foreach {
+   case Row(exp: Double, out: Double) =>
+  assert(exp ~== out absTol 1e-5, s"Imputed values differ. 
Expected: $exp, actual: $out")
+  }
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
--- End diff --

I'd argue this test is not necessary because of the test right below, where 
we test that it works for integer missing values and leaves `NaN`s alone.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215155942
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215155656
  
**[Test build #57140 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57140/consoleFull)**
 for PR 11601 at commit 
[`053d489`](https://github.com/apache/spark/commit/053d489a70a28674029ee51a69f529e851261c96).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215155949
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57140/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215143852
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215143820
  
**[Test build #57139 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57139/consoleFull)**
 for PR 11601 at commit 
[`b3633e8`](https://github.com/apache/spark/commit/b3633e8dd0edf47a684aa344ba6a3c43ac0d91fe).
 * This patch **fails MiMa tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215143854
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/57139/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215143141
  
**[Test build #57140 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57140/consoleFull)**
 for PR 11601 at commit 
[`053d489`](https://github.com/apache/spark/commit/053d489a70a28674029ee51a69f529e851261c96).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215141731
  
**[Test build #57139 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/57139/consoleFull)**
 for PR 11601 at commit 
[`b3633e8`](https://github.com/apache/spark/commit/b3633e8dd0edf47a684aa344ba6a3c43ac0d91fe).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61286231
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Int with missing Value -1") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
+)).toDF("id", "value", "mean", "median", "mode")
+
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Int, d2: Int) =>
+  assert(d1 === d2, s"Imputer ut error: $d2 should be $d1")
+}
+}
+  }
+
+  test("Imputer should impute null") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
--- End diff --

added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61286255
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Int with missing Value -1") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
+)).toDF("id", "value", "mean", "median", "mode")
+
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Int, d2: Int) =>
+  assert(d1 === d2, s"Imputer ut error: $d2 should be $d1")
+}
+}
+  }
+
--- End diff --

added


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61286207
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
--- End diff --

sure, changed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61286179
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
--- End diff --

I get the point, yet data in the format of 
```
val df = sqlContext.createDataFrame( Seq(
  (0, 1.0, 1.0, 1.0),
  (1, 1.0, 1.0, 1.0),
  (2, 3.0, 3.0, 3.0),
  (3, 4.0, 4.0, 4.0),
  (4, Double.NaN, 2.25, 1.0)
)).toDF("id", "value", "exp_mean", "exp_median")
```
provides direct correspondence.

I've updated the columns names.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61282637
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
--- End diff --

Sure, thanks.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-27 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-215057260
  
Thanks for helping review and the helpful discussion. I've read through the 
comments. Two major changes are:
1. This initial PR does not include support for "mode" strategy. The 
preference is to implement "mode" for categorical features (metadata) and use 
DataFrame API.
2. We'll support only FloatType and DoubleType for type safety.

Preparing update now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61163112
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61162298
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61158015
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61157610
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
--- End diff --

Make all Param vals final


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61157367
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61156296
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61152055
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
--- End diff --

The way these tests are written was pretty confusing to me. It seems like 
the "mean" column should be the mean of some _values_ but really it is the 
expected output when "mean" strategy is used. This is minor since it just 
affects readability and might be more of a personal preference. I think the 
following is clearer:

```scala
val df = sqlContext.createDataFrame( Seq(
  (0, 1.0),
  (1, 1.0),
  (2, 3.0),
  (3, 4.0),
  (4, Double.NaN)
)).toDF("id", "value")
val expectedOutput = Map(
  "mean"->Array(1.0, 1.0, 3.0, 4.0, 2.25),
  "median" -> Array(1.0, 1.0, 3.0, 4.0, 1.0),
  "mode" -> Array(1.0, 1.0, 3.0, 4.0, 1.0))
Seq("mean", "median", "mode").foreach { strategy =>
  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
  val model = imputer.fit(df)
  val result = 
model.transform(df).select("out").collect().map(_.getDouble(0))
  result.zip(expectedOutput(strategy)).foreach { case (actual, 
expected) =>
assert(actual ~== expected absTol 1e-5)
  }
```
Really, just any way of indicating that the extra columns are _expected 
outputs_ would be clearer to me.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61136907
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
--- End diff --

Using median here is fine I think, but can we document that it's actually 
an approximation?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread sethah
Github user sethah commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61136387
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread BenFradet
Github user BenFradet commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-214792913
  
Lgtm, really looking forward to this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread BenFradet
Github user BenFradet commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r6326
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
--- End diff --

I think the word "values" is missing in this one.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread BenFradet
Github user BenFradet commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61110533
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
--- End diff --

`checkNumericType` even


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61104774
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-214780361
  
@hhbyyh sorry for the delay. I've made some comments and suggestions.

In particular, as per one of my inline comments, I'd propose we don't 
support the "mode" option in this first version. Let's see whether any users 
actually need/request it.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61105305
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Int with missing Value -1") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
+)).toDF("id", "value", "mean", "median", "mode")
+
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Int, d2: Int) =>
+  assert(d1 === d2, s"Imputer ut error: $d2 should be $d1")
+}
+}
+  }
+
--- End diff --

we should also have a test for a `non-NaN` missing value, but with `NaN` in 
the dataset, to check that "mean" and "median" behave as we expect.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61105066
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Double with missing Value -1.0") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, -1.0, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1.0)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
+}
+}
+  }
+
+  test("Imputer for Int with missing Value -1") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
+)).toDF("id", "value", "mean", "median", "mode")
+
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+.setMissingValue(-1)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Int, d2: Int) =>
+  assert(d1 === d2, s"Imputer ut error: $d2 should be $d1")
+}
+}
+  }
+
+  test("Imputer should impute null") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1, 1, 1, 1),
+  (1, 3, 3, 3, 3),
+  (2, 10, 10, 10, 10),
+  (3, 10, 10, 10, 10),
+  (4, -1, 6, 3, 10)
--- End diff --

can we test here a "missing value" as well as null, to see it does the 
right thing.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61103526
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61101270
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61099521
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61100245
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/ml/feature/ImputerSuite.scala ---
@@ -0,0 +1,121 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.ml.feature
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.ml.util.{DefaultReadWriteTest}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.sql.Row
+
+class ImputerSuite extends SparkFunSuite with MLlibTestSparkContext with 
DefaultReadWriteTest {
+
+  test("Imputer for Double with default missing Value NaN") {
+val df = sqlContext.createDataFrame( Seq(
+  (0, 1.0, 1.0, 1.0, 1.0),
+  (1, 1.0, 1.0, 1.0, 1.0),
+  (2, 3.0, 3.0, 3.0, 3.0),
+  (3, 4.0, 4.0, 4.0, 4.0),
+  (4, Double.NaN, 2.25, 1.0, 1.0 )
+)).toDF("id", "value", "mean", "median", "mode")
+Seq("mean", "median", "mode").foreach { strategy =>
+  val imputer = new 
Imputer().setInputCol("value").setOutputCol("out").setStrategy(strategy)
+  val model = imputer.fit(df)
+  model.transform(df).select(strategy, "out").collect()
+.foreach { case Row(d1: Double, d2: Double) =>
+  assert(d1 ~== d2 absTol 1e-5, s"Imputer ut error: $d2 should be 
$d1")
--- End diff --

Could we make this error message more informative, like "Imputed values 
differ. Expected: $d1, actual: $d2"


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61099076
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
--- End diff --

likewise, we can use `SchemaUtils.appendColumn` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-26 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r61098866
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,231 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
--- End diff --

I think we can use `SchemaUtils.checkColumnType` here


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-20 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212723400
  
@jkbradley Appreciate if you can take another look, especially for the 
question from @MLnick in last comment.
@MLnick I'm not sure if we need to wait for Joseph to make the final 
decision. Can I know your preference on output column type?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212503267
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56372/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-20 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212503265
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212503081
  
**[Test build #56372 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56372/consoleFull)**
 for PR 11601 at commit 
[`594c501`](https://github.com/apache/spark/commit/594c501f85cad2a278caee5f08b85deb61272e5d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-20 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212489019
  
**[Test build #56372 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56372/consoleFull)**
 for PR 11601 at commit 
[`594c501`](https://github.com/apache/spark/commit/594c501f85cad2a278caee5f08b85deb61272e5d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-19 Thread hhbyyh
Github user hhbyyh commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-212188252
  
Updated according to the comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-210947567
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-210947568
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/56035/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-210947421
  
**[Test build #56035 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56035/consoleFull)**
 for PR 11601 at commit 
[`1718422`](https://github.com/apache/spark/commit/171842210d3ea2e3c97fe803f0a8bb3831063f3f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/11601#issuecomment-210944565
  
**[Test build #56035 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/56035/consoleFull)**
 for PR 11601 at commit 
[`1718422`](https://github.com/apache/spark/commit/171842210d3ea2e3c97fe803f0a8bb3831063f3f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r59977329
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-16 Thread hhbyyh
Github user hhbyyh commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r59977297
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

[GitHub] spark pull request: [SPARK-13568] [ML] Create feature transformer ...

2016-04-15 Thread MLnick
Github user MLnick commented on a diff in the pull request:

https://github.com/apache/spark/pull/11601#discussion_r59918234
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/Imputer.scala ---
@@ -0,0 +1,230 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.ml.feature
+
+import org.apache.hadoop.fs.Path
+
+import org.apache.spark.SparkException
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.ml.{Estimator, Model}
+import org.apache.spark.ml.param._
+import org.apache.spark.ml.param.shared.{HasInputCol, HasOutputCol}
+import org.apache.spark.ml.util._
+import org.apache.spark.sql.{DataFrame, Dataset, Row}
+import org.apache.spark.sql.functions._
+import org.apache.spark.sql.types._
+import org.apache.spark.sql.types.NumericType
+
+/**
+ * Params for [[Imputer]] and [[ImputerModel]].
+ */
+private[feature] trait ImputerParams extends Params with HasInputCol with 
HasOutputCol {
+
+  /**
+   * The imputation strategy.
+   * If "mean", then replace missing values using the mean value of the 
feature.
+   * If "median", then replace missing values using the median value of 
the feature.
+   * If "mode", then replace missing using the most frequent value of the 
feature.
+   * Default: mean
+   *
+   * @group param
+   */
+  val strategy: Param[String] = new Param(this, "strategy", "strategy for 
imputation. " +
+"If mean, then replace missing values using the mean value of the 
feature." +
+"If median, then replace missing values using the median value of the 
feature." +
+"If mode, then replace missing using the most frequent value of the 
feature.",
+
ParamValidators.inArray[String](Imputer.supportedStrategyNames.toArray))
+
+  /** @group getParam */
+  def getStrategy: String = $(strategy)
+
+  /**
+   * The placeholder for the missing values. All occurrences of 
missingValue will be imputed.
+   * Default: Double.NaN
+   *
+   * @group param
+   */
+  val missingValue: DoubleParam = new DoubleParam(this, "missingValue",
+"The placeholder for the missing values. All occurrences of 
missingValue will be imputed")
+
+  /** @group getParam */
+  def getMissingValue: Double = $(missingValue)
+
+  /** Validates and transforms the input schema. */
+  protected def validateAndTransformSchema(schema: StructType): StructType 
= {
+val inputType = schema($(inputCol)).dataType
+require(inputType.isInstanceOf[NumericType],
+  s"Input column ${$(inputCol)} must be of NumericType")
+require(!schema.fieldNames.contains($(outputCol)),
+  s"Output column ${$(outputCol)} already exists.")
+val outputFields = schema.fields :+
+  StructField($(outputCol), inputType, schema($(inputCol)).nullable)
+StructType(outputFields)
+  }
+
+}
+
+/**
+ * :: Experimental ::
+ * Imputation estimator for completing missing values, either using the 
mean("mean"), the
+ * median("median") or the most frequent value("mode") of the column in 
which the missing
+ * values are located.
+ */
+@Experimental
+class Imputer @Since("2.0.0")(override val uid: String)
+  extends Estimator[ImputerModel] with ImputerParams with 
DefaultParamsWritable {
+
+  @Since("2.0.0")
+  def this() = this(Identifiable.randomUID("imputer"))
+
+  /** @group setParam */
+  def setInputCol(value: String): this.type = set(inputCol, value)
+
+  /** @group setParam */
+  def setOutputCol(value: String): this.type = set(outputCol, value)
+
+  /**
+   * Imputation strategy. Available options are ["mean", "median" and 
"mode"].
+   * @group setParam
+   */
+  def setStrategy(value: String): this.type = set(strategy, value)
+
+  /** @group setParam */
+  def setMissingValue(value: Double): this.type = set(missingValue, value)
+
+  setDefault(strategy -> "mea

  1   2   >