[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2018-07-31 Thread zhengruifeng
Github user zhengruifeng closed the pull request at:

https://github.com/apache/spark/pull/19084


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2018-06-14 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19084#discussion_r195429701
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
@@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
 transformSchema(dataset.schema, logging = true)
-val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map {
-  case Row(v: Vector) => OldVectors.fromML(v)
+
--- End diff --

Rather than copy all that code, I wonder if that utility class can just be 
modified to selectively handle NaN differently?  


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2017-12-05 Thread zhengruifeng
Github user zhengruifeng commented on a diff in the pull request:

https://github.com/apache/spark/pull/19084#discussion_r154897263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
@@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
 transformSchema(dataset.schema, logging = true)
-val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map {
-  case Row(v: Vector) => OldVectors.fromML(v)
+
--- End diff --

`Statistics.colStats(input)` uses `MultivariateOnlineSummarizer` to compute 
the max/min which will ignore `Double.NaN`.

I change the behavior of NaN handling in `MultivariateOnlineSummarizer` in 
this PR, so I have to make another impl of `MinMaxScaler` to keep it behavior.



---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...

2017-12-05 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19084#discussion_r154894819
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala ---
@@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") 
override val uid: String)
   @Since("2.0.0")
   override def fit(dataset: Dataset[_]): MinMaxScalerModel = {
 transformSchema(dataset.schema, logging = true)
-val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map {
-  case Row(v: Vector) => OldVectors.fromML(v)
+
--- End diff --

Is this code a copy of  Statistics.colStats(input) ? how does it differ?


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org