[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...
Github user zhengruifeng closed the pull request at: https://github.com/apache/spark/pull/19084 --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19084#discussion_r195429701 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala --- @@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String) @Since("2.0.0") override def fit(dataset: Dataset[_]): MinMaxScalerModel = { transformSchema(dataset.schema, logging = true) -val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map { - case Row(v: Vector) => OldVectors.fromML(v) + --- End diff -- Rather than copy all that code, I wonder if that utility class can just be modified to selectively handle NaN differently? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...
Github user zhengruifeng commented on a diff in the pull request: https://github.com/apache/spark/pull/19084#discussion_r154897263 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala --- @@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String) @Since("2.0.0") override def fit(dataset: Dataset[_]): MinMaxScalerModel = { transformSchema(dataset.schema, logging = true) -val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map { - case Row(v: Vector) => OldVectors.fromML(v) + --- End diff -- `Statistics.colStats(input)` uses `MultivariateOnlineSummarizer` to compute the max/min which will ignore `Double.NaN`. I change the behavior of NaN handling in `MultivariateOnlineSummarizer` in this PR, so I have to make another impl of `MinMaxScaler` to keep it behavior. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19084: [SPARK-20711][ML]MultivariateOnlineSummarizer/Sum...
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19084#discussion_r154894819 --- Diff: mllib/src/main/scala/org/apache/spark/ml/feature/MinMaxScaler.scala --- @@ -117,11 +113,56 @@ class MinMaxScaler @Since("1.5.0") (@Since("1.5.0") override val uid: String) @Since("2.0.0") override def fit(dataset: Dataset[_]): MinMaxScalerModel = { transformSchema(dataset.schema, logging = true) -val input: RDD[OldVector] = dataset.select($(inputCol)).rdd.map { - case Row(v: Vector) => OldVectors.fromML(v) + --- End diff -- Is this code a copy of Statistics.colStats(input) ? how does it differ? --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org