[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109937398 Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser to ML library* so that JIRA picks up on the issue. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31899145 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of "Flink's MinMax Scaler" + + import MinMaxScalerData._ + + it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector <- scaledVectors) { + val test = vector.asBreeze.forall(fv => { +fv >= 0.0 && fv <= 1.0 --- End diff -- Yes it is. This assures that if someone changes something of the `Transformer` logic, then he will see an error. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895883 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of "Flink's MinMax Scaler" + + import MinMaxScalerData._ + + it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector <- scaledVectors) { + val test = vector.asBreeze.forall(fv => { +fv >= 0.0 && fv <= 1.0 --- End diff -- In this case I will use the same method as in the implementation of the transformer. Calculating the min and max of each feature and then applying the formula which I explain in the documentation. Is that OK? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109913485 Really good work @fobeligi. The code is really well structured and documented. I had only some minor comments. When you have them addressed, I think it's good to be merged :-) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895785 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of "Flink's MinMax Scaler" + + import MinMaxScalerData._ + + it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector <- scaledVectors) { + val test = vector.asBreeze.forall(fv => { +fv >= 0.0 && fv <= 1.0 --- End diff -- Maybe we could not only compare whether the data lies between `0` and `1` but also whether the vectors have been correctly scaled. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895797 --- Diff: flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala --- @@ -0,0 +1,180 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.scala._ +import org.apache.flink.ml.common.LabeledVector +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{DenseVector, Vector} +import org.apache.flink.test.util.FlinkTestBase +import org.scalatest.{FlatSpec, Matchers} + + +class MinMaxScalerITSuite + extends FlatSpec + with Matchers + with FlinkTestBase { + + behavior of "Flink's MinMax Scaler" + + import MinMaxScalerData._ + + it should "scale the vectors' values to be restricted in the (0.0,1.0) range" in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data) +val minMaxScaler = MinMaxScaler() +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data.length) + +for (vector <- scaledVectors) { + val test = vector.asBreeze.forall(fv => { +fv >= 0.0 && fv <= 1.0 + }) + test shouldEqual (true) +} + + } + + it should "scale vectors' values in the (-1.0,1.0) range" in { + +val env = ExecutionEnvironment.getExecutionEnvironment + +val dataSet = env.fromCollection(data2) +val minMaxScaler = MinMaxScaler().setMin(-1.0).setMax(1.0) +minMaxScaler.fit(dataSet) +val scaledVectors = minMaxScaler.transform(dataSet).collect + +scaledVectors.length should equal(data2.length) + +for (labeledVector <- scaledVectors) { + val test = labeledVector.vector.asBreeze.forall(lv => { +lv >= -1.0 && lv <= 1.0 --- End diff -- The same here. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895445 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxSca
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895095 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxSca
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31895117 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxSca
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894794 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,113 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Std --- End diff -- Yes, you are right! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894783 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,255 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = (0,1). + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These values are used in the transform step +* to transform the given input data. +* +* @tparam T Input data type which is a subtype of [[Vector]] +* @return +*/ + implicit def fitVectorMinMaxScaler[T <: Vector] = new FitOperation[MinMaxSca
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31894690 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,113 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Std --- End diff -- Shouldn't this be called `Max`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---
[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library
GitHub user fobeligi opened a pull request: https://github.com/apache/flink/pull/798 [Flink 1844] Add Normaliser to ML library Adds a MinMaxScaler to the ML preprocessing package. MinMax scaler scales the values to a user-specified range. You can merge this pull request into a Git repository by running: $ git pull https://github.com/fobeligi/incubator-flink FLINK-1844 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/798.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #798 commit 802b9da07a2c3f7c055b4c024aaecbbe647db1cd Author: fobeligi Date: 2015-06-05T21:12:43Z [FLINK-1844] Add MinMaxScaler implementation in the proprocessing package, test for the for the corresponding functionality and documentation. commit e639185108f9bda253e296bae4c6c4269a30d1d0 Author: fobeligi Date: 2015-06-05T22:12:33Z [FLINK-1844] Change second test to use LabeledVectors instead of Vectors --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. ---