[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578569#comment-14578569 ] ASF GitHub Bot commented on FLINK-1844: --- Github user asfgit closed the pull request at: https://github.com/apache/flink/pull/798 > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577360#comment-14577360 ] ASF GitHub Bot commented on FLINK-1844: --- Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31927083 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Yes ^^ > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577346#comment-14577346 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31926077 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Package private should be ok, since the test is in the same package, right? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577329#comment-14577329 ] ASF GitHub Bot commented on FLINK-1844: --- Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31924947 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Hey, if the {{metricsOption}} field is package private then my tests will fail, cause I am also testing in the {{MinMaxScalerITSuite}} if the min, max of each feature has been calculated correct. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577298#comment-14577298 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922838 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Will make the field package private. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577292#comment-14577292 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31922171 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- As private state, the developer should be able to choose any type. Thus, a `BreezeVector` should be fine here. I was just wondering, whether a `DenseVector` does not make more sense here. Is it safe to assume that every feature has at least 2 non-zero values? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577286#comment-14577286 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921747 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577284#comment-14577284 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921716 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577280#comment-14577280 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921435 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,112 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Max + + + The maximum value of the range for the scaled data set. (Default value: 1.0) + + + + + + +## Examples + +{% highlight scala %} +// Create MinMax scaler transformer +val minMaxscaler = MinMaxScaler() +.setMin(-1.0) --- End diff -- Will address this when merging. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577279#comment-14577279 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31921419 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. --- End diff -- You're right. Will add it when I merge it. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577193#comment-14577193 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31914466 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Not right now, so these can remain. I was mostly concerned that this parameter was user-facing, meaning the user had to provide Breeze vectors as parameters, but that is not the case. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577187#comment-14577187 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31913806 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These valu
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577185#comment-14577185 ] ASF GitHub Bot commented on FLINK-1844: --- Github user fobeligi commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31913634 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- I am using metricsOption vectors internally in the transformer in elementwise subtraction and divisions, so instead of transforming to/from Breeze to flink.ml.math.Vector I have it as breeze.linalg.Vector. Can I perform the same operations with flink.ml.math.Vector, or do you believe that it would be better to perform the transformations (to/from breeze vectors) in the functions? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577145#comment-14577145 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31911384 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None + + /** Sets the minimum for the range of the transformed data +* +* @param min the user-specified minimum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMin(min: Double): MinMaxScaler = { +parameters.add(Min, min) +this + } + + /** Sets the maximum for the range of the transformed data +* +* @param max the user-specified maximum value. +* @return the MinMaxScaler instance with its minimum value set to the user-specified value. +*/ + def setMax(max: Double): MinMaxScaler = { +parameters.add(Max, max) +this + } +} + +object MinMaxScaler { + + // == Parameters = + + case object Min extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(0.0) + } + + case object Max extends Parameter[Double] { +override val defaultValue: Option[Double] = Some(1.0) + } + + // Factory methods == + + def apply(): MinMaxScaler = { +new MinMaxScaler() + } + + // == Operations = + + /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by learning the minimum and +* maximum of each feature of the training data. These valu
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577142#comment-14577142 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31911306 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. + * + * @example + * {{{ + * val trainingDS: DataSet[Vector] = env.fromCollection(data) + * val transformer = MinMaxScaler().setMin(-1.0) + * + * transformer.fit(trainingDS) + * val transformedDS = transformer.transform(trainingDS) + * }}} + * + * =Parameters= + * + * - [[Min]]: The minimum value of the range of the transformed data set; by default equal to 0 + * - [[Max]]: The maximum value of the range of the transformed data set; by default + * equal to 1 + */ +class MinMaxScaler extends Transformer[MinMaxScaler] { + + var metricsOption: Option[DataSet[(linalg.Vector[Double], linalg.Vector[Double])]] = None --- End diff -- Are these of breeze.linag.Vector type? If yes why not use flink.ml.math.Vector? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577141#comment-14577141 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31911162 --- Diff: docs/libs/ml/minMax_scaler.md --- @@ -0,0 +1,112 @@ +--- +mathjax: include +htmlTitle: FlinkML - MinMax Scaler +title: FlinkML - MinMax Scaler +--- + + +* This will be replaced by the TOC +{:toc} + +## Description + + The MinMax scaler scales the given data set, so that all values will lie between a user specified range [min,max]. + In case the user does not provide a specific minimum and maximum value for the scaling range, the MinMax scaler transforms the features of the input data set to lie in the [0,1] interval. + Given a set of input data $x_1, x_2,... x_n$, with minimum value: + + $$x_{min} = min({x_1, x_2,..., x_n})$$ + + and maximum value: + + $$x_{max} = max({x_1, x_2,..., x_n})$$ + +The scaled data set $z_1, z_2,...,z_n$ will be: + + $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min \right ) + min$$ + +where $\textit{min}$ and $\textit{max}$ are the user specified minimum and maximum values of the range to scale. + +## Operations + +`MinMaxScaler` is a `Transformer`. +As such, it supports the `fit` and `transform` operation. + +### Fit + +MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`: + +* `fit[T <: Vector]: DataSet[T] => Unit` +* `fit: DataSet[LabeledVector] => Unit` + +### Transform + +MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into the respective type: + +* `transform[T <: Vector]: DataSet[T] => DataSet[T]` +* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]` + +## Parameters + +The MinMax scaler implementation can be controlled by the following two parameters: + + + + + Parameters + Description + + + + + + Min + + + The minimum value of the range for the scaled data set. (Default value: 0.0) + + + + + Max + + + The maximum value of the range for the scaled data set. (Default value: 1.0) + + + + + + +## Examples + +{% highlight scala %} +// Create MinMax scaler transformer +val minMaxscaler = MinMaxScaler() +.setMin(-1.0) --- End diff -- Indent 2 spaces > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577140#comment-14577140 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on a diff in the pull request: https://github.com/apache/flink/pull/798#discussion_r31911138 --- Diff: flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala --- @@ -0,0 +1,254 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one + * or more contributor license agreements. See the NOTICE file + * distributed with this work for additional information + * regarding copyright ownership. The ASF licenses this file + * to you under the Apache License, Version 2.0 (the + * "License"); you may not use this file except in compliance + * with the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.flink.ml.preprocessing + +import breeze.linalg +import breeze.linalg.{max, min} +import org.apache.flink.api.common.typeinfo.TypeInformation +import org.apache.flink.api.scala._ +import org.apache.flink.ml._ +import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap} +import org.apache.flink.ml.math.Breeze._ +import org.apache.flink.ml.math.{BreezeVectorConverter, Vector} +import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, Transformer} +import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min} + +import scala.reflect.ClassTag + +/** Scales observations, so that all features are in a user-specified range. + * By default for [[MinMaxScaler]] transformer range = [0,1]. + * + * This transformer takes a subtype of [[Vector]] of values and maps it to a + * scaled subtype of [[Vector]] such that each feature lies between a user-specified range. + * + * This transformer can be prepended to all [[Transformer]] and + * [[org.apache.flink.ml.pipeline.Predictor]] implementations which expect as input a subtype + * of [[Vector]]. --- End diff -- Doesn't LabedledVector apply here as well? > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577131#comment-14577131 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109983940 The documentation must also change index.html (FlinkML landing site) so that it is linked from somewhere. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14577122#comment-14577122 ] ASF GitHub Bot commented on FLINK-1844: --- Github user tillrohrmann commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109982291 LGTM. Will merge once Travis gives green light. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576962#comment-14576962 ] ASF GitHub Bot commented on FLINK-1844: --- Github user thvasilo commented on the pull request: https://github.com/apache/flink/pull/798#issuecomment-109937398 Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser to ML library* so that JIRA picks up on the issue. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14576282#comment-14576282 ] Faye Beligianni commented on FLINK-1844: Hey [~tvas] I opened a PR for the normaliser, which I named MinMaxScaler. Any comments are welcomed! Regarding the two tests that I wrote, I think that maybe they are too simple, as I am only checking if the numbers are in the user-specified range. An attempt to cross check the result against a dataset of "expectedScaledVectors" would require to use the same method for calculating the "expectedScaledVectors" which I used in the implementation of the MinMaxScaler (wasn't sure if that would've been correct). > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14574166#comment-14574166 ] Theodore Vasiloudis commented on FLINK-1844: No worries [~fobeligi], thank you for your contribution. Keep us updated. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14573629#comment-14573629 ] Faye Beligianni commented on FLINK-1844: Hello [~tvas], I have implemented the main algorithm but I will have to migrate to the new ml pipeline and also create a test. I am sorry for not looking at this issue for a while, I will finalise it during weekend. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)
[jira] [Commented] (FLINK-1844) Add Normaliser to ML library
[ https://issues.apache.org/jira/browse/FLINK-1844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14572941#comment-14572941 ] Theodore Vasiloudis commented on FLINK-1844: Hello Faye, have you managed to make any progress on this? We could really use it for the quickstart examples. > Add Normaliser to ML library > > > Key: FLINK-1844 > URL: https://issues.apache.org/jira/browse/FLINK-1844 > Project: Flink > Issue Type: Improvement > Components: Machine Learning Library >Reporter: Faye Beligianni >Assignee: Faye Beligianni >Priority: Minor > Labels: ML, Starter > > In many algorithms in ML, the features' values would be better to lie between > a given range of values, usually in the range (0,1) [1]. Therefore, a > {{Transformer}} could be implemented to achieve that normalisation. > Resources: > [1][http://scikit-learn.org/stable/modules/generated/sklearn.preprocessing.MinMaxScaler.html] -- This message was sent by Atlassian JIRA (v6.3.4#6332)