[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread thvasilo
Github user thvasilo commented on the pull request:

https://github.com/apache/flink/pull/798#issuecomment-109937398
  
Note: you might want to rename this to *[FLINK-1844] [ml] - Add Normaliser 
to ML library* so that JIRA picks up on the issue.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31899145
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of "Flink's MinMax Scaler"
+
+  import MinMaxScalerData._
+
+  it should "scale the vectors' values to be restricted in the (0.0,1.0) 
range" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector <- scaledVectors) {
+  val test = vector.asBreeze.forall(fv => {
+fv >= 0.0 && fv <= 1.0
--- End diff --

Yes it is. This assures that if someone changes something of the 
`Transformer` logic, then he will see an error.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread fobeligi
Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895883
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of "Flink's MinMax Scaler"
+
+  import MinMaxScalerData._
+
+  it should "scale the vectors' values to be restricted in the (0.0,1.0) 
range" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector <- scaledVectors) {
+  val test = vector.asBreeze.forall(fv => {
+fv >= 0.0 && fv <= 1.0
--- End diff --

In this case I will use the same method as in the implementation of the 
transformer.
Calculating the min and max of each feature and then applying the formula 
which I explain in the documentation. Is that OK?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on the pull request:

https://github.com/apache/flink/pull/798#issuecomment-109913485
  
Really good work @fobeligi. The code is really well structured and 
documented. I had only some minor comments. When you have them addressed, I 
think it's good to be merged :-)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895785
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of "Flink's MinMax Scaler"
+
+  import MinMaxScalerData._
+
+  it should "scale the vectors' values to be restricted in the (0.0,1.0) 
range" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector <- scaledVectors) {
+  val test = vector.asBreeze.forall(fv => {
+fv >= 0.0 && fv <= 1.0
--- End diff --

Maybe we could not only compare whether the data lies between `0` and `1` 
but also whether the vectors have been correctly scaled.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895797
  
--- Diff: 
flink-staging/flink-ml/src/test/scala/org/apache/flink/ml/preprocessing/MinMaxScalerITSuite.scala
 ---
@@ -0,0 +1,180 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.scala._
+import org.apache.flink.ml.common.LabeledVector
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{DenseVector, Vector}
+import org.apache.flink.test.util.FlinkTestBase
+import org.scalatest.{FlatSpec, Matchers}
+
+
+class MinMaxScalerITSuite
+  extends FlatSpec
+  with Matchers
+  with FlinkTestBase {
+
+  behavior of "Flink's MinMax Scaler"
+
+  import MinMaxScalerData._
+
+  it should "scale the vectors' values to be restricted in the (0.0,1.0) 
range" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data)
+val minMaxScaler = MinMaxScaler()
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data.length)
+
+for (vector <- scaledVectors) {
+  val test = vector.asBreeze.forall(fv => {
+fv >= 0.0 && fv <= 1.0
+  })
+  test shouldEqual (true)
+}
+
+  }
+
+  it should "scale vectors' values in the (-1.0,1.0) range" in {
+
+val env = ExecutionEnvironment.getExecutionEnvironment
+
+val dataSet = env.fromCollection(data2)
+val minMaxScaler = MinMaxScaler().setMin(-1.0).setMax(1.0)
+minMaxScaler.fit(dataSet)
+val scaledVectors = minMaxScaler.transform(dataSet).collect
+
+scaledVectors.length should equal(data2.length)
+
+for (labeledVector <- scaledVectors) {
+  val test = labeledVector.vector.asBreeze.forall(lv => {
+lv >= -1.0 && lv <= 1.0
--- End diff --

The same here.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895445
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T <: Vector] = new 
FitOperation[MinMaxSca

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895095
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T <: Vector] = new 
FitOperation[MinMaxSca

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31895117
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T <: Vector] = new 
FitOperation[MinMaxSca

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread fobeligi
Github user fobeligi commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894794
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,113 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: FlinkML - MinMax Scaler
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ 
+  
+
+  Parameters
+  Description
+
+  
+
+  
+
+  Min
+  
+
+  The minimum value of the range for the scaled data set. (Default 
value: 0.0)
+
+  
+
+
+  Std
--- End diff --

Yes, you are right!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894783
  
--- Diff: 
flink-staging/flink-ml/src/main/scala/org/apache/flink/ml/preprocessing/MinMaxScaler.scala
 ---
@@ -0,0 +1,255 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one
+ * or more contributor license agreements.  See the NOTICE file
+ * distributed with this work for additional information
+ * regarding copyright ownership.  The ASF licenses this file
+ * to you under the Apache License, Version 2.0 (the
+ * "License"); you may not use this file except in compliance
+ * with the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.flink.ml.preprocessing
+
+import breeze.linalg
+import org.apache.flink.api.common.typeinfo.TypeInformation
+import org.apache.flink.api.scala._
+import org.apache.flink.ml._
+import org.apache.flink.ml.common.{LabeledVector, Parameter, ParameterMap}
+import org.apache.flink.ml.math.Breeze._
+import org.apache.flink.ml.math.{BreezeVectorConverter, Vector}
+import org.apache.flink.ml.pipeline.{FitOperation, TransformOperation, 
Transformer}
+import org.apache.flink.ml.preprocessing.MinMaxScaler.{Max, Min}
+
+import scala.reflect.ClassTag
+
+/** Scales observations, so that all features are in a user-specified 
range.
+  * By default for [[MinMaxScaler]] transformer range = (0,1).
+  *
+  * This transformer takes a subtype of  [[Vector]] of values and maps it 
to a
+  * scaled subtype of [[Vector]] such that each feature lies between a 
user-specified range.
+  *
+  * This transformer can be prepended to all [[Transformer]] and
+  * [[org.apache.flink.ml.pipeline.Predictor]] implementations which 
expect as input a subtype
+  * of [[Vector]].
+  *
+  * @example
+  * {{{
+  *   val trainingDS: DataSet[Vector] = 
env.fromCollection(data)
+  *   val transformer = MinMaxScaler().setMin(-1.0).setMax(1.0)
+  *
+  *   transformer.fit(trainingDS)
+  *   val transformedDS = transformer.transform(trainingDS)
+  * }}}
+  *
+  * =Parameters=
+  *
+  * - [[Min]]: The minimum value of the range of the transformed data set; 
by default equal to 0
+  * - [[Max]]: The maximum value of the range of the transformed data set; 
by default
+  * equal to 1
+  */
+class MinMaxScaler extends Transformer[MinMaxScaler] {
+
+  var metricsOption: Option[DataSet[(linalg.Vector[Double], 
linalg.Vector[Double])]] = None
+
+  /** Sets the minimum for the range of the transformed data
+*
+* @param min the user-specified minimum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMin(min: Double): MinMaxScaler = {
+parameters.add(Min, min)
+this
+  }
+
+  /** Sets the maximum for the range of the transformed data
+*
+* @param max the user-specified maximum value.
+* @return the MinMaxScaler instance with its minimum value set to the 
user-specified value.
+*/
+  def setMax(max: Double): MinMaxScaler = {
+parameters.add(Max, max)
+this
+  }
+}
+
+object MinMaxScaler {
+
+  // == Parameters 
=
+
+  case object Min extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(0.0)
+  }
+
+  case object Max extends Parameter[Double] {
+override val defaultValue: Option[Double] = Some(1.0)
+  }
+
+  //  Factory methods 
==
+
+  def apply(): MinMaxScaler = {
+new MinMaxScaler()
+  }
+
+  // == Operations 
=
+
+  /** Trains the [[org.apache.flink.ml.preprocessing.MinMaxScaler]] by 
learning the minimum and
+* maximum of each feature of the training data. These values are used 
in the transform step
+* to transform the given input data.
+*
+* @tparam T Input data type which is a subtype of [[Vector]]
+* @return
+*/
+  implicit def fitVectorMinMaxScaler[T <: Vector] = new 
FitOperation[MinMaxSca

[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-08 Thread tillrohrmann
Github user tillrohrmann commented on a diff in the pull request:

https://github.com/apache/flink/pull/798#discussion_r31894690
  
--- Diff: docs/libs/ml/minMax_scaler.md ---
@@ -0,0 +1,113 @@
+---
+mathjax: include
+htmlTitle: FlinkML - MinMax Scaler
+title: FlinkML - MinMax Scaler
+---
+
+
+* This will be replaced by the TOC
+{:toc}
+
+## Description
+
+ The MinMax scaler scales the given data set, so that all values will lie 
between a user specified range [min,max].
+ In case the user does not provide a specific minimum and maximum value 
for the scaling range, the MinMax scaler transforms the features of the input 
data set to lie in the [0,1] interval.
+ Given a set of input data $x_1, x_2,... x_n$, with minimum value:
+
+ $$x_{min} = min({x_1, x_2,..., x_n})$$
+
+ and maximum value:
+
+ $$x_{max} = max({x_1, x_2,..., x_n})$$
+
+The scaled data set $z_1, z_2,...,z_n$ will be:
+
+ $$z_{i}= \frac{x_{i} - x_{min}}{x_{max} - x_{min}} \left ( max - min 
\right ) + min$$
+
+where $\textit{min}$ and $\textit{max}$ are the user specified minimum and 
maximum values of the range to scale.
+
+## Operations
+
+`MinMaxScaler` is a `Transformer`.
+As such, it supports the `fit` and `transform` operation.
+
+### Fit
+
+MinMaxScaler is trained on all subtypes of `Vector` or `LabeledVector`:
+
+* `fit[T <: Vector]: DataSet[T] => Unit`
+* `fit: DataSet[LabeledVector] => Unit`
+
+### Transform
+
+MinMaxScaler transforms all subtypes of `Vector` or `LabeledVector` into 
the respective type:
+
+* `transform[T <: Vector]: DataSet[T] => DataSet[T]`
+* `transform: DataSet[LabeledVector] => DataSet[LabeledVector]`
+
+## Parameters
+
+The MinMax scaler implementation can be controlled by the following two 
parameters:
+
+ 
+  
+
+  Parameters
+  Description
+
+  
+
+  
+
+  Min
+  
+
+  The minimum value of the range for the scaled data set. (Default 
value: 0.0)
+
+  
+
+
+  Std
--- End diff --

Shouldn't this be called `Max`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---


[GitHub] flink pull request: [Flink 1844] Add Normaliser to ML library

2015-06-05 Thread fobeligi
GitHub user fobeligi opened a pull request:

https://github.com/apache/flink/pull/798

[Flink 1844] Add Normaliser to ML library

Adds a MinMaxScaler to the ML preprocessing package. MinMax scaler scales 
the values to a user-specified range.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/fobeligi/incubator-flink FLINK-1844

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/flink/pull/798.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #798


commit 802b9da07a2c3f7c055b4c024aaecbbe647db1cd
Author: fobeligi 
Date:   2015-06-05T21:12:43Z

[FLINK-1844] Add MinMaxScaler implementation in the proprocessing package, 
test for the for the corresponding functionality and documentation.

commit e639185108f9bda253e296bae4c6c4269a30d1d0
Author: fobeligi 
Date:   2015-06-05T22:12:33Z

[FLINK-1844] Change second test to use LabeledVectors instead of Vectors




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---