[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140861746
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
+  case DistanceSuite.COSINE => true
+  case _ => false
+}
+  }
+}
+
+/**
+ * A vector with its norm for fast distance computation.
+ *
+ * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
+ */
+private[clustering]
+class VectorWithNorm(val vector: Vector, val norm: Double) extends 
Serializable {
+
+  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
+
+  def this(array: Array[Double]) = this(Vectors.dense(array))
+
+  /** Converts the vector to a dense vector. */
+  def toDense: VectorWithNorm = new 
VectorWithNorm(Vectors.dense(vector.toArray), norm)
+}
+
+
+private[spark] abstract class DistanceSuite extends Serializable {
+
+  /**
+   * Returns the index of the closest center to the given point, as well 
as the squared distance.
+   */
+  def findClosest(
+ centers: TraversableOnce[VectorWithNorm],
+ point: VectorWithNorm): (Int, Double)
+
+  /**
+   * Returns the K-means cost of a given point against the given cluster 
centers.
+   */
+  def pointCost(
+  centers: TraversableOnce[VectorWithNorm],
+  point: VectorWithNorm): Double =
+findClosest(centers, point)._2
+
+  /**
+   * Returns whether a center converged or not, given the epsilon 
parameter.
+   */
+  def isCenterConverged(
+  oldCenter: VectorWithNorm,
+  newCenter: VectorWithNorm,
+  epsilon: Double): Boolean
+
+}
+
+@Since("2.3.0")
+object DistanceSuite {
--- End diff --

About the name, if you have any better suggestion, I'd be happy to change 
it. Maybe `DistanceMeasure`?
This in not internal because it contains the definition of the two 
constants which might be used by the users to set the right distance measure.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140861096
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
+  case DistanceSuite.COSINE => true
+  case _ => false
+}
+  }
+}
+
+/**
+ * A vector with its norm for fast distance computation.
+ *
+ * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
+ */
+private[clustering]
+class VectorWithNorm(val vector: Vector, val norm: Double) extends 
Serializable {
+
+  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
+
+  def this(array: Array[Double]) = this(Vectors.dense(array))
+
+  /** Converts the vector to a dense vector. */
+  def toDense: VectorWithNorm = new 
VectorWithNorm(Vectors.dense(vector.toArray), norm)
+}
+
+
+private[spark] abstract class DistanceSuite extends Serializable {
+
+  /**
+   * Returns the index of the closest center to the given point, as well 
as the squared distance.
+   */
+  def findClosest(
--- End diff --

Even though you are right in theory, if you look at the implementation for 
the euclidean distance, in the current code there is an optimization which 
doesn't use the real distance measure for performance reason. Thus, dropping 
this method and introducing a more generic one would cause a performance 
regression for the euclidean distance, which is something I'd definitely avoid.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140860281
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
--- End diff --

I just wanted to be consistent with the similar implementation which is 
three lines above. Doing the same thing in two different ways a few lines of 
code after might be very confusing IMHO.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140859735
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -260,7 +269,8 @@ class KMeans @Since("1.5.0") (
 maxIter -> 20,
 initMode -> MLlibKMeans.K_MEANS_PARALLEL,
 initSteps -> 2,
-tol -> 1e-4)
+tol -> 1e-4,
+distanceMeasure -> DistanceSuite.EUCLIDEAN)
--- End diff --

do you have any suggestion about which might be an appropriate name? Thanks.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
Github user mgaido91 commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140859569
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -71,6 +71,15 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   @Since("1.5.0")
   def getInitMode: String = $(initMode)
 
+  @Since("2.3.0")
+  final val distanceMeasure = new Param[String](this, "distanceMeasure", 
"The distance measure. " +
--- End diff --

This would be hard for two main reasons:
 1 - as I will explain later, even though theoretically we would need only 
a function, in practice this is not true for performance reasons;
 2 - saving and loading a module would be much harder (I'm not sure it 
would even be feasible).


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140833485
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
+  case DistanceSuite.COSINE => true
+  case _ => false
+}
+  }
+}
+
+/**
+ * A vector with its norm for fast distance computation.
+ *
+ * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
+ */
+private[clustering]
+class VectorWithNorm(val vector: Vector, val norm: Double) extends 
Serializable {
+
+  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
+
+  def this(array: Array[Double]) = this(Vectors.dense(array))
+
+  /** Converts the vector to a dense vector. */
+  def toDense: VectorWithNorm = new 
VectorWithNorm(Vectors.dense(vector.toArray), norm)
+}
+
+
+private[spark] abstract class DistanceSuite extends Serializable {
+
+  /**
+   * Returns the index of the closest center to the given point, as well 
as the squared distance.
+   */
+  def findClosest(
--- End diff --

Why would something like this vary with the distance function? finding a 
closest thing is the same for all definitions of close-ness.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140832224
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -260,7 +269,8 @@ class KMeans @Since("1.5.0") (
 maxIter -> 20,
 initMode -> MLlibKMeans.K_MEANS_PARALLEL,
 initSteps -> 2,
-tol -> 1e-4)
+tol -> 1e-4,
+distanceMeasure -> DistanceSuite.EUCLIDEAN)
--- End diff --

"DistanceSuite" sounds like a test case, which you can't use here, but, 
looks like it's an object you added in non-test code. That's confusing.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r14083
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
+  case DistanceSuite.COSINE => true
+  case _ => false
+}
+  }
+}
+
+/**
+ * A vector with its norm for fast distance computation.
+ *
+ * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]]
+ */
+private[clustering]
+class VectorWithNorm(val vector: Vector, val norm: Double) extends 
Serializable {
+
+  def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0))
+
+  def this(array: Array[Double]) = this(Vectors.dense(array))
+
+  /** Converts the vector to a dense vector. */
+  def toDense: VectorWithNorm = new 
VectorWithNorm(Vectors.dense(vector.toArray), norm)
+}
+
+
+private[spark] abstract class DistanceSuite extends Serializable {
+
+  /**
+   * Returns the index of the closest center to the given point, as well 
as the squared distance.
+   */
+  def findClosest(
+ centers: TraversableOnce[VectorWithNorm],
+ point: VectorWithNorm): (Int, Double)
+
+  /**
+   * Returns the K-means cost of a given point against the given cluster 
centers.
+   */
+  def pointCost(
+  centers: TraversableOnce[VectorWithNorm],
+  point: VectorWithNorm): Double =
+findClosest(centers, point)._2
+
+  /**
+   * Returns whether a center converged or not, given the epsilon 
parameter.
+   */
+  def isCenterConverged(
+  oldCenter: VectorWithNorm,
+  newCenter: VectorWithNorm,
+  epsilon: Double): Boolean
+
+}
+
+@Since("2.3.0")
+object DistanceSuite {
--- End diff --

Why is this an API (and why so named)? this can all be internal.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140833252
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -546,10 +574,88 @@ object KMeans {
   .run(data)
   }
 
+  private[spark] def validateInitMode(initMode: String): Boolean = {
+initMode match {
+  case KMeans.RANDOM => true
+  case KMeans.K_MEANS_PARALLEL => true
+  case _ => false
+}
+  }
+  private[spark] def validateDistanceMeasure(distanceMeasure: String): 
Boolean = {
+distanceMeasure match {
+  case DistanceSuite.EUCLIDEAN => true
--- End diff --

You can use two labels in one statement if the result is the same; might be 
clearer. Match is probably overkill anyway


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140832619
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala ---
@@ -40,20 +40,29 @@ import org.apache.spark.util.random.XORShiftRandom
  * to it should be cached by the user.
  */
 @Since("0.8.0")
-class KMeans private (
+class KMeans @Since("2.3.0") private (
--- End diff --

If it's a private constructor, it shouldn't have API "Since" annotations. 
You don't need to preserve it for compatibility.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread srowen
Github user srowen commented on a diff in the pull request:

https://github.com/apache/spark/pull/19340#discussion_r140832427
  
--- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala 
---
@@ -71,6 +71,15 @@ private[clustering] trait KMeansParams extends Params 
with HasMaxIter with HasFe
   @Since("1.5.0")
   def getInitMode: String = $(initMode)
 
+  @Since("2.3.0")
+  final val distanceMeasure = new Param[String](this, "distanceMeasure", 
"The distance measure. " +
--- End diff --

Interesting question here -- what about supplying a function as an 
argument, for full generality? but then that doesn't translate to Pyspark I 
guess, and, probably only 2-3 distance functions that ever make sense here.


---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans

2017-09-25 Thread mgaido91
GitHub user mgaido91 opened a pull request:

https://github.com/apache/spark/pull/19340

[SPARK-22119] Add cosine distance to KMeans

## What changes were proposed in this pull request?

Currently, KMeans assumes the only possible distance measure to be used is 
the Euclidean. This PR aims to add the cosine distance support to the KMeans 
algorithm.

## How was this patch tested?

existing and added UTs.


You can merge this pull request into a Git repository by running:

$ git pull https://github.com/mgaido91/spark SPARK-22119

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/19340.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #19340


commit d679adde78426d5517a15640d632ef0588a6b249
Author: Marco Gaido 
Date:   2017-09-11T14:43:53Z

Add the distanceSuite parameter and the cosine distance impl to mllib.KMeans

commit c364ae3c66740cbbfe763f7e4fd10a8abd02ced8
Author: Marco Gaido 
Date:   2017-09-25T12:07:17Z

Add distance measure to ml Kmeans

commit 0e2a9ee48c46f51f55b09f9d60d47ac3325676bd
Author: Marco Gaido 
Date:   2017-09-25T14:40:37Z

Add tests for cosine

commit d8d8c642345b6a7815998ab5409a5e4cd86d9a5d
Author: Marco Gaido 
Date:   2017-09-25T15:05:32Z

fix scalastyle




---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org