[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140861746 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true + case DistanceSuite.COSINE => true + case _ => false +} + } +} + +/** + * A vector with its norm for fast distance computation. + * + * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]] + */ +private[clustering] +class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable { + + def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0)) + + def this(array: Array[Double]) = this(Vectors.dense(array)) + + /** Converts the vector to a dense vector. */ + def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm) +} + + +private[spark] abstract class DistanceSuite extends Serializable { + + /** + * Returns the index of the closest center to the given point, as well as the squared distance. + */ + def findClosest( + centers: TraversableOnce[VectorWithNorm], + point: VectorWithNorm): (Int, Double) + + /** + * Returns the K-means cost of a given point against the given cluster centers. + */ + def pointCost( + centers: TraversableOnce[VectorWithNorm], + point: VectorWithNorm): Double = +findClosest(centers, point)._2 + + /** + * Returns whether a center converged or not, given the epsilon parameter. + */ + def isCenterConverged( + oldCenter: VectorWithNorm, + newCenter: VectorWithNorm, + epsilon: Double): Boolean + +} + +@Since("2.3.0") +object DistanceSuite { --- End diff -- About the name, if you have any better suggestion, I'd be happy to change it. Maybe `DistanceMeasure`? This in not internal because it contains the definition of the two constants which might be used by the users to set the right distance measure. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140861096 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true + case DistanceSuite.COSINE => true + case _ => false +} + } +} + +/** + * A vector with its norm for fast distance computation. + * + * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]] + */ +private[clustering] +class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable { + + def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0)) + + def this(array: Array[Double]) = this(Vectors.dense(array)) + + /** Converts the vector to a dense vector. */ + def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm) +} + + +private[spark] abstract class DistanceSuite extends Serializable { + + /** + * Returns the index of the closest center to the given point, as well as the squared distance. + */ + def findClosest( --- End diff -- Even though you are right in theory, if you look at the implementation for the euclidean distance, in the current code there is an optimization which doesn't use the real distance measure for performance reason. Thus, dropping this method and introducing a more generic one would cause a performance regression for the euclidean distance, which is something I'd definitely avoid. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140860281 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true --- End diff -- I just wanted to be consistent with the similar implementation which is three lines above. Doing the same thing in two different ways a few lines of code after might be very confusing IMHO. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140859735 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -260,7 +269,8 @@ class KMeans @Since("1.5.0") ( maxIter -> 20, initMode -> MLlibKMeans.K_MEANS_PARALLEL, initSteps -> 2, -tol -> 1e-4) +tol -> 1e-4, +distanceMeasure -> DistanceSuite.EUCLIDEAN) --- End diff -- do you have any suggestion about which might be an appropriate name? Thanks. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user mgaido91 commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140859569 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -71,6 +71,15 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe @Since("1.5.0") def getInitMode: String = $(initMode) + @Since("2.3.0") + final val distanceMeasure = new Param[String](this, "distanceMeasure", "The distance measure. " + --- End diff -- This would be hard for two main reasons: 1 - as I will explain later, even though theoretically we would need only a function, in practice this is not true for performance reasons; 2 - saving and loading a module would be much harder (I'm not sure it would even be feasible). --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140833485 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true + case DistanceSuite.COSINE => true + case _ => false +} + } +} + +/** + * A vector with its norm for fast distance computation. + * + * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]] + */ +private[clustering] +class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable { + + def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0)) + + def this(array: Array[Double]) = this(Vectors.dense(array)) + + /** Converts the vector to a dense vector. */ + def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm) +} + + +private[spark] abstract class DistanceSuite extends Serializable { + + /** + * Returns the index of the closest center to the given point, as well as the squared distance. + */ + def findClosest( --- End diff -- Why would something like this vary with the distance function? finding a closest thing is the same for all definitions of close-ness. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140832224 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -260,7 +269,8 @@ class KMeans @Since("1.5.0") ( maxIter -> 20, initMode -> MLlibKMeans.K_MEANS_PARALLEL, initSteps -> 2, -tol -> 1e-4) +tol -> 1e-4, +distanceMeasure -> DistanceSuite.EUCLIDEAN) --- End diff -- "DistanceSuite" sounds like a test case, which you can't use here, but, looks like it's an object you added in non-test code. That's confusing. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r14083 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true + case DistanceSuite.COSINE => true + case _ => false +} + } +} + +/** + * A vector with its norm for fast distance computation. + * + * @see [[org.apache.spark.mllib.clustering.KMeans#fastSquaredDistance]] + */ +private[clustering] +class VectorWithNorm(val vector: Vector, val norm: Double) extends Serializable { + + def this(vector: Vector) = this(vector, Vectors.norm(vector, 2.0)) + + def this(array: Array[Double]) = this(Vectors.dense(array)) + + /** Converts the vector to a dense vector. */ + def toDense: VectorWithNorm = new VectorWithNorm(Vectors.dense(vector.toArray), norm) +} + + +private[spark] abstract class DistanceSuite extends Serializable { + + /** + * Returns the index of the closest center to the given point, as well as the squared distance. + */ + def findClosest( + centers: TraversableOnce[VectorWithNorm], + point: VectorWithNorm): (Int, Double) + + /** + * Returns the K-means cost of a given point against the given cluster centers. + */ + def pointCost( + centers: TraversableOnce[VectorWithNorm], + point: VectorWithNorm): Double = +findClosest(centers, point)._2 + + /** + * Returns whether a center converged or not, given the epsilon parameter. + */ + def isCenterConverged( + oldCenter: VectorWithNorm, + newCenter: VectorWithNorm, + epsilon: Double): Boolean + +} + +@Since("2.3.0") +object DistanceSuite { --- End diff -- Why is this an API (and why so named)? this can all be internal. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140833252 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -546,10 +574,88 @@ object KMeans { .run(data) } + private[spark] def validateInitMode(initMode: String): Boolean = { +initMode match { + case KMeans.RANDOM => true + case KMeans.K_MEANS_PARALLEL => true + case _ => false +} + } + private[spark] def validateDistanceMeasure(distanceMeasure: String): Boolean = { +distanceMeasure match { + case DistanceSuite.EUCLIDEAN => true --- End diff -- You can use two labels in one statement if the result is the same; might be clearer. Match is probably overkill anyway --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140832619 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/KMeans.scala --- @@ -40,20 +40,29 @@ import org.apache.spark.util.random.XORShiftRandom * to it should be cached by the user. */ @Since("0.8.0") -class KMeans private ( +class KMeans @Since("2.3.0") private ( --- End diff -- If it's a private constructor, it shouldn't have API "Since" annotations. You don't need to preserve it for compatibility. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
Github user srowen commented on a diff in the pull request: https://github.com/apache/spark/pull/19340#discussion_r140832427 --- Diff: mllib/src/main/scala/org/apache/spark/ml/clustering/KMeans.scala --- @@ -71,6 +71,15 @@ private[clustering] trait KMeansParams extends Params with HasMaxIter with HasFe @Since("1.5.0") def getInitMode: String = $(initMode) + @Since("2.3.0") + final val distanceMeasure = new Param[String](this, "distanceMeasure", "The distance measure. " + --- End diff -- Interesting question here -- what about supplying a function as an argument, for full generality? but then that doesn't translate to Pyspark I guess, and, probably only 2-3 distance functions that ever make sense here. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request #19340: [SPARK-22119] Add cosine distance to KMeans
GitHub user mgaido91 opened a pull request: https://github.com/apache/spark/pull/19340 [SPARK-22119] Add cosine distance to KMeans ## What changes were proposed in this pull request? Currently, KMeans assumes the only possible distance measure to be used is the Euclidean. This PR aims to add the cosine distance support to the KMeans algorithm. ## How was this patch tested? existing and added UTs. You can merge this pull request into a Git repository by running: $ git pull https://github.com/mgaido91/spark SPARK-22119 Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/19340.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #19340 commit d679adde78426d5517a15640d632ef0588a6b249 Author: Marco Gaido Date: 2017-09-11T14:43:53Z Add the distanceSuite parameter and the cosine distance impl to mllib.KMeans commit c364ae3c66740cbbfe763f7e4fd10a8abd02ced8 Author: Marco Gaido Date: 2017-09-25T12:07:17Z Add distance measure to ml Kmeans commit 0e2a9ee48c46f51f55b09f9d60d47ac3325676bd Author: Marco Gaido Date: 2017-09-25T14:40:37Z Add tests for cosine commit d8d8c642345b6a7815998ab5409a5e4cd86d9a5d Author: Marco Gaido Date: 2017-09-25T15:05:32Z fix scalastyle --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org