[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155156703 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155156753 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155158286 **[Test build #45396 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45396/consoleFull)** for PR 5267 at commit [`75ca2a0`](https://github.com/apache/spark/commit/75ca2a000936fe62fb53293d805b2888c67cb563). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155190941 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155175125 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155175092 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155177514 **[Test build #45406 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45406/consoleFull)** for PR 5267 at commit [`29ccdf9`](https://github.com/apache/spark/commit/29ccdf9eaa987530435782d2051acbeda3d3ac36). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r44322215 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,489 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" + * by Steinbach, Karypis, and Kumar, with modification to fit Spark. + * The algorithm starts from a single cluster that contains all points. + * Iteratively it finds divisible clusters on the bottom level and bisects each of them using + * k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. + * The bisecting steps of clusters on the same level are grouped together to increase parallelism. + * If bisecting all divisible clusters on the bottom level would result more than `k` leaf clusters, + * larger clusters get higher priority. + * + * @param k the desired number of leaf clusters (default: 4). The actual number could be smaller if + * there are no divisible leaf clusters. + * @param maxIterations the max number of k-means iterations to split clusters (default: 20) + * @param minDivisibleClusterSize the minimum number of points (if >= 1.0) or the minimum proportion + *of points (if < 1.0) of a divisible cluster (default: 1) + * @param seed a random seed (default: hash value of the class name) + * + * @see [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf + * Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, + * KDD Workshop on Text Mining, 2000.]] + */ +@Since("1.6.0") +@Experimental +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var minDivisibleClusterSize: Double, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(4, 20, 1.0, classOf[BisectingKMeans].getName.##) + + /** + * Sets the desired number of leaf clusters (default: 4). + * The actual number could be smaller if there are no divisible leaf clusters. + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +require(k > 0, s"k must be positive but got $k.") +this.k = k +this + } + + /** + * Gets the desired number of leaf clusters. + */ + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the max number of k-means iterations to split clusters (default: 20). + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +require(maxIterations > 0, s"maxIterations must be positive but got $maxIterations.") +this.maxIterations = maxIterations +this + } + + /** + * Gets the max number of k-means iterations to split clusters. + */ + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the minimum number of points (if >= `1.0`) or the minimum proportion of points + * (if < `1.0`) of a divisible cluster (default: 1). + */ + @Since("1.6.0") + def setMinDivisibleClusterSize(minDivisibleClusterSize: Double): this.type = { +require(minDivisibleClusterSize > 0.0, + s"minDivisibleClusterSize must be
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155174143 **[Test build #45396 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45396/consoleFull)** for PR 5267 at commit [`75ca2a0`](https://github.com/apache/spark/commit/75ca2a000936fe62fb53293d805b2888c67cb563). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * ` * Sets the random seed (default: hash value of the class name).`\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155190761 **[Test build #45406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45406/consoleFull)** for PR 5267 at commit [`29ccdf9`](https://github.com/apache/spark/commit/29ccdf9eaa987530435782d2051acbeda3d3ac36). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * ` * Sets the random seed (default: hash value of the class name).`\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155174569 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r44321863 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,489 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import java.util.Random + +import scala.collection.mutable + +import org.apache.spark.Logging +import org.apache.spark.annotation.{Experimental, Since} +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors} +import org.apache.spark.mllib.util.MLUtils +import org.apache.spark.rdd.RDD +import org.apache.spark.storage.StorageLevel + +/** + * A bisecting k-means algorithm based on the paper "A comparison of document clustering techniques" + * by Steinbach, Karypis, and Kumar, with modification to fit Spark. + * The algorithm starts from a single cluster that contains all points. + * Iteratively it finds divisible clusters on the bottom level and bisects each of them using + * k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. + * The bisecting steps of clusters on the same level are grouped together to increase parallelism. + * If bisecting all divisible clusters on the bottom level would result more than `k` leaf clusters, + * larger clusters get higher priority. + * + * @param k the desired number of leaf clusters (default: 4). The actual number could be smaller if + * there are no divisible leaf clusters. + * @param maxIterations the max number of k-means iterations to split clusters (default: 20) + * @param minDivisibleClusterSize the minimum number of points (if >= 1.0) or the minimum proportion + *of points (if < 1.0) of a divisible cluster (default: 1) + * @param seed a random seed (default: hash value of the class name) + * + * @see [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf + * Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, + * KDD Workshop on Text Mining, 2000.]] + */ +@Since("1.6.0") +@Experimental +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var minDivisibleClusterSize: Double, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(4, 20, 1.0, classOf[BisectingKMeans].getName.##) + + /** + * Sets the desired number of leaf clusters (default: 4). + * The actual number could be smaller if there are no divisible leaf clusters. + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +require(k > 0, s"k must be positive but got $k.") +this.k = k +this + } + + /** + * Gets the desired number of leaf clusters. + */ + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the max number of k-means iterations to split clusters (default: 20). + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +require(maxIterations > 0, s"maxIterations must be positive but got $maxIterations.") +this.maxIterations = maxIterations +this + } + + /** + * Gets the max number of k-means iterations to split clusters. + */ + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the minimum number of points (if >= `1.0`) or the minimum proportion of points + * (if < `1.0`) of a divisible cluster (default: 1). + */ + @Since("1.6.0") + def setMinDivisibleClusterSize(minDivisibleClusterSize: Double): this.type = { +require(minDivisibleClusterSize > 0.0, + s"minDivisibleClusterSize must be
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155314648 @jkbradley thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155225558 LGTM. Merged into master. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155231942 @freeman-lab thank you for your great support! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/5267 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155231687 awesome, nice job all! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155239249 @yu-iskw Thanks for persevering and getting this merged! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-155226627 Thank you for merging it!! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152259858 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152259884 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152275043 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152274585 **[Test build #44615 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/consoleFull)** for PR 5267 at commit [`a876ba2`](https://github.com/apache/spark/commit/a876ba233e197bb5228a05fe2946523af1b40ee6). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152275050 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152342812 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152342783 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152350863 **[Test build #44642 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/consoleFull)** for PR 5267 at commit [`5da05d3`](https://github.com/apache/spark/commit/5da05d3fd41d5b730d19f595ca9bf622aaf0a14d). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152350977 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152350975 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152343011 **[Test build #44642 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/consoleFull)** for PR 5267 at commit [`5da05d3`](https://github.com/apache/spark/commit/5da05d3fd41d5b730d19f595ca9bf622aaf0a14d). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152262191 **[Test build #44615 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/consoleFull)** for PR 5267 at commit [`a876ba2`](https://github.com/apache/spark/commit/a876ba233e197bb5228a05fe2946523af1b40ee6). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152356488 **[Test build #44646 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/consoleFull)** for PR 5267 at commit [`a50689a`](https://github.com/apache/spark/commit/a50689a144bf5801c831d0b2f33eb6435e87f929). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152362401 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152362400 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152355068 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-152362256 **[Test build #44646 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/consoleFull)** for PR 5267 at commit [`a50689a`](https://github.com/apache/spark/commit/a50689a144bf5801c831d0b2f33eb6435e87f929). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151312766 **[Test build #44382 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/consoleFull)** for PR 5267 at commit [`2e8226d`](https://github.com/apache/spark/commit/2e8226d509ead4c14940335bc44c110d4c02c52f). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43072422 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the costs, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The costs means how large the cluster is. That is, the cluster + * whose cost is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the random seed + */
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151320160 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151320164 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43072197 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the costs, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The costs means how large the cluster is. That is, the cluster + * whose cost is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the random seed + */
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151327314 **[Test build #44387 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/consoleFull)** for PR 5267 at commit [`165e191`](https://github.com/apache/spark/commit/165e191755bb641f10c604f1ec214248f5a104e7). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151313120 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151313092 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151319812 **[Test build #44382 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/consoleFull)** for PR 5267 at commit [`2e8226d`](https://github.com/apache/spark/commit/2e8226d509ead4c14940335bc44c110d4c02c52f). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43072115 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the costs, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The costs means how large the cluster is. That is, the cluster + * whose cost is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the random seed + */
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151311507 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151311530 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151321365 **[Test build #44384 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/consoleFull)** for PR 5267 at commit [`622499e`](https://github.com/apache/spark/commit/622499e95e2717625eecd11f5c43e0c9eddc820e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43072937 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,691 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the costs, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The costs means how large the cluster is. That is, the cluster + * whose cost is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(2, 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the random seed + */
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151325995 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151334250 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151334153 **[Test build #44387 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/consoleFull)** for PR 5267 at commit [`165e191`](https://github.com/apache/spark/commit/165e191755bb641f10c604f1ec214248f5a104e7). * This patch **fails Spark unit tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151334247 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151313829 **[Test build #44384 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/consoleFull)** for PR 5267 at commit [`622499e`](https://github.com/apache/spark/commit/622499e95e2717625eecd11f5c43e0c9eddc820e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43071613 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the costs, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The costs means how large the cluster is. That is, the cluster + * whose cost is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var maxIterations: Int, +private var seed: Long) extends Logging { + + import BisectingKMeans._ + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +this.maxIterations = maxIterations +this + } + + @Since("1.6.0") + def getMaxIterations: Int = this.maxIterations + + /** + * Sets the random seed + */
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151321462 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r43071633 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,744 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} --- End diff -- `Map` is not needed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151321461 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-151326036 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150126146 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150126127 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150127788 **[Test build #44138 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/consoleFull)** for PR 5267 at commit [`3156dd7`](https://github.com/apache/spark/commit/3156dd7e0c4fc4f5d1b1116bb689592eee76ec25). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150136686 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150136685 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150136544 **[Test build #44138 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/consoleFull)** for PR 5267 at commit [`3156dd7`](https://github.com/apache/spark/commit/3156dd7e0c4fc4f5d1b1116bb689592eee76ec25). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821656 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821607 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821750 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821747 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820997 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820980 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821019 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820998 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821001 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821931 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala --- @@ -0,0 +1,137 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import breeze.linalg.{Vector => BV, norm => breezeNorm} + +import org.apache.spark.Logging +import org.apache.spark.annotation.Since +import org.apache.spark.api.java.JavaRDD +import org.apache.spark.mllib.linalg.Vector +import org.apache.spark.rdd.RDD + +/** + * This class is used for the model of the bisecting kmeans + * + * @param node a cluster as a tree node + */ +@Since("1.6.0") +class BisectingKMeansModel @Since("1.6.0") ( +@Since("1.6.0") val node: BisectingClusterNode + ) extends Serializable with Logging { + + @Since("1.6.0") + def getClusters: Array[BisectingClusterNode] = this.node.getLeavesNodes + + @Since("1.6.0") + def getCenters: Array[Vector] = this.getClusters.map(_.center) + + /** + * Predicts the closest cluster by one point + */ + @Since("1.6.0") + def predict(vector: Vector): Int = { --- End diff -- * We need to do this from root to leaf so we don't need to compare the point with all leaf clusters. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821027 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821017 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820987 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], --- End diff -- * remove it from constructor args, move it to `run` method instead * `BigInt` -> `Long` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821008 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821012 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821021 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820995 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42820990 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + --- End diff -- `import BisectingKMeans._` --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821004 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42821578 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala --- @@ -0,0 +1,690 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering + +import scala.collection.{Map, mutable} + +import breeze.linalg + .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => breezeNorm, sum => breezeSum} + +import org.apache.spark.{Logging, SparkException} +import org.apache.spark.annotation.Since +import org.apache.spark.mllib.linalg.{Vector, Vectors} +import org.apache.spark.rdd.RDD + + +/** + * This is a divisive hierarchical clustering algorithm based on bisecting k-means algorithm. + * + * The main idea of this algorithm is based on "A comparison of document clustering techniques", + * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 2000. + * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf + * + * However, we modified it to fit for Spark. This algorithm consists of the two main parts. + * + * 1. Split clusters until the number of clusters will be enough to build a cluster tree + * 2. Build a cluster tree as a binary tree by the splitted clusters + * + * First, it splits clusters to their children clusters step by step, not considering a cluster + * will be included in the final cluster tree or not. That's because it makes the algorithm more + * efficient on Spark and splitting a cluster one by one is very slow. It will keep splitting until + * the number of clusters will be enough to build a cluster tree. Otherwise, it will stop splitting + * when there are no dividable clusters before the number of clusters will be sufficient. And + * it calculates the criterions, such as average cost, entropy and so on, for building a cluster + * tree in the first part. The criterion means how large the cluster is. That is, the cluster + * whose criterion is maximum of all the clusters is the largest cluster. + * + * Second, it builds a cluster tree as a binary tree by the result of the first part. + * First of all, the cluster tree starts with only the root cluster which includes all points. + * So, there are two candidates which can be merged to the cluster tree. Those are the children of + * the root. Then, it picks up the larger child of the two and merge it to the cluster tree. + * After that, there are tree candidates to merge. Those are the smaller child of the root and + * the two children of the larger cluster of the root. It picks up the largest cluster of the tree + * and merge it to the * cluster tree. Like this, it continues to pick up the largest one of the + * candidates and merge it to the cluster tree until the desired number of clusters is reached. + * + * @param k tne desired number of clusters + * @param clusterMap the pairs of cluster and its index as Map + * @param maxIterations the number of maximal iterations to split clusters + * @param seed a random seed + */ +@Since("1.6.0") +class BisectingKMeans private ( +private var k: Int, +private var clusterMap: Map[BigInt, BisectingClusterNode], +private var maxIterations: Int, +private var seed: Long) extends Logging { + + /** + * Constructs with the default configuration + */ + @Since("1.6.0") + def this() = this(20, mutable.ListMap.empty[BigInt, BisectingClusterNode], 20, 1) + + /** + * Sets the number of clusters you want + */ + @Since("1.6.0") + def setK(k: Int): this.type = { +this.k = k +this + } + + @Since("1.6.0") + def getK: Int = this.k + + /** + * Sets the number of maximal iterations in each clustering step + */ + @Since("1.6.0") + def setMaxIterations(maxIterations: Int): this.type = { +
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on a diff in the pull request: https://github.com/apache/spark/pull/5267#discussion_r42822007 --- Diff: mllib/src/test/java/org/apache/spark/mllib/clustering/JavaBisectingKMeansSuite.java --- @@ -0,0 +1,123 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + *http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.spark.mllib.clustering; + +import com.google.common.collect.Lists; +import org.apache.spark.api.java.JavaRDD; --- End diff -- reorganize imports --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150285114 **[Test build #44157 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/consoleFull)** for PR 5267 at commit [`31623ea`](https://github.com/apache/spark/commit/31623ead10d4664c7c5d98b5c72731bcc804bcba). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel @Since(\"1.6.0\") (`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150285291 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150285292 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150272472 **[Test build #44157 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/consoleFull)** for PR 5267 at commit [`31623ea`](https://github.com/apache/spark/commit/31623ead10d4664c7c5d98b5c72731bcc804bcba). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150270427 Merged build triggered. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-150270450 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-126521482 @freeman-lab It's alright. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-126530412 @mengxr thank you for your feedback. I'll modify them ASAP. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-126519990 @mengxr @freeman-lab where are with this PR? I really appreciate your support. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user mengxr commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-126527910 @yu-iskw I made one pass, but I think we need more time on this PR. Some high-level comments: 1. Documentation is not sufficient. For example, the criterion used to split clusters is not documented. We only mention that the implementation is based on the paper, but we didn't describe the algorithm nor how it differs from the algorithm used in the paper. There are also several public methods without JavaDoc. 2. Many methods should be private rather than package private. If we want to test some utility functions, we can put them under a utility class/object and mark the object package private but methods under it public. Otherwise, it is hard to read the unit tests. 3. This PR adds many public APIs, especially on tree cluster nodes. We need to discuss the names as well as how to keep them minimal. 4. Correctness. I didn't check the details, but the variance calculation is wrong (line 337). 5. Code style issues. For 1), could you add more description about the algorithms? If I have some time before 1.5, I can send you a PR to fix some issues. But we may miss the 1.5 release. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user freeman-lab commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-126520577 sorry for the delay @yu-iskw, i'm going through it today, comments soon --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user yu-iskw commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-125866875 @FlytxtRnD thank you for reviewing it. I have modified them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/5267#issuecomment-125856449 Merged build started. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org