[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155156703
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155156753
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155158286
  
**[Test build #45396 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45396/consoleFull)**
 for PR 5267 at commit 
[`75ca2a0`](https://github.com/apache/spark/commit/75ca2a000936fe62fb53293d805b2888c67cb563).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155190941
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155175125
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155175092
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155177514
  
**[Test build #45406 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45406/consoleFull)**
 for PR 5267 at commit 
[`29ccdf9`](https://github.com/apache/spark/commit/29ccdf9eaa987530435782d2051acbeda3d3ac36).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r44322215
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,489 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import scala.collection.mutable
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * A bisecting k-means algorithm based on the paper "A comparison of 
document clustering techniques"
+ * by Steinbach, Karypis, and Kumar, with modification to fit Spark.
+ * The algorithm starts from a single cluster that contains all points.
+ * Iteratively it finds divisible clusters on the bottom level and bisects 
each of them using
+ * k-means, until there are `k` leaf clusters in total or no leaf clusters 
are divisible.
+ * The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+ * If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf clusters,
+ * larger clusters get higher priority.
+ *
+ * @param k the desired number of leaf clusters (default: 4). The actual 
number could be smaller if
+ *  there are no divisible leaf clusters.
+ * @param maxIterations the max number of k-means iterations to split 
clusters (default: 20)
+ * @param minDivisibleClusterSize the minimum number of points (if >= 1.0) 
or the minimum proportion
+ *of points (if < 1.0) of a divisible 
cluster (default: 1)
+ * @param seed a random seed (default: hash value of the class name)
+ *
+ * @see 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+ * Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+ * KDD Workshop on Text Mining, 2000.]]
+ */
+@Since("1.6.0")
+@Experimental
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var minDivisibleClusterSize: Double,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(4, 20, 1.0, classOf[BisectingKMeans].getName.##)
+
+  /**
+   * Sets the desired number of leaf clusters (default: 4).
+   * The actual number could be smaller if there are no divisible leaf 
clusters.
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+require(k > 0, s"k must be positive but got $k.")
+this.k = k
+this
+  }
+
+  /**
+   * Gets the desired number of leaf clusters.
+   */
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the max number of k-means iterations to split clusters (default: 
20).
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+require(maxIterations > 0, s"maxIterations must be positive but got 
$maxIterations.")
+this.maxIterations = maxIterations
+this
+  }
+
+  /**
+   * Gets the max number of k-means iterations to split clusters.
+   */
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the minimum number of points (if >= `1.0`) or the minimum 
proportion of points
+   * (if < `1.0`) of a divisible cluster (default: 1).
+   */
+  @Since("1.6.0")
+  def setMinDivisibleClusterSize(minDivisibleClusterSize: Double): 
this.type = {
+require(minDivisibleClusterSize > 0.0,
+  s"minDivisibleClusterSize must be 

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155174143
  
**[Test build #45396 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45396/consoleFull)**
 for PR 5267 at commit 
[`75ca2a0`](https://github.com/apache/spark/commit/75ca2a000936fe62fb53293d805b2888c67cb563).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `   
* Sets the random seed (default: hash value of the class name).`\n  * `class 
BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155190761
  
**[Test build #45406 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/45406/consoleFull)**
 for PR 5267 at commit 
[`29ccdf9`](https://github.com/apache/spark/commit/29ccdf9eaa987530435782d2051acbeda3d3ac36).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `   
* Sets the random seed (default: hash value of the class name).`\n  * `class 
BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155174569
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread yu-iskw
Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r44321863
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,489 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import java.util.Random
+
+import scala.collection.mutable
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.{Experimental, Since}
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.{BLAS, Vector, Vectors}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * A bisecting k-means algorithm based on the paper "A comparison of 
document clustering techniques"
+ * by Steinbach, Karypis, and Kumar, with modification to fit Spark.
+ * The algorithm starts from a single cluster that contains all points.
+ * Iteratively it finds divisible clusters on the bottom level and bisects 
each of them using
+ * k-means, until there are `k` leaf clusters in total or no leaf clusters 
are divisible.
+ * The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+ * If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf clusters,
+ * larger clusters get higher priority.
+ *
+ * @param k the desired number of leaf clusters (default: 4). The actual 
number could be smaller if
+ *  there are no divisible leaf clusters.
+ * @param maxIterations the max number of k-means iterations to split 
clusters (default: 20)
+ * @param minDivisibleClusterSize the minimum number of points (if >= 1.0) 
or the minimum proportion
+ *of points (if < 1.0) of a divisible 
cluster (default: 1)
+ * @param seed a random seed (default: hash value of the class name)
+ *
+ * @see 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+ * Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+ * KDD Workshop on Text Mining, 2000.]]
+ */
+@Since("1.6.0")
+@Experimental
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var minDivisibleClusterSize: Double,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(4, 20, 1.0, classOf[BisectingKMeans].getName.##)
+
+  /**
+   * Sets the desired number of leaf clusters (default: 4).
+   * The actual number could be smaller if there are no divisible leaf 
clusters.
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+require(k > 0, s"k must be positive but got $k.")
+this.k = k
+this
+  }
+
+  /**
+   * Gets the desired number of leaf clusters.
+   */
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the max number of k-means iterations to split clusters (default: 
20).
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+require(maxIterations > 0, s"maxIterations must be positive but got 
$maxIterations.")
+this.maxIterations = maxIterations
+this
+  }
+
+  /**
+   * Gets the max number of k-means iterations to split clusters.
+   */
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the minimum number of points (if >= `1.0`) or the minimum 
proportion of points
+   * (if < `1.0`) of a divisible cluster (default: 1).
+   */
+  @Since("1.6.0")
+  def setMinDivisibleClusterSize(minDivisibleClusterSize: Double): 
this.type = {
+require(minDivisibleClusterSize > 0.0,
+  s"minDivisibleClusterSize must be 

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155314648
  
@jkbradley thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155225558
  
LGTM. Merged into master. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155231942
  
@freeman-lab thank you for your great support!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/5267


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread freeman-lab
Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155231687
  
awesome, nice job all!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155239249
  
@yu-iskw  Thanks for persevering and getting this merged!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-11-09 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-155226627
  
Thank you for merging it!!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152259858
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152259884
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152275043
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152274585
  
**[Test build #44615 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/consoleFull)**
 for PR 5267 at commit 
[`a876ba2`](https://github.com/apache/spark/commit/a876ba233e197bb5228a05fe2946523af1b40ee6).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152275050
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152342812
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152342783
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152350863
  
**[Test build #44642 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/consoleFull)**
 for PR 5267 at commit 
[`5da05d3`](https://github.com/apache/spark/commit/5da05d3fd41d5b730d19f595ca9bf622aaf0a14d).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152350977
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152350975
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152343011
  
**[Test build #44642 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44642/consoleFull)**
 for PR 5267 at commit 
[`5da05d3`](https://github.com/apache/spark/commit/5da05d3fd41d5b730d19f595ca9bf622aaf0a14d).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152262191
  
**[Test build #44615 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44615/consoleFull)**
 for PR 5267 at commit 
[`a876ba2`](https://github.com/apache/spark/commit/a876ba233e197bb5228a05fe2946523af1b40ee6).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152356488
  
**[Test build #44646 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/consoleFull)**
 for PR 5267 at commit 
[`a50689a`](https://github.com/apache/spark/commit/a50689a144bf5801c831d0b2f33eb6435e87f929).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152362401
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152362400
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152355068
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-29 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-152362256
  
**[Test build #44646 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44646/consoleFull)**
 for PR 5267 at commit 
[`a50689a`](https://github.com/apache/spark/commit/a50689a144bf5801c831d0b2f33eb6435e87f929).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151312766
  
**[Test build #44382 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/consoleFull)**
 for PR 5267 at commit 
[`2e8226d`](https://github.com/apache/spark/commit/2e8226d509ead4c14940335bc44c110d4c02c52f).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43072422
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,744 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the costs, such as average cost, entropy and so on, for 
building a cluster
+ * tree in the first part. The costs means how large the cluster is. That 
is, the cluster
+ * whose cost is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the random seed
+   */
  

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151320160
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151320164
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43072197
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,744 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the costs, such as average cost, entropy and so on, for 
building a cluster
+ * tree in the first part. The costs means how large the cluster is. That 
is, the cluster
+ * whose cost is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the random seed
+   */
  

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151327314
  
**[Test build #44387 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/consoleFull)**
 for PR 5267 at commit 
[`165e191`](https://github.com/apache/spark/commit/165e191755bb641f10c604f1ec214248f5a104e7).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151313120
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151313092
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151319812
  
**[Test build #44382 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44382/consoleFull)**
 for PR 5267 at commit 
[`2e8226d`](https://github.com/apache/spark/commit/2e8226d509ead4c14940335bc44c110d4c02c52f).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43072115
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,744 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the costs, such as average cost, entropy and so on, for 
building a cluster
+ * tree in the first part. The costs means how large the cluster is. That 
is, the cluster
+ * whose cost is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the random seed
+   */
  

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151311507
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151311530
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151321365
  
**[Test build #44384 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/consoleFull)**
 for PR 5267 at commit 
[`622499e`](https://github.com/apache/spark/commit/622499e95e2717625eecd11f5c43e0c9eddc820e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43072937
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,691 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the costs, such as average cost, entropy and so on, for 
building a cluster
+ * tree in the first part. The costs means how large the cluster is. That 
is, the cluster
+ * whose cost is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(2, 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the random seed
+   */
   

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151325995
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151334250
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151334153
  
**[Test build #44387 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44387/consoleFull)**
 for PR 5267 at commit 
[`165e191`](https://github.com/apache/spark/commit/165e191755bb641f10c604f1ec214248f5a104e7).
 * This patch **fails Spark unit tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151334247
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151313829
  
**[Test build #44384 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/consoleFull)**
 for PR 5267 at commit 
[`622499e`](https://github.com/apache/spark/commit/622499e95e2717625eecd11f5c43e0c9eddc820e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43071613
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,744 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg.{SparseVector => BSV, Vector => BV, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the costs, such as average cost, entropy and so on, for 
building a cluster
+ * tree in the first part. The costs means how large the cluster is. That 
is, the cluster
+ * whose cost is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  import BisectingKMeans._
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  @Since("1.6.0")
+  def getMaxIterations: Int = this.maxIterations
+
+  /**
+   * Sets the random seed
+   */
  

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151321462
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44384/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r43071633
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,744 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
--- End diff --

`Map` is not needed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151321461
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-26 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-151326036
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150126146
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150126127
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150127788
  
**[Test build #44138 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/consoleFull)**
 for PR 5267 at commit 
[`3156dd7`](https://github.com/apache/spark/commit/3156dd7e0c4fc4f5d1b1116bb689592eee76ec25).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150136686
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150136685
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150136544
  
**[Test build #44138 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44138/consoleFull)**
 for PR 5267 at commit 
[`3156dd7`](https://github.com/apache/spark/commit/3156dd7e0c4fc4f5d1b1116bb689592eee76ec25).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821656
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821607
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821750
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821747
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820997
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820980
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821019
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820998
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821001
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821931
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeansModel.scala
 ---
@@ -0,0 +1,137 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import breeze.linalg.{Vector => BV, norm => breezeNorm}
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Since
+import org.apache.spark.api.java.JavaRDD
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * This class is used for the model of the bisecting kmeans
+ *
+ * @param node a cluster as a tree node
+ */
+@Since("1.6.0")
+class BisectingKMeansModel @Since("1.6.0") (
+@Since("1.6.0") val node: BisectingClusterNode
+  ) extends Serializable with Logging {
+
+  @Since("1.6.0")
+  def getClusters: Array[BisectingClusterNode] = this.node.getLeavesNodes
+
+  @Since("1.6.0")
+  def getCenters: Array[Vector] = this.getClusters.map(_.center)
+
+  /**
+   * Predicts the closest cluster by one point
+   */
+  @Since("1.6.0")
+  def predict(vector: Vector): Int = {
--- End diff --

* We need to do this from root to leaf so we don't need to compare the 
point with all leaf clusters.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821027
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821017
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820987
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
--- End diff --

* remove it from constructor args, move it to `run` method instead
* `BigInt` -> `Long`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821008
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821012
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821021
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820995
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42820990
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
--- End diff --

`import BisectingKMeans._`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821004
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42821578
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/BisectingKMeans.scala ---
@@ -0,0 +1,690 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.{Map, mutable}
+
+import breeze.linalg
+  .{SparseVector => BSV, Vector => BV, any => breezeAny, norm => 
breezeNorm, sum => breezeSum}
+
+import org.apache.spark.{Logging, SparkException}
+import org.apache.spark.annotation.Since
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.rdd.RDD
+
+
+/**
+ * This is a divisive hierarchical clustering algorithm based on bisecting 
k-means algorithm.
+ *
+ * The main idea of this algorithm is based on "A comparison of document 
clustering techniques",
+ * M. Steinbach, G. Karypis and V. Kumar. Workshop on Text Mining, KDD, 
2000.
+ * http://cs.fit.edu/~pkc/classes/ml-internet/papers/steinbach00tr.pdf
+ *
+ * However, we modified it to fit for Spark. This algorithm consists of 
the two main parts.
+ *
+ * 1. Split clusters until the number of clusters will be enough to build 
a cluster tree
+ * 2. Build a cluster tree as a binary tree by the splitted clusters
+ *
+ * First, it splits clusters to their children clusters step by step, not 
considering a cluster
+ * will be included in the final cluster tree or not. That's because it 
makes the algorithm more
+ * efficient on Spark and splitting a cluster one by one is very slow. It 
will keep splitting until
+ * the number of clusters will be enough to build a cluster tree. 
Otherwise, it will stop splitting
+ * when there are no dividable clusters before the number of clusters will 
be sufficient. And
+ * it calculates the criterions, such as average cost, entropy and so on, 
for building a cluster
+ * tree in the first part. The criterion means how large the cluster is. 
That is, the cluster
+ * whose criterion is maximum of all the clusters is the largest cluster.
+ *
+ * Second, it builds a cluster tree as a binary tree by the result of the 
first part.
+ * First of all, the cluster tree starts with only the root cluster which 
includes all points.
+ * So, there are two candidates which can be merged to the cluster tree. 
Those are the children of
+ * the root. Then, it picks up the larger child of the two and merge it to 
the cluster tree.
+ * After that, there are tree candidates to merge. Those are the smaller 
child of the root and
+ * the two children of the larger cluster of the root. It picks up the 
largest cluster of the tree
+ * and merge it to the * cluster tree. Like this, it continues to pick up 
the largest one of the
+ * candidates and merge it to the cluster tree until the desired number of 
clusters is reached.
+ *
+ * @param k tne desired number of clusters
+ * @param clusterMap the pairs of cluster and its index as Map
+ * @param maxIterations the number of maximal iterations to split clusters
+ * @param seed a random seed
+ */
+@Since("1.6.0")
+class BisectingKMeans private (
+private var k: Int,
+private var clusterMap: Map[BigInt, BisectingClusterNode],
+private var maxIterations: Int,
+private var seed: Long) extends Logging {
+
+  /**
+   * Constructs with the default configuration
+   */
+  @Since("1.6.0")
+  def this() = this(20, mutable.ListMap.empty[BigInt, 
BisectingClusterNode], 20, 1)
+
+  /**
+   * Sets the number of clusters you want
+   */
+  @Since("1.6.0")
+  def setK(k: Int): this.type = {
+this.k = k
+this
+  }
+
+  @Since("1.6.0")
+  def getK: Int = this.k
+
+  /**
+   * Sets the number of maximal iterations in each clustering step
+   */
+  @Since("1.6.0")
+  def setMaxIterations(maxIterations: Int): this.type = {
+

[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread mengxr
Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/5267#discussion_r42822007
  
--- Diff: 
mllib/src/test/java/org/apache/spark/mllib/clustering/JavaBisectingKMeansSuite.java
 ---
@@ -0,0 +1,123 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering;
+
+import com.google.common.collect.Lists;
+import org.apache.spark.api.java.JavaRDD;
--- End diff --

reorganize imports


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150285114
  
**[Test build #44157 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/consoleFull)**
 for PR 5267 at commit 
[`31623ea`](https://github.com/apache/spark/commit/31623ead10d4664c7c5d98b5c72731bcc804bcba).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel @Since(\"1.6.0\") (`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150285291
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150285292
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150272472
  
**[Test build #44157 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44157/consoleFull)**
 for PR 5267 at commit 
[`31623ea`](https://github.com/apache/spark/commit/31623ead10d4664c7c5d98b5c72731bcc804bcba).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150270427
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-10-22 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-150270450
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-30 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-126521482
  
@freeman-lab It's alright. Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-30 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-126530412
  
@mengxr thank you for your feedback. I'll modify them ASAP.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-30 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-126519990
  
@mengxr @freeman-lab where are with this PR? I really appreciate your 
support.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-30 Thread mengxr
Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-126527910
  
@yu-iskw I made one pass, but I think we need more time on this PR. Some 
high-level comments:

1. Documentation is not sufficient. For example, the criterion used to 
split clusters is not documented. We only mention that the implementation is 
based on the paper, but we didn't describe the algorithm nor how it differs 
from the algorithm used in the paper. There are also several public methods 
without JavaDoc.
2. Many methods should be private rather than package private. If we want 
to test some utility functions, we can put them under a utility class/object 
and mark the object package private but methods under it public. Otherwise, it 
is hard to read the unit tests.
3. This PR adds many public APIs, especially on tree cluster nodes. We need 
to discuss the names as well as how to keep them minimal.
4. Correctness. I didn't check the details, but the variance calculation is 
wrong (line 337).
5. Code style issues.

For 1), could you add more description about the algorithms? If I have some 
time before 1.5, I can send you a PR to fix some issues. But we may miss the 
1.5 release.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-30 Thread freeman-lab
Github user freeman-lab commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-126520577
  
sorry for the delay @yu-iskw, i'm going through it today, comments soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-29 Thread yu-iskw
Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-125866875
  
@FlytxtRnD thank you for reviewing it. I have modified them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-6517][mllib] Implement the Algorithm of...

2015-07-29 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/5267#issuecomment-125856449
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   3   >