subject:"\[GitHub\] spark pull request\: \[SPARK\-8402\]\[MLLIB\] DP Means Clustering"

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206350081
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55108/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206350080
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206349899
  
**[Test build #55108 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55108/consoleFull)**
 for PR 6880 at commit 
[`c25eae2`](https://github.com/apache/spark/commit/c25eae2eacacf867c666d730e05bc6daa3fe7a78).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206336389
  
**[Test build #55108 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55108/consoleFull)**
 for PR 6880 at commit 
[`c25eae2`](https://github.com/apache/spark/commit/c25eae2eacacf867c666d730e05bc6daa3fe7a78).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206169860
  
**[Test build #55098 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55098/consoleFull)**
 for PR 6880 at commit 
[`23316d4`](https://github.com/apache/spark/commit/23316d4b6abab5e1f23ae991a22722d7629e9ab2).
 * This patch **fails to build**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206169896
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/55098/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206169890
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-06 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-206166387
  
**[Test build #55098 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/55098/consoleFull)**
 for PR 6880 at commit 
[`23316d4`](https://github.com/apache/spark/commit/23316d4b6abab5e1f23ae991a22722d7629e9ab2).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-204633940
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-01 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-204633941
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/54751/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-204633939
  
**[Test build #54751 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54751/consoleFull)**
 for PR 6880 at commit 
[`b088e46`](https://github.com/apache/spark/commit/b088e460cebf18f2e169661fd9c598e45a0f1bd1).
 * This patch **fails Scala style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2016-04-01 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-204633686
  
**[Test build #54751 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/54751/consoleFull)**
 for PR 6880 at commit 
[`b088e46`](https://github.com/apache/spark/commit/b088e460cebf18f2e169661fd9c598e45a0f1bd1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-13 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-156399657
  
@yu-iskw  @jkbradley any other review comments, please.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153564736
  
@yu-iskw I didn't get your comment on @Since tags. We will be waiting for 
further review comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread yu-iskw

Github user yu-iskw commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153409313
  
@FlytxtRnD thank you for the update. We should add `@since` tags in the 
first commit.
Btw, I haven't read the original paper carefully yet. I'll review this PR 
in terms of the algorithm later. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153337608
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153337609
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44917/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153337507
  
**[Test build #44917 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44917/consoleFull)**
 for PR 6880 at commit 
[`b088e46`](https://github.com/apache/spark/commit/b088e460cebf18f2e169661fd9c598e45a0f1bd1).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * `  
case class Params(`\n  * `class DpMeansModel(`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153329563
  
@yu-iskw PR is updated. Shall I include @since to the methods? Or is it 
done after getting merged? Please provide any other suggestions, if any.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153327592
  
**[Test build #44917 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/44917/consoleFull)**
 for PR 6880 at commit 
[`b088e46`](https://github.com/apache/spark/commit/b088e460cebf18f2e169661fd9c598e45a0f1bd1).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153327399
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-03 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-153327369
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-02 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43697828
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-02 Thread FlytxtRnD

Github user FlytxtRnD commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43622644
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-152919710
  
Thank you @yu-iskw for the review comments.. Will update the PR asap


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43591908
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43591894
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43591263
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43590992
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-11-01 Thread yu-iskw

Github user yu-iskw commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r43590450
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,279 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP-means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics"
+ * by Brian Kulis, Michael I. Jordan. This implementation is based on 
"MLbase: Distributed
+ * Machine Learning Made Easy" by Xinghao Pan, Evan R. Sparks, Andre 
Wibisono
+ *
+ * @param lambdaValue distance value that controls cluster creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ * // TODO
+ * @param maxClusterCount The maximum expected number of clusters.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambdaValue: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int,
+private var maxClusterCount: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance with default parameters: {lambdaValue: 
1,
+   * convergenceTol: 0.01, maxIterations: 20, maxClusterCount: 1000}.
+   */
+  def this() = this(1, 0.01, 20, 1000)
+
+  /** Sets the value for the lambda parameter, which controls the cluster 
creation. Default: 1 */
+  def setLambdaValue(lambdaValue: Double): this.type = {
+this.lambdaValue = lambdaValue
+this
+  }
+
+  /** Returns the lambda value */
+  def getLambdaValue: Double = lambdaValue
+
+  /** Sets the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Returns the convergence threshold value . */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Sets the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Returns the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /** Sets the maximum number of clusters expected. Default: 1000 */
+  def setMaxClusterCount(maxClusterCount: Int): this.type = {
+this.maxClusterCount = maxClusterCount
+this
+  }
+
+  /** Returns the maximum number of clusters expected. */
+  def getMaxClusterCount:

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-10-19 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-149156272
  
@mengxr Could you please have a look into this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-27 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-143643138
  
@mengxr @jkbradley I have incorporated the suggestions and changes and 
updated the PR. Could you please take another look ?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142535150
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142535152
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42899/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142535047
  
  [Test build #42899 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42899/console)
 for   PR 6880 at commit 
[`e796866`](https://github.com/apache/spark/commit/e7968668c63730cf41c6a0f756853560a073894a).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Params(`
  * `  case class WeightedPoint(vector: Vector, count: Long)`
  * `class DpMeansModel(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142527071
  
  [Test build #42899 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/42899/consoleFull)
 for   PR 6880 at commit 
[`e796866`](https://github.com/apache/spark/commit/e7968668c63730cf41c6a0f756853560a073894a).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142526576
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-23 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-142526560
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-17 Thread FlytxtRnD

Github user FlytxtRnD commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r39827805
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP mean

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-11 Thread FlytxtRnD

Github user FlytxtRnD commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r39255547
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP mean

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-10 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-139170303
  
@mengxr Thank you for all the suggestions. Will update soon


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38974586
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-138687534
  
@FlytxtRnD I made another pass. Please follow the code style guide closely: 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide. I 
will make another pass on the implementation after your updates.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38953908
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DpMeansSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.util.Utils
+
+class DpMeansSuite  extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("single cluster") {
+val data = sc.parallelize(Array(
+  Vectors.dense(0.1, 0.5, 0.7),
+  Vectors.dense(0.2, 0.6, 0.8),
+  Vectors.dense(0.3, 0.7, 0.9)
+))
+val center = Vectors.dense(0.2, 0.6, 0.8)
+
+val model = new DpMeans().setLambda(2).setConvergenceTol(1).run(data)
+assert(model.clusterCenters.head ~== center absTol 1E-5)
+  }
+
+  test("two clusters") {
+val data = sc.parallelize(DpMeansSuite.data)
+val model = new DpMeans().setLambda(12).setConvergenceTol(1).run(data)
+val predictedClusters = model.predict(data).collect()
+
+assert(predictedClusters(0) === predictedClusters(1))
+assert(predictedClusters(0) === predictedClusters(2))
+assert(predictedClusters(6) === predictedClusters(14))
+assert(predictedClusters(8) === predictedClusters(9))
+assert(predictedClusters(0) != predictedClusters(7))
+  }
+
+  test("single cluster with sparse data") {
+val n = 1
+val data = sc.parallelize((1 to 100).flatMap { i =>
+  val x = i / 1000.0
+  Array(
+Vectors.sparse(n, Seq((0, 1.0 + x), (1, 2.0), (2, 6.0))),
+Vectors.sparse(n, Seq((0, 1.0 - x), (1, 2.0), (2, 6.0))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 3.0 + x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 3.0 - x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 4.0), (2, 6.0 + x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 4.0), (2, 6.0 - x)))
+  )
+}, 4)
+data.persist()
+
+val center = Vectors.sparse(n, Seq((0, 1.0), (1, 3.0), (2, 4.0)))
+
+val model = new DpMeans().setLambda(40).setConvergenceTol(1).run(data)
+assert(model.clusterCenters.head == center)
+  }
+
+  object DpMeansSuite extends SparkFunSuite{
+
+val data = Array(
+ Vectors.dense(-5.1971), Vectors.dense(-2.5359), 
Vectors.dense(-3.8220),
+ Vectors.dense(-5.2211), Vectors.dense(-5.0602), Vectors.dense( 
-4.7118),
--- End diff --

remove space before `-4.7118`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38953865
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DpMeansSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.util.Utils
+
+class DpMeansSuite  extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("single cluster") {
+val data = sc.parallelize(Array(
+  Vectors.dense(0.1, 0.5, 0.7),
+  Vectors.dense(0.2, 0.6, 0.8),
+  Vectors.dense(0.3, 0.7, 0.9)
+))
+val center = Vectors.dense(0.2, 0.6, 0.8)
+
+val model = new DpMeans().setLambda(2).setConvergenceTol(1).run(data)
+assert(model.clusterCenters.head ~== center absTol 1E-5)
+  }
+
+  test("two clusters") {
+val data = sc.parallelize(DpMeansSuite.data)
+val model = new DpMeans().setLambda(12).setConvergenceTol(1).run(data)
+val predictedClusters = model.predict(data).collect()
+
+assert(predictedClusters(0) === predictedClusters(1))
+assert(predictedClusters(0) === predictedClusters(2))
+assert(predictedClusters(6) === predictedClusters(14))
+assert(predictedClusters(8) === predictedClusters(9))
+assert(predictedClusters(0) != predictedClusters(7))
+  }
+
+  test("single cluster with sparse data") {
+val n = 1
+val data = sc.parallelize((1 to 100).flatMap { i =>
+  val x = i / 1000.0
+  Array(
+Vectors.sparse(n, Seq((0, 1.0 + x), (1, 2.0), (2, 6.0))),
+Vectors.sparse(n, Seq((0, 1.0 - x), (1, 2.0), (2, 6.0))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 3.0 + x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 3.0 - x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 4.0), (2, 6.0 + x))),
+Vectors.sparse(n, Seq((0, 1.0), (1, 4.0), (2, 6.0 - x)))
+  )
+}, 4)
+data.persist()
+
+val center = Vectors.sparse(n, Seq((0, 1.0), (1, 3.0), (2, 4.0)))
+
+val model = new DpMeans().setLambda(40).setConvergenceTol(1).run(data)
+assert(model.clusterCenters.head == center)
+  }
+
+  object DpMeansSuite extends SparkFunSuite{
--- End diff --

remove `extends SparkFunSuite`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38953796
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DpMeansSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.util.Utils
+
+class DpMeansSuite  extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("single cluster") {
+val data = sc.parallelize(Array(
+  Vectors.dense(0.1, 0.5, 0.7),
+  Vectors.dense(0.2, 0.6, 0.8),
+  Vectors.dense(0.3, 0.7, 0.9)
+))
+val center = Vectors.dense(0.2, 0.6, 0.8)
+
+val model = new DpMeans().setLambda(2).setConvergenceTol(1).run(data)
+assert(model.clusterCenters.head ~== center absTol 1E-5)
+  }
+
+  test("two clusters") {
+val data = sc.parallelize(DpMeansSuite.data)
+val model = new DpMeans().setLambda(12).setConvergenceTol(1).run(data)
+val predictedClusters = model.predict(data).collect()
+
+assert(predictedClusters(0) === predictedClusters(1))
+assert(predictedClusters(0) === predictedClusters(2))
+assert(predictedClusters(6) === predictedClusters(14))
+assert(predictedClusters(8) === predictedClusters(9))
+assert(predictedClusters(0) != predictedClusters(7))
--- End diff --

We should add tests to check the in-cluster distances are indeed smaller 
than `lambda`.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38953485
  
--- Diff: 
mllib/src/test/scala/org/apache/spark/mllib/clustering/DpMeansSuite.scala ---
@@ -0,0 +1,84 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import org.apache.spark.SparkFunSuite
+import org.apache.spark.mllib.linalg.{DenseVector, SparseVector, Vectors}
+import org.apache.spark.mllib.util.MLlibTestSparkContext
+import org.apache.spark.mllib.util.TestingUtils._
+import org.apache.spark.util.Utils
+
+class DpMeansSuite  extends SparkFunSuite with MLlibTestSparkContext {
+
+  test("single cluster") {
+val data = sc.parallelize(Array(
+  Vectors.dense(0.1, 0.5, 0.7),
+  Vectors.dense(0.2, 0.6, 0.8),
+  Vectors.dense(0.3, 0.7, 0.9)
+))
+val center = Vectors.dense(0.2, 0.6, 0.8)
+
+val model = new DpMeans().setLambda(2).setConvergenceTol(1).run(data)
--- End diff --

use `2.0` and `1.0` for float values


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38953312
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
+(val clusterCenters: Array[Vector]) extends Serializable {
+
+  /** A Java-friendly constructor that takes an Iterable of Vectors. */
+  def this(centers: java.lang.Iterable[Vector]) = 
this(centers.asScala.toArray)
+
+  /** Total number of clusters obtained. */
+  def k: Int = clusterCenters.length
+
+  /** Returns the cluster index that a given point belongs to. */
+  def predict(point: Vector): Int = {
+val centersWithNorm = clusterCentersWithNorm
+DpMeans.assignCluster(centersWithNorm.to[mutable.ArrayBuffer], new 
VectorWithNorm(point))._1
+  }
+
+  /** Maps the points in the given RDD to their closest cluster indices. */
+  def predict(points: RDD[Vector]): RDD[Int] = {
+val centersWithNorm = clusterCentersWithNorm
+val bcCentersWithNorm = points.context.broadcast(centersWithNorm)
+points.map(p => 
DpMeans.assignCluster(bcCentersWithNorm.value.to[mutable.ArrayBuffer],
+ new VectorWithNorm(p))._1)
--- End diff --

fix indentation and break lines to make it easier to read, e.g.

~~~scala
points.map { p =>
  DpMeans.assignCluster(
bcCentersWithNorm.value.to[mutable.ArrayBuffer], new 
VectorWithNorm(p)
  )._1
   }
~~~


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952962
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
+(val clusterCenters: Array[Vector]) extends Serializable {
--- End diff --

if this line doesn't fit into L29, we should move `(` to L29:

~~~scala
class DpMeansModel(
clusterCenters: Array[Vector]) extends Serializable {
  ...
}
~~~

Btw, it is also useful to save `lambda` in the model.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952794
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952984
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
+(val clusterCenters: Array[Vector]) extends Serializable {
+
+  /** A Java-friendly constructor that takes an Iterable of Vectors. */
+  def this(centers: java.lang.Iterable[Vector]) = 
this(centers.asScala.toArray)
+
+  /** Total number of clusters obtained. */
+  def k: Int = clusterCenters.length
+
+  /** Returns the cluster index that a given point belongs to. */
+  def predict(point: Vector): Int = {
+val centersWithNorm = clusterCentersWithNorm
+DpMeans.assignCluster(centersWithNorm.to[mutable.ArrayBuffer], new 
VectorWithNorm(point))._1
+  }
+
+  /** Maps the points in the given RDD to their closest cluster indices. */
+  def predict(points: RDD[Vector]): RDD[Int] = {
+val centersWithNorm = clusterCentersWithNorm
+val bcCentersWithNorm = points.context.broadcast(centersWithNorm)
+points.map(p => 
DpMeans.assignCluster(bcCentersWithNorm.value.to[mutable.ArrayBuffer],
+ new VectorWithNorm(p))._1)
+  }
+
+  /**
+   * Return the cost (sum of squared distances of points to their nearest 
center) for this
+   * model on the given data.
+   */
+  def computeCost(data: RDD[Vector]): Double = {
+val centersWithNorm = clusterCentersWithNorm
+val bcCentersWithNorm = data.context.broadcast(centersWithNorm)
+data.map(p => 
DpMeans.assignCluster(bcCentersWithNorm.value.to[mutable.ArrayBuffer],
+new VectorWithNorm(p))._2).sum()
+  }
+
+  private def clusterCentersWithNorm: Iterable[VectorWithNorm] =
+clusterCenters.map(new VectorWithNorm(_))
+
+}
+
--- End diff --

remove empty line


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952809
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952800
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952815
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,67 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.rdd.RDD
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
--- End diff --

missing doc for `clusterCenters`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952801
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952480
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952500
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952510
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952454
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952539
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952505
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952496
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952319
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952241
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952093
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
--- End diff --

wrong doc. Btw, use `returns`/`sets` in API doc instead of `return`/`set`. 
Please fix other API doc as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952012
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
--- End diff --

Though we implement the distributed version described in the MLbase paper, 
it is still worth citing the original paper of DP-means.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38951952
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
--- End diff --

`DP means` -> `DP-means` which is used in the original paper, similar to 
`k-means`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952088
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
--- End diff --

space after `.`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952178
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952167
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952320
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38951973
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
+ * {{{
+ * ./bin/run-example mllib.DenseDpMeans <--lambda> [<--convergenceTol> 
<--maxIterations>] 
+ * }}}
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object DenseDpMeans {
+  case class Params(
+input: String = null,
+lambda: Double = 0.0,
+convergenceTol: Double = 0.01,
+maxIterations: Int = 20) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params]("DenseDpMeans") {
+  head("DenseDpMeans: Dp means example application.")
+  opt[Double]("lambda")
+.required()
+.text("distance threshold, required")
+.action((x, c) => c.copy(lambda = x))
+  opt[Double]("convergenceTol")
+.abbr("ct")
+.text(s"convergence threshold, default: 
${defaultParams.convergenceTol}")
+.action((x, c) => c.copy(convergenceTol = x))
+  opt[Int]("maxIterations")
+.abbr("iter")
+.text(s"number of iterations, default: 
${defaultParams.maxIterations}")
+.action((x, c) => c.copy(maxIterations = x))
+  arg[String]("")
+.text("path to input data")
+.required()
+.action((x, c) => c.copy(input = x))
+}
+
+parser.parse(args, defaultParams).map { params =>
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  private def run(params: Params) {
+val conf = new SparkConf().setAppName("DP means example")
+val sc = new SparkContext(conf)
+
+val data = sc.textFile(params.input).map { line =>
+  Vectors.dense(line.trim.split(' ').map(_.toDouble))
+}.cache()
+
+val clusters = new DpMeans()
+  .setLambda(params.lambda)
+  .setConvergenceTol(params.convergenceTol)
+  .setMaxIterations(params.maxIterations)
+  .run(data)
+
+val k = clusters.k
+println(s"Number of Clusters = $k.")
+println()
+
+println("Clusters centers ::")
+for (i <- 0 until clusters.k) {
+  println(clusters.clusterCenters(i))
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952170
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952190
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP means a

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952091
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda: 1, 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
--- End diff --

`Set` -> `Returns`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38951966
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
+ * {{{
+ * ./bin/run-example mllib.DenseDpMeans <--lambda> [<--convergenceTol> 
<--maxIterations>] 
+ * }}}
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object DenseDpMeans {
+  case class Params(
+input: String = null,
+lambda: Double = 0.0,
+convergenceTol: Double = 0.01,
+maxIterations: Int = 20) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params]("DenseDpMeans") {
+  head("DenseDpMeans: Dp means example application.")
+  opt[Double]("lambda")
+.required()
+.text("distance threshold, required")
+.action((x, c) => c.copy(lambda = x))
+  opt[Double]("convergenceTol")
+.abbr("ct")
+.text(s"convergence threshold, default: 
${defaultParams.convergenceTol}")
+.action((x, c) => c.copy(convergenceTol = x))
+  opt[Int]("maxIterations")
+.abbr("iter")
+.text(s"number of iterations, default: 
${defaultParams.maxIterations}")
+.action((x, c) => c.copy(maxIterations = x))
+  arg[String]("")
+.text("path to input data")
+.required()
+.action((x, c) => c.copy(input = x))
+}
+
+parser.parse(args, defaultParams).map { params =>
+  run(params)
+}.getOrElse {
+  sys.exit(1)
+}
+  }
+
+  private def run(params: Params) {
+val conf = new SparkConf().setAppName("DP means example")
--- End diff --

`DP-means`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38951957
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
+ * {{{
+ * ./bin/run-example mllib.DenseDpMeans <--lambda> [<--convergenceTol> 
<--maxIterations>] 
+ * }}}
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object DenseDpMeans {
+  case class Params(
+input: String = null,
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38952017
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,247 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
--- End diff --

There are two issues with using `lambda` as the parameter name:

1. `lambda` is a keyword in Python.
2. It is used in other algorithms as the regularization parameter.

We can use a more descriptive name like `clusterPenalty`, `maxRadius`, or 
`distanceThreshold`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-08 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r38951960
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,106 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
+ * {{{
+ * ./bin/run-example mllib.DenseDpMeans <--lambda> [<--convergenceTol> 
<--maxIterations>] 
+ * }}}
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object DenseDpMeans {
+  case class Params(
+input: String = null,
+lambda: Double = 0.0,
+convergenceTol: Double = 0.01,
+maxIterations: Int = 20) extends AbstractParams[Params]
+
+  def main(args: Array[String]) {
+val defaultParams = Params()
+
+val parser = new OptionParser[Params]("DenseDpMeans") {
+  head("DenseDpMeans: Dp means example application.")
--- End diff --

`Dp means` -> `DP-means`


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-09-02 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-136956338
  
@mengxr We have updated the JIRA ticket to include the benchmark results as 
well..Could you please take a look and give your suggestions?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-28 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-125838477
  
Thank you @mengxr . We will take a look into the PR you mentioned.We are 
looking forward to have DP-Means in the 1.6 release. Thanks a lot for your kind 
support. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-28 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-125733942
  
@FlytxtRnD You might need `build/sbt clean` first. Given the review 
bandwidth, we may not be able to make this into 1.5. So I will make another 
pass after the 1.5 feature freeze. In the meantime, it would be super helpful 
if you can help review some other PRs that are on the 1.5 roadmap, e.g. 
https://github.com/apache/spark/pull/5267 (bisecting k-means). Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-28 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-125571416
  
@jkbradley To generate docs, I installed jekyll.  In jekyll build command, 
it is showing error.
`[info] Done updating.
[error] (catalyst/compile:compile) Compilation failed
[error] Total time: 601 s, completed 28 Jul, 2015 4:25:49 PM
Moving back into docs dir.
Making directory api/scala
cp -r ../target/scala-2.10/unidoc/. api/scala
jekyll 2.5.3 | Error:  No such file or directory - 
../target/scala-2.10/unidoc/.`

But SKIP_API=1 jekyll build is successfully completed. Could you please 
help me to solve this?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-20 Thread jkbradley

Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-123170256
  
@FlytxtRnD To generate the docs, I've always used jekyll (following the 
instructions on that same page).  I know that builds more than you want, but 
does that at leasdt work?

Sorry this PR is having to wait a bit for full review!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-19 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-122671172
  
@mengxr @jkbradley Gentle remainder.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-09 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119924897
  
@mengxr  Could you please tell me how to generate the API docs? I run 
build/sbt unidoc as mentioned in 
https://github.com/apache/spark/blob/master/docs/README.md. But it ends in an 
assertion error. Please help.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-09 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119907064
  
@mengxr I have reduced the PR length so that it would be easier for you to 
review. The style issues have been fixed wherever they were observed. 
I will change the paper name in the next update and  the benchmark results 
will also be ready asap. 
Could you please review this updated PR and  give suggestions?



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119486668
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119486527
  
  [Test build #36766 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36766/console)
 for   PR 6880 at commit 
[`907f4f1`](https://github.com/apache/spark/commit/907f4f1b0c94f35f0a6097c1d35987e12b352e09).
 * This patch **passes all tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:
  * `  case class Params(`



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-08 Thread SparkQA

Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119470721
  
  [Test build #36766 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/36766/consoleFull)
 for   PR 6880 at commit 
[`907f4f1`](https://github.com/apache/spark/commit/907f4f1b0c94f35f0a6097c1d35987e12b352e09).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119469656
  
Merged build started.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-08 Thread AmplabJenkins

Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-119469583
  
 Merged build triggered.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33736631
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.json4s.DefaultFormats
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.pmml.PMMLExportable
+import org.apache.spark.mllib.util.{Loader, Saveable}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
--- End diff --

I don't expect users call clustering models via a generic interface, as 
least for now. So we don't need to address this in this PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-117854395
  
@FlytxtRnD I haven't checked the implementation yet. Some high-level 
comments:

1. Please follow the code style guide. I saw wrong indentation, extra 
spacing, vertical alignment in your code.
2. Move save/load and the example code to follow-up PRs. Keep this PR small 
to accelerate the code review.
3. Check the generated API doc. Usually this is the simplest way to find 
public APIs that should be private.

On the algorithm part, could you list a few successful stories about 
k-means vs. kp-means? Some benchmark result also helps.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33736229
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.json4s.DefaultFormats
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.pmml.PMMLExportable
+import org.apache.spark.mllib.util.{Loader, Saveable}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
+(val clusterCenters: Array[Vector]) extends Saveable with Serializable 
with PMMLExportable {
+
+  /** A Java-friendly constructor that takes an Iterable of Vectors. */
+  def this(centers: java.lang.Iterable[Vector]) = 
this(centers.asScala.toArray)
+
+  /** Total number of clusters obtained. */
+  def k: Int = clusterCenters.length
+
+  /** Returns the cluster index that a given point belongs to. */
+  def predict(point: Vector): Int = {
+val centersWithNorm = clusterCentersWithNorm
+DpMeans.assignCluster(centersWithNorm.to[mutable.ArrayBuffer], new 
VectorWithNorm(point))._1
+  }
+
+  /** Maps the points in the given RDD to their closest cluster indices. */
+  def predict(points: RDD[Vector]): RDD[Int] = {
+val centersWithNorm = clusterCentersWithNorm
+val bcCentersWithNorm = points.context.broadcast(centersWithNorm)
+points.map(p => 
DpMeans.assignCluster(bcCentersWithNorm.value.to[mutable.ArrayBuffer],
+ new VectorWithNorm(p))._1)
+  }
+
+  /**
+   * Return the cost (sum of squared distances of points to their nearest 
center) for this
+   * model on the given data.
+   */
+  def computeCost(data: RDD[Vector]): Double = {
+val centersWithNorm = clusterCentersWithNorm
+val bcCentersWithNorm = data.context.broadcast(centersWithNorm)
+data.map(p => 
DpMeans.assignCluster(bcCentersWithNorm.value.to[mutable.ArrayBuffer],
+new VectorWithNorm(p))._2).sum()
+  }
+
+  private def clusterCentersWithNorm: Iterable[VectorWithNorm] =
+clusterCenters.map(new VectorWithNorm(_))
+
+  override def save(sc: SparkContext, path: String): Unit = {
+DpMeansModel.SaveLoadV1_0.save(sc, this, path)
+  }
+
+  override protected def formatVersion: String = "1.0"
+
+}
+
+object DpMeansModel extends Loader[DpMeansModel] {
--- End diff --

Shall we add save/load in a separate PR? Just to reduce the length of this 
PR.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33736217
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
--- End diff --

Should it be "Revisiting k-means: New Algorithms via Bayesian 
Nonparametrics" instead? 
http://machinelearning.wustl.edu/mlpapers/papers/ICML2012Kulis_291


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33736224
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeans.scala ---
@@ -0,0 +1,248 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.mutable.ArrayBuffer
+
+import org.apache.spark.Logging
+import org.apache.spark.annotation.Experimental
+import org.apache.spark.mllib.linalg.{Vector, Vectors}
+import org.apache.spark.mllib.linalg.BLAS.{axpy, scal}
+import org.apache.spark.mllib.util.MLUtils
+import org.apache.spark.rdd.RDD
+import org.apache.spark.storage.StorageLevel
+
+/**
+ * :: Experimental ::
+ *
+ * The Dirichlet process (DP) is a popular non-parametric Bayesian mixture
+ * model that allows for flexible clustering of data without having to
+ * determine the number of clusters in advance.
+ *
+ * Given a set of data points, this class performs cluster creation 
process,
+ * based on DP means algorithm, iterating until the maximum number of 
iterations
+ * is reached or the convergence criteria is satisfied. With the current
+ * global set of centers, it locally creates a new cluster centered at `x`
+ * whenever it encounters an uncovered data point `x`. In a similar manner,
+ * a local cluster center is promoted to a global center whenever an 
uncovered
+ * local cluster center is found. A data point is said to be "covered" by
+ * a cluster `c` if the distance from the point to the cluster center of 
`c`
+ * is less than a given lambda value.
+ *
+ * The original paper is "MLbase: Distributed Machine Learning Made Easy" 
by
+ * Xinghao Pan, Evan R. Sparks, Andre Wibisono
+ *
+ * @param lambda The distance threshold value that controls cluster 
creation.
+ * @param convergenceTol The threshold value at which convergence is 
considered to have occurred.
+ * @param maxIterations The maximum number of iterations to perform.
+ */
+
+@Experimental
+class DpMeans private (
+private var lambda: Double,
+private var convergenceTol: Double,
+private var maxIterations: Int) extends Serializable with Logging {
+
+  /**
+   * Constructs a default instance.The default parameters are {lambda:1 , 
convergenceTol: 0.01,
+   * maxIterations: 20}.
+   */
+
+  def this() = this(1, 0.01, 20)
+
+  /** Set the distance threshold that controls cluster creation. Default: 
1 */
+  def getLambda(): Double = lambda
+
+  /** Return the lambda. */
+  def setLambda(lambda: Double): this.type = {
+this.lambda = lambda
+this
+  }
+
+  /** Set the threshold value at which convergence is considered to have 
occurred. Default: 0.01 */
+  def setConvergenceTol(convergenceTol: Double): this.type = {
+this.convergenceTol = convergenceTol
+this
+  }
+
+  /** Return the threshold value at which convergence is considered to 
have occurred. */
+  def getConvergenceTol: Double = convergenceTol
+
+  /** Set the maximum number of iterations. Default: 20 */
+  def setMaxIterations(maxIterations: Int): this.type = {
+this.maxIterations = maxIterations
+this
+  }
+
+  /** Return the maximum number of iterations. */
+  def getMaxIterations: Int = maxIterations
+
+  /**
+   * Perform DP means clustering
+   */
+  def run(data: RDD[Vector]): DpMeansModel = {
+if (data.getStorageLevel == StorageLevel.NONE) {
+  logWarning("The input data is not directly cached, which may hurt 
performance if its"
++ " parent RDDs are also uncached.")
+}
+
+// Compute squared norms and cache them.
+val norms = data.map(Vectors.norm(_, 2.0))
+norms.persist()
+val zippedData = data.zip(norms).map {
+  case (v, norm) => new VectorWithNorm(v, norm)
+}
+
+// Implementation of DP m

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-07-01 Thread mengxr

Github user mengxr commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33736213
  
--- Diff: 
examples/src/main/scala/org/apache/spark/examples/mllib/DenseDpMeans.scala ---
@@ -0,0 +1,109 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.examples.mllib
+
+import scopt.OptionParser
+
+import org.apache.spark.mllib.clustering.DpMeans
+import org.apache.spark.mllib.linalg.Vectors
+import org.apache.spark.{SparkConf, SparkContext}
+
+/**
+ * An example DP means app. Run with
+ * {{{
+ * ./bin/run-example mllib.DenseDpMeans <--lambda> [<--convergenceTol> 
<--maxIterations>] 
+ * }}}
+ * If you use it as a template to create your own app, please use 
`spark-submit` to submit your app.
+ */
+object DenseDpMeans {
+
+  case class Params(
+ input: String = null,
--- End diff --

Please follow Spark code style guide: 
https://cwiki.apache.org/confluence/display/SPARK/Spark+Code+Style+Guide


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-06-29 Thread FlytxtRnD

Github user FlytxtRnD commented on the pull request:

https://github.com/apache/spark/pull/6880#issuecomment-116645007
  
@mengxr Could you please say your comments on this PR ? 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

[GitHub] spark pull request: [SPARK-8402][MLLIB] DP Means Clustering

2015-06-25 Thread sujkh85

Github user sujkh85 commented on a diff in the pull request:

https://github.com/apache/spark/pull/6880#discussion_r33227311
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/clustering/DpMeansModel.scala ---
@@ -0,0 +1,149 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ *http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.spark.mllib.clustering
+
+import scala.collection.JavaConverters._
+import scala.collection.mutable
+
+import org.json4s.DefaultFormats
+import org.json4s.JsonDSL._
+import org.json4s.jackson.JsonMethods._
+
+import org.apache.spark.mllib.linalg.Vector
+import org.apache.spark.mllib.pmml.PMMLExportable
+import org.apache.spark.mllib.util.{Loader, Saveable}
+import org.apache.spark.rdd.RDD
+import org.apache.spark.SparkContext
+import org.apache.spark.sql.{Row, SQLContext}
+
+/**
+ * A clustering model for DP means. Each point belongs to the cluster with 
the closest center.
+ */
+class DpMeansModel
--- End diff --


NAVER - http://www.naver.com/


su...@naver.com ëê» ë³´ë´ì  ë©ì¼  ì´ ë¤ìê³¼ ê°ì ì´ì ë¡ ì ì¡ 
ì¤í¨íìµëë¤.



ë°ë ì¬ëì´ íìëì ë©ì¼ì ìì ì°¨ë¨ íììµëë¤. 






---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org

1 2 >

1 - 100 of 111 matches

Mail list logo