[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-19 Thread asfgit
Github user asfgit closed the pull request at:

https://github.com/apache/spark/pull/10150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-19 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-172938259
  
LGTM
Thanks for the PR!
Merging with master


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-172756623
  
**[Test build #2406 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2406/consoleFull)**
 for PR 10150 at commit 
[`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-18 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-172764328
  
**[Test build #2406 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2406/consoleFull)**
 for PR 10150 at commit 
[`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-16 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-172234499
  
@jkbradley updated to long link


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171882962
  
**[Test build #49442 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49442/consoleFull)**
 for PR 10150 at commit 
[`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-14 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171887491
  
**[Test build #49442 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49442/consoleFull)**
 for PR 10150 at commit 
[`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171887744
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49442/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-14 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171887741
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-13 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171473355
  
@jkbradley is the bit.ly link ok or should I break the 72 char length 
(since the discussion on the cleanup JIRA seems like its maybe not something we 
care about enforcing?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-13 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171491552
  
I'd prefer to use a long link.  Thanks!


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-13 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49676469
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,129 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data, 2), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p)
+0
+>>> model.k
+4
+>>> model.computeCost(p)
+0.0
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self, java_model):
+super(BisectingKMeansModel, self).__init__(java_model)
+self.centers = [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy
+arrays."""
+return self.centers
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster that each of the points belongs to in this
+model.
+
+:param x: the point (or RDD of points) to determine
+  compute the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, x):
+"""
+Return the Bisecting K-means cost (sum of squared distances of
+points to their nearest center) for this model on the given
+data. If provided with an RDD of points returns the sum.
+
+:param point: the point or RDD of points to compute the cost(s).
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("computeCost", vecs)
+
+return self.call("computeCost", _convert_to_vector(x))
+
+
+class BisectingKMeans(object):
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of
+document clustering techniques" by Steinbach, Karypis, and Kumar,
+with modification to fit Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and
+bisects each of them using k-means, until there are `k` leaf
+clusters in total or no leaf clusters are divisible.
+The bisecting steps of clusters on the same level are grouped
+together to increase parallelism. If bisecting all divisible
+clusters on the bottom level would result more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A
--- End diff --

I think long lines are OK for links if needed.  I'd prefer no bitly.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49495742
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,129 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data, 2), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p)
+0
+>>> model.k
+4
+>>> model.computeCost(p)
+0.0
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self, java_model):
+super(BisectingKMeansModel, self).__init__(java_model)
+self.centers = [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy
+arrays."""
+return self.centers
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster that each of the points belongs to in this
+model.
+
+:param x: the point (or RDD of points) to determine
+  compute the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, x):
+"""
+Return the Bisecting K-means cost (sum of squared distances of
+points to their nearest center) for this model on the given
+data. If provided with an RDD of points returns the sum.
+
+:param point: the point or RDD of points to compute the cost(s).
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("computeCost", vecs)
+
+return self.call("computeCost", _convert_to_vector(x))
+
+
+class BisectingKMeans(object):
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of
+document clustering techniques" by Steinbach, Karypis, and Kumar,
+with modification to fit Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and
+bisects each of them using k-means, until there are `k` leaf
+clusters in total or no leaf clusters are divisible.
+The bisecting steps of clusters on the same level are grouped
+together to increase parallelism. If bisecting all divisible
+clusters on the bottom level would result more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A
--- End diff --

Ah ok, I noticed we were using bit.ly links elsewhere in the pydocs since 
the 72 char limit is pretty short. I'll switch this back to the old URL.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49493664
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -136,7 +258,10 @@ def predict(self, x):
 def computeCost(self, rdd):
 """
 Return the K-means cost (sum of squared distances of points to
-their nearest center) for this model on the given data.
+their nearest center) for this model on the given
+data.
+
+:param point: the point or RDD of points to compute the cost(s).
--- End diff --

This is only for RDDs, not single points.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49493668
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -419,6 +419,17 @@ class ListTests(MLlibTestCase):
 as NumPy arrays.
 """
 
+def test_bisecting_kmeans(self):
+from pyspark.mllib.clustering import BisectingKMeans
+data = array([0.0, 0.0, 1.0, 1.0, 9.0, 8.0, 8.0, 9.0]).reshape(4, 
2)
+bskm = BisectingKMeans()
+model = bskm.train(sc.parallelize(data, 2), k=4)
+p = array([0.0, 0.0])
+rdd_p = self.sc.parallelize([p])
+self.assertEqual(model.predict(p), model.predict(rdd_p).first())
+self.assertEqual(model.computeCost(p), model.computeCost(rdd_p))
--- End diff --

I'm surprised this works.  Shouldn't you have to call first() on the RDD?  
IIRC, assertEqual is from unittest, which won't understand RDDs.  (I also want 
to make sure this is being run.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49501929
  
--- Diff: python/pyspark/mllib/tests.py ---
@@ -419,6 +419,17 @@ class ListTests(MLlibTestCase):
 as NumPy arrays.
 """
 
+def test_bisecting_kmeans(self):
+from pyspark.mllib.clustering import BisectingKMeans
+data = array([0.0, 0.0, 1.0, 1.0, 9.0, 8.0, 8.0, 9.0]).reshape(4, 
2)
+bskm = BisectingKMeans()
+model = bskm.train(sc.parallelize(data, 2), k=4)
+p = array([0.0, 0.0])
+rdd_p = self.sc.parallelize([p])
+self.assertEqual(model.predict(p), model.predict(rdd_p).first())
+self.assertEqual(model.computeCost(p), model.computeCost(rdd_p))
--- End diff --

Um nevermind...that was silly


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171047719
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171047588
  
**[Test build #49254 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49254/consoleFull)**
 for PR 10150 at commit 
[`ba5b467`](https://github.com/apache/spark/commit/ba5b467628ca9e5d27af8a5d2a7bd52ea242c03c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171047722
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49254/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171036044
  
**[Test build #49254 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49254/consoleFull)**
 for PR 10150 at commit 
[`ba5b467`](https://github.com/apache/spark/commit/ba5b467628ca9e5d27af8a5d2a7bd52ea242c03c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-171005137
  
Thanks for adding the unit test!  I just had a few comments.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49494004
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,129 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data, 2), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p)
+0
+>>> model.k
+4
+>>> model.computeCost(p)
+0.0
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self, java_model):
+super(BisectingKMeansModel, self).__init__(java_model)
+self.centers = [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy
+arrays."""
+return self.centers
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster that each of the points belongs to in this
+model.
+
+:param x: the point (or RDD of points) to determine
+  compute the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, x):
+"""
+Return the Bisecting K-means cost (sum of squared distances of
+points to their nearest center) for this model on the given
+data. If provided with an RDD of points returns the sum.
+
+:param point: the point or RDD of points to compute the cost(s).
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("computeCost", vecs)
+
+return self.call("computeCost", _convert_to_vector(x))
+
+
+class BisectingKMeans(object):
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of
+document clustering techniques" by Steinbach, Karypis, and Kumar,
+with modification to fit Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and
+bisects each of them using k-means, until there are `k` leaf
+clusters in total or no leaf clusters are divisible.
+The bisecting steps of clusters on the same level are grouped
+together to increase parallelism. If bisecting all divisible
+clusters on the bottom level would result more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A
--- End diff --

I'd prefer to keep the original link.  Bitly links might not make people 
happy since it's less clear what you're linking to.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-12 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49503542
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,129 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data, 2), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p)
+0
+>>> model.k
+4
+>>> model.computeCost(p)
+0.0
+
+.. versionadded:: 2.0.0
+"""
+
+def __init__(self, java_model):
+super(BisectingKMeansModel, self).__init__(java_model)
+self.centers = [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy
+arrays."""
+return self.centers
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster that each of the points belongs to in this
+model.
+
+:param x: the point (or RDD of points) to determine
+  compute the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, x):
+"""
+Return the Bisecting K-means cost (sum of squared distances of
+points to their nearest center) for this model on the given
+data. If provided with an RDD of points returns the sum.
+
+:param point: the point or RDD of points to compute the cost(s).
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("computeCost", vecs)
+
+return self.call("computeCost", _convert_to_vector(x))
+
+
+class BisectingKMeans(object):
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of
+document clustering techniques" by Steinbach, Karypis, and Kumar,
+with modification to fit Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and
+bisects each of them using k-means, until there are `k` leaf
+clusters in total or no leaf clusters are divisible.
+The bisecting steps of clusters on the same level are grouped
+together to increase parallelism. If bisecting all divisible
+clusters on the bottom level would result more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A
--- End diff --

ah actually the original URL is over 72 characters - since we use bit.ly 
links elsewhere do you think it would be ok to keep it here?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49389929
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Yeah, it may be 72.  IMO it'd be nice to add a lint rule after the cleanup 
JIRA gets fixed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170728609
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49181/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170728607
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170728335
  
**[Test build #49181 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49181/consoleFull)**
 for PR 10150 at commit 
[`c902d93`](https://github.com/apache/spark/commit/c902d93f34e0da9f240286e00b7f0d907334f7a9).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49391345
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Sounds good. I've got https://issues.apache.org/jira/browse/SPARK-12731 to 
track this and I'll add it to my next to do in my tools hacking time.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170703468
  
**[Test build #49174 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49174/consoleFull)**
 for PR 10150 at commit 
[`0f17577`](https://github.com/apache/spark/commit/0f17577b0b08aff4e2bea775820086273cf7f169).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170705376
  
**[Test build #49174 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49174/consoleFull)**
 for PR 10150 at commit 
[`0f17577`](https://github.com/apache/spark/commit/0f17577b0b08aff4e2bea775820086273cf7f169).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170705382
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170705386
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49174/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-11 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170707177
  
**[Test build #49181 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49181/consoleFull)**
 for PR 10150 at commit 
[`c902d93`](https://github.com/apache/spark/commit/c902d93f34e0da9f240286e00b7f0d907334f7a9).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49249961
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
--- End diff --

inherit from object


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49253655
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
--- End diff --

Agreed, this is however the same text as used in KMeansModel so I'll update 
that ones docstring as well.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49253296
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Are we sure on the 74? Looking at pep8/pep257 it says 72 (although we 
extended the length for code lines so maybe we changed that too)? We could try 
and add a lint rule for this maybe in the future.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49255412
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Also we have ~380 doctstring lines over length of 72 I'll file a cleanup 
JIRA for this.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251302
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
--- End diff --

Specify number of partitions for sc.parallelize; not doing so has caused 
flaky tests in the past (because of randomization interacting with 
partitioning).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170156045
  
@holdenk Thanks for the PR!  That's all for now.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-170149735
  
@yanboliang Thanks for reviewing!  I'll review now too


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251289
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
--- End diff --

It'd be nice to store the centers right after training the model.  I could 
imagine users calling this method within a closure.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251287
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -120,6 +120,23 @@ private[python] class PythonMLLibAPI extends 
Serializable {
   }
 
   /**
+   * Java stub for Python mllib BisectingKMeans.run()
+   */
+  def trainBisectingKMeans(
+data: JavaRDD[Vector],
--- End diff --

fix indentation


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251292
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
--- End diff --

It'd be nice to support RDDs here too.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251291
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
--- End diff --

Confusing doc; reword.  Also fix indentation on next line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251290
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
--- End diff --

This sounds like 1 point only.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251288
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
--- End diff --

I'd write this as more of an example than a unit test.  It's good to 
exercise all functionality, but unit test code should go in tests.py.  (We have 
been inconsistent about this, but it'd be good to improve.)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49251293
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

I believe we try to limit doc lines in Python to <= 80 chars (unlike code, 
which is <= 100 chars).  Could you please update this and other parts?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-08 Thread jkbradley
Github user jkbradley commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49252449
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,120 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> p = array([0.0, 0.0])
+>>> model.predict(p) == model.predict(p)
+True
+>>> model.predict(sc.parallelize([p])).first() == model.predict(p)
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+vecs = x.map(_convert_to_vector)
+return self.call("predict", vecs)
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('2.0.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+.. note:: Experimental
+
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
--- End diff --

Update: It should actually be 74 chars.  You can check with ```pydoc 
pyspark``` from the spark/python directory and changing the terminal size to 80 
chars wide.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140391
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Ah yes it should be, I'll ad a docstring test for this method.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49149092
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Ah seems that the JavaModelWraper call method being used won't work on the 
workers. I'll have to port the predict method over.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140159
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

Also, maybe you can add a test for this case in the docstring.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread thunterdb
Github user thunterdb commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r49140117
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,116 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.train(sc.parallelize(data), k=4)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.train(sc.parallelize(data), k=2)
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 2.0.0
+"""
+
+@property
+@since('2.0.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('2.0.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('2.0.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
--- End diff --

I am not sure I understand this line, shouldn't it be `x.map(self.predict)`?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169623692
  
Looks fine to me. cc @jkbradley 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169864895
  
@thunterdb fixed the issue with predicting on RDDs :)


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169865674
  
**[Test build #48995 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48995/consoleFull)**
 for PR 10150 at commit 
[`dc1c885`](https://github.com/apache/spark/commit/dc1c885ee3675087c4ccf6c8113e1d74350c1aac).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169866477
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169866478
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48994/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169874290
  
**[Test build #48995 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48995/consoleFull)**
 for PR 10150 at commit 
[`dc1c885`](https://github.com/apache/spark/commit/dc1c885ee3675087c4ccf6c8113e1d74350c1aac).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169874558
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48995/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-07 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169874556
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168931946
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48742/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168931761
  
**[Test build #48742 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48742/consoleFull)**
 for PR 10150 at commit 
[`0310efe`](https://github.com/apache/spark/commit/0310efeec1a202733b40a50085178ec1b1d97409).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168931943
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-05 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-169095113
  
@yanboliang I've added the since annotations.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-04 Thread yanboliang
Github user yanboliang commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168643020
  
@holdenk Sorry for late response, thanks for updates.
The PR looks good to me, and we add ```since('2.0.0')``` to the public 
class and functions now. 


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-04 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168919548
  
Sounds good, this PR was made back before the 1.6 branch was cut so I 
didn't have any annotations on it. I'll update them.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2016-01-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168922090
  
**[Test build #48742 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48742/consoleFull)**
 for PR 10150 at commit 
[`0310efe`](https://github.com/apache/spark/commit/0310efeec1a202733b40a50085178ec1b1d97409).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168054823
  
**[Test build #48493 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48493/consoleFull)**
 for PR 10150 at commit 
[`57471e6`](https://github.com/apache/spark/commit/57471e676c982285718bc4e3161932dc1509695c).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168062435
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-30 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168062436
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48493/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-30 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-168062281
  
**[Test build #48493 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48493/consoleFull)**
 for PR 10150 at commit 
[`57471e6`](https://github.com/apache/spark/commit/57471e676c982285718bc4e3161932dc1509695c).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds no public classes.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-166513946
  
Merged build finished. Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-166513927
  
**[Test build #48163 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48163/consoleFull)**
 for PR 10150 at commit 
[`fa6367c`](https://github.com/apache/spark/commit/fa6367c03cfe9734505d621c38b6c9e90f1e598b).
 * This patch **fails Python style tests**.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel(JavaModelWrapper):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-21 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-166513365
  
**[Test build #48163 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48163/consoleFull)**
 for PR 10150 at commit 
[`fa6367c`](https://github.com/apache/spark/commit/fa6367c03cfe9734505d621c38b6c9e90f1e598b).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-21 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-166513950
  
Test FAILed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48163/
Test FAILed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-19 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r48097828
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,158 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.setK(2).run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+"""
+
+@property
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
+techniques" by Steinbach, Karypis, and Kumar, with modification to fit 
Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and 
bisects each of them using
+k-means, until there are `k` leaf clusters in total or no leaf 
clusters are divisible.
+The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+KDD Workshop on Text Mining, 2000.]]
+"""
+def __init__(self):
+self.k = 4
+self.maxIterations = 20
+self.minDivisibleClusterSize = 1.0
+self.seed = -1888008604  # classOf[BisectingKMeans].getName.##
+
+def setK(self, k):
+"""
+Set the number of leaf clusters.
+
+:param k: the desired number of leaf clusters (default: 4). The 
actual number could be
+smaller if there are no divisible leaf clusters.
+"""
+self.k = k
+return self
+
+def getK(self):
+"""Return the desired number of leaf clusters."""
+return self.k
+
+def setMaxIterations(self, maxIterations):
+"""
+Set the maximum number of iterations.
+
+:param maxIterations: the max number of k-means iterations to 
split clusters (default: 20)
+"""
+self.maxIterations = maxIterations
+return self
+
+def getMaxIterations(self):
+"""Return the maximum number of iterations."""
+   

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-15 Thread 3ourroom
Github user 3ourroom commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47619760
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,158 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.setK(2).run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+"""
+
+@property
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
+techniques" by Steinbach, Karypis, and Kumar, with modification to fit 
Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and 
bisects each of them using
+k-means, until there are `k` leaf clusters in total or no leaf 
clusters are divisible.
+The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+KDD Workshop on Text Mining, 2000.]]
+"""
+def __init__(self):
+self.k = 4
+self.maxIterations = 20
+self.minDivisibleClusterSize = 1.0
+self.seed = -1888008604  # classOf[BisectingKMeans].getName.##
+
+def setK(self, k):
+"""
+Set the number of leaf clusters.
+
+:param k: the desired number of leaf clusters (default: 4). The 
actual number could be
+smaller if there are no divisible leaf clusters.
+"""
+self.k = k
+return self
+
+def getK(self):
+"""Return the desired number of leaf clusters."""
+return self.k
+
+def setMaxIterations(self, maxIterations):
+"""
+Set the maximum number of iterations.
+
+:param maxIterations: the max number of k-means iterations to 
split clusters (default: 20)
+"""
+self.maxIterations = maxIterations
+return self
+
+def getMaxIterations(self):
+"""Return the maximum number of iterations."""
+  

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-15 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47618583
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,158 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.setK(2).run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+"""
+
+@property
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
+techniques" by Steinbach, Karypis, and Kumar, with modification to fit 
Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and 
bisects each of them using
+k-means, until there are `k` leaf clusters in total or no leaf 
clusters are divisible.
+The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+KDD Workshop on Text Mining, 2000.]]
+"""
+def __init__(self):
+self.k = 4
+self.maxIterations = 20
+self.minDivisibleClusterSize = 1.0
+self.seed = -1888008604  # classOf[BisectingKMeans].getName.##
+
+def setK(self, k):
+"""
+Set the number of leaf clusters.
+
+:param k: the desired number of leaf clusters (default: 4). The 
actual number could be
+smaller if there are no divisible leaf clusters.
+"""
+self.k = k
+return self
+
+def getK(self):
+"""Return the desired number of leaf clusters."""
+return self.k
+
+def setMaxIterations(self, maxIterations):
+"""
+Set the maximum number of iterations.
+
+:param maxIterations: the max number of k-means iterations to 
split clusters (default: 20)
+"""
+self.maxIterations = maxIterations
+return self
+
+def getMaxIterations(self):
+"""Return the maximum number of iterations."""
+

[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-14 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-164526117
  
ping?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-163791273
  
**[Test build #47548 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47548/consoleFull)**
 for PR 10150 at commit 
[`7fe3152`](https://github.com/apache/spark/commit/7fe3152692f56deae53bc3bd89887a5a2c2ffe5e).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel(JavaModelWrapper):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-163791405
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-10 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-163791408
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47548/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-10 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-163783796
  
**[Test build #47548 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47548/consoleFull)**
 for PR 10150 at commit 
[`7fe3152`](https://github.com/apache/spark/commit/7fe3152692f56deae53bc3bd89887a5a2c2ffe5e).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-10 Thread holdenk
Github user holdenk commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47298375
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,175 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
--- End diff --

Its a sanity check - I can take it out if you want.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-09 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47070153
  
--- Diff: 
mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala ---
@@ -121,6 +121,22 @@ private[python] class PythonMLLibAPI extends 
Serializable {
   }
 
   /**
+   * Java stub for Python mllib BisectingKMeans.run()
+   */
+  def trainBisectingKMeans(
+data: JavaRDD[Vector],
+k: Int,
+maxIterations: Int,
+minDivisibleClusterSize: Double,
+seed: Long): BisectingKMeansModel = {
+new BisectingKMeans()
+  .setK(k)
+  .setMaxIterations(maxIterations)
+  .setMinDivisibleClusterSize(minDivisibleClusterSize)
+  .setSeed(seed).run(data)
--- End diff --

nit: new line.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-09 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47070681
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,175 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.setK(2).run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 1.6.0
--- End diff --

I think we can not catch up with 1.6, so version related mark should be 
deleted.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-09 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47070564
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,175 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
--- End diff --

I can not understand the meaning of this test case. The same point should 
be clustered into same center is right in any conditions.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-09 Thread yanboliang
Github user yanboliang commented on a diff in the pull request:

https://github.com/apache/spark/pull/10150#discussion_r47073596
  
--- Diff: python/pyspark/mllib/clustering.py ---
@@ -38,13 +38,175 @@
 from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, 
JavaSaveable
 from pyspark.streaming import DStream
 
-__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 
'GaussianMixture',
-   'PowerIterationClusteringModel', 'PowerIterationClustering',
-   'StreamingKMeans', 'StreamingKMeansModel',
+__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 
'KMeans',
+   'GaussianMixtureModel', 'GaussianMixture', 
'PowerIterationClusteringModel',
+   'PowerIterationClustering', 'StreamingKMeans', 
'StreamingKMeansModel',
'LDA', 'LDAModel']
 
 
 @inherit_doc
+class BisectingKMeansModel(JavaModelWrapper):
+"""
+.. note:: Experimental
+
+A clustering model derived from the bisecting k-means method.
+
+>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2)
+>>> bskm = BisectingKMeans()
+>>> model = bskm.run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 
0.0]))
+True
+>>> model.k
+4
+>>> model.computeCost(array([0.0, 0.0]))
+0.0
+>>> model.k == len(model.clusterCenters)
+True
+>>> model = bskm.setK(2).run(sc.parallelize(data))
+>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 
1.0]))
+True
+>>> model.k
+2
+
+.. versionadded:: 1.6.0
+"""
+
+@property
+@since('1.6.0')
+def clusterCenters(self):
+"""Get the cluster centers, represented as a list of NumPy 
arrays."""
+return [c.toArray() for c in self.call("clusterCenters")]
+
+@property
+@since('1.6.0')
+def k(self):
+"""Get the number of clusters"""
+return self.call("k")
+
+@since('1.6.0')
+def predict(self, x):
+"""
+Find the cluster to which x belongs in this model.
+
+:param x: Either the point to determine the cluster for or an RDD 
of points to determine
+the clusters for.
+"""
+if isinstance(x, RDD):
+return x.map(self.predict(x))
+
+x = _convert_to_vector(x)
+return self.call("predict", x)
+
+@since('1.6.0')
+def computeCost(self, point):
+"""
+Return the Bisecting K-means cost (sum of squared distances of 
points to
+their nearest center) for this model on the given data.
+
+:param point: the point to compute the cost to
+"""
+return self.call("computeCost", _convert_to_vector(point))
+
+
+class BisectingKMeans:
+"""
+A bisecting k-means algorithm based on the paper "A comparison of 
document clustering
+techniques" by Steinbach, Karypis, and Kumar, with modification to fit 
Spark.
+The algorithm starts from a single cluster that contains all points.
+Iteratively it finds divisible clusters on the bottom level and 
bisects each of them using
+k-means, until there are `k` leaf clusters in total or no leaf 
clusters are divisible.
+The bisecting steps of clusters on the same level are grouped together 
to increase parallelism.
+If bisecting all divisible clusters on the bottom level would result 
more than `k` leaf
+clusters, larger clusters get higher priority.
+
+Based on 
[[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf
+Steinbach, Karypis, and Kumar, A comparison of document clustering 
techniques,
+KDD Workshop on Text Mining, 2000.]]
+
+.. versionadded:: 1.6.0
+"""
+def __init__(self):
+self.k = 4
+self.maxIterations = 20
+self.minDivisibleClusterSize = 1.0
+self.seed = 42
--- End diff --

I found the default value of ```seed``` is 
```classOf[BisectingKMeans].getName.##``` at Scala side, I think we should keep 
consistent.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162257194
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-05 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162257197
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47232/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162257155
  
**[Test build #47232 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47232/consoleFull)**
 for PR 10150 at commit 
[`0fd962c`](https://github.com/apache/spark/commit/0fd962ca67aaa923a5087f592a2acb34d4c89d07).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel(JavaModelWrapper):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-05 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162255249
  
**[Test build #47232 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47232/consoleFull)**
 for PR 10150 at commit 
[`0fd962c`](https://github.com/apache/spark/commit/0fd962ca67aaa923a5087f592a2acb34d4c89d07).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread holdenk
Github user holdenk commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162101912
  
Just noticed an issue with cost function, will close and reopen after 
fixed/tested


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread holdenk
Github user holdenk closed the pull request at:

https://github.com/apache/spark/pull/10150


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread holdenk
GitHub user holdenk opened a pull request:

https://github.com/apache/spark/pull/10150

[SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means

From the coverage issues for 1.6 : Add Python API for 
mllib.clustering.BisectingKMeans.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/holdenk/spark 
SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10150


commit 9b95e944f943a31ea8e969faa80662bce1080bdd
Author: Holden Karau 
Date:   2015-12-02T16:07:39Z

Some progress, not a lot

commit 427e487ca3e2ad27b692d9acd40fbd8a9b726312
Author: Holden Karau 
Date:   2015-12-03T04:20:55Z

murh some murh. airplain code isn't very good but it distracts me - start 
adding a class for calling bisectingkmeans. I don't really like how 
BisectingKMeans is set up (its different from many of the others which is fnur) 
but trying to decide if I should make the python API more closely match the 
Scala API or match the rest of the Python API. These are questions for after 
I've slept perhaps.

commit f5a40c85a2b91b4c93a66db2c15164bb57db44d6
Author: Holden Karau 
Date:   2015-12-04T21:01:58Z

A bunch of works towards getting BisectingKMeans in PySpark

commit d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00
Author: Holden Karau 
Date:   2015-12-04T22:23:08Z

Add a bit more pydoc descriptions, fix the prediction call, and verify with 
different k




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162108002
  
Test PASSed.
Refer to this link for build results (access rights to CI server needed): 
https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47208/
Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162107904
  
**[Test build #47208 has 
finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47208/consoleFull)**
 for PR 10150 at commit 
[`d3e4c1a`](https://github.com/apache/spark/commit/d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00).
 * This patch passes all tests.
 * This patch merges cleanly.
 * This patch adds the following public classes _(experimental)_:\n  * 
`class BisectingKMeansModel(JavaModelWrapper):`\n


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread AmplabJenkins
Github user AmplabJenkins commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162108001
  
Merged build finished. Test PASSed.


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread SparkQA
Github user SparkQA commented on the pull request:

https://github.com/apache/spark/pull/10150#issuecomment-162101253
  
**[Test build #47208 has 
started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47208/consoleFull)**
 for PR 10150 at commit 
[`d3e4c1a`](https://github.com/apache/spark/commit/d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00).


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...

2015-12-04 Thread holdenk
GitHub user holdenk reopened a pull request:

https://github.com/apache/spark/pull/10150

[SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means

From the coverage issues for 1.6 : Add Python API for 
mllib.clustering.BisectingKMeans.

You can merge this pull request into a Git repository by running:

$ git pull https://github.com/holdenk/spark 
SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans

Alternatively you can review and apply these changes as the patch at:

https://github.com/apache/spark/pull/10150.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

This closes #10150


commit 9b95e944f943a31ea8e969faa80662bce1080bdd
Author: Holden Karau 
Date:   2015-12-02T16:07:39Z

Some progress, not a lot

commit 427e487ca3e2ad27b692d9acd40fbd8a9b726312
Author: Holden Karau 
Date:   2015-12-03T04:20:55Z

murh some murh. airplain code isn't very good but it distracts me - start 
adding a class for calling bisectingkmeans. I don't really like how 
BisectingKMeans is set up (its different from many of the others which is fnur) 
but trying to decide if I should make the python API more closely match the 
Scala API or match the rest of the Python API. These are questions for after 
I've slept perhaps.

commit f5a40c85a2b91b4c93a66db2c15164bb57db44d6
Author: Holden Karau 
Date:   2015-12-04T21:01:58Z

A bunch of works towards getting BisectingKMeans in PySpark

commit d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00
Author: Holden Karau 
Date:   2015-12-04T22:23:08Z

Add a bit more pydoc descriptions, fix the prediction call, and verify with 
different k

commit 868c4a7931834fe2bf85ccabe97a640f4bff4dc2
Author: Holden Karau 
Date:   2015-12-05T04:25:17Z

fix compute cost




---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastruct...@apache.org or file a JIRA ticket
with INFRA.
---

-
To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org
For additional commands, e-mail: reviews-h...@spark.apache.org



  1   2   >