[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user asfgit closed the pull request at: https://github.com/apache/spark/pull/10150 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-172938259 LGTM Thanks for the PR! Merging with master --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-172764328 **[Test build #2406 has finished](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2406/consoleFull)** for PR 10150 at commit [`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-172756623 **[Test build #2406 has started](https://amplab.cs.berkeley.edu/jenkins/job/NewSparkPullRequestBuilder/2406/consoleFull)** for PR 10150 at commit [`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-172234499 @jkbradley updated to long link --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171887491 **[Test build #49442 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49442/consoleFull)** for PR 10150 at commit [`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171887744 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49442/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171887741 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171882962 **[Test build #49442 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49442/consoleFull)** for PR 10150 at commit [`5eec54b`](https://github.com/apache/spark/commit/5eec54b9072e737d32a38efb8f1c101ae05b3044). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171491552 I'd prefer to use a long link. Thanks! --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49676469 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,129 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data, 2), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) +0 +>>> model.k +4 +>>> model.computeCost(p) +0.0 + +.. versionadded:: 2.0.0 +""" + +def __init__(self, java_model): +super(BisectingKMeansModel, self).__init__(java_model) +self.centers = [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy +arrays.""" +return self.centers + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster that each of the points belongs to in this +model. + +:param x: the point (or RDD of points) to determine + compute the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, x): +""" +Return the Bisecting K-means cost (sum of squared distances of +points to their nearest center) for this model on the given +data. If provided with an RDD of points returns the sum. + +:param point: the point or RDD of points to compute the cost(s). +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("computeCost", vecs) + +return self.call("computeCost", _convert_to_vector(x)) + + +class BisectingKMeans(object): +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of +document clustering techniques" by Steinbach, Karypis, and Kumar, +with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and +bisects each of them using k-means, until there are `k` leaf +clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped +together to increase parallelism. If bisecting all divisible +clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A --- End diff -- I think long lines are OK for links if needed. I'd prefer no bitly. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171473355 @jkbradley is the bit.ly link ok or should I break the 72 char length (since the discussion on the cleanup JIRA seems like its maybe not something we care about enforcing? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171047722 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49254/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171047588 **[Test build #49254 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49254/consoleFull)** for PR 10150 at commit [`ba5b467`](https://github.com/apache/spark/commit/ba5b467628ca9e5d27af8a5d2a7bd52ea242c03c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171047719 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171036044 **[Test build #49254 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49254/consoleFull)** for PR 10150 at commit [`ba5b467`](https://github.com/apache/spark/commit/ba5b467628ca9e5d27af8a5d2a7bd52ea242c03c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49503542 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,129 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data, 2), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) +0 +>>> model.k +4 +>>> model.computeCost(p) +0.0 + +.. versionadded:: 2.0.0 +""" + +def __init__(self, java_model): +super(BisectingKMeansModel, self).__init__(java_model) +self.centers = [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy +arrays.""" +return self.centers + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster that each of the points belongs to in this +model. + +:param x: the point (or RDD of points) to determine + compute the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, x): +""" +Return the Bisecting K-means cost (sum of squared distances of +points to their nearest center) for this model on the given +data. If provided with an RDD of points returns the sum. + +:param point: the point or RDD of points to compute the cost(s). +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("computeCost", vecs) + +return self.call("computeCost", _convert_to_vector(x)) + + +class BisectingKMeans(object): +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of +document clustering techniques" by Steinbach, Karypis, and Kumar, +with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and +bisects each of them using k-means, until there are `k` leaf +clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped +together to increase parallelism. If bisecting all divisible +clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A --- End diff -- ah actually the original URL is over 72 characters - since we use bit.ly links elsewhere do you think it would be ok to keep it here? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49501929 --- Diff: python/pyspark/mllib/tests.py --- @@ -419,6 +419,17 @@ class ListTests(MLlibTestCase): as NumPy arrays. """ +def test_bisecting_kmeans(self): +from pyspark.mllib.clustering import BisectingKMeans +data = array([0.0, 0.0, 1.0, 1.0, 9.0, 8.0, 8.0, 9.0]).reshape(4, 2) +bskm = BisectingKMeans() +model = bskm.train(sc.parallelize(data, 2), k=4) +p = array([0.0, 0.0]) +rdd_p = self.sc.parallelize([p]) +self.assertEqual(model.predict(p), model.predict(rdd_p).first()) +self.assertEqual(model.computeCost(p), model.computeCost(rdd_p)) --- End diff -- Um nevermind...that was silly --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49495742 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,129 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data, 2), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) +0 +>>> model.k +4 +>>> model.computeCost(p) +0.0 + +.. versionadded:: 2.0.0 +""" + +def __init__(self, java_model): +super(BisectingKMeansModel, self).__init__(java_model) +self.centers = [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy +arrays.""" +return self.centers + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster that each of the points belongs to in this +model. + +:param x: the point (or RDD of points) to determine + compute the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, x): +""" +Return the Bisecting K-means cost (sum of squared distances of +points to their nearest center) for this model on the given +data. If provided with an RDD of points returns the sum. + +:param point: the point or RDD of points to compute the cost(s). +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("computeCost", vecs) + +return self.call("computeCost", _convert_to_vector(x)) + + +class BisectingKMeans(object): +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of +document clustering techniques" by Steinbach, Karypis, and Kumar, +with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and +bisects each of them using k-means, until there are `k` leaf +clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped +together to increase parallelism. If bisecting all divisible +clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A --- End diff -- Ah ok, I noticed we were using bit.ly links elsewhere in the pydocs since the 72 char limit is pretty short. I'll switch this back to the old URL. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49494004 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,129 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data, 2), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) +0 +>>> model.k +4 +>>> model.computeCost(p) +0.0 + +.. versionadded:: 2.0.0 +""" + +def __init__(self, java_model): +super(BisectingKMeansModel, self).__init__(java_model) +self.centers = [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy +arrays.""" +return self.centers + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster that each of the points belongs to in this +model. + +:param x: the point (or RDD of points) to determine + compute the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, x): +""" +Return the Bisecting K-means cost (sum of squared distances of +points to their nearest center) for this model on the given +data. If provided with an RDD of points returns the sum. + +:param point: the point or RDD of points to compute the cost(s). +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("computeCost", vecs) + +return self.call("computeCost", _convert_to_vector(x)) + + +class BisectingKMeans(object): +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of +document clustering techniques" by Steinbach, Karypis, and Kumar, +with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and +bisects each of them using k-means, until there are `k` leaf +clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped +together to increase parallelism. If bisecting all divisible +clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on U{http://bit.ly/1OTnFP1} Steinbach, Karypis, and Kumar, A --- End diff -- I'd prefer to keep the original link. Bitly links might not make people happy since it's less clear what you're linking to. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-171005137 Thanks for adding the unit test! I just had a few comments. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49493664 --- Diff: python/pyspark/mllib/clustering.py --- @@ -136,7 +258,10 @@ def predict(self, x): def computeCost(self, rdd): """ Return the K-means cost (sum of squared distances of points to -their nearest center) for this model on the given data. +their nearest center) for this model on the given +data. + +:param point: the point or RDD of points to compute the cost(s). --- End diff -- This is only for RDDs, not single points. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49493668 --- Diff: python/pyspark/mllib/tests.py --- @@ -419,6 +419,17 @@ class ListTests(MLlibTestCase): as NumPy arrays. """ +def test_bisecting_kmeans(self): +from pyspark.mllib.clustering import BisectingKMeans +data = array([0.0, 0.0, 1.0, 1.0, 9.0, 8.0, 8.0, 9.0]).reshape(4, 2) +bskm = BisectingKMeans() +model = bskm.train(sc.parallelize(data, 2), k=4) +p = array([0.0, 0.0]) +rdd_p = self.sc.parallelize([p]) +self.assertEqual(model.predict(p), model.predict(rdd_p).first()) +self.assertEqual(model.computeCost(p), model.computeCost(rdd_p)) --- End diff -- I'm surprised this works. Shouldn't you have to call first() on the RDD? IIRC, assertEqual is from unittest, which won't understand RDDs. (I also want to make sure this is being run.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170728607 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170728609 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49181/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170728335 **[Test build #49181 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49181/consoleFull)** for PR 10150 at commit [`c902d93`](https://github.com/apache/spark/commit/c902d93f34e0da9f240286e00b7f0d907334f7a9). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49391345 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Sounds good. I've got https://issues.apache.org/jira/browse/SPARK-12731 to track this and I'll add it to my next to do in my tools hacking time. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49389929 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Yeah, it may be 72. IMO it'd be nice to add a lint rule after the cleanup JIRA gets fixed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170707177 **[Test build #49181 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49181/consoleFull)** for PR 10150 at commit [`c902d93`](https://github.com/apache/spark/commit/c902d93f34e0da9f240286e00b7f0d907334f7a9). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170705382 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170705376 **[Test build #49174 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49174/consoleFull)** for PR 10150 at commit [`0f17577`](https://github.com/apache/spark/commit/0f17577b0b08aff4e2bea775820086273cf7f169). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170705386 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/49174/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170703468 **[Test build #49174 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/49174/consoleFull)** for PR 10150 at commit [`0f17577`](https://github.com/apache/spark/commit/0f17577b0b08aff4e2bea775820086273cf7f169). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49255412 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Also we have ~380 doctstring lines over length of 72 I'll file a cleanup JIRA for this. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49253655 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. --- End diff -- Agreed, this is however the same text as used in KMeansModel so I'll update that ones docstring as well. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49253296 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Are we sure on the 74? Looking at pep8/pep257 it says 72 (although we extended the length for code lines so maybe we changed that too)? We could try and add a lint rule for this maybe in the future. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49252449 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- Update: It should actually be 74 chars. You can check with ```pydoc pyspark``` from the spark/python directory and changing the terminal size to 80 chars wide. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251302 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) --- End diff -- Specify number of partitions for sc.parallelize; not doing so has caused flaky tests in the past (because of randomization interacting with partitioning). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170156045 @holdenk Thanks for the PR! That's all for now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251291 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine --- End diff -- Confusing doc; reword. Also fix indentation on next line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251293 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +.. note:: Experimental + +A bisecting k-means algorithm based on the paper "A comparison of document clustering --- End diff -- I believe we try to limit doc lines in Python to <= 80 chars (unlike code, which is <= 100 chars). Could you please update this and other parts? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251288 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) --- End diff -- I'd write this as more of an example than a unit test. It's good to exercise all functionality, but unit test code should go in tests.py. (We have been inconsistent about this, but it'd be good to improve.) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251290 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. --- End diff -- This sounds like 1 point only. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251287 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -120,6 +120,23 @@ private[python] class PythonMLLibAPI extends Serializable { } /** + * Java stub for Python mllib BisectingKMeans.run() + */ + def trainBisectingKMeans( +data: JavaRDD[Vector], --- End diff -- fix indentation --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251292 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): --- End diff -- It'd be nice to support RDDs here too. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49251289 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] --- End diff -- It'd be nice to store the centers right after training the model. I could imagine users calling this method within a closure. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49249961 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,120 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> p = array([0.0, 0.0]) +>>> model.predict(p) == model.predict(p) +True +>>> model.predict(sc.parallelize([p])).first() == model.predict(p) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +vecs = x.map(_convert_to_vector) +return self.call("predict", vecs) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('2.0.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: --- End diff -- inherit from object --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user jkbradley commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-170149735 @yanboliang Thanks for reviewing! I'll review now too --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169874556 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169874290 **[Test build #48995 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48995/consoleFull)** for PR 10150 at commit [`dc1c885`](https://github.com/apache/spark/commit/dc1c885ee3675087c4ccf6c8113e1d74350c1aac). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169874558 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48995/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169866477 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169866478 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48994/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169865674 **[Test build #48995 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48995/consoleFull)** for PR 10150 at commit [`dc1c885`](https://github.com/apache/spark/commit/dc1c885ee3675087c4ccf6c8113e1d74350c1aac). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169864895 @thunterdb fixed the issue with predicting on RDDs :) --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49149092 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Ah seems that the JavaModelWraper call method being used won't work on the workers. I'll have to port the predict method over. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140391 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Ah yes it should be, I'll ad a docstring test for this method. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140159 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- Also, maybe you can add a test for this case in the docstring. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user thunterdb commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r49140117 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,116 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.train(sc.parallelize(data), k=4) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.train(sc.parallelize(data), k=2) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 2.0.0 +""" + +@property +@since('2.0.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('2.0.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('2.0.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) --- End diff -- I am not sure I understand this line, shouldn't it be `x.map(self.predict)`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169623692 Looks fine to me. cc @jkbradley --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-169095113 @yanboliang I've added the since annotations. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931943 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931946 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48742/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168931761 **[Test build #48742 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48742/consoleFull)** for PR 10150 at commit [`0310efe`](https://github.com/apache/spark/commit/0310efeec1a202733b40a50085178ec1b1d97409). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168922090 **[Test build #48742 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48742/consoleFull)** for PR 10150 at commit [`0310efe`](https://github.com/apache/spark/commit/0310efeec1a202733b40a50085178ec1b1d97409). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168919548 Sounds good, this PR was made back before the 1.6 branch was cut so I didn't have any annotations on it. I'll update them. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168643020 @holdenk Sorry for late response, thanks for updates. The PR looks good to me, and we add ```since('2.0.0')``` to the public class and functions now. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168062435 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168062436 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48493/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168062281 **[Test build #48493 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48493/consoleFull)** for PR 10150 at commit [`57471e6`](https://github.com/apache/spark/commit/57471e676c982285718bc4e3161932dc1509695c). * This patch passes all tests. * This patch merges cleanly. * This patch adds no public classes. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-168054823 **[Test build #48493 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48493/consoleFull)** for PR 10150 at commit [`57471e6`](https://github.com/apache/spark/commit/57471e676c982285718bc4e3161932dc1509695c). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-166513950 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/48163/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-166513946 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-166513927 **[Test build #48163 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48163/consoleFull)** for PR 10150 at commit [`fa6367c`](https://github.com/apache/spark/commit/fa6367c03cfe9734505d621c38b6c9e90f1e598b). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel(JavaModelWrapper):`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-166513365 **[Test build #48163 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/48163/consoleFull)** for PR 10150 at commit [`fa6367c`](https://github.com/apache/spark/commit/fa6367c03cfe9734505d621c38b6c9e90f1e598b). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r48097828 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,158 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.setK(2).run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 +""" + +@property +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +def k(self): +"""Get the number of clusters""" +return self.call("k") + +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) + +x = _convert_to_vector(x) +return self.call("predict", x) + +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +A bisecting k-means algorithm based on the paper "A comparison of document clustering +techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and bisects each of them using +k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped together to increase parallelism. +If bisecting all divisible clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf +Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, +KDD Workshop on Text Mining, 2000.]] +""" +def __init__(self): +self.k = 4 +self.maxIterations = 20 +self.minDivisibleClusterSize = 1.0 +self.seed = -1888008604 # classOf[BisectingKMeans].getName.## + +def setK(self, k): +""" +Set the number of leaf clusters. + +:param k: the desired number of leaf clusters (default: 4). The actual number could be +smaller if there are no divisible leaf clusters. +""" +self.k = k +return self + +def getK(self): +"""Return the desired number of leaf clusters.""" +return self.k + +def setMaxIterations(self, maxIterations): +""" +Set the maximum number of iterations. + +:param maxIterations: the max number of k-means iterations to split clusters (default: 20) +""" +self.maxIterations = maxIterations +return self + +def getMaxIterations(self): +"""Return the maximum number of iterations.""" +
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user 3ourroom commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47619760 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,158 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.setK(2).run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 +""" + +@property +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +def k(self): +"""Get the number of clusters""" +return self.call("k") + +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) + +x = _convert_to_vector(x) +return self.call("predict", x) + +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +A bisecting k-means algorithm based on the paper "A comparison of document clustering +techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and bisects each of them using +k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped together to increase parallelism. +If bisecting all divisible clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf +Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, +KDD Workshop on Text Mining, 2000.]] +""" +def __init__(self): +self.k = 4 +self.maxIterations = 20 +self.minDivisibleClusterSize = 1.0 +self.seed = -1888008604 # classOf[BisectingKMeans].getName.## + +def setK(self, k): +""" +Set the number of leaf clusters. + +:param k: the desired number of leaf clusters (default: 4). The actual number could be +smaller if there are no divisible leaf clusters. +""" +self.k = k +return self + +def getK(self): +"""Return the desired number of leaf clusters.""" +return self.k + +def setMaxIterations(self, maxIterations): +""" +Set the maximum number of iterations. + +:param maxIterations: the max number of k-means iterations to split clusters (default: 20) +""" +self.maxIterations = maxIterations +return self + +def getMaxIterations(self): +"""Return the maximum number of iterations.""" +
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47618583 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,158 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.setK(2).run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 +""" + +@property +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +def k(self): +"""Get the number of clusters""" +return self.call("k") + +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) + +x = _convert_to_vector(x) +return self.call("predict", x) + +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +A bisecting k-means algorithm based on the paper "A comparison of document clustering +techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and bisects each of them using +k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped together to increase parallelism. +If bisecting all divisible clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf +Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, +KDD Workshop on Text Mining, 2000.]] +""" +def __init__(self): +self.k = 4 +self.maxIterations = 20 +self.minDivisibleClusterSize = 1.0 +self.seed = -1888008604 # classOf[BisectingKMeans].getName.## + +def setK(self, k): +""" +Set the number of leaf clusters. + +:param k: the desired number of leaf clusters (default: 4). The actual number could be +smaller if there are no divisible leaf clusters. +""" +self.k = k +return self + +def getK(self): +"""Return the desired number of leaf clusters.""" +return self.k + +def setMaxIterations(self, maxIterations): +""" +Set the maximum number of iterations. + +:param maxIterations: the max number of k-means iterations to split clusters (default: 20) +""" +self.maxIterations = maxIterations +return self + +def getMaxIterations(self): +"""Return the maximum number of iterations.""" +
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-164526117 ping? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-163791405 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-163791408 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47548/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-163791273 **[Test build #47548 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47548/consoleFull)** for PR 10150 at commit [`7fe3152`](https://github.com/apache/spark/commit/7fe3152692f56deae53bc3bd89887a5a2c2ffe5e). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel(JavaModelWrapper):`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-163783796 **[Test build #47548 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47548/consoleFull)** for PR 10150 at commit [`7fe3152`](https://github.com/apache/spark/commit/7fe3152692f56deae53bc3bd89887a5a2c2ffe5e). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user holdenk commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47298375 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,175 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) --- End diff -- Its a sanity check - I can take it out if you want. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47073596 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,175 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.setK(2).run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 1.6.0 +""" + +@property +@since('1.6.0') +def clusterCenters(self): +"""Get the cluster centers, represented as a list of NumPy arrays.""" +return [c.toArray() for c in self.call("clusterCenters")] + +@property +@since('1.6.0') +def k(self): +"""Get the number of clusters""" +return self.call("k") + +@since('1.6.0') +def predict(self, x): +""" +Find the cluster to which x belongs in this model. + +:param x: Either the point to determine the cluster for or an RDD of points to determine +the clusters for. +""" +if isinstance(x, RDD): +return x.map(self.predict(x)) + +x = _convert_to_vector(x) +return self.call("predict", x) + +@since('1.6.0') +def computeCost(self, point): +""" +Return the Bisecting K-means cost (sum of squared distances of points to +their nearest center) for this model on the given data. + +:param point: the point to compute the cost to +""" +return self.call("computeCost", _convert_to_vector(point)) + + +class BisectingKMeans: +""" +A bisecting k-means algorithm based on the paper "A comparison of document clustering +techniques" by Steinbach, Karypis, and Kumar, with modification to fit Spark. +The algorithm starts from a single cluster that contains all points. +Iteratively it finds divisible clusters on the bottom level and bisects each of them using +k-means, until there are `k` leaf clusters in total or no leaf clusters are divisible. +The bisecting steps of clusters on the same level are grouped together to increase parallelism. +If bisecting all divisible clusters on the bottom level would result more than `k` leaf +clusters, larger clusters get higher priority. + +Based on [[http://glaros.dtc.umn.edu/gkhome/fetch/papers/docclusterKDDTMW00.pdf +Steinbach, Karypis, and Kumar, A comparison of document clustering techniques, +KDD Workshop on Text Mining, 2000.]] + +.. versionadded:: 1.6.0 +""" +def __init__(self): +self.k = 4 +self.maxIterations = 20 +self.minDivisibleClusterSize = 1.0 +self.seed = 42 --- End diff -- I found the default value of ```seed``` is ```classOf[BisectingKMeans].getName.##``` at Scala side, I think we should keep consistent. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47070681 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,175 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) +True +>>> model.k +4 +>>> model.computeCost(array([0.0, 0.0])) +0.0 +>>> model.k == len(model.clusterCenters) +True +>>> model = bskm.setK(2).run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([1.0, 1.0])) +True +>>> model.k +2 + +.. versionadded:: 1.6.0 --- End diff -- I think we can not catch up with 1.6, so version related mark should be deleted. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47070564 --- Diff: python/pyspark/mllib/clustering.py --- @@ -38,13 +38,175 @@ from pyspark.mllib.util import Saveable, Loader, inherit_doc, JavaLoader, JavaSaveable from pyspark.streaming import DStream -__all__ = ['KMeansModel', 'KMeans', 'GaussianMixtureModel', 'GaussianMixture', - 'PowerIterationClusteringModel', 'PowerIterationClustering', - 'StreamingKMeans', 'StreamingKMeansModel', +__all__ = ['BisectingKMeansModel', 'BisectingKMeans', 'KMeansModel', 'KMeans', + 'GaussianMixtureModel', 'GaussianMixture', 'PowerIterationClusteringModel', + 'PowerIterationClustering', 'StreamingKMeans', 'StreamingKMeansModel', 'LDA', 'LDAModel'] @inherit_doc +class BisectingKMeansModel(JavaModelWrapper): +""" +.. note:: Experimental + +A clustering model derived from the bisecting k-means method. + +>>> data = array([0.0,0.0, 1.0,1.0, 9.0,8.0, 8.0,9.0]).reshape(4, 2) +>>> bskm = BisectingKMeans() +>>> model = bskm.run(sc.parallelize(data)) +>>> model.predict(array([0.0, 0.0])) == model.predict(array([0.0, 0.0])) --- End diff -- I can not understand the meaning of this test case. The same point should be clustered into same center is right in any conditions. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user yanboliang commented on a diff in the pull request: https://github.com/apache/spark/pull/10150#discussion_r47070153 --- Diff: mllib/src/main/scala/org/apache/spark/mllib/api/python/PythonMLLibAPI.scala --- @@ -121,6 +121,22 @@ private[python] class PythonMLLibAPI extends Serializable { } /** + * Java stub for Python mllib BisectingKMeans.run() + */ + def trainBisectingKMeans( +data: JavaRDD[Vector], +k: Int, +maxIterations: Int, +minDivisibleClusterSize: Double, +seed: Long): BisectingKMeansModel = { +new BisectingKMeans() + .setK(k) + .setMaxIterations(maxIterations) + .setMinDivisibleClusterSize(minDivisibleClusterSize) + .setSeed(seed).run(data) --- End diff -- nit: new line. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162257194 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162257197 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47232/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162257155 **[Test build #47232 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47232/consoleFull)** for PR 10150 at commit [`0fd962c`](https://github.com/apache/spark/commit/0fd962ca67aaa923a5087f592a2acb34d4c89d07). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel(JavaModelWrapper):`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162255249 **[Test build #47232 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47232/consoleFull)** for PR 10150 at commit [`0fd962c`](https://github.com/apache/spark/commit/0fd962ca67aaa923a5087f592a2acb34d4c89d07). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162139505 Test FAILed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47216/ Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162139503 Merged build finished. Test FAILed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162139499 **[Test build #47216 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47216/consoleFull)** for PR 10150 at commit [`868c4a7`](https://github.com/apache/spark/commit/868c4a7931834fe2bf85ccabe97a640f4bff4dc2). * This patch **fails Python style tests**. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel(JavaModelWrapper):`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162139389 **[Test build #47216 has started](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47216/consoleFull)** for PR 10150 at commit [`868c4a7`](https://github.com/apache/spark/commit/868c4a7931834fe2bf85ccabe97a640f4bff4dc2). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
GitHub user holdenk reopened a pull request: https://github.com/apache/spark/pull/10150 [SPARK-11944][PYSPARK][MLLIB] python mllib.clustering.bisecting k means From the coverage issues for 1.6 : Add Python API for mllib.clustering.BisectingKMeans. You can merge this pull request into a Git repository by running: $ git pull https://github.com/holdenk/spark SPARK-11937-python-api-coverage-SPARK-11944-python-mllib.clustering.BisectingKMeans Alternatively you can review and apply these changes as the patch at: https://github.com/apache/spark/pull/10150.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #10150 commit 9b95e944f943a31ea8e969faa80662bce1080bdd Author: Holden Karau Date: 2015-12-02T16:07:39Z Some progress, not a lot commit 427e487ca3e2ad27b692d9acd40fbd8a9b726312 Author: Holden Karau Date: 2015-12-03T04:20:55Z murh some murh. airplain code isn't very good but it distracts me - start adding a class for calling bisectingkmeans. I don't really like how BisectingKMeans is set up (its different from many of the others which is fnur) but trying to decide if I should make the python API more closely match the Scala API or match the rest of the Python API. These are questions for after I've slept perhaps. commit f5a40c85a2b91b4c93a66db2c15164bb57db44d6 Author: Holden Karau Date: 2015-12-04T21:01:58Z A bunch of works towards getting BisectingKMeans in PySpark commit d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00 Author: Holden Karau Date: 2015-12-04T22:23:08Z Add a bit more pydoc descriptions, fix the prediction call, and verify with different k commit 868c4a7931834fe2bf85ccabe97a640f4bff4dc2 Author: Holden Karau Date: 2015-12-05T04:25:17Z fix compute cost --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162108001 Merged build finished. Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user AmplabJenkins commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162108002 Test PASSed. Refer to this link for build results (access rights to CI server needed): https://amplab.cs.berkeley.edu/jenkins//job/SparkPullRequestBuilder/47208/ Test PASSed. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org
[GitHub] spark pull request: [SPARK-11944][PYSPARK][MLLIB] python mllib.clu...
Github user SparkQA commented on the pull request: https://github.com/apache/spark/pull/10150#issuecomment-162107904 **[Test build #47208 has finished](https://amplab.cs.berkeley.edu/jenkins/job/SparkPullRequestBuilder/47208/consoleFull)** for PR 10150 at commit [`d3e4c1a`](https://github.com/apache/spark/commit/d3e4c1a6a19e8cb0d19bbe5feab48c7655d48a00). * This patch passes all tests. * This patch merges cleanly. * This patch adds the following public classes _(experimental)_:\n * `class BisectingKMeansModel(JavaModelWrapper):`\n --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org