spark git commit: [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue

meng Mon, 06 Jul 2015 16:15:59 -0700

Repository: spark
Updated Branches:
  refs/heads/master 96c5eeec3 -> 0effe180f



[SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue

PySpark PowerIterationClustering test failure due to bad demo data.
If the data is small,  PowerIterationClustering will behavior indeterministic.

Author: Yanbo Liang <yblia...@gmail.com>

Closes #7177 from yanboliang/spark-8765 and squashes the following commits:

392ae54 [Yanbo Liang] fix model.assignments output
5ec3f1e [Yanbo Liang] fix PySpark PowerIterationClustering test issue


Project: http://git-wip-us.apache.org/repos/asf/spark/repo
Commit: http://git-wip-us.apache.org/repos/asf/spark/commit/0effe180
Tree: http://git-wip-us.apache.org/repos/asf/spark/tree/0effe180
Diff: http://git-wip-us.apache.org/repos/asf/spark/diff/0effe180

Branch: refs/heads/master
Commit: 0effe180f4c2cf37af1012b33b43912bdecaf756
Parents: 96c5eee
Author: Yanbo Liang <yblia...@gmail.com>
Authored: Mon Jul 6 16:15:12 2015 -0700
Committer: Xiangrui Meng <m...@databricks.com>
Committed: Mon Jul 6 16:15:12 2015 -0700

----------------------------------------------------------------------
 python/pyspark/mllib/clustering.py | 16 ++++++++++++++--
 1 file changed, 14 insertions(+), 2 deletions(-)
----------------------------------------------------------------------


http://git-wip-us.apache.org/repos/asf/spark/blob/0effe180/python/pyspark/mllib/clustering.py
----------------------------------------------------------------------
diff --git a/python/pyspark/mllib/clustering.py 
b/python/pyspark/mllib/clustering.py
index a3eab63..ed4d78a 100644
--- a/python/pyspark/mllib/clustering.py
+++ b/python/pyspark/mllib/clustering.py
@@ -282,18 +282,30 @@ class PowerIterationClusteringModel(JavaModelWrapper, 
JavaSaveable, JavaLoader):
 
     Model produced by [[PowerIterationClustering]].
 
-    >>> data = [(0, 1, 1.0), (0, 2, 1.0), (1, 3, 1.0), (2, 3, 1.0),
-    ...     (0, 3, 1.0), (1, 2, 1.0), (0, 4, 0.1)]
+    >>> data = [(0, 1, 1.0), (0, 2, 1.0), (0, 3, 1.0), (1, 2, 1.0), (1, 3, 
1.0),
+    ... (2, 3, 1.0), (3, 4, 0.1), (4, 5, 1.0), (4, 15, 1.0), (5, 6, 1.0),
+    ... (6, 7, 1.0), (7, 8, 1.0), (8, 9, 1.0), (9, 10, 1.0), (10, 11, 1.0),
+    ... (11, 12, 1.0), (12, 13, 1.0), (13, 14, 1.0), (14, 15, 1.0)]
     >>> rdd = sc.parallelize(data, 2)
     >>> model = PowerIterationClustering.train(rdd, 2, 100)
     >>> model.k
     2
+    >>> result = sorted(model.assignments().collect(), key=lambda x: x.id)
+    >>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+    True
+    >>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+    True
     >>> import os, tempfile
     >>> path = tempfile.mkdtemp()
     >>> model.save(sc, path)
     >>> sameModel = PowerIterationClusteringModel.load(sc, path)
     >>> sameModel.k
     2
+    >>> result = sorted(model.assignments().collect(), key=lambda x: x.id)
+    >>> result[0].cluster == result[1].cluster == result[2].cluster == 
result[3].cluster
+    True
+    >>> result[4].cluster == result[5].cluster == result[6].cluster == 
result[7].cluster
+    True
     >>> from shutil import rmtree
     >>> try:
     ...     rmtree(path)


---------------------------------------------------------------------
To unsubscribe, e-mail: commits-unsubscr...@spark.apache.org
For additional commands, e-mail: commits-h...@spark.apache.org

spark git commit: [SPARK-8765] [MLLIB] Fix PySpark PowerIterationClustering test issue

Reply via email to