Github user WeichenXu123 commented on a diff in the pull request: https://github.com/apache/spark/pull/21513#discussion_r194214831 --- Diff: python/pyspark/ml/clustering.py --- @@ -1156,6 +1157,204 @@ def getKeepLastCheckpoint(self): return self.getOrDefault(self.keepLastCheckpoint) +@inherit_doc +class PowerIterationClustering(HasMaxIter, HasWeightCol, JavaParams, JavaMLReadable, + JavaMLWritable): + """ + .. note:: Experimental + + Power Iteration Clustering (PIC), a scalable graph clustering algorithm developed by + <a href=http://www.icml2010.org/papers/387.pdf>Lin and Cohen</a>. From the abstract: + PIC finds a very low-dimensional embedding of a dataset using truncated power + iteration on a normalized pair-wise similarity matrix of the data. + + This class is not yet an Estimator/Transformer, use `assignClusters` method to run the + PowerIterationClustering algorithm. + + .. seealso:: `Wikipedia on Spectral clustering \ + <http://en.wikipedia.org/wiki/Spectral_clustering>`_ + + >>> from pyspark.sql.types import DoubleType, LongType, StructField, StructType + >>> import math + >>> def genCircle(r, n): + ... points = [] + ... for i in range(0, n): + ... theta = 2.0 * math.pi * i / n + ... points.append((r * math.cos(theta), r * math.sin(theta))) + ... return points + >>> def sim(x, y): + ... dist = (x[0] - y[0]) * (x[0] - y[0]) + (x[1] - y[1]) * (x[1] - y[1]) + ... return math.exp(-dist / 2.0) + >>> r1 = 1.0 + >>> n1 = 10 + >>> r2 = 4.0 + >>> n2 = 40 + >>> n = n1 + n2 + >>> points = genCircle(r1, n1) + genCircle(r2, n2) + >>> data = [(i, j, sim(points[i], points[j])) for i in range(1, n) for j in range(0, i)] + >>> rdd = sc.parallelize(data, 2) + >>> schema = StructType([StructField("src", LongType(), False), \ + StructField("dst", LongType(), True), \ + StructField("weight", DoubleType(), True)]) + >>> df = spark.createDataFrame(rdd, schema) --- End diff -- The test code here is too complex to be in doctest. could you change it to code like: `` df = sc.parallelize(...).toDF(...) `` generate a small, hardcoded dataset.
--- --------------------------------------------------------------------- To unsubscribe, e-mail: reviews-unsubscr...@spark.apache.org For additional commands, e-mail: reviews-h...@spark.apache.org