[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14279170#comment-14279170 ]
Muhammad-Ali A'rabi commented on SPARK-5226: -------------------------------------------- This is DBSCAN algorithm: {noformat} DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) < MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') >= MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {noformat} As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory. There are two phases of running: * Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel. * Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing. > Add DBSCAN Clustering Algorithm to MLlib > ---------------------------------------- > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Muhammad-Ali A'rabi > Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org