[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292399#comment-14292399 ]
Dmitriy Lyubimov commented on SPARK-5226: ----------------------------------------- All attempts to parallelize dbscan in literature lately (or similar DeLiClu type of things) i read about include partitioning the task into smaller subtasks, solving each on individual level and merging it all back (see MR.Scan paper for example). Merging is of course is the new and the tricky thing. As far as i understand, they all pretty much have limitations to reduce scope to euclidean distances and captitalize on notions of euclidean geometry resulting from that, in order to solve partition and merge problems. Which substantially reduces attractiveness of general algorithm. However, the naive straightforward port of simple DBScan algorithm is not terribly practical for big data because of total complexity of the problem (or impracticality of building something like huge distributed R-tree index system on shared-nothing programming models). > Add DBSCAN Clustering Algorithm to MLlib > ---------------------------------------- > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Muhammad-Ali A'rabi > Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org