[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14292399#comment-14292399
 ] 

Dmitriy Lyubimov commented on SPARK-5226:
-----------------------------------------

All attempts to parallelize dbscan in literature lately (or similar DeLiClu 
type of things) i read about include partitioning the task into smaller 
subtasks, solving each on individual level and merging it all back (see MR.Scan 
paper for example). Merging is of course is the new and the tricky thing.

As far as i understand, they all pretty much have limitations to reduce scope 
to euclidean distances  and captitalize on notions of euclidean geometry 
resulting from that, in order to solve partition and merge problems. Which 
substantially reduces attractiveness of general algorithm. However, the naive 
straightforward port of  simple DBScan algorithm is not terribly practical for 
big data because of total complexity of the problem (or impracticality of 
building something like huge distributed R-tree index system on shared-nothing 
programming models).

> Add DBSCAN Clustering Algorithm to MLlib
> ----------------------------------------
>
>                 Key: SPARK-5226
>                 URL: https://issues.apache.org/jira/browse/SPARK-5226
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Muhammad-Ali A'rabi
>            Priority: Minor
>              Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to