[ 
https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290523#comment-14290523
 ] 

Muhammad-Ali A'rabi commented on SPARK-5226:
--------------------------------------------

That's right. For very huge data, it won't be a good implementation.
It is O(log n), actually. In preprocessing phase, we created a sorted map or 
something, and with a radius, we can retrieve all points with less distance in 
O(log n).
If we use the first implementation, for each region query we have to calculate 
lots of distances, and some of them are surely calculated before.
We can have both ways implemented, and user may use any of them depending on 
their need.
We can also use vector with norm and use the upper bound. But I don't trust 
this method and have to test it.

> Add DBSCAN Clustering Algorithm to MLlib
> ----------------------------------------
>
>                 Key: SPARK-5226
>                 URL: https://issues.apache.org/jira/browse/SPARK-5226
>             Project: Spark
>          Issue Type: New Feature
>          Components: MLlib
>            Reporter: Muhammad-Ali A'rabi
>            Priority: Minor
>              Labels: DBSCAN
>
> MLlib is all k-means now, and I think we should add some new clustering 
> algorithms to it. First candidate is DBSCAN as I think.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org
For additional commands, e-mail: issues-h...@spark.apache.org

Reply via email to