[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14290523#comment-14290523 ]
Muhammad-Ali A'rabi commented on SPARK-5226: -------------------------------------------- That's right. For very huge data, it won't be a good implementation. It is O(log n), actually. In preprocessing phase, we created a sorted map or something, and with a radius, we can retrieve all points with less distance in O(log n). If we use the first implementation, for each region query we have to calculate lots of distances, and some of them are surely calculated before. We can have both ways implemented, and user may use any of them depending on their need. We can also use vector with norm and use the upper bound. But I don't trust this method and have to test it. > Add DBSCAN Clustering Algorithm to MLlib > ---------------------------------------- > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Muhammad-Ali A'rabi > Priority: Minor > Labels: DBSCAN > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org