[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15087301#comment-15087301 ]
mustafa elbehery commented on SPARK-5226: ----------------------------------------- I have tried to use Aliaksei's implementation on 500MB of GPS Trajectories. The algorithm never finished. Though, his implementation worked very well on the provided sample data. When I have created a scatter plot for both datasets; sample data && trajectories data, I found out that his data's distribution was Gaussian, while mine was very skewed. Moreover, this implementation has a bottleneck, because basically all the partition are merged together in a reduce step, which leads turns the algorithm into Serial again !!!.. I have commented below a better implementation to avoid this bottleneck, hope it would be more helpful. > Add DBSCAN Clustering Algorithm to MLlib > ---------------------------------------- > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib > Reporter: Muhammad-Ali A'rabi > Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org