[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15880786#comment-15880786 ] Xiangrui Meng commented on SPARK-5226: -- I closed this ticket as "Won't Do" due to DBSCAN's high complexity and hence bad scalability as documented in http://staff.itee.uq.edu.au/taoyf/paper/sigmod15-dbscan.pdf. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.15#6346) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15087301#comment-15087301 ] mustafa elbehery commented on SPARK-5226: - I have tried to use Aliaksei's implementation on 500MB of GPS Trajectories. The algorithm never finished. Though, his implementation worked very well on the provided sample data. When I have created a scatter plot for both datasets; sample data && trajectories data, I found out that his data's distribution was Gaussian, while mine was very skewed. Moreover, this implementation has a bottleneck, because basically all the partition are merged together in a reduce step, which leads turns the algorithm into Serial again !!!.. I have commented below a better implementation to avoid this bottleneck, hope it would be more helpful. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15086591#comment-15086591 ] Joseph K. Bradley commented on SPARK-5226: -- Aliaksei did create a package: [http://spark-packages.org/package/alitouka/spark_dbscan] It'd be great to get feedback on it & other implementations. It may be a little while before we have reviewer bandwidth to get dbscan into MLlib itself, but a well-tested package would make doing so much easier! > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067200#comment-15067200 ] mustafa elbehery commented on SPARK-5226: - Better Implementation, based on research paper for parallel DBSCAN can be found here. https://github.com/irvingc/dbscan-on-spark .. The approach solved bottleneck of reduce step, in which discovered clusters are merged. Hope it helps. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15067177#comment-15067177 ] Stephen Boesch commented on SPARK-5226: --- It seems that Aliaksei has not been able to add to spark-packages. So I presume DBScan is still an open ticket ? > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=15005355#comment-15005355 ] mustafa elbehery commented on SPARK-5226: - Hello, I would like to use DBSCAN on spark. [~alitouka] I have tried to use ur implementation, on 500 MG of data. However, I think the **Population of partition index** step is to expensive. Is this implementation is going to be online soon, Regards. > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=14725284#comment-14725284 ] Rajeev Reddy commented on SPARK-5226: - Hello Aliaksei Litouka, I have looked into your implementation you are taking coordinate points i.e double as input for clustering can you please tell me how I can extend this for clustering a set of Text Documents > Add DBSCAN Clustering Algorithm to MLlib > > > Key: SPARK-5226 > URL: https://issues.apache.org/jira/browse/SPARK-5226 > Project: Spark > Issue Type: New Feature > Components: MLlib >Reporter: Muhammad-Ali A'rabi >Priority: Minor > Labels: DBSCAN, clustering > > MLlib is all k-means now, and I think we should add some new clustering > algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14301792#comment-14301792 ] Xiangrui Meng commented on SPARK-5226: -- [~alitouka] Thanks for implementing DBSCAN on top of Spark! I'd like to recommend you registering it as a package on http://spark-packages.org, so it is more visible to the community. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14299902#comment-14299902 ] Aliaksei Litouka commented on SPARK-5226: - [~angellandros] , FYI - I attempted to implement DBSCAN algorithm on top of Spark. My implementation is limited to Euclidean and Manhattan distance measures but maybe you'll find something helpful - https://github.com/alitouka/spark_dbscan . Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14292399#comment-14292399 ] Dmitriy Lyubimov commented on SPARK-5226: - All attempts to parallelize dbscan in literature lately (or similar DeLiClu type of things) i read about include partitioning the task into smaller subtasks, solving each on individual level and merging it all back (see MR.Scan paper for example). Merging is of course is the new and the tricky thing. As far as i understand, they all pretty much have limitations to reduce scope to euclidean distances and captitalize on notions of euclidean geometry resulting from that, in order to solve partition and merge problems. Which substantially reduces attractiveness of general algorithm. However, the naive straightforward port of simple DBScan algorithm is not terribly practical for big data because of total complexity of the problem (or impracticality of building something like huge distributed R-tree index system on shared-nothing programming models). Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14290523#comment-14290523 ] Muhammad-Ali A'rabi commented on SPARK-5226: That's right. For very huge data, it won't be a good implementation. It is O(log n), actually. In preprocessing phase, we created a sorted map or something, and with a radius, we can retrieve all points with less distance in O(log n). If we use the first implementation, for each region query we have to calculate lots of distances, and some of them are surely calculated before. We can have both ways implemented, and user may use any of them depending on their need. We can also use vector with norm and use the upper bound. But I don't trust this method and have to test it. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14284391#comment-14284391 ] Xiangrui Meng commented on SPARK-5226: -- The preprocessing step is parallel but very expensive. The distance matrix will have n^2 entries, which means it would be hard to handle beyond 1 million points. Is it true? In the main process, could you elaborate on how region queries are performed, and why it is O(1)? Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14279170#comment-14279170 ] Muhammad-Ali A'rabi commented on SPARK-5226: This is DBSCAN algorithm: {noformat} DBSCAN(D, eps, MinPts) C = 0 for each unvisited point P in dataset D mark P as visited NeighborPts = regionQuery(P, eps) if sizeof(NeighborPts) MinPts mark P as NOISE else C = next cluster expandCluster(P, NeighborPts, C, eps, MinPts) expandCluster(P, NeighborPts, C, eps, MinPts) add P to cluster C for each point P' in NeighborPts if P' is not visited mark P' as visited NeighborPts' = regionQuery(P', eps) if sizeof(NeighborPts') = MinPts NeighborPts = NeighborPts joined with NeighborPts' if P' is not yet member of any cluster add P' to cluster C regionQuery(P, eps) return all points within P's eps-neighborhood (including P) {noformat} As you can see, there are just two parameters. There is two ways of implementation. First one is faster (O(n log n), and requires more memory (O(n^2)). The other way is slower (O(n^2)) and requires less memory (O(n)). But I prefer the first one, as we are not short one memory. There are two phases of running: * Preprocessing. In this phase a distance matrix for all point is created and distances between every two points is calculated. Very parallel. * Main Process. In this phase the algorithm will run, as described in pseudo-code, and two foreach's are parallelized. Region queries are done very fast (O(1)), because of preprocessing. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14277932#comment-14277932 ] Xiangrui Meng commented on SPARK-5226: -- [~angellandros] Before you start coding, could you describe the computation and communication cost of a parallel implementation of DBSCAN as well as the proposed APIs (parameters, methods)? It would help us decide whether to include it in MLlib. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14278305#comment-14278305 ] Muhammad-Ali A'rabi commented on SPARK-5226: Yeah, of course. It will take me a day or two, so I commented to say that I'm working on it. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Reporter: Muhammad-Ali A'rabi Priority: Minor Labels: DBSCAN MLlib is all k-means now, and I think we should add some new clustering algorithms to it. First candidate is DBSCAN as I think. -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org
[jira] [Commented] (SPARK-5226) Add DBSCAN Clustering Algorithm to MLlib
[ https://issues.apache.org/jira/browse/SPARK-5226?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=14275981#comment-14275981 ] Muhammad-Ali A'rabi commented on SPARK-5226: Although I can't assign this task to myself, I am interested to do it. Add DBSCAN Clustering Algorithm to MLlib Key: SPARK-5226 URL: https://issues.apache.org/jira/browse/SPARK-5226 Project: Spark Issue Type: New Feature Components: MLlib Affects Versions: 1.2.0 Reporter: Muhammad-Ali A'rabi Priority: Minor -- This message was sent by Atlassian JIRA (v6.3.4#6332) - To unsubscribe, e-mail: issues-unsubscr...@spark.apache.org For additional commands, e-mail: issues-h...@spark.apache.org