To be clear- I do not recommend Method 2 at all for Mahout, but was trying to be illustrative.
On Sat, Aug 26, 2017 at 12:28 PM, Aditya <adityasarma...@gmail.com> wrote: > Hello all, > > The InCoreDBSCAN code has been pushed to the PR and can be found here > <https://github.com/apache/mahout/pull/334>. I'm also working on including > the rtree module to reduce the computational complexity of the existing > code. Once the RTree code has been pushed it can be used for other > algorithms / modules as well. With respect to the Distributed version of > DBSCAN, Trevor and I have been having discussions for a while now and we > have come to a conclusion that an *exact distributed algorithm* *would be > really expensive* due to large amounts of data copy. > > The design for the distributed version can be found in this google doc > <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4 > PyXLoNlVF0Yno8/edit?usp=sharing>. > I urge all of you to have a look at it, because if we get ideas to > implement the exact version that'd be golden! > > Apart from that Trevor and I have come up with a few ways about the design > of the *approximate distributed DBSCAN algorithm*. I have given details > about two methods we came up with. Please go through them and give your > thoughts about each one. The ideas are still raw and need some > brainstorming. > > It helps if you are familiar with the idea of an RTree before proceeding. > The information given in the wiki page > <https://en.wikipedia.org/wiki/R-tree> is more than sufficient for this > email. > > > > *Method 1: Iterative method* > *1. *In each partition, create rectangles to properly describe the data in > that partition. > *2. *The driver will collect information about all of the rectangles (a > rectangle is denoted by the coordinates of its bottom left corner and its > top right corner) > *3. *The driver will try to re-balance the collected rectangles by checking > for overlaps / containment etc and modify the input rectangles accordingly > *4. *Do cluster assignment according to the rebalanced rectangles in each > partition and check how many points end up in each cluster. > *5. *If the clustering is not satisfactory (This can be determined by a > standard indexes), then repeat the process again. > > > *Method 2: Sampling method* > *1. *If the given data is really really large select a sample of the data > points, generate clusters on the sampled dataset. > *2. *Assign each point in the full dataset to clusters computed on the > sample. The assignment of points to the clusters will be heuristic based. > > Now for the important part, we need feedback on the proposed ideas and > possibly discussion on what the heuristics should be. Hope all of you > contribute to the brainstorming process. > > PS: @Trevor: If I've missed something, please add it. > > Best Regards, > Aditya >