To be clear- I do not recommend Method 2 at all for Mahout, but was trying
to be illustrative.





On Sat, Aug 26, 2017 at 12:28 PM, Aditya <adityasarma...@gmail.com> wrote:

> Hello all,
>
> The InCoreDBSCAN code has been pushed to the PR and can be found here
> <https://github.com/apache/mahout/pull/334>. I'm also working on including
> the rtree module to reduce the computational complexity of the existing
> code. Once the RTree code has been pushed it can be used for other
> algorithms / modules as well. With respect to the Distributed version of
> DBSCAN, Trevor and I have been having discussions for a while now and we
> have come to a conclusion that an *exact distributed algorithm* *would be
> really expensive* due to large amounts of data copy.
>
> The design for the distributed version can be found in this google doc
> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4
> PyXLoNlVF0Yno8/edit?usp=sharing>.
> I urge all of you to have a look at it, because if we get ideas to
> implement the exact version that'd be golden!
>
> Apart from that Trevor and I have come up with a few ways about the design
> of the *approximate distributed DBSCAN algorithm*. I have given details
> about two methods we came up with. Please go through them and give your
> thoughts about each one. The ideas are still raw and need some
> brainstorming.
>
> It helps if you are familiar with the idea of an RTree before proceeding.
> The information given in the wiki page
> <https://en.wikipedia.org/wiki/R-tree> is more than sufficient for this
> email.
>
>
>
> *Method 1: Iterative method*
> *1. *In each partition, create rectangles to properly describe the data in
> that partition.
> *2. *The driver will collect information about all of the rectangles (a
> rectangle is denoted by the coordinates of its bottom left corner and its
> top right corner)
> *3. *The driver will try to re-balance the collected rectangles by checking
> for overlaps / containment etc and modify the input rectangles accordingly
> *4. *Do cluster assignment according to the rebalanced rectangles in each
> partition and check how many points end up in each cluster.
> *5. *If the clustering is not satisfactory (This can be determined by a
> standard indexes), then repeat the process again.
>
>
> *Method 2: Sampling method*
> *1. *If the given data is really really large select a sample of the data
> points, generate clusters on the sampled dataset.
> *2. *Assign each point in the full dataset to clusters computed on the
> sample. The assignment of points to the clusters will be heuristic based.
>
> Now for the important part, we need feedback on the proposed ideas and
> possibly discussion on what the heuristics should be. Hope all of you
> contribute to the brainstorming process.
>
> PS: @Trevor: If I've missed something, please add it.
>
> Best Regards,
> Aditya
>

Reply via email to