Density based Clustering in Mahout

Aditya Thu, 22 Jun 2017 16:44:56 -0700

Hello everyone,

I've been working for the past few weeks on coding an in-core DBSCAN
algorithm.


A more efficient version with an O(n*log(n)) complexity does exist but it
uses the R-Tree data structure to index the data. I have a few
concerns/questions and I'm hoping you would be able to help me out.

1. Based on my knowledge, using an indexing data structure like an R-Tree
or a Kd-Tree is the only way to reduce the complexity of the dbscan
algorithm. If there's any other method that you are familiar with, please
let me know.

2. From what I've read in the book Apache Mahout: Beyond MapReduce written
by Andrew and Dmitry, I don't see how I can directly exploit the existing
data structures and operations to get the functionality of an R-Tree.

3. On the off chance that an R-Tree module has to built in Mahout, because
it is not possible to easily use existing operations I need some insights
as to how to go about it. I learned that everything in Mahout at the end
should be serializable to a vector. I can't fathom how to do that
intuitively in the case of an R-Tree

There are a couple of other concerns that need to be discussed but these
are vital at the moment.

PS: The research paper that proposed the DBSCAN algorithm can be found here
<https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.

Thanks!

-Aditya

Density based Clustering in Mahout

Reply via email to