Hello everyone, I've been working for the past few weeks on coding an in-core DBSCAN algorithm.
A more efficient version with an O(n*log(n)) complexity does exist but it uses the R-Tree data structure to index the data. I have a few concerns/questions and I'm hoping you would be able to help me out. 1. Based on my knowledge, using an indexing data structure like an R-Tree or a Kd-Tree is the only way to reduce the complexity of the dbscan algorithm. If there's any other method that you are familiar with, please let me know. 2. From what I've read in the book Apache Mahout: Beyond MapReduce written by Andrew and Dmitry, I don't see how I can directly exploit the existing data structures and operations to get the functionality of an R-Tree. 3. On the off chance that an R-Tree module has to built in Mahout, because it is not possible to easily use existing operations I need some insights as to how to go about it. I learned that everything in Mahout at the end should be serializable to a vector. I can't fathom how to do that intuitively in the case of an R-Tree There are a couple of other concerns that need to be discussed but these are vital at the moment. PS: The research paper that proposed the DBSCAN algorithm can be found here <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>. Thanks! -Aditya
