Re: Density based Clustering in Mahout

Aditya Wed, 05 Jul 2017 00:10:17 -0700

***Important** **Do read** *
Hello everyone,

Trevor and I have been discussing as to how to effectively represent an
R-Tree in Mahout. Turns out there is a method to represent a Binary Search
Tree (BST) in the form of an ancestor matrix. This
<http://www.geeksforgeeks.org/construct-ancestor-matrix-from-a-given-binary-tree/>
and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-matrix/>
show the conversion logic from a tree representation to a matrix
representation and vice versa.

In a distributed scenario, I know of the following design
<https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
which is fairly efficient and intuitive to understand.

Now the point for debate is this:
**Please read the design provided in the link above before proceeding**
The R-Tree will always be a local entity, no where in the algorithm is
there a need / necessity to have a distributed R-Tree kind of scenario. On
the other hand, the data points as well as the union find data structure
need to be stored in a DRM like data structure and they very well can be
represented in the form of a matrix. (A union find data structure basically
can be implemented using a vector)

1. Why not build an R-Tree module in the form of a normal tree with two
children and a key-value pair? (I'm not sure if this is allowed in Mahout,
so veterans please chip in). Since an R-tree will always be an in-core
entity.

2. If 1. is not done, then the method followed for the matrix
representation of a BST should be followed. Only, the elements and
conditions will be changed. On an abstract sense, Matrix representation of
an R-Tree and matrix representation of a Binary Search Tree are analogous.
But in this case, the construction and search costs for the Matrix version
of an R-Tree will be costlier.

*PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
could offer your insights.*

Thanks,
Aditya

On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com>
wrote:

> What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?  (e.g.
> 3d and 4d tensors)?
>
> I implemented these for the MLPs (still WIP)
>
> https://github.com/apache/mahout/pull/323/files#diff-
> cd8a7c5e2cf42b91b5aa47c96daf19c0R25
>
> But those functions were specifically to overcome the challenges you
> describe wrt neural network weight sets.
>
> If you could use those as is- that would be awesome, if not maybe you'll
> find inspiration there.
>
>
>
> On Thu, Jun 22, 2017 at 6:43 PM, Aditya <adityasarma...@gmail.com> wrote:
>
> > Hello everyone,
> >
> > I've been working for the past few weeks on coding an in-core DBSCAN
> > algorithm.
> >
> > A more efficient version with an O(n*log(n)) complexity does exist but it
> > uses the R-Tree data structure to index the data. I have a few
> > concerns/questions and I'm hoping you would be able to help me out.
> >
> > 1. Based on my knowledge, using an indexing data structure like an R-Tree
> > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> > algorithm. If there's any other method that you are familiar with, please
> > let me know.
> >
> > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> written
> > by Andrew and Dmitry, I don't see how I can directly exploit the existing
> > data structures and operations to get the functionality of an R-Tree.
> >
> > 3. On the off chance that an R-Tree module has to built in Mahout,
> because
> > it is not possible to easily use existing operations I need some insights
> > as to how to go about it. I learned that everything in Mahout at the end
> > should be serializable to a vector. I can't fathom how to do that
> > intuitively in the case of an R-Tree
> >
> > There are a couple of other concerns that need to be discussed but these
> > are vital at the moment.
> >
> > PS: The research paper that proposed the DBSCAN algorithm can be found
> here
> > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> >
> > Thanks!
> >
> > -Aditya
> >
>

Re: Density based Clustering in Mahout

Reply via email to