Re: Density based Clustering in Mahout

Trevor Grant Thu, 06 Jul 2017 09:45:40 -0700

To Dmitriy's point (2)- I think it is acceptable to create an R-Tree
structure, that will exist only within the algorithm for doing in-core
operations, (or maybe it lives slightly outside of the algorithm so we
don't need to recreate trees for DBSCAN, Random Forrests, other tree-based
algorithms- e.g. we can reuse the same trees for various algorithms.)  BUT
Trees only exist WITHIN the in-core, i.e. we don't want to modify the
allReduceBlock to accept Matrices OR Trees, that will get out of hand
fast.  Please anyone chime in to correct me/argue against.


Aditya and I have talked offline, but the paper in question is [1]

The data-scructure that needs to be serialized is referred to as the
disjoint-set data structure.  Trees are used locally, but the disjoint-set
is the structure that is used during the reduce phase to merge.  This is
the dataset that will have to be serialized into a vector or matrix to
remain complaint with current implementation of allReduceBlock.

As far as I'm concerned- it would also be permisable to create a
disjoint-set data structure, and implement the FIND and UNION methods per
the referenced paper. The only requirement for Mahout compatability is to
be able to serialize/deserialize this structure into a vector/matrix (I'm
sure we can do this, and I will help Aditya once he makes the basic
structure)

So really, we've stumbled into a more important philosophical question- and
that is: Is it acceptable to create objects which make the internals of
algorithms easier to read and work with, so long as they may be serialized
to incore matrices/vectors? I am +1, and if it is decided this is not
acceptable, I need to go back and alter (or drop) things like the CanopyFn
[2] of the Canopy Clustering Algorithm.

[1]
http://www.ece.northwestern.edu/~choudhar/Publications/ANewScalableParallelDBSCANAlgorithmUsingDisjointSetDataStructure.pdf
[2]
https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/Canopy.scala#L129

On Wed, Jul 5, 2017 at 11:07 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote:

> PS i read a few papers, including i believe that of Google's, on
> partitioning of the DBScan problem for parallelization. It did not fit my
> purposes though as they inherently assumed that every cluster problems had
> enough centroids to figure to be efficiently partitioned. In effect it
> amounted to similar effects as the probabilistic sketches techniques
> mentioned in the book, but with much more headache for the bang. I
> eventually turned to solving problems pre-sketched one way or another
> (including for density clustering problems).
>
> On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <dlie...@gmail.com>
> wrote:
>
> > (1) I abandoned any attempts at DBScan and implemented another density
> > algorithm itself (can't say which, subject to patent restrictions). The
> > reason being, i couldn't immediately figure how to parallelize it
> > efficiently (aside from data structure discussions), the base algorithm
> is
> > inherently iterative.
> >
> > (2) Samsara provides R-base level algebra, not general data structure
> > support. IMO it would not pay to adjust it to do that any more than
> trying
> > to fit R base to do it. I did implement spatial structures standardized
> on
> > using Samsara in terms of carrying out computations (mostly in-memory),
> but
> > those were still data structures in their on right.
> >
> > (3) like i said before, experience tells me that using collection of 2d
> > tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
> > tensor. The fundamental difference here is mostly with sparse operations
> > where n-d sparse tensor could intelligently execute those. But this is
> not
> > supported properly pretty much by any library i know, so all the
> difference
> > in most cases that say they support it directly is just putting a tensor
> > api over collection of tensors. Practicality of having dense n-d tensors
> is
> > also a bit questioned, since it immediately increases requirements for
> > processing memory quantity of a single tensor instance, whereas
> collection
> > of tensors could be represented as retaining streaming properties. etc.
> > etc. Overall it sounds to me like a solution in a search of a problem
> > (given my admittedly very limited experience as a practitioner in math).
> >
> >
> > On Wed, Jul 5, 2017 at 12:09 AM, Aditya <adityasarma...@gmail.com>
> wrote:
> >
> >> ***Important** **Do read** *
> >> Hello everyone,
> >>
> >> Trevor and I have been discussing as to how to effectively represent an
> >> R-Tree in Mahout. Turns out there is a method to represent a Binary
> Search
> >> Tree (BST) in the form of an ancestor matrix. This
> >> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from
> >> -a-given-binary-tree/>
> >> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m
> >> atrix/>
> >> show the conversion logic from a tree representation to a matrix
> >> representation and vice versa.
> >>
> >> In a distributed scenario, I know of the following design
> >> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc
> >> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
> >> which is fairly efficient and intuitive to understand.
> >>
> >> Now the point for debate is this:
> >> **Please read the design provided in the link above before proceeding**
> >> The R-Tree will always be a local entity, no where in the algorithm is
> >> there a need / necessity to have a distributed R-Tree kind of scenario.
> On
> >> the other hand, the data points as well as the union find data structure
> >> need to be stored in a DRM like data structure and they very well can be
> >> represented in the form of a matrix. (A union find data structure
> >> basically
> >> can be implemented using a vector)
> >>
> >> 1. Why not build an R-Tree module in the form of a normal tree with two
> >> children and a key-value pair? (I'm not sure if this is allowed in
> Mahout,
> >> so veterans please chip in). Since an R-tree will always be an in-core
> >> entity.
> >>
> >> 2. If 1. is not done, then the method followed for the matrix
> >> representation of a BST should be followed. Only, the elements and
> >> conditions will be changed. On an abstract sense, Matrix representation
> of
> >> an R-Tree and matrix representation of a Binary Search Tree are
> analogous.
> >> But in this case, the construction and search costs for the Matrix
> version
> >> of an R-Tree will be costlier.
> >>
> >>
> >> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if
> you
> >> could offer your insights.*
> >>
> >> Thanks,
> >> Aditya
> >>
> >>
> >> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com
> >
> >> wrote:
> >>
> >> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
> >> (e.g.
> >> > 3d and 4d tensors)?
> >> >
> >> > I implemented these for the MLPs (still WIP)
> >> >
> >> > https://github.com/apache/mahout/pull/323/files#diff-
> >> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
> >> >
> >> > But those functions were specifically to overcome the challenges you
> >> > describe wrt neural network weight sets.
> >> >
> >> > If you could use those as is- that would be awesome, if not maybe
> you'll
> >> > find inspiration there.
> >> >
> >> >
> >> >
> >> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <adityasarma...@gmail.com>
> >> wrote:
> >> >
> >> > > Hello everyone,
> >> > >
> >> > > I've been working for the past few weeks on coding an in-core DBSCAN
> >> > > algorithm.
> >> > >
> >> > > A more efficient version with an O(n*log(n)) complexity does exist
> >> but it
> >> > > uses the R-Tree data structure to index the data. I have a few
> >> > > concerns/questions and I'm hoping you would be able to help me out.
> >> > >
> >> > > 1. Based on my knowledge, using an indexing data structure like an
> >> R-Tree
> >> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> >> > > algorithm. If there's any other method that you are familiar with,
> >> please
> >> > > let me know.
> >> > >
> >> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> >> > written
> >> > > by Andrew and Dmitry, I don't see how I can directly exploit the
> >> existing
> >> > > data structures and operations to get the functionality of an
> R-Tree.
> >> > >
> >> > > 3. On the off chance that an R-Tree module has to built in Mahout,
> >> > because
> >> > > it is not possible to easily use existing operations I need some
> >> insights
> >> > > as to how to go about it. I learned that everything in Mahout at the
> >> end
> >> > > should be serializable to a vector. I can't fathom how to do that
> >> > > intuitively in the case of an R-Tree
> >> > >
> >> > > There are a couple of other concerns that need to be discussed but
> >> these
> >> > > are vital at the moment.
> >> > >
> >> > > PS: The research paper that proposed the DBSCAN algorithm can be
> found
> >> > here
> >> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> >> > >
> >> > > Thanks!
> >> > >
> >> > > -Aditya
> >> > >
> >> >
> >>
> >
> >
>

Re: Density based Clustering in Mahout

Reply via email to