Re: Density based Clustering in Mahout

Dmitriy Lyubimov Wed, 05 Jul 2017 09:00:03 -0700

(1) I abandoned any attempts at DBScan and implemented another density
algorithm itself (can't say which, subject to patent restrictions). The
reason being, i couldn't immediately figure how to parallelize it
efficiently (aside from data structure discussions), the base algorithm is
inherently iterative.


(2) Samsara provides R-base level algebra, not general data structure
support. IMO it would not pay to adjust it to do that any more than trying
to fit R base to do it. I did implement spatial structures standardized on
using Samsara in terms of carrying out computations (mostly in-memory), but
those were still data structures in their on right.

(3) like i said before, experience tells me that using collection of 2d
tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
tensor. The fundamental difference here is mostly with sparse operations
where n-d sparse tensor could intelligently execute those. But this is not
supported properly pretty much by any library i know, so all the difference
in most cases that say they support it directly is just putting a tensor
api over collection of tensors. Practicality of having dense n-d tensors is
also a bit questioned, since it immediately increases requirements for
processing memory quantity of a single tensor instance, whereas collection
of tensors could be represented as retaining streaming properties. etc.
etc. Overall it sounds to me like a solution in a search of a problem
(given my admittedly very limited experience as a practitioner in math).


On Wed, Jul 5, 2017 at 12:09 AM, Aditya <adityasarma...@gmail.com> wrote:

> ***Important** **Do read** *
> Hello everyone,
>
> Trevor and I have been discussing as to how to effectively represent an
> R-Tree in Mahout. Turns out there is a method to represent a Binary Search
> Tree (BST) in the form of an ancestor matrix. This
> <http://www.geeksforgeeks.org/construct-ancestor-matrix-
> from-a-given-binary-tree/>
> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-
> matrix/>
> show the conversion logic from a tree representation to a matrix
> representation and vice versa.
>
> In a distributed scenario, I know of the following design
> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4
> PyXLoNlVF0Yno8/edit?usp=sharing>
> which is fairly efficient and intuitive to understand.
>
> Now the point for debate is this:
> **Please read the design provided in the link above before proceeding**
> The R-Tree will always be a local entity, no where in the algorithm is
> there a need / necessity to have a distributed R-Tree kind of scenario. On
> the other hand, the data points as well as the union find data structure
> need to be stored in a DRM like data structure and they very well can be
> represented in the form of a matrix. (A union find data structure basically
> can be implemented using a vector)
>
> 1. Why not build an R-Tree module in the form of a normal tree with two
> children and a key-value pair? (I'm not sure if this is allowed in Mahout,
> so veterans please chip in). Since an R-tree will always be an in-core
> entity.
>
> 2. If 1. is not done, then the method followed for the matrix
> representation of a BST should be followed. Only, the elements and
> conditions will be changed. On an abstract sense, Matrix representation of
> an R-Tree and matrix representation of a Binary Search Tree are analogous.
> But in this case, the construction and search costs for the Matrix version
> of an R-Tree will be costlier.
>
>
> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
> could offer your insights.*
>
> Thanks,
> Aditya
>
>
> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com>
> wrote:
>
> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
> (e.g.
> > 3d and 4d tensors)?
> >
> > I implemented these for the MLPs (still WIP)
> >
> > https://github.com/apache/mahout/pull/323/files#diff-
> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
> >
> > But those functions were specifically to overcome the challenges you
> > describe wrt neural network weight sets.
> >
> > If you could use those as is- that would be awesome, if not maybe you'll
> > find inspiration there.
> >
> >
> >
> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <adityasarma...@gmail.com>
> wrote:
> >
> > > Hello everyone,
> > >
> > > I've been working for the past few weeks on coding an in-core DBSCAN
> > > algorithm.
> > >
> > > A more efficient version with an O(n*log(n)) complexity does exist but
> it
> > > uses the R-Tree data structure to index the data. I have a few
> > > concerns/questions and I'm hoping you would be able to help me out.
> > >
> > > 1. Based on my knowledge, using an indexing data structure like an
> R-Tree
> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
> > > algorithm. If there's any other method that you are familiar with,
> please
> > > let me know.
> > >
> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
> > written
> > > by Andrew and Dmitry, I don't see how I can directly exploit the
> existing
> > > data structures and operations to get the functionality of an R-Tree.
> > >
> > > 3. On the off chance that an R-Tree module has to built in Mahout,
> > because
> > > it is not possible to easily use existing operations I need some
> insights
> > > as to how to go about it. I learned that everything in Mahout at the
> end
> > > should be serializable to a vector. I can't fathom how to do that
> > > intuitively in the case of an R-Tree
> > >
> > > There are a couple of other concerns that need to be discussed but
> these
> > > are vital at the moment.
> > >
> > > PS: The research paper that proposed the DBSCAN algorithm can be found
> > here
> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
> > >
> > > Thanks!
> > >
> > > -Aditya
> > >
> >
>

Re: Density based Clustering in Mahout

Reply via email to