Re: Density based Clustering in Mahout

Dmitriy Lyubimov Wed, 05 Jul 2017 09:07:21 -0700

PS i read a few papers, including i believe that of Google's, on
partitioning of the DBScan problem for parallelization. It did not fit my
purposes though as they inherently assumed that every cluster problems had
enough centroids to figure to be efficiently partitioned. In effect it
amounted to similar effects as the probabilistic sketches techniques
mentioned in the book, but with much more headache for the bang. I
eventually turned to solving problems pre-sketched one way or another
(including for density clustering problems).


On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <[email protected]> wrote:

> (1) I abandoned any attempts at DBScan and implemented another density
> algorithm itself (can't say which, subject to patent restrictions). The
> reason being, i couldn't immediately figure how to parallelize it
> efficiently (aside from data structure discussions), the base algorithm is
> inherently iterative.
>
> (2) Samsara provides R-base level algebra, not general data structure
> support. IMO it would not pay to adjust it to do that any more than trying
> to fit R base to do it. I did implement spatial structures standardized on
> using Samsara in terms of carrying out computations (mostly in-memory), but
> those were still data structures in their on right.
>
> (3) like i said before, experience tells me that using collection of 2d
> tensors (esp. sparse tensors) is fine instead of trying to introduce n-d
> tensor. The fundamental difference here is mostly with sparse operations
> where n-d sparse tensor could intelligently execute those. But this is not
> supported properly pretty much by any library i know, so all the difference
> in most cases that say they support it directly is just putting a tensor
> api over collection of tensors. Practicality of having dense n-d tensors is
> also a bit questioned, since it immediately increases requirements for
> processing memory quantity of a single tensor instance, whereas collection
> of tensors could be represented as retaining streaming properties. etc.
> etc. Overall it sounds to me like a solution in a search of a problem
> (given my admittedly very limited experience as a practitioner in math).
>
>
> On Wed, Jul 5, 2017 at 12:09 AM, Aditya <[email protected]> wrote:
>
>> ***Important** **Do read** *
>> Hello everyone,
>>
>> Trevor and I have been discussing as to how to effectively represent an
>> R-Tree in Mahout. Turns out there is a method to represent a Binary Search
>> Tree (BST) in the form of an ancestor matrix. This
>> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from
>> -a-given-binary-tree/>
>> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m
>> atrix/>
>> show the conversion logic from a tree representation to a matrix
>> representation and vice versa.
>>
>> In a distributed scenario, I know of the following design
>> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc
>> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing>
>> which is fairly efficient and intuitive to understand.
>>
>> Now the point for debate is this:
>> **Please read the design provided in the link above before proceeding**
>> The R-Tree will always be a local entity, no where in the algorithm is
>> there a need / necessity to have a distributed R-Tree kind of scenario. On
>> the other hand, the data points as well as the union find data structure
>> need to be stored in a DRM like data structure and they very well can be
>> represented in the form of a matrix. (A union find data structure
>> basically
>> can be implemented using a vector)
>>
>> 1. Why not build an R-Tree module in the form of a normal tree with two
>> children and a key-value pair? (I'm not sure if this is allowed in Mahout,
>> so veterans please chip in). Since an R-tree will always be an in-core
>> entity.
>>
>> 2. If 1. is not done, then the method followed for the matrix
>> representation of a BST should be followed. Only, the elements and
>> conditions will be changed. On an abstract sense, Matrix representation of
>> an R-Tree and matrix representation of a Binary Search Tree are analogous.
>> But in this case, the construction and search costs for the Matrix version
>> of an R-Tree will be costlier.
>>
>>
>> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you
>> could offer your insights.*
>>
>> Thanks,
>> Aditya
>>
>>
>> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <[email protected]>
>> wrote:
>>
>> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices?
>> (e.g.
>> > 3d and 4d tensors)?
>> >
>> > I implemented these for the MLPs (still WIP)
>> >
>> > https://github.com/apache/mahout/pull/323/files#diff-
>> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25
>> >
>> > But those functions were specifically to overcome the challenges you
>> > describe wrt neural network weight sets.
>> >
>> > If you could use those as is- that would be awesome, if not maybe you'll
>> > find inspiration there.
>> >
>> >
>> >
>> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <[email protected]>
>> wrote:
>> >
>> > > Hello everyone,
>> > >
>> > > I've been working for the past few weeks on coding an in-core DBSCAN
>> > > algorithm.
>> > >
>> > > A more efficient version with an O(n*log(n)) complexity does exist
>> but it
>> > > uses the R-Tree data structure to index the data. I have a few
>> > > concerns/questions and I'm hoping you would be able to help me out.
>> > >
>> > > 1. Based on my knowledge, using an indexing data structure like an
>> R-Tree
>> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan
>> > > algorithm. If there's any other method that you are familiar with,
>> please
>> > > let me know.
>> > >
>> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce
>> > written
>> > > by Andrew and Dmitry, I don't see how I can directly exploit the
>> existing
>> > > data structures and operations to get the functionality of an R-Tree.
>> > >
>> > > 3. On the off chance that an R-Tree module has to built in Mahout,
>> > because
>> > > it is not possible to easily use existing operations I need some
>> insights
>> > > as to how to go about it. I learned that everything in Mahout at the
>> end
>> > > should be serializable to a vector. I can't fathom how to do that
>> > > intuitively in the case of an R-Tree
>> > >
>> > > There are a couple of other concerns that need to be discussed but
>> these
>> > > are vital at the moment.
>> > >
>> > > PS: The research paper that proposed the DBSCAN algorithm can be found
>> > here
>> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>.
>> > >
>> > > Thanks!
>> > >
>> > > -Aditya
>> > >
>> >
>>
>
>

Re: Density based Clustering in Mahout

Reply via email to