To Dmitriy's point (2)- I think it is acceptable to create an R-Tree structure, that will exist only within the algorithm for doing in-core operations, (or maybe it lives slightly outside of the algorithm so we don't need to recreate trees for DBSCAN, Random Forrests, other tree-based algorithms- e.g. we can reuse the same trees for various algorithms.) BUT Trees only exist WITHIN the in-core, i.e. we don't want to modify the allReduceBlock to accept Matrices OR Trees, that will get out of hand fast. Please anyone chime in to correct me/argue against.
Aditya and I have talked offline, but the paper in question is [1] The data-scructure that needs to be serialized is referred to as the disjoint-set data structure. Trees are used locally, but the disjoint-set is the structure that is used during the reduce phase to merge. This is the dataset that will have to be serialized into a vector or matrix to remain complaint with current implementation of allReduceBlock. As far as I'm concerned- it would also be permisable to create a disjoint-set data structure, and implement the FIND and UNION methods per the referenced paper. The only requirement for Mahout compatability is to be able to serialize/deserialize this structure into a vector/matrix (I'm sure we can do this, and I will help Aditya once he makes the basic structure) So really, we've stumbled into a more important philosophical question- and that is: Is it acceptable to create objects which make the internals of algorithms easier to read and work with, so long as they may be serialized to incore matrices/vectors? I am +1, and if it is decided this is not acceptable, I need to go back and alter (or drop) things like the CanopyFn [2] of the Canopy Clustering Algorithm. [1] http://www.ece.northwestern.edu/~choudhar/Publications/ANewScalableParallelDBSCANAlgorithmUsingDisjointSetDataStructure.pdf [2] https://github.com/apache/mahout/blob/master/math-scala/src/main/scala/org/apache/mahout/math/algorithms/clustering/Canopy.scala#L129 On Wed, Jul 5, 2017 at 11:07 AM, Dmitriy Lyubimov <dlie...@gmail.com> wrote: > PS i read a few papers, including i believe that of Google's, on > partitioning of the DBScan problem for parallelization. It did not fit my > purposes though as they inherently assumed that every cluster problems had > enough centroids to figure to be efficiently partitioned. In effect it > amounted to similar effects as the probabilistic sketches techniques > mentioned in the book, but with much more headache for the bang. I > eventually turned to solving problems pre-sketched one way or another > (including for density clustering problems). > > On Wed, Jul 5, 2017 at 8:59 AM, Dmitriy Lyubimov <dlie...@gmail.com> > wrote: > > > (1) I abandoned any attempts at DBScan and implemented another density > > algorithm itself (can't say which, subject to patent restrictions). The > > reason being, i couldn't immediately figure how to parallelize it > > efficiently (aside from data structure discussions), the base algorithm > is > > inherently iterative. > > > > (2) Samsara provides R-base level algebra, not general data structure > > support. IMO it would not pay to adjust it to do that any more than > trying > > to fit R base to do it. I did implement spatial structures standardized > on > > using Samsara in terms of carrying out computations (mostly in-memory), > but > > those were still data structures in their on right. > > > > (3) like i said before, experience tells me that using collection of 2d > > tensors (esp. sparse tensors) is fine instead of trying to introduce n-d > > tensor. The fundamental difference here is mostly with sparse operations > > where n-d sparse tensor could intelligently execute those. But this is > not > > supported properly pretty much by any library i know, so all the > difference > > in most cases that say they support it directly is just putting a tensor > > api over collection of tensors. Practicality of having dense n-d tensors > is > > also a bit questioned, since it immediately increases requirements for > > processing memory quantity of a single tensor instance, whereas > collection > > of tensors could be represented as retaining streaming properties. etc. > > etc. Overall it sounds to me like a solution in a search of a problem > > (given my admittedly very limited experience as a practitioner in math). > > > > > > On Wed, Jul 5, 2017 at 12:09 AM, Aditya <adityasarma...@gmail.com> > wrote: > > > >> ***Important** **Do read** * > >> Hello everyone, > >> > >> Trevor and I have been discussing as to how to effectively represent an > >> R-Tree in Mahout. Turns out there is a method to represent a Binary > Search > >> Tree (BST) in the form of an ancestor matrix. This > >> <http://www.geeksforgeeks.org/construct-ancestor-matrix-from > >> -a-given-binary-tree/> > >> and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor-m > >> atrix/> > >> show the conversion logic from a tree representation to a matrix > >> representation and vice versa. > >> > >> In a distributed scenario, I know of the following design > >> <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJc > >> ci6JM4PyXLoNlVF0Yno8/edit?usp=sharing> > >> which is fairly efficient and intuitive to understand. > >> > >> Now the point for debate is this: > >> **Please read the design provided in the link above before proceeding** > >> The R-Tree will always be a local entity, no where in the algorithm is > >> there a need / necessity to have a distributed R-Tree kind of scenario. > On > >> the other hand, the data points as well as the union find data structure > >> need to be stored in a DRM like data structure and they very well can be > >> represented in the form of a matrix. (A union find data structure > >> basically > >> can be implemented using a vector) > >> > >> 1. Why not build an R-Tree module in the form of a normal tree with two > >> children and a key-value pair? (I'm not sure if this is allowed in > Mahout, > >> so veterans please chip in). Since an R-tree will always be an in-core > >> entity. > >> > >> 2. If 1. is not done, then the method followed for the matrix > >> representation of a BST should be followed. Only, the elements and > >> conditions will be changed. On an abstract sense, Matrix representation > of > >> an R-Tree and matrix representation of a Binary Search Tree are > analogous. > >> But in this case, the construction and search costs for the Matrix > version > >> of an R-Tree will be costlier. > >> > >> > >> *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if > you > >> could offer your insights.* > >> > >> Thanks, > >> Aditya > >> > >> > >> On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com > > > >> wrote: > >> > >> > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices? > >> (e.g. > >> > 3d and 4d tensors)? > >> > > >> > I implemented these for the MLPs (still WIP) > >> > > >> > https://github.com/apache/mahout/pull/323/files#diff- > >> > cd8a7c5e2cf42b91b5aa47c96daf19c0R25 > >> > > >> > But those functions were specifically to overcome the challenges you > >> > describe wrt neural network weight sets. > >> > > >> > If you could use those as is- that would be awesome, if not maybe > you'll > >> > find inspiration there. > >> > > >> > > >> > > >> > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <adityasarma...@gmail.com> > >> wrote: > >> > > >> > > Hello everyone, > >> > > > >> > > I've been working for the past few weeks on coding an in-core DBSCAN > >> > > algorithm. > >> > > > >> > > A more efficient version with an O(n*log(n)) complexity does exist > >> but it > >> > > uses the R-Tree data structure to index the data. I have a few > >> > > concerns/questions and I'm hoping you would be able to help me out. > >> > > > >> > > 1. Based on my knowledge, using an indexing data structure like an > >> R-Tree > >> > > or a Kd-Tree is the only way to reduce the complexity of the dbscan > >> > > algorithm. If there's any other method that you are familiar with, > >> please > >> > > let me know. > >> > > > >> > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce > >> > written > >> > > by Andrew and Dmitry, I don't see how I can directly exploit the > >> existing > >> > > data structures and operations to get the functionality of an > R-Tree. > >> > > > >> > > 3. On the off chance that an R-Tree module has to built in Mahout, > >> > because > >> > > it is not possible to easily use existing operations I need some > >> insights > >> > > as to how to go about it. I learned that everything in Mahout at the > >> end > >> > > should be serializable to a vector. I can't fathom how to do that > >> > > intuitively in the case of an R-Tree > >> > > > >> > > There are a couple of other concerns that need to be discussed but > >> these > >> > > are vital at the moment. > >> > > > >> > > PS: The research paper that proposed the DBSCAN algorithm can be > found > >> > here > >> > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>. > >> > > > >> > > Thanks! > >> > > > >> > > -Aditya > >> > > > >> > > >> > > > > >