(1) I abandoned any attempts at DBScan and implemented another density algorithm itself (can't say which, subject to patent restrictions). The reason being, i couldn't immediately figure how to parallelize it efficiently (aside from data structure discussions), the base algorithm is inherently iterative.
(2) Samsara provides R-base level algebra, not general data structure support. IMO it would not pay to adjust it to do that any more than trying to fit R base to do it. I did implement spatial structures standardized on using Samsara in terms of carrying out computations (mostly in-memory), but those were still data structures in their on right. (3) like i said before, experience tells me that using collection of 2d tensors (esp. sparse tensors) is fine instead of trying to introduce n-d tensor. The fundamental difference here is mostly with sparse operations where n-d sparse tensor could intelligently execute those. But this is not supported properly pretty much by any library i know, so all the difference in most cases that say they support it directly is just putting a tensor api over collection of tensors. Practicality of having dense n-d tensors is also a bit questioned, since it immediately increases requirements for processing memory quantity of a single tensor instance, whereas collection of tensors could be represented as retaining streaming properties. etc. etc. Overall it sounds to me like a solution in a search of a problem (given my admittedly very limited experience as a practitioner in math). On Wed, Jul 5, 2017 at 12:09 AM, Aditya <adityasarma...@gmail.com> wrote: > ***Important** **Do read** * > Hello everyone, > > Trevor and I have been discussing as to how to effectively represent an > R-Tree in Mahout. Turns out there is a method to represent a Binary Search > Tree (BST) in the form of an ancestor matrix. This > <http://www.geeksforgeeks.org/construct-ancestor-matrix- > from-a-given-binary-tree/> > and this <http://www.geeksforgeeks.org/construct-tree-from-ancestor- > matrix/> > show the conversion logic from a tree representation to a matrix > representation and vice versa. > > In a distributed scenario, I know of the following design > <https://docs.google.com/document/d/1SfMIt8hYENwlm328rSGAMJcci6JM4 > PyXLoNlVF0Yno8/edit?usp=sharing> > which is fairly efficient and intuitive to understand. > > Now the point for debate is this: > **Please read the design provided in the link above before proceeding** > The R-Tree will always be a local entity, no where in the algorithm is > there a need / necessity to have a distributed R-Tree kind of scenario. On > the other hand, the data points as well as the union find data structure > need to be stored in a DRM like data structure and they very well can be > represented in the form of a matrix. (A union find data structure basically > can be implemented using a vector) > > 1. Why not build an R-Tree module in the form of a normal tree with two > children and a key-value pair? (I'm not sure if this is allowed in Mahout, > so veterans please chip in). Since an R-tree will always be an in-core > entity. > > 2. If 1. is not done, then the method followed for the matrix > representation of a BST should be followed. Only, the elements and > conditions will be changed. On an abstract sense, Matrix representation of > an R-Tree and matrix representation of a Binary Search Tree are analogous. > But in this case, the construction and search costs for the Matrix version > of an R-Tree will be costlier. > > > *PS: Shannon, Dmitry, Andrew P, Andrew M and Trevor, it'd be great if you > could offer your insights.* > > Thanks, > Aditya > > > On Sat, Jun 24, 2017 at 3:41 AM, Trevor Grant <trevor.d.gr...@gmail.com> > wrote: > > > What if you had Arrays of Matrices, or Arrays of Arrays of Matrices? > (e.g. > > 3d and 4d tensors)? > > > > I implemented these for the MLPs (still WIP) > > > > https://github.com/apache/mahout/pull/323/files#diff- > > cd8a7c5e2cf42b91b5aa47c96daf19c0R25 > > > > But those functions were specifically to overcome the challenges you > > describe wrt neural network weight sets. > > > > If you could use those as is- that would be awesome, if not maybe you'll > > find inspiration there. > > > > > > > > On Thu, Jun 22, 2017 at 6:43 PM, Aditya <adityasarma...@gmail.com> > wrote: > > > > > Hello everyone, > > > > > > I've been working for the past few weeks on coding an in-core DBSCAN > > > algorithm. > > > > > > A more efficient version with an O(n*log(n)) complexity does exist but > it > > > uses the R-Tree data structure to index the data. I have a few > > > concerns/questions and I'm hoping you would be able to help me out. > > > > > > 1. Based on my knowledge, using an indexing data structure like an > R-Tree > > > or a Kd-Tree is the only way to reduce the complexity of the dbscan > > > algorithm. If there's any other method that you are familiar with, > please > > > let me know. > > > > > > 2. From what I've read in the book Apache Mahout: Beyond MapReduce > > written > > > by Andrew and Dmitry, I don't see how I can directly exploit the > existing > > > data structures and operations to get the functionality of an R-Tree. > > > > > > 3. On the off chance that an R-Tree module has to built in Mahout, > > because > > > it is not possible to easily use existing operations I need some > insights > > > as to how to go about it. I learned that everything in Mahout at the > end > > > should be serializable to a vector. I can't fathom how to do that > > > intuitively in the case of an R-Tree > > > > > > There are a couple of other concerns that need to be discussed but > these > > > are vital at the moment. > > > > > > PS: The research paper that proposed the DBSCAN algorithm can be found > > here > > > <https://www.aaai.org/Papers/KDD/1996/KDD96-037.pdf>. > > > > > > Thanks! > > > > > > -Aditya > > > > > >