Alexey, I am not sure if it will be a proper fir for you, but I think it worth a try.
Apache Ignite has an option to index geospatial data using third-party libraries (note that it is not included in the default Ignite build, the module is activated via the lgpl profile). The index is located in Ignite-geospatial module and uses an r-tree implementation underneath. One downside of this module is that the geospatial index is not supported for the Ignite native persistence yet. Hope this helps! --AG чт, 16 авг. 2018 г. в 6:21, Alexey Zinoviev <zaleslaw....@gmail.com>: > Sorry, for the delay, dear Pavel and Denis. > > Yes, I need a kind of indexing to improve KNN algorithms during training > the model. > > In my draft solution I've implemented building of > https://en.wikipedia.org/wiki/K-d_tree > <https://en.wikipedia.org/wiki/K-d_tree> on the training data set. > It collects the information about data distributed in our specific ML > Datasets (distributed by nodes over Ignite cache) > > Pavel, you ask me any questions and I've prepared answers. > > 1) Should be this index in-memory only or you want to persist it? > *Maybe it should be persisted (to reuse that for next predictions)* > > 2) In general index can be implemented in two ways per-partition and > per-node. > *Thank you for your explanation. Of course the per-node is better.* > > 3) I think K-d tree is preferable > *You are absolutely right, it should be K-d tree* > > 4) Could you please share use cases you're trying to solve with QuadTree? > With > close to real data and examples of queries? > > I need to solve *k-nearest neighbors search task *on a set of vectors with > unique keys presented in Ignite Cache (training set), > during the training phase I'm going to build a temporary index as a KD-Tree > (based on distance between vectors). > > The distance metric is a parameter too. > > After that, in prediction phase the *k-nearest neighbors search task *is > solved by brute-force iteration over all vectors to find the *k-nearest > neighbors.* > I'd like to improve the search part with queries to index to extract > closest vectors. > > Of course, it's kind of experiment, and maybe it's bad idea to patch SQL > internals to solve this private task, but as you mentioned it can be a good > point of collaboration between ML and SQL components. > > Can I get one of the implemented indices as a primary example and implement > it by myself? > Could you recommend something as an start point for me? > > Thanks for your help. > > > > > 2018-08-04 11:18 GMT+06:00 Denis Magda <dma...@apache.org>: > > > Alexey, are you working on some new ML/DL APIs/algorithms? Please > elaborate > > what you'd like to add to Ignite. > > > > -- > > Denis > > > > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <jokse...@gmail.com> > wrote: > > > > > Hello Alexey, > > > > > > It's not so difficult to implement new type of indexing of data, but if > > you > > > want to reach performance in distributed environment you need to have > > > strong knowledge of a data you're indexing and what kind of queries you > > > want to execute. > > > Should be this index in-memory only or you want to persist it? > > > In case of persistence your index should fit our page memory model > > > requirements. > > > In both cases your index should be ready to work in concurrent > > environment. > > > > > > In general index can be implemented in two ways per-partition and > > per-node. > > > Per-partition may be efficient if you have a lot of points (x,y) > > > representing a big one, e.g. image. In this case it's required that all > > of > > > these points will be in one partition that query e.g. makes images > > > intersection will execute in one node. But if you have multiple images, > > > your index will contain also another points from other object and will > > > overload it. > > > Per-node may be efficient if you have a lot of points (x,y) that are > > > independent of each other, that you will use it as spatial e.g.. But in > > > this case, I think K-d tree is preferable as it can be used in more > wide > > > way. > > > > > > Could you please share use cases you're trying to solve with QuadTree? > > With > > > close to real data and examples of queries? Because now the question is > > too > > > abstract and it's hard to understand how it should be implemented to > > reach > > > good results. > > > > > > > > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <zaleslaw....@gmail.com>: > > > > > > > Hi, Igniters. > > > > > > > > Currently I'm working on different math stuff over the Apache Ignite > > and > > > in > > > > a few tasks I need to implement in memory something like this > > > > > > > > https://en.wikipedia.org/wiki/Quadtree > > > > > > > > I didn't find such index in Apache Ignite, but maybe it's under > > > development > > > > by somebody? > > > > > > > > Is it a difficult to add a new index type to our distributed SQL > (from > > > > point of view of different infrastructure issues and so on P.S I > don't > > > > worry the math stuff here because I've implemented it many times in > > > > non-distributed version)? > > > > > > > > It will be great to hear any kind of your thoughts and maybe somebody > > > could > > > > help with implementation > > > > > > > > zaleslaw > > > > > > > > > >