Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Alexey Goncharuk Thu, 16 Aug 2018 01:11:06 -0700

Alexey,

I am not sure if it will be a proper fir for you, but I think it worth a
try.


Apache Ignite has an option to index geospatial data using third-party
libraries (note that it is not included in the default Ignite build, the
module is activated via the lgpl profile). The index is located in
Ignite-geospatial module and uses an r-tree implementation underneath. One
downside of this module is that the geospatial index is not supported for
the Ignite native persistence yet.

Hope this helps!
--AG

чт, 16 авг. 2018 г. в 6:21, Alexey Zinoviev <[email protected]>:

> Sorry, for the delay, dear Pavel and Denis.
>
> Yes, I need a kind of indexing to improve KNN algorithms during training
> the model.
>
> In my draft solution I've implemented building of
> https://en.wikipedia.org/wiki/K-d_tree
> <https://en.wikipedia.org/wiki/K-d_tree> on the training data set.
> It collects the information about data distributed in our specific ML
> Datasets (distributed by nodes over Ignite cache)
>
> Pavel, you ask me any questions and I've prepared answers.
>
> 1) Should be this index in-memory only or you want to persist it?
> *Maybe it should be persisted (to reuse that for next predictions)*
>
> 2) In general index can be implemented in two ways per-partition and
> per-node.
> *Thank you for your explanation. Of course the per-node is better.*
>
> 3) I think K-d tree is preferable
> *You are absolutely right, it should be K-d tree*
>
> 4) Could you please share use cases you're trying to solve with QuadTree?
> With
> close to real data and examples of queries?
>
> I need to solve *k-nearest neighbors search task *on a set of vectors with
> unique keys presented in Ignite Cache (training set),
> during the training phase I'm going to build a temporary index as a KD-Tree
> (based on distance between vectors).
>
> The distance metric is a parameter too.
>
> After that, in prediction phase the *k-nearest neighbors search task *is
> solved by brute-force iteration over all vectors to find the *k-nearest
> neighbors.*
> I'd like to improve the search part with queries to index to extract
> closest vectors.
>
> Of course, it's kind of experiment, and maybe it's bad idea to patch SQL
> internals to solve this private task, but as you mentioned it can be a good
> point of collaboration between ML and SQL components.
>
> Can I get one of the implemented indices as a primary example and implement
> it by myself?
> Could you recommend something as an start point for me?
>
> Thanks for your help.
>
>
>
>
> 2018-08-04 11:18 GMT+06:00 Denis Magda <[email protected]>:
>
> > Alexey, are you working on some new ML/DL APIs/algorithms? Please
> elaborate
> > what you'd like to add to Ignite.
> >
> > --
> > Denis
> >
> > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <[email protected]>
> wrote:
> >
> > > Hello Alexey,
> > >
> > > It's not so difficult to implement new type of indexing of data, but if
> > you
> > > want to reach performance in distributed environment you need to have
> > > strong knowledge of a data you're indexing and what kind of queries you
> > > want to execute.
> > > Should be this index in-memory only or you want to persist it?
> > > In case of persistence your index should fit our page memory model
> > > requirements.
> > > In both cases your index should be ready to work in concurrent
> > environment.
> > >
> > > In general index can be implemented in two ways per-partition and
> > per-node.
> > > Per-partition may be efficient if you have a lot of points (x,y)
> > > representing a big one, e.g. image. In this case it's required that all
> > of
> > > these points will be in one partition that query e.g. makes images
> > > intersection will execute in one node. But if you have multiple images,
> > > your index will contain also another points from other object and will
> > > overload it.
> > > Per-node may be efficient if you have a lot of points (x,y) that are
> > > independent of each other, that you will use it as spatial e.g.. But in
> > > this case, I think K-d tree is preferable as it can be used in more
> wide
> > > way.
> > >
> > > Could you please share use cases you're trying to solve with QuadTree?
> > With
> > > close to real data and examples of queries? Because now the question is
> > too
> > > abstract and it's hard to understand how it should be implemented to
> > reach
> > > good results.
> > >
> > >
> > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <[email protected]>:
> > >
> > > > Hi, Igniters.
> > > >
> > > > Currently I'm working on different math stuff over the Apache Ignite
> > and
> > > in
> > > > a few tasks I need to implement in memory something like this
> > > >
> > > > https://en.wikipedia.org/wiki/Quadtree
> > > >
> > > > I didn't find such index in Apache Ignite, but maybe it's under
> > > development
> > > > by somebody?
> > > >
> > > > Is it a difficult to add a new index type to our distributed SQL
> (from
> > > > point of view of different infrastructure issues and so on P.S I
> don't
> > > > worry the math stuff here because I've implemented it many times in
> > > > non-distributed version)?
> > > >
> > > > It will be great to hear any kind of your thoughts and maybe somebody
> > > could
> > > > help with implementation
> > > >
> > > > zaleslaw
> > > >
> > >
> >
>

Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Reply via email to