Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Vladimir Ozerov Thu, 16 Aug 2018 01:17:29 -0700

Hi Alex,

>From what I see there is some interest to K-D indexes in SQL world.
Postgres has it. MySQL users asks about it from time to time, and usually
use some simpler spatial solutions ask MySQL lacks this index type. We
definitely may consider integrating it with SQL, but this would be big
enough feature, requiring changes in almost all SQL components.


For this reason I would put SQL case aside for now and focus on
implementation for ML. I had a talk with Yuriy Babak some time ago and
explained how new index type can be added to the product. May be Yuriy has
some thoughts on how to do that properly with respect to ML roadmap and
plans.

On Thu, Aug 16, 2018 at 6:21 AM Alexey Zinoviev <[email protected]>
wrote:

> Sorry, for the delay, dear Pavel and Denis.
>
> Yes, I need a kind of indexing to improve KNN algorithms during training
> the model.
>
> In my draft solution I've implemented building of
> https://en.wikipedia.org/wiki/K-d_tree
> <https://en.wikipedia.org/wiki/K-d_tree> on the training data set.
> It collects the information about data distributed in our specific ML
> Datasets (distributed by nodes over Ignite cache)
>
> Pavel, you ask me any questions and I've prepared answers.
>
> 1) Should be this index in-memory only or you want to persist it?
> *Maybe it should be persisted (to reuse that for next predictions)*
>
> 2) In general index can be implemented in two ways per-partition and
> per-node.
> *Thank you for your explanation. Of course the per-node is better.*
>
> 3) I think K-d tree is preferable
> *You are absolutely right, it should be K-d tree*
>
> 4) Could you please share use cases you're trying to solve with QuadTree?
> With
> close to real data and examples of queries?
>
> I need to solve *k-nearest neighbors search task *on a set of vectors with
> unique keys presented in Ignite Cache (training set),
> during the training phase I'm going to build a temporary index as a KD-Tree
> (based on distance between vectors).
>
> The distance metric is a parameter too.
>
> After that, in prediction phase the *k-nearest neighbors search task *is
> solved by brute-force iteration over all vectors to find the *k-nearest
> neighbors.*
> I'd like to improve the search part with queries to index to extract
> closest vectors.
>
> Of course, it's kind of experiment, and maybe it's bad idea to patch SQL
> internals to solve this private task, but as you mentioned it can be a good
> point of collaboration between ML and SQL components.
>
> Can I get one of the implemented indices as a primary example and implement
> it by myself?
> Could you recommend something as an start point for me?
>
> Thanks for your help.
>
>
>
>
> 2018-08-04 11:18 GMT+06:00 Denis Magda <[email protected]>:
>
> > Alexey, are you working on some new ML/DL APIs/algorithms? Please
> elaborate
> > what you'd like to add to Ignite.
> >
> > --
> > Denis
> >
> > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <[email protected]>
> wrote:
> >
> > > Hello Alexey,
> > >
> > > It's not so difficult to implement new type of indexing of data, but if
> > you
> > > want to reach performance in distributed environment you need to have
> > > strong knowledge of a data you're indexing and what kind of queries you
> > > want to execute.
> > > Should be this index in-memory only or you want to persist it?
> > > In case of persistence your index should fit our page memory model
> > > requirements.
> > > In both cases your index should be ready to work in concurrent
> > environment.
> > >
> > > In general index can be implemented in two ways per-partition and
> > per-node.
> > > Per-partition may be efficient if you have a lot of points (x,y)
> > > representing a big one, e.g. image. In this case it's required that all
> > of
> > > these points will be in one partition that query e.g. makes images
> > > intersection will execute in one node. But if you have multiple images,
> > > your index will contain also another points from other object and will
> > > overload it.
> > > Per-node may be efficient if you have a lot of points (x,y) that are
> > > independent of each other, that you will use it as spatial e.g.. But in
> > > this case, I think K-d tree is preferable as it can be used in more
> wide
> > > way.
> > >
> > > Could you please share use cases you're trying to solve with QuadTree?
> > With
> > > close to real data and examples of queries? Because now the question is
> > too
> > > abstract and it's hard to understand how it should be implemented to
> > reach
> > > good results.
> > >
> > >
> > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <[email protected]>:
> > >
> > > > Hi, Igniters.
> > > >
> > > > Currently I'm working on different math stuff over the Apache Ignite
> > and
> > > in
> > > > a few tasks I need to implement in memory something like this
> > > >
> > > > https://en.wikipedia.org/wiki/Quadtree
> > > >
> > > > I didn't find such index in Apache Ignite, but maybe it's under
> > > development
> > > > by somebody?
> > > >
> > > > Is it a difficult to add a new index type to our distributed SQL
> (from
> > > > point of view of different infrastructure issues and so on P.S I
> don't
> > > > worry the math stuff here because I've implemented it many times in
> > > > non-distributed version)?
> > > >
> > > > It will be great to hear any kind of your thoughts and maybe somebody
> > > could
> > > > help with implementation
> > > >
> > > > zaleslaw
> > > >
> > >
> >
>

Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Reply via email to