Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Alexey Zinoviev Thu, 16 Aug 2018 04:27:01 -0700

Vladimir, yes I remember this discussion with Yuri. Of course I couldn't
estimate the changes to add new index type as you and thank you for your
consultation.


Maybe it's best approach to continue with my draft where it's implemented
via ML specific data structures over the cache.

2018-08-16 14:16 GMT+06:00 Vladimir Ozerov <voze...@gridgain.com>:

> Hi Alex,
>
> From what I see there is some interest to K-D indexes in SQL world.
> Postgres has it. MySQL users asks about it from time to time, and usually
> use some simpler spatial solutions ask MySQL lacks this index type. We
> definitely may consider integrating it with SQL, but this would be big
> enough feature, requiring changes in almost all SQL components.
>
> For this reason I would put SQL case aside for now and focus on
> implementation for ML. I had a talk with Yuriy Babak some time ago and
> explained how new index type can be added to the product. May be Yuriy has
> some thoughts on how to do that properly with respect to ML roadmap and
> plans.
>
> On Thu, Aug 16, 2018 at 6:21 AM Alexey Zinoviev <zaleslaw....@gmail.com>
> wrote:
>
> > Sorry, for the delay, dear Pavel and Denis.
> >
> > Yes, I need a kind of indexing to improve KNN algorithms during training
> > the model.
> >
> > In my draft solution I've implemented building of
> > https://en.wikipedia.org/wiki/K-d_tree
> > <https://en.wikipedia.org/wiki/K-d_tree> on the training data set.
> > It collects the information about data distributed in our specific ML
> > Datasets (distributed by nodes over Ignite cache)
> >
> > Pavel, you ask me any questions and I've prepared answers.
> >
> > 1) Should be this index in-memory only or you want to persist it?
> > *Maybe it should be persisted (to reuse that for next predictions)*
> >
> > 2) In general index can be implemented in two ways per-partition and
> > per-node.
> > *Thank you for your explanation. Of course the per-node is better.*
> >
> > 3) I think K-d tree is preferable
> > *You are absolutely right, it should be K-d tree*
> >
> > 4) Could you please share use cases you're trying to solve with QuadTree?
> > With
> > close to real data and examples of queries?
> >
> > I need to solve *k-nearest neighbors search task *on a set of vectors
> with
> > unique keys presented in Ignite Cache (training set),
> > during the training phase I'm going to build a temporary index as a
> KD-Tree
> > (based on distance between vectors).
> >
> > The distance metric is a parameter too.
> >
> > After that, in prediction phase the *k-nearest neighbors search task *is
> > solved by brute-force iteration over all vectors to find the *k-nearest
> > neighbors.*
> > I'd like to improve the search part with queries to index to extract
> > closest vectors.
> >
> > Of course, it's kind of experiment, and maybe it's bad idea to patch SQL
> > internals to solve this private task, but as you mentioned it can be a
> good
> > point of collaboration between ML and SQL components.
> >
> > Can I get one of the implemented indices as a primary example and
> implement
> > it by myself?
> > Could you recommend something as an start point for me?
> >
> > Thanks for your help.
> >
> >
> >
> >
> > 2018-08-04 11:18 GMT+06:00 Denis Magda <dma...@apache.org>:
> >
> > > Alexey, are you working on some new ML/DL APIs/algorithms? Please
> > elaborate
> > > what you'd like to add to Ignite.
> > >
> > > --
> > > Denis
> > >
> > > On Wed, Aug 1, 2018 at 3:10 PM Pavel Kovalenko <jokse...@gmail.com>
> > wrote:
> > >
> > > > Hello Alexey,
> > > >
> > > > It's not so difficult to implement new type of indexing of data, but
> if
> > > you
> > > > want to reach performance in distributed environment you need to have
> > > > strong knowledge of a data you're indexing and what kind of queries
> you
> > > > want to execute.
> > > > Should be this index in-memory only or you want to persist it?
> > > > In case of persistence your index should fit our page memory model
> > > > requirements.
> > > > In both cases your index should be ready to work in concurrent
> > > environment.
> > > >
> > > > In general index can be implemented in two ways per-partition and
> > > per-node.
> > > > Per-partition may be efficient if you have a lot of points (x,y)
> > > > representing a big one, e.g. image. In this case it's required that
> all
> > > of
> > > > these points will be in one partition that query e.g. makes images
> > > > intersection will execute in one node. But if you have multiple
> images,
> > > > your index will contain also another points from other object and
> will
> > > > overload it.
> > > > Per-node may be efficient if you have a lot of points (x,y) that are
> > > > independent of each other, that you will use it as spatial e.g.. But
> in
> > > > this case, I think K-d tree is preferable as it can be used in more
> > wide
> > > > way.
> > > >
> > > > Could you please share use cases you're trying to solve with
> QuadTree?
> > > With
> > > > close to real data and examples of queries? Because now the question
> is
> > > too
> > > > abstract and it's hard to understand how it should be implemented to
> > > reach
> > > > good results.
> > > >
> > > >
> > > > 2018-08-01 16:45 GMT+03:00 Alexey Zinoviev <zaleslaw....@gmail.com>:
> > > >
> > > > > Hi, Igniters.
> > > > >
> > > > > Currently I'm working on different math stuff over the Apache
> Ignite
> > > and
> > > > in
> > > > > a few tasks I need to implement in memory something like this
> > > > >
> > > > > https://en.wikipedia.org/wiki/Quadtree
> > > > >
> > > > > I didn't find such index in Apache Ignite, but maybe it's under
> > > > development
> > > > > by somebody?
> > > > >
> > > > > Is it a difficult to add a new index type to our distributed SQL
> > (from
> > > > > point of view of different infrastructure issues and so on P.S I
> > don't
> > > > > worry the math stuff here because I've implemented it many times in
> > > > > non-distributed version)?
> > > > >
> > > > > It will be great to hear any kind of your thoughts and maybe
> somebody
> > > > could
> > > > > help with implementation
> > > > >
> > > > > zaleslaw
> > > > >
> > > >
> > >
> >
>

Re: [Distributed SQL] Do we have a plan to implement QuadTree index?

Reply via email to