Re: Text Queries Support

Ilya Kasnacheev Tue, 27 Jul 2021 03:04:40 -0700

Hello!

Let me try to answer the questions below, since I did not see anybody do
that and thus not everybody may be on the same page.


Regards,

пт, 23 июл. 2021 г. в 13:56, Andrey Mashenkov <andrey.mashen...@gmail.com>:

> Atri,
>
> As for now, the potential capabilities are not clear to me.
> At first glance, I see the next topics that must be covered at first:
>
> General questions
> * How Lucene index can be split among the nodes?
>
In the same fashion as SQL indexes - each node might only hold index for
its primary partitions.


> * If we'll have a single index for all partitions on the particular node,
> then how index records will be aware of partitioning?
>
I'm not sure, how does our SQL deal with it? If there is scenario where
some keys are no longer primary, we can perhaps filter them out and in the
meantime exclude from index.


> This is important to filter out backup records from the results to avoid
> duplicates.
> * How results from several nodes can be merged on the Reduce stage?
>
It is actually the primary use case for Lucene/Solr, usually they are
merged by relevance/score.


> * Does Lucene supports smth like JOIN operation or others that may require
> data from another partition or index?
> If so, then it likes to multistep query with merging results on
> intermediate stages and requires detailed investigation and design.
> It is ok if Ignite will have some limitations here, but we would like to
> know about them at the early stage.
>
Lucene has block-join which allows it to near store related data. Lucene
also has regular join, but I don't see any use case for it since we can do
SQL join as well.



> * How effectively map Lucene files to the page memory? Is it even possible?
> Otherwise, how to deal with potential OOM on large queries and memory
> capacity planning?
>
I think it's pretty good here, it's the must for information retrieval
since there's usually a lot of it.


>
> Persistence.
> * How and what consistency guarantees could we have/expect?
> Seems, we may not be able to write physical records for Lucene index to our
> WAL. What can we do with this?
>
I think we should be able to do it in the same fashion as we do it with SQL
indexes, during WAL recovery, also update the Lucene index. On clear
shutdown, assume that it is okay. If Lucene index is removed then do a full
rebuild, like we do it with index.bin.


>
> Transactions.
> * Will we support transactions?
> * Should Lucene be aware of Transaction and track mvcc (or whatever)
> versions for the records?
> * What will be consistency guarantees?
>
I think the answer here is NO. Text search is not expected to be
transactionally up-to-date. It is expected to be eventually full. So it is
OK if it takes a split-second for entries to become searchable.

The traditional way to update text indexes is batching.


>
> UX
> * How to add FullText search queries syntax into Calcite?
> * AFAIK, the Lucene index has many properties for tuning. How will the user
> configure the index?
> * How and where to store the settings? What are cluster-wide and what a
> local to the particular node?
> * Will be all the settings immutable? Can be they changed on-fly? after
> node/grid restart?
> * Any limitations on query syntax?
>
Solr and Elasticsearch spent a lot of time on this, and the field is huge
here. They have really extensive query language. On the bright side, most
of the "settings" are dynamic and cached, so if you need a different
filtering of your data all you need is to request it once. Ones which
aren't usually concern how data is prepared before being put into index
(stemming, tokenizing, etc), changing it will require index rebuild. I
don't think why settings will not be shared along cluster.


> SQL
> * Will we support FullText search in SQL?
> * How to integrate Lucene index into Calcite? What is the cost model?
> Splitting rules? Traits?
> * What about consistency with DDL operations, e.g. column rename?
> Ignite indices will operate column ID, so rename operation will not affect
> the index.
>

Regards,

-- 



>
>
> With all of this, you can go with the IEP (or even some short summary) and
> further POC and implementation.
> That's a big deal, so let's discuss what could be done here.
>
> On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <a...@apache.org> wrote:
>
> > I am actually happy to drive the feature for Ignite 3. FTS is very
> > important for me and I think Ignite users will benefit from it
> > greatly.
> >
> > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > eager to contribute there and lead the development.
> >
> > Please share your thoughts.
> >
> > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > <andrey.mashen...@gmail.com> wrote:
> > >
> > > Hi Atri,
> > >
> > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > targeted to Ignite 2.
> > >
> > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > yet.
> > > By the way, we are getting requests for this thing from the user side,
> > and
> > > definitely,
> > > FTS would be a valuable feature for Ignite.
> > >
> > > It will be great if the one wants to drive it, any help will be
> > appreciated.
> > >
> > >
> > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <a...@apache.org> wrote:
> > >
> > > > Hello,
> > > >
> > > > An update, please. I am working through persistence of Lucene index
> > using
> > > > Ignite Dictionary, and will be asking some questions soon.
> > > >
> > > > I had one doubt - - where does this change go? Ignite 3?
> > > >
> > > > Also, I know we want to build native support for text searches in
> > Ignite 3.
> > > > Is the work I am proposing here part of that, or will that be a
> > separate
> > > > effort?
> > > >
> > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> ilya.kasnach...@gmail.com
> > >
> > > > wrote:
> > > >
> > > > > Hello!
> > > > >
> > > > > I think that number one is the most important one, then maybe it
> > will see
> > > > > more use and other deficiencies become more apparent, leading to
> more
> > > > > tickets and visibility.
> > > > >
> > > > > Maybe 2. and 3. will even use a different approach when persistence
> > is
> > > > > implemented.
> > > > >
> > > > > Regards,
> > > > > --
> > > > > Ilya Kasnacheev
> > > > >
> > > > >
> > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <a...@apache.org>:
> > > > >
> > > > > > Hello Again!
> > > > > >
> > > > > > I have been looking into the aforementioned and here are my
> follow
> > up
> > > > > > thoughts:
> > > > > >
> > > > > > 1. Support persistence of Lucene indexes.
> > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > fixing of
> > > > > > moving partitions first)
> > > > > > 3. Figure out how to return scores from nodes and use them as
> sort
> > > > > > parameters on the coordinator node
> > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > >
> > > > > > Please let me know if this looks ok to make text queries
> > functional?
> > > > > >
> > > > > > Atri
> > > > > >
> > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > <alexey.scherbak...@gmail.com> wrote:
> > > > > > >
> > > > > > > Hi.
> > > > > > >
> > > > > > > One of the biggest issues with text queries is a lack of
> support
> > for
> > > > > > lucene
> > > > > > > indices persistence, which makes this functionality useless if
> a
> > > > > > > persistence is enabled.
> > > > > > >
> > > > > > > I would first take care of it.
> > > > > > >
> > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > timonin.ma...@gmail.com
> > > > >:
> > > > > > >
> > > > > > > > Hi, Atri!
> > > > > > > >
> > > > > > > > You're right, Actually there is a lack of support for
> > TextQueries.
> > > > > For
> > > > > > the
> > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > page
> > > > > size
> > > > > > > > support, for example). I'm glad that somebody wants to
> maintain
> > > > this
> > > > > > > > functionality. Thanks a lot!
> > > > > > > >
> > > > > > > > For the MergeSort algorithm there is already a patch for that
> > [1].
> > > > > It's
> > > > > > > > currently on review. This patch introduces an abstract
> reducer
> > for
> > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > Then
> > > > > > TextQuery
> > > > > > > > leverages on MergeSort to order results from multiple nodes
> by
> > > > score.
> > > > > > This
> > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > Could
> > > > you
> > > > > > > > please check if it fully matches your idea? Any issues or
> > comments
> > > > > are
> > > > > > > > welcome.
> > > > > > > >
> > > > > > > > I've prepared this ticket, because I need the MergeSort
> > algorithm
> > > > for
> > > > > > the
> > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > also
> > > > > > provide
> > > > > > > > ordered results over multiple nodes). Currently I'm not
> > planning to
> > > > > go
> > > > > > > > further with TextQuery, so if you're going to support this
> > it'll
> > > > be a
> > > > > > great
> > > > > > > > contribution, I think.
> > > > > > > >
> > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > >
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> a...@apache.org>
> > > > > wrote:
> > > > > > > >
> > > > > > > > > Hi All,
> > > > > > > > >
> > > > > > > > > I have been looking into our text queries support and see
> > that it
> > > > > has
> > > > > > > > > limited community support.
> > > > > > > > >
> > > > > > > > > Therefore, I volunteer to be the maintainer of the module
> and
> > > > work
> > > > > on
> > > > > > > > > enhancing it further.
> > > > > > > > >
> > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > sorted
> > > > > reduce
> > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > Lucene
> > > > > > ranks
> > > > > > > > > documents according to their score, and documents are
> > returned in
> > > > > the
> > > > > > > > > order of their score. Since the scoring function is
> > homogeneous,
> > > > > this
> > > > > > > > > means that across nodes, we can compare scores and merge
> > sort.
> > > > > > > > >
> > > > > > > > > Please let me know if I can take this up.
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > > Regards,
> > > > > > > > >
> > > > > > > > > Atri
> > > > > > > > > Apache Concerted
> > > > > > > > >
> > > > > > > >
> > > > > > >
> > > > > > >
> > > > > > > --
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Alexei Scherbakov
> > > > > >
> > > > > > --
> > > > > > Regards,
> > > > > >
> > > > > > Atri
> > > > > > Apache Concerted
> > > > > >
> > > > >
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov
>

Re: Text Queries Support

Reply via email to