Re: Text Queries Support

Atri Sharma Tue, 27 Jul 2021 09:19:23 -0700

Andrey,

> Per-partition Lucene index looks simple to implement, but it may require
> per-partition SQL to make full-text search expressions work correctly
> within the SQL quiery.
I think that as long as we follow the map - reduce process that we
already do for other queries, we should be fine.


> Per-partition SQL index may kill the performance. We already tried to do
> that in Ignite 2. However, QueryParallelism feature helps to speed up some
> data-intensive queries,
> but hits the performance in simple cases, and at some point (e.g. segments
> > number of CPU) the performance rapidly degrades with the increasing
> number of segments.

Yeah, that is always the case, but a global index will be a nightmare
in terms of concurrency and pessimistic concurrency control will
anyways kill the benefits, coupled with the metadata requirements.
What were the specific issues with per partition index?
>
> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> Maybe, the map-reduce technique underneath FTS expressions and some hacks
> will add a minimal overhead.

Lucene uses many types of indices but the aspect here is that per
partition Lucene indices can return docIDs and we can merge them in
reduce phase. So we are abstracted out from specifics of the internal
index being used to serve the query.

>
> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> > Lucene indices. The important thing here is to not treat Lucene
> > indices as source of truth.
> To use WAL we either should relay Lucene files to our Page memory or be
> aware of Lucene files structure.
> The first looks tricky, as we should guarantee a contiguous address space
> in Page memory for reflecting Lucene file. Maybe separate managed memory
> segment with its own rules?

Why not use Lucene's MMappedDirectory and map it to our storage classes?

>
> >> Transactions.
> >> * Will we support transactions?
> > Lucene has no concept of transactions.
> Yes, but we have.
> Lucene index may be non-transactional, but users never expect to see
> uncommited data.
> How does this connect with transactional SQL?
We could have the Lucene writes done as a part of transactions and ack
back only when it succeeds/fails. WDYT?
>
> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <[email protected]> wrote:
>
> > Sorry, I planned on creating a Wiki page for this, but it makes more
> > sense to be replying here.
> >
> > > * How Lucene index can be split among the nodes?
> >
> > We can have partition level indices on each node.
> >
> > > * If we'll have a single index for all partitions on the particular node,
> > > then how index records will be aware of partitioning?
> >
> > Index records dont need to be aware of partitioning -- each Lucene
> > index is independent.
> >
> > > This is important to filter out backup records from the results to avoid
> > > duplicates.
> >
> > We can merge documents from different nodes and remove duplicates as
> > long as docIDs are globally unique.
> >
> > > * How results from several nodes can be merged on the Reduce stage?
> >
> > As long as documents have a globally unique docID, Lucene has merge
> > functions that can merge results from multiple partial results.
> >
> > > * Does Lucene supports smth like JOIN operation or others that may
> > require
> > > data from another partition or index?
> >
> > As illustrated by Ilya, Block-Join works for us.
> >
> > > If so, then it likes to multistep query with merging results on
> > > intermediate stages and requires detailed investigation and design.
> > > It is ok if Ignite will have some limitations here, but we would like to
> > > know about them at the early stage.
> >
> > > * How effectively map Lucene files to the page memory? Is it even
> > possible?
> >
> > Lucene has PageDirectory implementations which allow storing Lucene
> > indices on different kind of file structures. It has a
> > MMappedFileDirectory that we could use?
> >
> > > Otherwise, how to deal with potential OOM on large queries and memory
> > > capacity planning?
> >
> > We can use Lucene's MMapped directory.
> >
> > >
> > > Persistence.
> > > * How and what consistency guarantees could we have/expect?
> >
> > Lucene does not have WAL logs but is append only
> >
> > > Seems, we may not be able to write physical records for Lucene index to
> > our
> > > WAL. What can we do with this?
> >
> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> > Lucene indices. The important thing here is to not treat Lucene
> > indices as source of truth.
> > >
> > > Transactions.
> > > * Will we support transactions?
> > Lucene has no concept of transactions.
> >
> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > > versions for the records?
> > No
> > > * What will be consistency guarantees?
> > We can acknowledge writes back only after Lucene index is updated.
> > >
> > > UX
> > > * How to add FullText search queries syntax into Calcite?
> > Postgres's FTS functions are a good reference.
> > > * AFAIK, the Lucene index has many properties for tuning. How will the
> > user
> > > configure the index?
> > Most of those properties can be cluster level and exposed as a new sub
> > config for ignite.
> > > * How and where to store the settings? What are cluster-wide and what a
> > > local to the particular node?
> > All can be cluster level.
> > > * Will be all the settings immutable? Can be they changed on-fly? after
> > > node/grid restart?
> > They should be applied post restart.
> >
> > > * Any limitations on query syntax?
> > It depends on how we model our queries for text search.
> >
> > >
> > > SQL
> > > * Will we support FullText search in SQL?
> > We need custom functions for it. See Postgres's FTS functions.
> > > * How to integrate Lucene index into Calcite? What is the cost model?
> > There cannot be any cost model since there are no paths for a text
> > query. If we see a text query, we have to use Lucene index or return
> > an error. In this way, we need to model text search as a set of UDFs
> >
> > > Splitting rules? Traits?
> > Please see my reply above.
> > >
> > >
> > > With all of this, you can go with the IEP (or even some short summary)
> > and
> > > further POC and implementation.
> > > That's a big deal, so let's discuss what could be done here.
> > >
> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <[email protected]> wrote:
> > >
> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > > important for me and I think Ignite users will benefit from it
> > > > greatly.
> > > >
> > > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > > eager to contribute there and lead the development.
> > > >
> > > > Please share your thoughts.
> > > >
> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > > <[email protected]> wrote:
> > > > >
> > > > > Hi Atri,
> > > > >
> > > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > > targeted to Ignite 2.
> > > > >
> > > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > > yet.
> > > > > By the way, we are getting requests for this thing from the user
> > side,
> > > > and
> > > > > definitely,
> > > > > FTS would be a valuable feature for Ignite.
> > > > >
> > > > > It will be great if the one wants to drive it, any help will be
> > > > appreciated.
> > > > >
> > > > >
> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <[email protected]>
> > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > An update, please. I am working through persistence of Lucene index
> > > > using
> > > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > > >
> > > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > > >
> > > > > > Also, I know we want to build native support for text searches in
> > > > Ignite 3.
> > > > > > Is the work I am proposing here part of that, or will that be a
> > > > separate
> > > > > > effort?
> > > > > >
> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > [email protected]
> > > > >
> > > > > > wrote:
> > > > > >
> > > > > > > Hello!
> > > > > > >
> > > > > > > I think that number one is the most important one, then maybe it
> > > > will see
> > > > > > > more use and other deficiencies become more apparent, leading to
> > more
> > > > > > > tickets and visibility.
> > > > > > >
> > > > > > > Maybe 2. and 3. will even use a different approach when
> > persistence
> > > > is
> > > > > > > implemented.
> > > > > > >
> > > > > > > Regards,
> > > > > > > --
> > > > > > > Ilya Kasnacheev
> > > > > > >
> > > > > > >
> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <[email protected]>:
> > > > > > >
> > > > > > > > Hello Again!
> > > > > > > >
> > > > > > > > I have been looking into the aforementioned and here are my
> > follow
> > > > up
> > > > > > > > thoughts:
> > > > > > > >
> > > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > > fixing of
> > > > > > > > moving partitions first)
> > > > > > > > 3. Figure out how to return scores from nodes and use them as
> > sort
> > > > > > > > parameters on the coordinator node
> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > > >
> > > > > > > > Please let me know if this looks ok to make text queries
> > > > functional?
> > > > > > > >
> > > > > > > > Atri
> > > > > > > >
> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > > <[email protected]> wrote:
> > > > > > > > >
> > > > > > > > > Hi.
> > > > > > > > >
> > > > > > > > > One of the biggest issues with text queries is a lack of
> > support
> > > > for
> > > > > > > > lucene
> > > > > > > > > indices persistence, which makes this functionality useless
> > if a
> > > > > > > > > persistence is enabled.
> > > > > > > > >
> > > > > > > > > I would first take care of it.
> > > > > > > > >
> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > > [email protected]
> > > > > > >:
> > > > > > > > >
> > > > > > > > > > Hi, Atri!
> > > > > > > > > >
> > > > > > > > > > You're right, Actually there is a lack of support for
> > > > TextQueries.
> > > > > > > For
> > > > > > > > the
> > > > > > > > > > last ticket I'm doing I see some obvious issues with them
> > (no
> > > > page
> > > > > > > size
> > > > > > > > > > support, for example). I'm glad that somebody wants to
> > maintain
> > > > > > this
> > > > > > > > > > functionality. Thanks a lot!
> > > > > > > > > >
> > > > > > > > > > For the MergeSort algorithm there is already a patch for
> > that
> > > > [1].
> > > > > > > It's
> > > > > > > > > > currently on review. This patch introduces an abstract
> > reducer
> > > > for
> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> > merge-sort).
> > > > Then
> > > > > > > > TextQuery
> > > > > > > > > > leverages on MergeSort to order results from multiple
> > nodes by
> > > > > > score.
> > > > > > > > This
> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > > Could
> > > > > > you
> > > > > > > > > > please check if it fully matches your idea? Any issues or
> > > > comments
> > > > > > > are
> > > > > > > > > > welcome.
> > > > > > > > > >
> > > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > > algorithm
> > > > > > for
> > > > > > > > the
> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > > also
> > > > > > > > provide
> > > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > > planning to
> > > > > > > go
> > > > > > > > > > further with TextQuery, so if you're going to support this
> > > > it'll
> > > > > > be a
> > > > > > > > great
> > > > > > > > > > contribution, I think.
> > > > > > > > > >
> > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > > >
> > > > > > > > > >
> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > [email protected]>
> > > > > > > wrote:
> > > > > > > > > >
> > > > > > > > > > > Hi All,
> > > > > > > > > > >
> > > > > > > > > > > I have been looking into our text queries support and see
> > > > that it
> > > > > > > has
> > > > > > > > > > > limited community support.
> > > > > > > > > > >
> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> > module and
> > > > > > work
> > > > > > > on
> > > > > > > > > > > enhancing it further.
> > > > > > > > > > >
> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > > sorted
> > > > > > > reduce
> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > > Lucene
> > > > > > > > ranks
> > > > > > > > > > > documents according to their score, and documents are
> > > > returned in
> > > > > > > the
> > > > > > > > > > > order of their score. Since the scoring function is
> > > > homogeneous,
> > > > > > > this
> > > > > > > > > > > means that across nodes, we can compare scores and merge
> > > > sort.
> > > > > > > > > > >
> > > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > >
> > > > > > > > > > > --
> > > > > > > > > > > Regards,
> > > > > > > > > > >
> > > > > > > > > > > Atri
> > > > > > > > > > > Apache Concerted
> > > > > > > > > > >
> > > > > > > > > >
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > --
> > > > > > > > >
> > > > > > > > > Best regards,
> > > > > > > > > Alexei Scherbakov
> > > > > > > >
> > > > > > > > --
> > > > > > > > Regards,
> > > > > > > >
> > > > > > > > Atri
> > > > > > > > Apache Concerted
> > > > > > > >
> > > > > > >
> > > > > >
> > > > >
> > > > >
> > > > > --
> > > > > Best regards,
> > > > > Andrey V. Mashenkov
> > > >
> > > > --
> > > > Regards,
> > > >
> > > > Atri
> > > > Apache Concerted
> > > >
> > >
> > >
> > > --
> > > Best regards,
> > > Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
> Best regards,
> Andrey V. Mashenkov

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Reply via email to