Re: Text Queries Support

Atri Sharma Mon, 02 Aug 2021 01:34:25 -0700

Hi Ivan,

Would you like to propose an alternative to Lucene?


Atri

On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <[email protected]> wrote:

> Folks,
>
> Sorry if read the thread not thoroughly enough, but do we consider
> Lucene as obviously right choice? In my understanding Ignite history
> has shown clearly that "fastest feature implementation" is not usually
> the best. And one example of this are text queries. Are not we trying
> to do a same mistake again? FTS is a huge feature, I do not believe
> there is an easy win for it.
>
> 2021-07-27 19:18 GMT+03:00, Atri Sharma <[email protected]>:
> > Andrey,
> >
> >> Per-partition Lucene index looks simple to implement, but it may require
> >> per-partition SQL to make full-text search expressions work correctly
> >> within the SQL quiery.
> > I think that as long as we follow the map - reduce process that we
> > already do for other queries, we should be fine.
> >
> >> Per-partition SQL index may kill the performance. We already tried to do
> >> that in Ignite 2. However, QueryParallelism feature helps to speed up
> >> some
> >> data-intensive queries,
> >> but hits the performance in simple cases, and at some point (e.g.
> >> segments
> >> > number of CPU) the performance rapidly degrades with the increasing
> >> number of segments.
> >
> > Yeah, that is always the case, but a global index will be a nightmare
> > in terms of concurrency and pessimistic concurrency control will
> > anyways kill the benefits, coupled with the metadata requirements.
> > What were the specific issues with per partition index?
> >>
> >> AFAIK, Lucene widely used bitmap indices that are easy to merge.
> >> Maybe, the map-reduce technique underneath FTS expressions and some
> hacks
> >> will add a minimal overhead.
> >
> > Lucene uses many types of indices but the aspect here is that per
> > partition Lucene indices can return docIDs and we can merge them in
> > reduce phase. So we are abstracted out from specifics of the internal
> > index being used to serve the query.
> >
> >>
> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> > Lucene indices. The important thing here is to not treat Lucene
> >> > indices as source of truth.
> >> To use WAL we either should relay Lucene files to our Page memory or be
> >> aware of Lucene files structure.
> >> The first looks tricky, as we should guarantee a contiguous address
> space
> >> in Page memory for reflecting Lucene file. Maybe separate managed memory
> >> segment with its own rules?
> >
> > Why not use Lucene's MMappedDirectory and map it to our storage classes?
> >
> >>
> >> >> Transactions.
> >> >> * Will we support transactions?
> >> > Lucene has no concept of transactions.
> >> Yes, but we have.
> >> Lucene index may be non-transactional, but users never expect to see
> >> uncommited data.
> >> How does this connect with transactional SQL?
> > We could have the Lucene writes done as a part of transactions and ack
> > back only when it succeeds/fails. WDYT?
> >>
> >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <[email protected]> wrote:
> >>
> >> > Sorry, I planned on creating a Wiki page for this, but it makes more
> >> > sense to be replying here.
> >> >
> >> > > * How Lucene index can be split among the nodes?
> >> >
> >> > We can have partition level indices on each node.
> >> >
> >> > > * If we'll have a single index for all partitions on the particular
> >> > > node,
> >> > > then how index records will be aware of partitioning?
> >> >
> >> > Index records dont need to be aware of partitioning -- each Lucene
> >> > index is independent.
> >> >
> >> > > This is important to filter out backup records from the results to
> >> > > avoid
> >> > > duplicates.
> >> >
> >> > We can merge documents from different nodes and remove duplicates as
> >> > long as docIDs are globally unique.
> >> >
> >> > > * How results from several nodes can be merged on the Reduce stage?
> >> >
> >> > As long as documents have a globally unique docID, Lucene has merge
> >> > functions that can merge results from multiple partial results.
> >> >
> >> > > * Does Lucene supports smth like JOIN operation or others that may
> >> > require
> >> > > data from another partition or index?
> >> >
> >> > As illustrated by Ilya, Block-Join works for us.
> >> >
> >> > > If so, then it likes to multistep query with merging results on
> >> > > intermediate stages and requires detailed investigation and design.
> >> > > It is ok if Ignite will have some limitations here, but we would
> like
> >> > > to
> >> > > know about them at the early stage.
> >> >
> >> > > * How effectively map Lucene files to the page memory? Is it even
> >> > possible?
> >> >
> >> > Lucene has PageDirectory implementations which allow storing Lucene
> >> > indices on different kind of file structures. It has a
> >> > MMappedFileDirectory that we could use?
> >> >
> >> > > Otherwise, how to deal with potential OOM on large queries and
> memory
> >> > > capacity planning?
> >> >
> >> > We can use Lucene's MMapped directory.
> >> >
> >> > >
> >> > > Persistence.
> >> > > * How and what consistency guarantees could we have/expect?
> >> >
> >> > Lucene does not have WAL logs but is append only
> >> >
> >> > > Seems, we may not be able to write physical records for Lucene index
> >> > > to
> >> > our
> >> > > WAL. What can we do with this?
> >> >
> >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild
> >> > Lucene indices. The important thing here is to not treat Lucene
> >> > indices as source of truth.
> >> > >
> >> > > Transactions.
> >> > > * Will we support transactions?
> >> > Lucene has no concept of transactions.
> >> >
> >> > > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> >> > > versions for the records?
> >> > No
> >> > > * What will be consistency guarantees?
> >> > We can acknowledge writes back only after Lucene index is updated.
> >> > >
> >> > > UX
> >> > > * How to add FullText search queries syntax into Calcite?
> >> > Postgres's FTS functions are a good reference.
> >> > > * AFAIK, the Lucene index has many properties for tuning. How will
> >> > > the
> >> > user
> >> > > configure the index?
> >> > Most of those properties can be cluster level and exposed as a new sub
> >> > config for ignite.
> >> > > * How and where to store the settings? What are cluster-wide and
> what
> >> > > a
> >> > > local to the particular node?
> >> > All can be cluster level.
> >> > > * Will be all the settings immutable? Can be they changed on-fly?
> >> > > after
> >> > > node/grid restart?
> >> > They should be applied post restart.
> >> >
> >> > > * Any limitations on query syntax?
> >> > It depends on how we model our queries for text search.
> >> >
> >> > >
> >> > > SQL
> >> > > * Will we support FullText search in SQL?
> >> > We need custom functions for it. See Postgres's FTS functions.
> >> > > * How to integrate Lucene index into Calcite? What is the cost
> model?
> >> > There cannot be any cost model since there are no paths for a text
> >> > query. If we see a text query, we have to use Lucene index or return
> >> > an error. In this way, we need to model text search as a set of UDFs
> >> >
> >> > > Splitting rules? Traits?
> >> > Please see my reply above.
> >> > >
> >> > >
> >> > > With all of this, you can go with the IEP (or even some short
> >> > > summary)
> >> > and
> >> > > further POC and implementation.
> >> > > That's a big deal, so let's discuss what could be done here.
> >> > >
> >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <[email protected]>
> wrote:
> >> > >
> >> > > > I am actually happy to drive the feature for Ignite 3. FTS is very
> >> > > > important for me and I think Ignite users will benefit from it
> >> > > > greatly.
> >> > > >
> >> > > > If it makes sense to be focusing on Ignite 3 for this capability,
> I
> >> > > > am
> >> > > > eager to contribute there and lead the development.
> >> > > >
> >> > > > Please share your thoughts.
> >> > > >
> >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> >> > > > <[email protected]> wrote:
> >> > > > >
> >> > > > > Hi Atri,
> >> > > > >
> >> > > > > All the Jira tickets we have on the Full-text search (FTS) thing
> >> > > > > are
> >> > > > > targeted to Ignite 2.
> >> > > > >
> >> > > > > AFAIK, we want, but we have NOT committed to FTS support in
> Ignite
> >> > > > > 3,
> >> > > > yet.
> >> > > > > By the way, we are getting requests for this thing from the user
> >> > side,
> >> > > > and
> >> > > > > definitely,
> >> > > > > FTS would be a valuable feature for Ignite.
> >> > > > >
> >> > > > > It will be great if the one wants to drive it, any help will be
> >> > > > appreciated.
> >> > > > >
> >> > > > >
> >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <[email protected]>
> >> > wrote:
> >> > > > >
> >> > > > > > Hello,
> >> > > > > >
> >> > > > > > An update, please. I am working through persistence of Lucene
> >> > > > > > index
> >> > > > using
> >> > > > > > Ignite Dictionary, and will be asking some questions soon.
> >> > > > > >
> >> > > > > > I had one doubt - - where does this change go? Ignite 3?
> >> > > > > >
> >> > > > > > Also, I know we want to build native support for text searches
> >> > > > > > in
> >> > > > Ignite 3.
> >> > > > > > Is the work I am proposing here part of that, or will that be
> a
> >> > > > separate
> >> > > > > > effort?
> >> > > > > >
> >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> >> > [email protected]
> >> > > > >
> >> > > > > > wrote:
> >> > > > > >
> >> > > > > > > Hello!
> >> > > > > > >
> >> > > > > > > I think that number one is the most important one, then
> maybe
> >> > > > > > > it
> >> > > > will see
> >> > > > > > > more use and other deficiencies become more apparent,
> leading
> >> > > > > > > to
> >> > more
> >> > > > > > > tickets and visibility.
> >> > > > > > >
> >> > > > > > > Maybe 2. and 3. will even use a different approach when
> >> > persistence
> >> > > > is
> >> > > > > > > implemented.
> >> > > > > > >
> >> > > > > > > Regards,
> >> > > > > > > --
> >> > > > > > > Ilya Kasnacheev
> >> > > > > > >
> >> > > > > > >
> >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <[email protected]>:
> >> > > > > > >
> >> > > > > > > > Hello Again!
> >> > > > > > > >
> >> > > > > > > > I have been looking into the aforementioned and here are
> my
> >> > follow
> >> > > > up
> >> > > > > > > > thoughts:
> >> > > > > > > >
> >> > > > > > > > 1. Support persistence of Lucene indexes.
> >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401
> >> > > > > > > > (Needs
> >> > > > fixing of
> >> > > > > > > > moving partitions first)
> >> > > > > > > > 3. Figure out how to return scores from nodes and use them
> >> > > > > > > > as
> >> > sort
> >> > > > > > > > parameters on the coordinator node
> >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> >> > > > > > > >
> >> > > > > > > > Please let me know if this looks ok to make text queries
> >> > > > functional?
> >> > > > > > > >
> >> > > > > > > > Atri
> >> > > > > > > >
> >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> >> > > > > > > > <[email protected]> wrote:
> >> > > > > > > > >
> >> > > > > > > > > Hi.
> >> > > > > > > > >
> >> > > > > > > > > One of the biggest issues with text queries is a lack of
> >> > support
> >> > > > for
> >> > > > > > > > lucene
> >> > > > > > > > > indices persistence, which makes this functionality
> >> > > > > > > > > useless
> >> > if a
> >> > > > > > > > > persistence is enabled.
> >> > > > > > > > >
> >> > > > > > > > > I would first take care of it.
> >> > > > > > > > >
> >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> >> > > > [email protected]
> >> > > > > > >:
> >> > > > > > > > >
> >> > > > > > > > > > Hi, Atri!
> >> > > > > > > > > >
> >> > > > > > > > > > You're right, Actually there is a lack of support for
> >> > > > TextQueries.
> >> > > > > > > For
> >> > > > > > > > the
> >> > > > > > > > > > last ticket I'm doing I see some obvious issues with
> >> > > > > > > > > > them
> >> > (no
> >> > > > page
> >> > > > > > > size
> >> > > > > > > > > > support, for example). I'm glad that somebody wants to
> >> > maintain
> >> > > > > > this
> >> > > > > > > > > > functionality. Thanks a lot!
> >> > > > > > > > > >
> >> > > > > > > > > > For the MergeSort algorithm there is already a patch
> >> > > > > > > > > > for
> >> > that
> >> > > > [1].
> >> > > > > > > It's
> >> > > > > > > > > > currently on review. This patch introduces an abstract
> >> > reducer
> >> > > > for
> >> > > > > > > > > > CacheQueries with 2 implementations (unordered,
> >> > merge-sort).
> >> > > > Then
> >> > > > > > > > TextQuery
> >> > > > > > > > > > leverages on MergeSort to order results from multiple
> >> > nodes by
> >> > > > > > score.
> >> > > > > > > > This
> >> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned
> >> > > > > > > > > > before.
> >> > > > Could
> >> > > > > > you
> >> > > > > > > > > > please check if it fully matches your idea? Any issues
> >> > > > > > > > > > or
> >> > > > comments
> >> > > > > > > are
> >> > > > > > > > > > welcome.
> >> > > > > > > > > >
> >> > > > > > > > > > I've prepared this ticket, because I need the
> MergeSort
> >> > > > algorithm
> >> > > > > > for
> >> > > > > > > > the
> >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it
> >> > > > > > > > > > should
> >> > > > also
> >> > > > > > > > provide
> >> > > > > > > > > > ordered results over multiple nodes). Currently I'm
> not
> >> > > > planning to
> >> > > > > > > go
> >> > > > > > > > > > further with TextQuery, so if you're going to support
> >> > > > > > > > > > this
> >> > > > it'll
> >> > > > > > be a
> >> > > > > > > > great
> >> > > > > > > > > > contribution, I think.
> >> > > > > > > > > >
> >> > > > > > > > > > [1]
> https://issues.apache.org/jira/browse/IGNITE-14703
> >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> >> > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> >> > [email protected]>
> >> > > > > > > wrote:
> >> > > > > > > > > >
> >> > > > > > > > > > > Hi All,
> >> > > > > > > > > > >
> >> > > > > > > > > > > I have been looking into our text queries support
> and
> >> > > > > > > > > > > see
> >> > > > that it
> >> > > > > > > has
> >> > > > > > > > > > > limited community support.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the
> >> > module and
> >> > > > > > work
> >> > > > > > > on
> >> > > > > > > > > > > enhancing it further.
> >> > > > > > > > > > >
> >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work
> >> > > > > > > > > > > on
> >> > > > sorted
> >> > > > > > > reduce
> >> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable
> >> > > > > > > > > > > since
> >> > > > Lucene
> >> > > > > > > > ranks
> >> > > > > > > > > > > documents according to their score, and documents
> are
> >> > > > returned in
> >> > > > > > > the
> >> > > > > > > > > > > order of their score. Since the scoring function is
> >> > > > homogeneous,
> >> > > > > > > this
> >> > > > > > > > > > > means that across nodes, we can compare scores and
> >> > > > > > > > > > > merge
> >> > > > sort.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Please let me know if I can take this up.
> >> > > > > > > > > > >
> >> > > > > > > > > > > Atri
> >> > > > > > > > > > >
> >> > > > > > > > > > > --
> >> > > > > > > > > > > Regards,
> >> > > > > > > > > > >
> >> > > > > > > > > > > Atri
> >> > > > > > > > > > > Apache Concerted
> >> > > > > > > > > > >
> >> > > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > >
> >> > > > > > > > > --
> >> > > > > > > > >
> >> > > > > > > > > Best regards,
> >> > > > > > > > > Alexei Scherbakov
> >> > > > > > > >
> >> > > > > > > > --
> >> > > > > > > > Regards,
> >> > > > > > > >
> >> > > > > > > > Atri
> >> > > > > > > > Apache Concerted
> >> > > > > > > >
> >> > > > > > >
> >> > > > > >
> >> > > > >
> >> > > > >
> >> > > > > --
> >> > > > > Best regards,
> >> > > > > Andrey V. Mashenkov
> >> > > >
> >> > > > --
> >> > > > Regards,
> >> > > >
> >> > > > Atri
> >> > > > Apache Concerted
> >> > > >
> >> > >
> >> > >
> >> > > --
> >> > > Best regards,
> >> > > Andrey V. Mashenkov
> >> >
> >> > --
> >> > Regards,
> >> >
> >> > Atri
> >> > Apache Concerted
> >> >
> >>
> >>
> >> --
> >> Best regards,
> >> Andrey V. Mashenkov
> >
> > --
> > Regards,
> >
> > Atri
> > Apache Concerted
> >
>
>
> --
>
> Best regards,
> Ivan Pavlukhin
>

Re: Text Queries Support

Reply via email to