Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Alexei Scherbakov Fri, 30 Aug 2019 00:38:40 -0700

Yuriy,

Note what one of major blockers for text queries is [1] which makes lucene
indexes unusable with persistence and main reason for discontinuation.
Probably it's should be addressed first to make text queries a valid
product feature.


Distributed sorting and advanved querying is indeed not a trivial task.
Some kind of merging must be implemented on query originating node.

[1] https://issues.apache.org/jira/browse/IGNITE-5371

чт, 29 авг. 2019 г. в 23:38, Denis Magda <dma...@apache.org>:

> Yuriy,
>
> If you are ready to take over the full-text search indexes then please go
> ahead. The primary reason why the community wants to discontinue them first
> (and, probable, resurrect later) are the limitations listed by Andrey and
> minimal support from the community end.
>
> -
> Denis
>
>
> On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov <
> andrey.mashen...@gmail.com>
> wrote:
>
> > Hi Yuriy,
> >
> > Unfortunatelly, there is a plan to discontinue TextQueries in Ignite [1].
> > Motivation here is text indexes are not persistent, not transactional and
> > can't be user together with SQL or inside SQL.
> > and there is a lack of interest from community side.
> > You are weclome to take on these issues and make TextQueries great.
> >
> > 1,  PageSize can't be used to limit resultset.
> > Query results return from data node to client-side cursor in page-by-page
> > manner and
> > this parameter is designed control page size. It is supposed query
> executes
> > lazily on server side and
> > it is not excepted full resultset be loaded to memory on server side at
> > once, but by pages.
> > Do you mean you found Lucene load entire resultset into memory before
> first
> > page is sent to client?
> >
> > I'd think a new parameter should be added to limit result. The best
> > solution is to use query language commands for this, e.g. "LIMIT/OFFSET"
> in
> > SQL.
> >
> > This task doesn't look trivial. Query is distributed operation and same
> > user query will be executed on data nodes
> > and then results from all nodes should be correcly merged before being
> > returned via client-cursor.
> > So, LIMIT should be applied on every node and then on merge phase.
> >
> > Also, this may be non-obviuos, limiting results make no sence without
> > sorting,
> > as there is no guarantee every next query run will return same data
> because
> > of page reordeing.
> > Basically, merge phase receive results from data nodes asynchronously and
> > messages from different nodes can't be ordered.
> >
> > 2.
> > a. "tokenize" param name (for @QueryTextFiled) looks more verbose, isn't
> > it.
> > b,c. What about distributed query? How partial results from nodes will be
> > merged?
> >  Does Lucene allows to configure comparator for data sorting?
> > What comparator Ignite should choose to sort result on merge phase?
> >
> > 3. For now Lucene engine is not configurable at all. E.g. it is
> impossible
> > to configure Tokenizer.
> > I'd think about possible ways to configure engine at first and only then
> go
> > further to discuss\implement complex features,
> > that may depends on engine config.
> >
> >
> >
> > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga <shul...@gmail.com> wrote:
> >
> > > Dear community,
> > >
> > > By starting this chain I'd like to open discussion that would come to
> > > contribution results in subj. area.
> > >
> > > Ignite has indexing capabilities, backed up by different mechanisms,
> > > including Lucene.
> > >
> > > Currently, Lucene 7.5.0 is used (past year release).
> > > This is a wide spread and mature technology that covers text search
> area
> > > and beyond (e.g. spacial data indexing).
> > >
> > > My goal is to *expose more Lucene functionality to Ignite indexing and
> > > query mechanisms for text data*.
> > >
> > > It's quite simple request at current stage. It is coming from our
> > project's
> > > needs, but i believe, will be useful for a lot more people.
> > > Let's walk through and vote or discuss about Jira tickets for them.
> > >
> > > 1.[trivial] Use  dataQuery.getPageSize()  to limit search response
> items
> > > inside GridLuceneIndex.query(). Currently it is calling
> > > IndexSearcher.search(query, *Integer.MAX_VALUE*) - so basically all
> > scored
> > > matches will me returned, what we do not need in most cases.
> > >
> > > 2.[simple] Add sorting.  Then more capable search call can be
> > > executed: *IndexSearcher.search(query, count,
> > > sort) *
> > > Implementation steps:
> > > a) Introduce boolean *sortField* parameter in *@QueryTextFiled *
> > > annotation. If
> > > *true *the filed will be indexed but not tokenized. Number types are
> > > preferred here.
> > > b) Add *sort* collection to *TextQuery* constructor. It should define
> > > desired sort fields used for querying.
> > > c) Implement Lucene sort usage in GridLuceneIndex.query().
> > >
> > > 3.[moderate] Build complex queries with *TextQuery*, including
> > > terms/queries boosting.
> > > *This section for voting only, as requires more detailed work. Should
> be
> > > extended if community is interested in it.*
> > >
> > > Looking forward to your comments!
> > >
> > > BR,
> > > Yuriy Shuliha
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >
>


-- 

Best regards,
Alexei Scherbakov

Re: Text queries/indexes (GridLuceneIndex, @QueryTextFiled)

Reply via email to