Yuriy, Thank you, fine with it.
пт, 4 окт. 2019 г. в 11:01, Yuriy Shuliga <shul...@gmail.com>: > > Ivan, > > Yes, your observation is correct. > > This behavior lasts from the very beginning when Lucene indexing was > implemented for distributed queries. > Implementation of the *limit* solves the problem of redundant response > size. Without it *ALL* off the records are fetched each time; that is not > good, especially for loose patterns. > In order to solve relevance issue correct sorting should be implemented. > > Y. > > пт, 4 жовт. 2019 о 10:45 Ivan Pavlukhin <vololo...@gmail.com> пише: > > > Yuriy, > > > > Am I getting it right that in your PR if we have a limit N than > > returned items (at most N) will not be strictly the most relevant > > ones? E.g. if one node returned N items faster than others but with > > not so good relevance? > > > > чт, 3 окт. 2019 г. в 17:47, Andrey Mashenkov <andrey.mashen...@gmail.com>: > > > > > > Yuri, > > > > > > I've done with review. > > > No crime found, but trivial compatibility bug. > > > > > > On Thu, Oct 3, 2019 at 3:54 PM Yuriy Shuliga <shul...@gmail.com> wrote: > > > > > > > Denis, > > > > > > > > Thank you for your attention to this. > > > > as for now, the https://issues.apache.org/jira/browse/IGNITE-12189 > > ticket > > > > is still pending review. > > > > Do we have a chance to move it forward somehow? > > > > > > > > BR, > > > > Yuriy Shuliha > > > > > > > > пн, 30 вер. 2019 о 23:35 Denis Magda <dma...@apache.org> пише: > > > > > > > > > Yuriy, > > > > > > > > > > I've seen you opening a pull-request with the first changes: > > > > > https://issues.apache.org/jira/browse/IGNITE-12189 > > > > > > > > > > Alex Scherbakov and Ivan are you the right guys to do the review? > > > > > > > > > > - > > > > > Denis > > > > > > > > > > > > > > > On Fri, Sep 27, 2019 at 8:48 AM Павлухин Иван <vololo...@gmail.com> > > > > wrote: > > > > > > > > > > > Yuriy, > > > > > > > > > > > > Thank you for providing details! Quite interesting. > > > > > > > > > > > > Yes, we already have support of distributed limit and merging > > sorted > > > > > > subresults for SQL queries. E.g. ReduceIndexSorted and > > > > > > MergeStreamIterator are used for merging sorted streams. > > > > > > > > > > > > Could you please also clarify about score/relevance? Is it > > provided by > > > > > > Lucene engine for each query result? I am thinking how to do sorted > > > > > > merge properly in this case. > > > > > > > > > > > > ср, 25 сент. 2019 г. в 18:56, Yuriy Shuliga <shul...@gmail.com>: > > > > > > > > > > > > > > Ivan, > > > > > > > > > > > > > > Thank you for interesting question! > > > > > > > > > > > > > > Text searches (or full text searches) are mostly human-oriented. > > And > > > > > the > > > > > > > point of user's interest is topmost part of response. > > > > > > > Then user can read it, evaluate and use the given records for > > further > > > > > > > purposes. > > > > > > > > > > > > > > Particularly in our case, we use Ignite for operations with > > financial > > > > > > data, > > > > > > > and there lots of text stuff like assets names, fin. instruments, > > > > > > companies > > > > > > > etc. > > > > > > > In order to operate with this quickly and reliably, users used to > > > > work > > > > > > with > > > > > > > text search, type-ahead completions, suggestions. > > > > > > > > > > > > > > For this purposes we are indexing particular string data in > > separate > > > > > > caches. > > > > > > > > > > > > > > Sorting capabilities and response size limitations are very > > important > > > > > > > there. As our API have to provide most relevant information in > > view > > > > of > > > > > > > limited size. > > > > > > > > > > > > > > Now let me comment some Ignite/Lucene perspective. > > > > > > > Actually Ignite queries and Lucene returns *TopDocs.scoresDocs > > > > *already > > > > > > > sorted by *score *(relevance). So most relevant documents are on > > the > > > > > top. > > > > > > > And currently distributed queries responses from different nodes > > are > > > > > > merged > > > > > > > into final query cursor queue in arbitrary way. > > > > > > > So in fact we already have the score order ruined here. Also > > Ignite > > > > > > > requests all possible documents from Lucene that is redundant > > and not > > > > > > good > > > > > > > for performance. > > > > > > > > > > > > > > I'm implementing *limit* parameter to be part of *TextQuery *and > > have > > > > > to > > > > > > > notice that we still have to add sorting for text queries > > processing > > > > in > > > > > > > order to have applicable results. > > > > > > > > > > > > > > *Limit* parameter itself should improve the part of issues from > > > > above, > > > > > > but > > > > > > > definitely, sorting by document score at least should be > > implemented > > > > > > along > > > > > > > with limit. > > > > > > > > > > > > > > This is a pretty short commentary if you still have any > > questions, > > > > > please > > > > > > > ask, do not hesitate) > > > > > > > > > > > > > > BR, > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > чт, 19 вер. 2019 о 11:38 Павлухин Иван <vololo...@gmail.com> > > пише: > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > Greatly appreciate your interest. > > > > > > > > > > > > > > > > Could you please elaborate a little bit about sorting? What > > tasks > > > > > does > > > > > > > > it help to solve and how? It would be great to provide an > > example. > > > > > > > > > > > > > > > > ср, 18 сент. 2019 г. в 09:39, Alexei Scherbakov < > > > > > > > > alexey.scherbak...@gmail.com>: > > > > > > > > > > > > > > > > > > Denis, > > > > > > > > > > > > > > > > > > I like the idea of throwing an exception for enabled text > > queries > > > > > on > > > > > > > > > persistent caches. > > > > > > > > > > > > > > > > > > Also I'm fine with proposed limit for unsorted searches. > > > > > > > > > > > > > > > > > > Yury, please proceed with ticket creation. > > > > > > > > > > > > > > > > > > вт, 17 сент. 2019 г., 22:06 Denis Magda <dma...@apache.org>: > > > > > > > > > > > > > > > > > > > Igniters, > > > > > > > > > > > > > > > > > > > > I see nothing wrong with Yury's proposal in regards > > full-text > > > > > > search > > > > > > > > API > > > > > > > > > > evolution as long as Yury is ready to push it forward. > > > > > > > > > > > > > > > > > > > > As for the in-memory mode only, it makes total sense for > > > > > in-memory > > > > > > data > > > > > > > > > > grid deployments when Ignite caches data of an underlying > > DB > > > > like > > > > > > > > Postgres. > > > > > > > > > > As part of the changes, I would simply throw an exception > > (by > > > > > > default) > > > > > > > > if > > > > > > > > > > the one attempts to use text indices with the native > > > > persistence > > > > > > > > enabled. > > > > > > > > > > If the person is ready to live with that limitation that an > > > > > > explicit > > > > > > > > > > configuration change is needed to come around the > > exception. > > > > > > > > > > > > > > > > > > > > Thoughts? > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Tue, Sep 17, 2019 at 7:44 AM Yuriy Shuliga < > > > > shul...@gmail.com > > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > Hello to all again, > > > > > > > > > > > > > > > > > > > > > > Thank you for important comments and notes given below! > > > > > > > > > > > > > > > > > > > > > > Let me answer and continue the discussion. > > > > > > > > > > > > > > > > > > > > > > (I) Overall needs in Lucene indexing > > > > > > > > > > > > > > > > > > > > > > Alexei has referenced to > > > > > > > > > > > https://issues.apache.org/jira/browse/IGNITE-5371 where > > > > > > > > > > > absence of index persistence was declared as an obstacle > > to > > > > > > further > > > > > > > > > > > development. > > > > > > > > > > > > > > > > > > > > > > a) This ticket is already closed as not valid.b) There > > are > > > > > > definite > > > > > > > > needs > > > > > > > > > > > (and in our project as well) in just in-memory indexing > > of > > > > > > selected > > > > > > > > data. > > > > > > > > > > > We intend to use search capabilities for fetching limited > > > > > amount > > > > > > of > > > > > > > > > > records > > > > > > > > > > > that should be used in type-ahead search / suggestions. > > > > > > > > > > > Not all of the data will be indexed and the are no need > > in > > > > > Lucene > > > > > > > > index > > > > > > > > > > to > > > > > > > > > > > be persistence. Hope this is a wide pattern of > > text-search > > > > > usage. > > > > > > > > > > > > > > > > > > > > > > (II) Necessary fixes in current implementation. > > > > > > > > > > > > > > > > > > > > > > a) Implementation of correct *limit *(*offset* seems to > > be > > > > not > > > > > > > > required > > > > > > > > > > in > > > > > > > > > > > text-search tasks for now) > > > > > > > > > > > I have investigated the data flow for distributed text > > > > queries. > > > > > > it > > > > > > > > was > > > > > > > > > > > simple test prefix query, like 'name'*='ene*'* > > > > > > > > > > > For now each server-node returns all response records to > > the > > > > > > > > client-node > > > > > > > > > > > and it may contain ~thousands, ~hundred thousands > > records. > > > > > > > > > > > Event if we need only first 10-100. Again, all the > > results > > > > are > > > > > > added > > > > > > > > to > > > > > > > > > > > queue in GridCacheQueryFutureAdapter in arbitrary order > > by > > > > > pages. > > > > > > > > > > > I did not find here any means to deliver deterministic > > > > result. > > > > > > > > > > > So implementing limit as part of query and > > > > > > (GridCacheQueryRequest) > > > > > > > > will > > > > > > > > > > not > > > > > > > > > > > change the nature of response but will limit load on > > nodes > > > > and > > > > > > > > > > networking. > > > > > > > > > > > > > > > > > > > > > > Can we consider to open a ticket for this? > > > > > > > > > > > > > > > > > > > > > > (III) Further extension of Lucene API exposition to > > Ignite > > > > > > > > > > > > > > > > > > > > > > a) Sorting > > > > > > > > > > > The solution for this could be: > > > > > > > > > > > - Make entities comparable > > > > > > > > > > > - Add custom comparator to entity > > > > > > > > > > > - Add annotations to mark sorted fields for Lucene > > indexing > > > > > > > > > > > - Use comparators when merging responses or reducing to > > > > desired > > > > > > > > limit on > > > > > > > > > > > client node. > > > > > > > > > > > Will require full result set to be loaded into memory. > > Though > > > > > > can be > > > > > > > > used > > > > > > > > > > > for relatively small limits. > > > > > > > > > > > BR, > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > пт, 30 серп. 2019 о 10:37 Alexei Scherbakov < > > > > > > > > > > alexey.scherbak...@gmail.com> > > > > > > > > > > > пише: > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > Note what one of major blockers for text queries is [1] > > > > which > > > > > > makes > > > > > > > > > > > lucene > > > > > > > > > > > > indexes unusable with persistence and main reason for > > > > > > > > discontinuation. > > > > > > > > > > > > Probably it's should be addressed first to make text > > > > queries > > > > > a > > > > > > > > valid > > > > > > > > > > > > product feature. > > > > > > > > > > > > > > > > > > > > > > > > Distributed sorting and advanved querying is indeed > > not a > > > > > > trivial > > > > > > > > task. > > > > > > > > > > > > Some kind of merging must be implemented on query > > > > originating > > > > > > node. > > > > > > > > > > > > > > > > > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-5371 > > > > > > > > > > > > > > > > > > > > > > > > чт, 29 авг. 2019 г. в 23:38, Denis Magda < > > > > dma...@apache.org > > > > > >: > > > > > > > > > > > > > > > > > > > > > > > > > Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > If you are ready to take over the full-text search > > > > indexes > > > > > > then > > > > > > > > > > please > > > > > > > > > > > go > > > > > > > > > > > > > ahead. The primary reason why the community wants to > > > > > > discontinue > > > > > > > > them > > > > > > > > > > > > first > > > > > > > > > > > > > (and, probable, resurrect later) are the limitations > > > > listed > > > > > > by > > > > > > > > Andrey > > > > > > > > > > > and > > > > > > > > > > > > > minimal support from the community end. > > > > > > > > > > > > > > > > > > > > > > > > > > - > > > > > > > > > > > > > Denis > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 1:29 PM Andrey Mashenkov < > > > > > > > > > > > > > andrey.mashen...@gmail.com> > > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > Hi Yuriy, > > > > > > > > > > > > > > > > > > > > > > > > > > > > Unfortunatelly, there is a plan to discontinue > > > > > TextQueries > > > > > > in > > > > > > > > > > Ignite > > > > > > > > > > > > [1]. > > > > > > > > > > > > > > Motivation here is text indexes are not > > persistent, not > > > > > > > > > > transactional > > > > > > > > > > > > and > > > > > > > > > > > > > > can't be user together with SQL or inside SQL. > > > > > > > > > > > > > > and there is a lack of interest from community > > side. > > > > > > > > > > > > > > You are weclome to take on these issues and make > > > > > > TextQueries > > > > > > > > great. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1, PageSize can't be used to limit resultset. > > > > > > > > > > > > > > Query results return from data node to client-side > > > > cursor > > > > > > in > > > > > > > > > > > > page-by-page > > > > > > > > > > > > > > manner and > > > > > > > > > > > > > > this parameter is designed control page size. It is > > > > > > supposed > > > > > > > > query > > > > > > > > > > > > > executes > > > > > > > > > > > > > > lazily on server side and > > > > > > > > > > > > > > it is not excepted full resultset be loaded to > > memory > > > > on > > > > > > server > > > > > > > > > > side > > > > > > > > > > > at > > > > > > > > > > > > > > once, but by pages. > > > > > > > > > > > > > > Do you mean you found Lucene load entire resultset > > into > > > > > > memory > > > > > > > > > > before > > > > > > > > > > > > > first > > > > > > > > > > > > > > page is sent to client? > > > > > > > > > > > > > > > > > > > > > > > > > > > > I'd think a new parameter should be added to limit > > > > > result. > > > > > > The > > > > > > > > best > > > > > > > > > > > > > > solution is to use query language commands for > > this, > > > > e.g. > > > > > > > > > > > > "LIMIT/OFFSET" > > > > > > > > > > > > > in > > > > > > > > > > > > > > SQL. > > > > > > > > > > > > > > > > > > > > > > > > > > > > This task doesn't look trivial. Query is > > distributed > > > > > > operation > > > > > > > > and > > > > > > > > > > > same > > > > > > > > > > > > > > user query will be executed on data nodes > > > > > > > > > > > > > > and then results from all nodes should be correcly > > > > merged > > > > > > > > before > > > > > > > > > > > being > > > > > > > > > > > > > > returned via client-cursor. > > > > > > > > > > > > > > So, LIMIT should be applied on every node and then > > on > > > > > merge > > > > > > > > phase. > > > > > > > > > > > > > > > > > > > > > > > > > > > > Also, this may be non-obviuos, limiting results > > make no > > > > > > sence > > > > > > > > > > without > > > > > > > > > > > > > > sorting, > > > > > > > > > > > > > > as there is no guarantee every next query run will > > > > return > > > > > > same > > > > > > > > data > > > > > > > > > > > > > because > > > > > > > > > > > > > > of page reordeing. > > > > > > > > > > > > > > Basically, merge phase receive results from data > > nodes > > > > > > > > > > asynchronously > > > > > > > > > > > > and > > > > > > > > > > > > > > messages from different nodes can't be ordered. > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2. > > > > > > > > > > > > > > a. "tokenize" param name (for @QueryTextFiled) > > looks > > > > more > > > > > > > > verbose, > > > > > > > > > > > > isn't > > > > > > > > > > > > > > it. > > > > > > > > > > > > > > b,c. What about distributed query? How partial > > results > > > > > from > > > > > > > > nodes > > > > > > > > > > > will > > > > > > > > > > > > be > > > > > > > > > > > > > > merged? > > > > > > > > > > > > > > Does Lucene allows to configure comparator for > > data > > > > > > sorting? > > > > > > > > > > > > > > What comparator Ignite should choose to sort > > result on > > > > > > merge > > > > > > > > phase? > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3. For now Lucene engine is not configurable at > > all. > > > > E.g. > > > > > > it is > > > > > > > > > > > > > impossible > > > > > > > > > > > > > > to configure Tokenizer. > > > > > > > > > > > > > > I'd think about possible ways to configure engine > > at > > > > > first > > > > > > and > > > > > > > > only > > > > > > > > > > > > then > > > > > > > > > > > > > go > > > > > > > > > > > > > > further to discuss\implement complex features, > > > > > > > > > > > > > > that may depends on engine config. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > On Thu, Aug 29, 2019 at 8:17 PM Yuriy Shuliga < > > > > > > > > shul...@gmail.com> > > > > > > > > > > > > wrote: > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Dear community, > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > By starting this chain I'd like to open > > discussion > > > > that > > > > > > would > > > > > > > > > > come > > > > > > > > > > > to > > > > > > > > > > > > > > > contribution results in subj. area. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Ignite has indexing capabilities, backed up by > > > > > different > > > > > > > > > > > mechanisms, > > > > > > > > > > > > > > > including Lucene. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Currently, Lucene 7.5.0 is used (past year > > release). > > > > > > > > > > > > > > > This is a wide spread and mature technology that > > > > covers > > > > > > text > > > > > > > > > > search > > > > > > > > > > > > > area > > > > > > > > > > > > > > > and beyond (e.g. spacial data indexing). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > My goal is to *expose more Lucene functionality > > to > > > > > Ignite > > > > > > > > > > indexing > > > > > > > > > > > > and > > > > > > > > > > > > > > > query mechanisms for text data*. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > It's quite simple request at current stage. It is > > > > > coming > > > > > > > > from our > > > > > > > > > > > > > > project's > > > > > > > > > > > > > > > needs, but i believe, will be useful for a lot > > more > > > > > > people. > > > > > > > > > > > > > > > Let's walk through and vote or discuss about Jira > > > > > > tickets for > > > > > > > > > > them. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 1.[trivial] Use dataQuery.getPageSize() to > > limit > > > > > search > > > > > > > > > > response > > > > > > > > > > > > > items > > > > > > > > > > > > > > > inside GridLuceneIndex.query(). Currently it is > > > > calling > > > > > > > > > > > > > > > IndexSearcher.search(query, *Integer.MAX_VALUE*) > > - so > > > > > > > > basically > > > > > > > > > > all > > > > > > > > > > > > > > scored > > > > > > > > > > > > > > > matches will me returned, what we do not need in > > most > > > > > > cases. > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 2.[simple] Add sorting. Then more capable search > > > > call > > > > > > can be > > > > > > > > > > > > > > > executed: *IndexSearcher.search(query, count, > > > > > > > > > > > > > > > sort) * > > > > > > > > > > > > > > > Implementation steps: > > > > > > > > > > > > > > > a) Introduce boolean *sortField* parameter in > > > > > > > > *@QueryTextFiled * > > > > > > > > > > > > > > > annotation. If > > > > > > > > > > > > > > > *true *the filed will be indexed but not > > tokenized. > > > > > > Number > > > > > > > > types > > > > > > > > > > > are > > > > > > > > > > > > > > > preferred here. > > > > > > > > > > > > > > > b) Add *sort* collection to *TextQuery* > > constructor. > > > > It > > > > > > > > should > > > > > > > > > > > define > > > > > > > > > > > > > > > desired sort fields used for querying. > > > > > > > > > > > > > > > c) Implement Lucene sort usage in > > > > > > GridLuceneIndex.query(). > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > 3.[moderate] Build complex queries with > > *TextQuery*, > > > > > > > > including > > > > > > > > > > > > > > > terms/queries boosting. > > > > > > > > > > > > > > > *This section for voting only, as requires more > > > > > detailed > > > > > > > > work. > > > > > > > > > > > Should > > > > > > > > > > > > > be > > > > > > > > > > > > > > > extended if community is interested in it.* > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > Looking forward to your comments! > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > BR, > > > > > > > > > > > > > > > Yuriy Shuliha > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > > > Andrey V. Mashenkov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > > > > > > > > > > > > > > > > > Best regards, > > > > > > > > > > > > Alexei Scherbakov > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > > > Best regards, > > > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > > -- > > > > > > Best regards, > > > > > > Ivan Pavlukhin > > > > > > > > > > > > > > > > > > > > > > > > -- > > > Best regards, > > > Andrey V. Mashenkov > > > > > > > > -- > > Best regards, > > Ivan Pavlukhin > > -- Best regards, Ivan Pavlukhin