Hi Ivan, Would you like to propose an alternative to Lucene?
Atri On Mon, 2 Aug 2021, 13:48 Ivan Pavlukhin, <vololo...@gmail.com> wrote: > Folks, > > Sorry if read the thread not thoroughly enough, but do we consider > Lucene as obviously right choice? In my understanding Ignite history > has shown clearly that "fastest feature implementation" is not usually > the best. And one example of this are text queries. Are not we trying > to do a same mistake again? FTS is a huge feature, I do not believe > there is an easy win for it. > > 2021-07-27 19:18 GMT+03:00, Atri Sharma <a...@apache.org>: > > Andrey, > > > >> Per-partition Lucene index looks simple to implement, but it may require > >> per-partition SQL to make full-text search expressions work correctly > >> within the SQL quiery. > > I think that as long as we follow the map - reduce process that we > > already do for other queries, we should be fine. > > > >> Per-partition SQL index may kill the performance. We already tried to do > >> that in Ignite 2. However, QueryParallelism feature helps to speed up > >> some > >> data-intensive queries, > >> but hits the performance in simple cases, and at some point (e.g. > >> segments > >> > number of CPU) the performance rapidly degrades with the increasing > >> number of segments. > > > > Yeah, that is always the case, but a global index will be a nightmare > > in terms of concurrency and pessimistic concurrency control will > > anyways kill the benefits, coupled with the metadata requirements. > > What were the specific issues with per partition index? > >> > >> AFAIK, Lucene widely used bitmap indices that are easy to merge. > >> Maybe, the map-reduce technique underneath FTS expressions and some > hacks > >> will add a minimal overhead. > > > > Lucene uses many types of indices but the aspect here is that per > > partition Lucene indices can return docIDs and we can merge them in > > reduce phase. So we are abstracted out from specifics of the internal > > index being used to serve the query. > > > >> > >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild > >> > Lucene indices. The important thing here is to not treat Lucene > >> > indices as source of truth. > >> To use WAL we either should relay Lucene files to our Page memory or be > >> aware of Lucene files structure. > >> The first looks tricky, as we should guarantee a contiguous address > space > >> in Page memory for reflecting Lucene file. Maybe separate managed memory > >> segment with its own rules? > > > > Why not use Lucene's MMappedDirectory and map it to our storage classes? > > > >> > >> >> Transactions. > >> >> * Will we support transactions? > >> > Lucene has no concept of transactions. > >> Yes, but we have. > >> Lucene index may be non-transactional, but users never expect to see > >> uncommited data. > >> How does this connect with transactional SQL? > > We could have the Lucene writes done as a part of transactions and ack > > back only when it succeeds/fails. WDYT? > >> > >> On Tue, Jul 27, 2021 at 1:36 PM Atri Sharma <a...@apache.org> wrote: > >> > >> > Sorry, I planned on creating a Wiki page for this, but it makes more > >> > sense to be replying here. > >> > > >> > > * How Lucene index can be split among the nodes? > >> > > >> > We can have partition level indices on each node. > >> > > >> > > * If we'll have a single index for all partitions on the particular > >> > > node, > >> > > then how index records will be aware of partitioning? > >> > > >> > Index records dont need to be aware of partitioning -- each Lucene > >> > index is independent. > >> > > >> > > This is important to filter out backup records from the results to > >> > > avoid > >> > > duplicates. > >> > > >> > We can merge documents from different nodes and remove duplicates as > >> > long as docIDs are globally unique. > >> > > >> > > * How results from several nodes can be merged on the Reduce stage? > >> > > >> > As long as documents have a globally unique docID, Lucene has merge > >> > functions that can merge results from multiple partial results. > >> > > >> > > * Does Lucene supports smth like JOIN operation or others that may > >> > require > >> > > data from another partition or index? > >> > > >> > As illustrated by Ilya, Block-Join works for us. > >> > > >> > > If so, then it likes to multistep query with merging results on > >> > > intermediate stages and requires detailed investigation and design. > >> > > It is ok if Ignite will have some limitations here, but we would > like > >> > > to > >> > > know about them at the early stage. > >> > > >> > > * How effectively map Lucene files to the page memory? Is it even > >> > possible? > >> > > >> > Lucene has PageDirectory implementations which allow storing Lucene > >> > indices on different kind of file structures. It has a > >> > MMappedFileDirectory that we could use? > >> > > >> > > Otherwise, how to deal with potential OOM on large queries and > memory > >> > > capacity planning? > >> > > >> > We can use Lucene's MMapped directory. > >> > > >> > > > >> > > Persistence. > >> > > * How and what consistency guarantees could we have/expect? > >> > > >> > Lucene does not have WAL logs but is append only > >> > > >> > > Seems, we may not be able to write physical records for Lucene index > >> > > to > >> > our > >> > > WAL. What can we do with this? > >> > > >> > As illustrated by Ilya, we can use Ignite's WAL records to rebuild > >> > Lucene indices. The important thing here is to not treat Lucene > >> > indices as source of truth. > >> > > > >> > > Transactions. > >> > > * Will we support transactions? > >> > Lucene has no concept of transactions. > >> > > >> > > * Should Lucene be aware of Transaction and track mvcc (or whatever) > >> > > versions for the records? > >> > No > >> > > * What will be consistency guarantees? > >> > We can acknowledge writes back only after Lucene index is updated. > >> > > > >> > > UX > >> > > * How to add FullText search queries syntax into Calcite? > >> > Postgres's FTS functions are a good reference. > >> > > * AFAIK, the Lucene index has many properties for tuning. How will > >> > > the > >> > user > >> > > configure the index? > >> > Most of those properties can be cluster level and exposed as a new sub > >> > config for ignite. > >> > > * How and where to store the settings? What are cluster-wide and > what > >> > > a > >> > > local to the particular node? > >> > All can be cluster level. > >> > > * Will be all the settings immutable? Can be they changed on-fly? > >> > > after > >> > > node/grid restart? > >> > They should be applied post restart. > >> > > >> > > * Any limitations on query syntax? > >> > It depends on how we model our queries for text search. > >> > > >> > > > >> > > SQL > >> > > * Will we support FullText search in SQL? > >> > We need custom functions for it. See Postgres's FTS functions. > >> > > * How to integrate Lucene index into Calcite? What is the cost > model? > >> > There cannot be any cost model since there are no paths for a text > >> > query. If we see a text query, we have to use Lucene index or return > >> > an error. In this way, we need to model text search as a set of UDFs > >> > > >> > > Splitting rules? Traits? > >> > Please see my reply above. > >> > > > >> > > > >> > > With all of this, you can go with the IEP (or even some short > >> > > summary) > >> > and > >> > > further POC and implementation. > >> > > That's a big deal, so let's discuss what could be done here. > >> > > > >> > > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <a...@apache.org> > wrote: > >> > > > >> > > > I am actually happy to drive the feature for Ignite 3. FTS is very > >> > > > important for me and I think Ignite users will benefit from it > >> > > > greatly. > >> > > > > >> > > > If it makes sense to be focusing on Ignite 3 for this capability, > I > >> > > > am > >> > > > eager to contribute there and lead the development. > >> > > > > >> > > > Please share your thoughts. > >> > > > > >> > > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov > >> > > > <andrey.mashen...@gmail.com> wrote: > >> > > > > > >> > > > > Hi Atri, > >> > > > > > >> > > > > All the Jira tickets we have on the Full-text search (FTS) thing > >> > > > > are > >> > > > > targeted to Ignite 2. > >> > > > > > >> > > > > AFAIK, we want, but we have NOT committed to FTS support in > Ignite > >> > > > > 3, > >> > > > yet. > >> > > > > By the way, we are getting requests for this thing from the user > >> > side, > >> > > > and > >> > > > > definitely, > >> > > > > FTS would be a valuable feature for Ignite. > >> > > > > > >> > > > > It will be great if the one wants to drive it, any help will be > >> > > > appreciated. > >> > > > > > >> > > > > > >> > > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <a...@apache.org> > >> > wrote: > >> > > > > > >> > > > > > Hello, > >> > > > > > > >> > > > > > An update, please. I am working through persistence of Lucene > >> > > > > > index > >> > > > using > >> > > > > > Ignite Dictionary, and will be asking some questions soon. > >> > > > > > > >> > > > > > I had one doubt - - where does this change go? Ignite 3? > >> > > > > > > >> > > > > > Also, I know we want to build native support for text searches > >> > > > > > in > >> > > > Ignite 3. > >> > > > > > Is the work I am proposing here part of that, or will that be > a > >> > > > separate > >> > > > > > effort? > >> > > > > > > >> > > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, < > >> > ilya.kasnach...@gmail.com > >> > > > > > >> > > > > > wrote: > >> > > > > > > >> > > > > > > Hello! > >> > > > > > > > >> > > > > > > I think that number one is the most important one, then > maybe > >> > > > > > > it > >> > > > will see > >> > > > > > > more use and other deficiencies become more apparent, > leading > >> > > > > > > to > >> > more > >> > > > > > > tickets and visibility. > >> > > > > > > > >> > > > > > > Maybe 2. and 3. will even use a different approach when > >> > persistence > >> > > > is > >> > > > > > > implemented. > >> > > > > > > > >> > > > > > > Regards, > >> > > > > > > -- > >> > > > > > > Ilya Kasnacheev > >> > > > > > > > >> > > > > > > > >> > > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <a...@apache.org>: > >> > > > > > > > >> > > > > > > > Hello Again! > >> > > > > > > > > >> > > > > > > > I have been looking into the aforementioned and here are > my > >> > follow > >> > > > up > >> > > > > > > > thoughts: > >> > > > > > > > > >> > > > > > > > 1. Support persistence of Lucene indexes. > >> > > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 > >> > > > > > > > (Needs > >> > > > fixing of > >> > > > > > > > moving partitions first) > >> > > > > > > > 3. Figure out how to return scores from nodes and use them > >> > > > > > > > as > >> > sort > >> > > > > > > > parameters on the coordinator node > >> > > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291) > >> > > > > > > > > >> > > > > > > > Please let me know if this looks ok to make text queries > >> > > > functional? > >> > > > > > > > > >> > > > > > > > Atri > >> > > > > > > > > >> > > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov > >> > > > > > > > <alexey.scherbak...@gmail.com> wrote: > >> > > > > > > > > > >> > > > > > > > > Hi. > >> > > > > > > > > > >> > > > > > > > > One of the biggest issues with text queries is a lack of > >> > support > >> > > > for > >> > > > > > > > lucene > >> > > > > > > > > indices persistence, which makes this functionality > >> > > > > > > > > useless > >> > if a > >> > > > > > > > > persistence is enabled. > >> > > > > > > > > > >> > > > > > > > > I would first take care of it. > >> > > > > > > > > > >> > > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin < > >> > > > timonin.ma...@gmail.com > >> > > > > > >: > >> > > > > > > > > > >> > > > > > > > > > Hi, Atri! > >> > > > > > > > > > > >> > > > > > > > > > You're right, Actually there is a lack of support for > >> > > > TextQueries. > >> > > > > > > For > >> > > > > > > > the > >> > > > > > > > > > last ticket I'm doing I see some obvious issues with > >> > > > > > > > > > them > >> > (no > >> > > > page > >> > > > > > > size > >> > > > > > > > > > support, for example). I'm glad that somebody wants to > >> > maintain > >> > > > > > this > >> > > > > > > > > > functionality. Thanks a lot! > >> > > > > > > > > > > >> > > > > > > > > > For the MergeSort algorithm there is already a patch > >> > > > > > > > > > for > >> > that > >> > > > [1]. > >> > > > > > > It's > >> > > > > > > > > > currently on review. This patch introduces an abstract > >> > reducer > >> > > > for > >> > > > > > > > > > CacheQueries with 2 implementations (unordered, > >> > merge-sort). > >> > > > Then > >> > > > > > > > TextQuery > >> > > > > > > > > > leverages on MergeSort to order results from multiple > >> > nodes by > >> > > > > > score. > >> > > > > > > > This > >> > > > > > > > > > patch also fixes the pageSize issue, I've mentioned > >> > > > > > > > > > before. > >> > > > Could > >> > > > > > you > >> > > > > > > > > > please check if it fully matches your idea? Any issues > >> > > > > > > > > > or > >> > > > comments > >> > > > > > > are > >> > > > > > > > > > welcome. > >> > > > > > > > > > > >> > > > > > > > > > I've prepared this ticket, because I need the > MergeSort > >> > > > algorithm > >> > > > > > for > >> > > > > > > > the > >> > > > > > > > > > new type of queries I'm implementing (IndexQuery, it > >> > > > > > > > > > should > >> > > > also > >> > > > > > > > provide > >> > > > > > > > > > ordered results over multiple nodes). Currently I'm > not > >> > > > planning to > >> > > > > > > go > >> > > > > > > > > > further with TextQuery, so if you're going to support > >> > > > > > > > > > this > >> > > > it'll > >> > > > > > be a > >> > > > > > > > great > >> > > > > > > > > > contribution, I think. > >> > > > > > > > > > > >> > > > > > > > > > [1] > https://issues.apache.org/jira/browse/IGNITE-14703 > >> > > > > > > > > > [2] https://github.com/apache/ignite/pull/9081 > >> > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma < > >> > a...@apache.org> > >> > > > > > > wrote: > >> > > > > > > > > > > >> > > > > > > > > > > Hi All, > >> > > > > > > > > > > > >> > > > > > > > > > > I have been looking into our text queries support > and > >> > > > > > > > > > > see > >> > > > that it > >> > > > > > > has > >> > > > > > > > > > > limited community support. > >> > > > > > > > > > > > >> > > > > > > > > > > Therefore, I volunteer to be the maintainer of the > >> > module and > >> > > > > > work > >> > > > > > > on > >> > > > > > > > > > > enhancing it further. > >> > > > > > > > > > > > >> > > > > > > > > > > First goal would be to move to Lucene 8.x, then work > >> > > > > > > > > > > on > >> > > > sorted > >> > > > > > > reduce > >> > > > > > > > > > > - merge across nodes. Fundamentally, this is doable > >> > > > > > > > > > > since > >> > > > Lucene > >> > > > > > > > ranks > >> > > > > > > > > > > documents according to their score, and documents > are > >> > > > returned in > >> > > > > > > the > >> > > > > > > > > > > order of their score. Since the scoring function is > >> > > > homogeneous, > >> > > > > > > this > >> > > > > > > > > > > means that across nodes, we can compare scores and > >> > > > > > > > > > > merge > >> > > > sort. > >> > > > > > > > > > > > >> > > > > > > > > > > Please let me know if I can take this up. > >> > > > > > > > > > > > >> > > > > > > > > > > Atri > >> > > > > > > > > > > > >> > > > > > > > > > > -- > >> > > > > > > > > > > Regards, > >> > > > > > > > > > > > >> > > > > > > > > > > Atri > >> > > > > > > > > > > Apache Concerted > >> > > > > > > > > > > > >> > > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > > >> > > > > > > > > -- > >> > > > > > > > > > >> > > > > > > > > Best regards, > >> > > > > > > > > Alexei Scherbakov > >> > > > > > > > > >> > > > > > > > -- > >> > > > > > > > Regards, > >> > > > > > > > > >> > > > > > > > Atri > >> > > > > > > > Apache Concerted > >> > > > > > > > > >> > > > > > > > >> > > > > > > >> > > > > > >> > > > > > >> > > > > -- > >> > > > > Best regards, > >> > > > > Andrey V. Mashenkov > >> > > > > >> > > > -- > >> > > > Regards, > >> > > > > >> > > > Atri > >> > > > Apache Concerted > >> > > > > >> > > > >> > > > >> > > -- > >> > > Best regards, > >> > > Andrey V. Mashenkov > >> > > >> > -- > >> > Regards, > >> > > >> > Atri > >> > Apache Concerted > >> > > >> > >> > >> -- > >> Best regards, > >> Andrey V. Mashenkov > > > > -- > > Regards, > > > > Atri > > Apache Concerted > > > > > -- > > Best regards, > Ivan Pavlukhin >