Re: Text Queries Support

Atri Sharma Fri, 23 Jul 2021 10:58:36 -0700

The standard ways to deal with text based searches in SQL are the
CONTAINS operator, the LIKE operator or specific functions
(REGEXP_MATCHES, for eg). I do not think any of these are supported by
Calcite at the moment.


On Fri, Jul 23, 2021 at 11:20 PM Valentin Kulichenko
<[email protected]> wrote:
>
> In my experience, one of the biggest usability issues with the current
> support of text queries is that they are completely decoupled from SQL.
> I.e. you can either execute a SQL query OR a text query. Modern databases,
> on the other hand, typically allow creating text-based indexes within
> regular tables and then using those indexes within regular SQL queries.
> Here is an example from Oracle:
> https://docs.oracle.com/cd/B10501_01/text.920/a96517/cdefault.htm
>
> I believe this is something we can look into in the scope of Ignite 3.
> Andrey, does Calcite have any support for this? What's your view on this?
>
> -Val
>
> On Fri, Jul 23, 2021 at 3:56 AM Andrey Mashenkov <[email protected]>
> wrote:
>
> > Atri,
> >
> > First of all, I'd recommend going through the Ignite ticket to gather
> > information about the current implementation issues and users' wants.
> > Then look at a code to get a complete understanding of how things work now,
> > which may help in future decisions.
> >
> > As we use the outdated Lucene version, some things may be irrelevant for
> > the latest Lucene version.
> > So, you will need expertise in the internals of modern Lucene version to
> > understand what capabilities, guarantees, and limitations Lucene has and
> > could bring to the Ignite.
> > The expertise could be got from the Lucene project code or Lucene project
> > dev-list.
> >
> >
> > As for now, the potential capabilities are not clear to me.
> > At first glance, I see the next topics that must be covered at first:
> >
> > General questions
> > * How Lucene index can be split among the nodes?
> > * If we'll have a single index for all partitions on the particular node,
> > then how index records will be aware of partitioning?
> > This is important to filter out backup records from the results to avoid
> > duplicates.
> > * How results from several nodes can be merged on the Reduce stage?
> > * Does Lucene supports smth like JOIN operation or others that may require
> > data from another partition or index?
> > If so, then it likes to multistep query with merging results on
> > intermediate stages and requires detailed investigation and design.
> > It is ok if Ignite will have some limitations here, but we would like to
> > know about them at the early stage.
> > * How effectively map Lucene files to the page memory? Is it even possible?
> > Otherwise, how to deal with potential OOM on large queries and memory
> > capacity planning?
> >
> > Persistence.
> > * How and what consistency guarantees could we have/expect?
> > Seems, we may not be able to write physical records for Lucene index to our
> > WAL. What can we do with this?
> >
> > Transactions.
> > * Will we support transactions?
> > * Should Lucene be aware of Transaction and track mvcc (or whatever)
> > versions for the records?
> > * What will be consistency guarantees?
> >
> > UX
> > * How to add FullText search queries syntax into Calcite?
> > * AFAIK, the Lucene index has many properties for tuning. How will the user
> > configure the index?
> > * How and where to store the settings? What are cluster-wide and what a
> > local to the particular node?
> > * Will be all the settings immutable? Can be they changed on-fly? after
> > node/grid restart?
> > * Any limitations on query syntax?
> >
> > SQL
> > * Will we support FullText search in SQL?
> > * How to integrate Lucene index into Calcite? What is the cost model?
> > Splitting rules? Traits?
> > * What about consistency with DDL operations, e.g. column rename?
> > Ignite indices will operate column ID, so rename operation will not affect
> > the index.
> >
> >
> > With all of this, you can go with the IEP (or even some short summary) and
> > further POC and implementation.
> > That's a big deal, so let's discuss what could be done here.
> >
> > On Fri, Jul 23, 2021 at 12:58 PM Atri Sharma <[email protected]> wrote:
> >
> > > I am actually happy to drive the feature for Ignite 3. FTS is very
> > > important for me and I think Ignite users will benefit from it
> > > greatly.
> > >
> > > If it makes sense to be focusing on Ignite 3 for this capability, I am
> > > eager to contribute there and lead the development.
> > >
> > > Please share your thoughts.
> > >
> > > On Fri, Jul 23, 2021 at 3:21 PM Andrey Mashenkov
> > > <[email protected]> wrote:
> > > >
> > > > Hi Atri,
> > > >
> > > > All the Jira tickets we have on the Full-text search (FTS) thing are
> > > > targeted to Ignite 2.
> > > >
> > > > AFAIK, we want, but we have NOT committed to FTS support in Ignite 3,
> > > yet.
> > > > By the way, we are getting requests for this thing from the user side,
> > > and
> > > > definitely,
> > > > FTS would be a valuable feature for Ignite.
> > > >
> > > > It will be great if the one wants to drive it, any help will be
> > > appreciated.
> > > >
> > > >
> > > > On Fri, Jul 23, 2021 at 12:12 PM Atri Sharma <[email protected]> wrote:
> > > >
> > > > > Hello,
> > > > >
> > > > > An update, please. I am working through persistence of Lucene index
> > > using
> > > > > Ignite Dictionary, and will be asking some questions soon.
> > > > >
> > > > > I had one doubt - - where does this change go? Ignite 3?
> > > > >
> > > > > Also, I know we want to build native support for text searches in
> > > Ignite 3.
> > > > > Is the work I am proposing here part of that, or will that be a
> > > separate
> > > > > effort?
> > > > >
> > > > > On Mon, 28 Jun 2021, 19:20 Ilya Kasnacheev, <
> > [email protected]
> > > >
> > > > > wrote:
> > > > >
> > > > > > Hello!
> > > > > >
> > > > > > I think that number one is the most important one, then maybe it
> > > will see
> > > > > > more use and other deficiencies become more apparent, leading to
> > more
> > > > > > tickets and visibility.
> > > > > >
> > > > > > Maybe 2. and 3. will even use a different approach when persistence
> > > is
> > > > > > implemented.
> > > > > >
> > > > > > Regards,
> > > > > > --
> > > > > > Ilya Kasnacheev
> > > > > >
> > > > > >
> > > > > > пн, 28 июн. 2021 г. в 14:34, Atri Sharma <[email protected]>:
> > > > > >
> > > > > > > Hello Again!
> > > > > > >
> > > > > > > I have been looking into the aforementioned and here are my
> > follow
> > > up
> > > > > > > thoughts:
> > > > > > >
> > > > > > > 1. Support persistence of Lucene indexes.
> > > > > > > 2. https://issues.apache.org/jira/browse/IGNITE-12401 (Needs
> > > fixing of
> > > > > > > moving partitions first)
> > > > > > > 3. Figure out how to return scores from nodes and use them as
> > sort
> > > > > > > parameters on the coordinator node
> > > > > > > (https://issues.apache.org/jira/browse/IGNITE-12291)
> > > > > > >
> > > > > > > Please let me know if this looks ok to make text queries
> > > functional?
> > > > > > >
> > > > > > > Atri
> > > > > > >
> > > > > > > On Mon, Jun 21, 2021 at 2:49 PM Alexei Scherbakov
> > > > > > > <[email protected]> wrote:
> > > > > > > >
> > > > > > > > Hi.
> > > > > > > >
> > > > > > > > One of the biggest issues with text queries is a lack of
> > support
> > > for
> > > > > > > lucene
> > > > > > > > indices persistence, which makes this functionality useless if
> > a
> > > > > > > > persistence is enabled.
> > > > > > > >
> > > > > > > > I would first take care of it.
> > > > > > > >
> > > > > > > > пн, 21 июн. 2021 г. в 12:16, Maksim Timonin <
> > > [email protected]
> > > > > >:
> > > > > > > >
> > > > > > > > > Hi, Atri!
> > > > > > > > >
> > > > > > > > > You're right, Actually there is a lack of support for
> > > TextQueries.
> > > > > > For
> > > > > > > the
> > > > > > > > > last ticket I'm doing I see some obvious issues with them (no
> > > page
> > > > > > size
> > > > > > > > > support, for example). I'm glad that somebody wants to
> > maintain
> > > > > this
> > > > > > > > > functionality. Thanks a lot!
> > > > > > > > >
> > > > > > > > > For the MergeSort algorithm there is already a patch for that
> > > [1].
> > > > > > It's
> > > > > > > > > currently on review. This patch introduces an abstract
> > reducer
> > > for
> > > > > > > > > CacheQueries with 2 implementations (unordered, merge-sort).
> > > Then
> > > > > > > TextQuery
> > > > > > > > > leverages on MergeSort to order results from multiple nodes
> > by
> > > > > score.
> > > > > > > This
> > > > > > > > > patch also fixes the pageSize issue, I've mentioned before.
> > > Could
> > > > > you
> > > > > > > > > please check if it fully matches your idea? Any issues or
> > > comments
> > > > > > are
> > > > > > > > > welcome.
> > > > > > > > >
> > > > > > > > > I've prepared this ticket, because I need the MergeSort
> > > algorithm
> > > > > for
> > > > > > > the
> > > > > > > > > new type of queries I'm implementing (IndexQuery, it should
> > > also
> > > > > > > provide
> > > > > > > > > ordered results over multiple nodes). Currently I'm not
> > > planning to
> > > > > > go
> > > > > > > > > further with TextQuery, so if you're going to support this
> > > it'll
> > > > > be a
> > > > > > > great
> > > > > > > > > contribution, I think.
> > > > > > > > >
> > > > > > > > > [1] https://issues.apache.org/jira/browse/IGNITE-14703
> > > > > > > > > [2] https://github.com/apache/ignite/pull/9081
> > > > > > > > >
> > > > > > > > >
> > > > > > > > > On Mon, Jun 21, 2021 at 11:11 AM Atri Sharma <
> > [email protected]>
> > > > > > wrote:
> > > > > > > > >
> > > > > > > > > > Hi All,
> > > > > > > > > >
> > > > > > > > > > I have been looking into our text queries support and see
> > > that it
> > > > > > has
> > > > > > > > > > limited community support.
> > > > > > > > > >
> > > > > > > > > > Therefore, I volunteer to be the maintainer of the module
> > and
> > > > > work
> > > > > > on
> > > > > > > > > > enhancing it further.
> > > > > > > > > >
> > > > > > > > > > First goal would be to move to Lucene 8.x, then work on
> > > sorted
> > > > > > reduce
> > > > > > > > > > - merge across nodes. Fundamentally, this is doable since
> > > Lucene
> > > > > > > ranks
> > > > > > > > > > documents according to their score, and documents are
> > > returned in
> > > > > > the
> > > > > > > > > > order of their score. Since the scoring function is
> > > homogeneous,
> > > > > > this
> > > > > > > > > > means that across nodes, we can compare scores and merge
> > > sort.
> > > > > > > > > >
> > > > > > > > > > Please let me know if I can take this up.
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > >
> > > > > > > > > > --
> > > > > > > > > > Regards,
> > > > > > > > > >
> > > > > > > > > > Atri
> > > > > > > > > > Apache Concerted
> > > > > > > > > >
> > > > > > > > >
> > > > > > > >
> > > > > > > >
> > > > > > > > --
> > > > > > > >
> > > > > > > > Best regards,
> > > > > > > > Alexei Scherbakov
> > > > > > >
> > > > > > > --
> > > > > > > Regards,
> > > > > > >
> > > > > > > Atri
> > > > > > > Apache Concerted
> > > > > > >
> > > > > >
> > > > >
> > > >
> > > >
> > > > --
> > > > Best regards,
> > > > Andrey V. Mashenkov
> > >
> > > --
> > > Regards,
> > >
> > > Atri
> > > Apache Concerted
> > >
> >
> >
> > --
> > Best regards,
> > Andrey V. Mashenkov
> >

-- 
Regards,

Atri
Apache Concerted

Re: Text Queries Support

Reply via email to