Re: Tokenization and SAI query syntax

Caleb Rackliffe Wed, 02 Aug 2023 20:56:42 -0700

For what it's worth, I'd very much like to completely remove SASI from the
codebase for 6.0. The only remaining functionality gaps at the moment are
LIKE (prefix/suffix) queries and its limited tokenization
capabilities, both of which already have SAI Phase 2 Jiras.


On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <[email protected]>
wrote:

> SASI just uses “=“ for the tokenized equality matching, which is the exact
> thing this discussion is about changing/not liking.
>
> > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <[email protected]>
> wrote:
> >
> > I do not think LIKE actually applies here. LIKE is used for prefix,
> contains, or suffix searches in SASI depending on the index type.
> >
> > This is about exact matching of tokens.
> >
> >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <[email protected]>
> wrote:
> >>
> >> Certain bits of functionality also already exist on the SASI side of
> things, but I'm not sure how much overlap there is.  Currently, there's a
> LIKE keyword that handles token matching, although it seems to have some
> differences from the feature set in SAI.
> >>
> >> That said, there seems to be enough of an overlap that it would make
> sense to consider using LIKE in the same manner, doesn't it?  I think it
> would be a little odd if we have different syntax for different indexes.
> >>
> >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
> >>
> >> I think one complication here is that there seems to be a desire, that
> I very much agree with, to expose as much of the underlying flexibility of
> Lucene as much as possible.  If it means we use Caleb's suggestion, I'd ask
> that the queries that SASI and SAI both support use the same syntax, even
> if it means there's two ways of writing the same query.  To use Caleb's
> example, this would mean supporting both LIKE and the `expr` column.
> >>
> >> Jon
> >>
> >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
> >>> Here are some additional bits of prior art, if anyone finds them
> useful:
> >>>
> >>>
> >>> The Stratio Lucene Index -
> >>> https://github.com/Stratio/cassandra-lucene-index#examples
> >>>
> >>> Stratio was the reason C* added the "expr" functionality. They embedded
> >>> something similar to ElasticSearch JSON, which probably isn't my
> favorite
> >>> choice, but it's there.
> >>>
> >>>
> >>> The ElasticSearch match query syntax -
> >>>
> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
> >>>
> >>> Again, not my favorite. It's verbose, and probably too powerful for us.
> >>>
> >>>
> >>> ElasticSearch's documentation for the basic Lucene query syntax -
> >>>
> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
> >>>
> >>> One idea is to take the basic Lucene index, which it seems we already
> have
> >>> some support for, and feed it to "expr". This is nice for two reasons:
> >>>
> >>> 1.) People can just write Lucene queries if they already know how.
> >>> 2.) No changes to the grammar.
> >>>
> >>> Lucene has distinct concepts of filtering and querying, and this is
> kind of
> >>> the latter. I'm not sure how, for example, we would want "expr" to
> interact
> >>> w/ filters on other column indexes in vanilla CQL space...
> >>>
> >>>
> >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <[email protected]>
> wrote:
> >>>>
> >>>> `column CONTAINS term`. Contains is used by both Java and Python for
> >>>> substring searches, so at least some users will be surprised by
> term-based
> >>>> behavior.
> >>>>
> >>>> I wonder whether users are in their "programming language" headspace
> or in
> >>>> their "querying a database" headspace when interacting with CQL? i.e.
> this
> >>>> would only present confusion if we expected users to be thinking in
> the
> >>>> idioms of their respective programming languages. If they're thinking
> in
> >>>> terms of SQL, MATCHES would probably end up confusing them a bit
> since it
> >>>> doesn't match the general structure of the MATCH operator.
> >>>>
> >>>> That said, I also think CONTAINS loses something important that you
> allude
> >>>> to here Jonathan:
> >>>>
> >>>> with corresponding query-time tokenization and analysis.  This means
> that
> >>>> the query term is not always a substring of the original string!
> Besides
> >>>> obvious transformations like lowercasing, you have things like
> >>>> PhoneticFilter available as well.
> >>>>
> >>>> So to me, neither MATCHES nor CONTAINS are particularly great
> candidates.
> >>>>
> >>>> So +1 to the "I don't actually hate it" sentiment on:
> >>>>
> >>>> column : term`. Inspired by Lucene’s syntax
> >>>>
> >>>>
> >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
> >>>>
> >>>>
> >>>> I have a strong preference not to use the name of an SQL operator,
> since
> >>>> it precludes us later providing the SQL standard operator to users.
> >>>>
> >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
> >>>>
> >>>>
> >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <[email protected]>
> wrote:
> >>>>
> >>>> 
> >>>> `column = term` is definitively problematic because it creates an
> >>>> ambiguity when the queried column belongs to the primary key. For some
> >>>> queries we wouldn't know whether the user wants a primary key query
> using
> >>>> regular equality or an index query using the analyzer.
> >>>>
> >>>> `term_matches(column, term)` seems quite clear and hard to
> misinterpret,
> >>>> but it's quite long to write and its implementation will be
> challenging
> >>>> since we would need a bunch of special casing around SelectStatement
> and
> >>>> functions.
> >>>>
> >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem
> to
> >>>> evoke different behaviours to what they would have.
> >>>>
> >>>> `column LIKE :term:` seems a bit redundant compared to just using
> `column
> >>>> : term`, and we are still introducing a new symbol.
> >>>>
> >>>> I think I like `column : term` the most, because it's brief, it's
> similar
> >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash with
> other
> >>>> different meanings that I can think of.
> >>>>
> >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <[email protected]>
> wrote:
> >>>>
> >>>> Hi all,
> >>>>
> >>>> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> >>>> aligning around phase 2 features.
> >>>>
> >>>> In particular, we need to nail down the syntax for doing non-exact
> string
> >>>> matches.  We have a proof of concept that includes full Lucene
> analyzer and
> >>>> filter functionality – just the text transformation pieces, none of
> the
> >>>> storage parts – which is the gold standard in this space.  For
> example, the
> >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
> (common
> >>>> words like “a”, “is”, “the” that are usually not useful to search
> >>>> against).  Lucene also has classes that offer stemming, special case
> >>>> handling for email, and many languages besides English [2].
> >>>>
> >>>> What syntax should we use to express “rows whose analyzed tokens match
> >>>> this search term?”
> >>>>
> >>>> The syntax must be clear that we want to look for this term within the
> >>>> column data using the configured index with corresponding query-time
> >>>> tokenization and analysis.  This means that the query term is not
> always a
> >>>> substring of the original string!  Besides obvious transformations
> like
> >>>> lowercasing, you have things like PhoneticFilter available as well.
> >>>>
> >>>> Here are my thoughts on some of the options:
> >>>>
> >>>> `column = term`.  This is what the POC does today and it’s super
> confusing
> >>>> to overload = to mean something other than exact equality.  I am not
> a fan.
> >>>>
> >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL operator,
> but
> >>>> neither the wildcarded nor unwildcarded syntax matches the semantics
> of
> >>>> term-based search.
> >>>>
> >>>> `column MATCHES term`. I rather like this one, although Mike points
> out
> >>>> that “match” has a meaning in the context of regular expressions that
> could
> >>>> cause confusion here.
> >>>>
> >>>> `column CONTAINS term`. Contains is used by both Java and Python for
> >>>> substring searches, so at least some users will be surprised by
> term-based
> >>>> behavior.
> >>>>
> >>>> `term_matches(column, term)`. Postgresql FTS makes you use functions
> like
> >>>> this for everything.  It’s pretty clunky, and we would need to make
> the
> >>>> amazingly hairy SelectStatement even hairier to handle “use a function
> >>>> result in a predicate” like this.
> >>>>
> >>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate
> it.
> >>>>
> >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
> symbol to
> >>>> indicate term matching.  Arguably more SQL-ish than a new bare symbol
> >>>> operator.
> >>>>
> >>>> [1]
> >>>>
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> >>>> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
> >>>>
> >>>> --
> >>>> Jonathan Ellis
> >>>> co-founder, http://www.datastax.com
> >>>> @spyced
> >>>>
> >>>>
> >>>>
> >>>
>

Re: Tokenization and SAI query syntax

Reply via email to