Re: Tokenization and SAI query syntax

Jon Haddad Wed, 02 Aug 2023 15:51:39 -0700

Certain bits of functionality also already exist on the SASI side of things, 
but I'm not sure how much overlap there is.  Currently, there's a LIKE keyword 
that handles token matching, although it seems to have some differences from 
the feature set in SAI.


That said, there seems to be enough of an overlap that it would make sense to 
consider using LIKE in the same manner, doesn't it?  I think it would be a 
little odd if we have different syntax for different indexes.  

https://github.com/apache/cassandra/blob/trunk/doc/SASI.md

I think one complication here is that there seems to be a desire, that I very 
much agree with, to expose as much of the underlying flexibility of Lucene as 
much as possible.  If it means we use Caleb's suggestion, I'd ask that the 
queries that SASI and SAI both support use the same syntax, even if it means 
there's two ways of writing the same query.  To use Caleb's example, this would 
mean supporting both LIKE and the `expr` column.  

Jon

On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
> Here are some additional bits of prior art, if anyone finds them useful:
> 
> 
> The Stratio Lucene Index -
> https://github.com/Stratio/cassandra-lucene-index#examples
> 
> Stratio was the reason C* added the "expr" functionality. They embedded
> something similar to ElasticSearch JSON, which probably isn't my favorite
> choice, but it's there.
> 
> 
> The ElasticSearch match query syntax -
> https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html
> 
> Again, not my favorite. It's verbose, and probably too powerful for us.
> 
> 
> ElasticSearch's documentation for the basic Lucene query syntax -
> https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html#query-string-syntax
> 
> One idea is to take the basic Lucene index, which it seems we already have
> some support for, and feed it to "expr". This is nice for two reasons:
> 
> 1.) People can just write Lucene queries if they already know how.
> 2.) No changes to the grammar.
> 
> Lucene has distinct concepts of filtering and querying, and this is kind of
> the latter. I'm not sure how, for example, we would want "expr" to interact
> w/ filters on other column indexes in vanilla CQL space...
> 
> 
> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <[email protected]> wrote:
> 
> > `column CONTAINS term`. Contains is used by both Java and Python for
> > substring searches, so at least some users will be surprised by term-based
> > behavior.
> >
> > I wonder whether users are in their "programming language" headspace or in
> > their "querying a database" headspace when interacting with CQL? i.e. this
> > would only present confusion if we expected users to be thinking in the
> > idioms of their respective programming languages. If they're thinking in
> > terms of SQL, MATCHES would probably end up confusing them a bit since it
> > doesn't match the general structure of the MATCH operator.
> >
> > That said, I also think CONTAINS loses something important that you allude
> > to here Jonathan:
> >
> > with corresponding query-time tokenization and analysis.  This means that
> > the query term is not always a substring of the original string!  Besides
> > obvious transformations like lowercasing, you have things like
> > PhoneticFilter available as well.
> >
> > So to me, neither MATCHES nor CONTAINS are particularly great candidates.
> >
> > So +1 to the "I don't actually hate it" sentiment on:
> >
> > column : term`. Inspired by Lucene’s syntax
> >
> >
> > On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
> >
> >
> > I have a strong preference not to use the name of an SQL operator, since
> > it precludes us later providing the SQL standard operator to users.
> >
> > What about CONTAINS TOKEN term? Or CONTAINS TERM term?
> >
> >
> > On 24 Jul 2023, at 13:34, Andrés de la Peña <[email protected]> wrote:
> >
> > 
> > `column = term` is definitively problematic because it creates an
> > ambiguity when the queried column belongs to the primary key. For some
> > queries we wouldn't know whether the user wants a primary key query using
> > regular equality or an index query using the analyzer.
> >
> > `term_matches(column, term)` seems quite clear and hard to misinterpret,
> > but it's quite long to write and its implementation will be challenging
> > since we would need a bunch of special casing around SelectStatement and
> > functions.
> >
> > LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
> > evoke different behaviours to what they would have.
> >
> > `column LIKE :term:` seems a bit redundant compared to just using `column
> > : term`, and we are still introducing a new symbol.
> >
> > I think I like `column : term` the most, because it's brief, it's similar
> > to the equivalent Lucene's syntax, and it doesn't seem to clash with other
> > different meanings that I can think of.
> >
> > On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <[email protected]> wrote:
> >
> > Hi all,
> >
> > With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> > aligning around phase 2 features.
> >
> > In particular, we need to nail down the syntax for doing non-exact string
> > matches.  We have a proof of concept that includes full Lucene analyzer and
> > filter functionality – just the text transformation pieces, none of the
> > storage parts – which is the gold standard in this space.  For example, the
> > StandardAnalyzer [1] lowercases all terms and removes stopwords (common
> > words like “a”, “is”, “the” that are usually not useful to search
> > against).  Lucene also has classes that offer stemming, special case
> > handling for email, and many languages besides English [2].
> >
> > What syntax should we use to express “rows whose analyzed tokens match
> > this search term?”
> >
> > The syntax must be clear that we want to look for this term within the
> > column data using the configured index with corresponding query-time
> > tokenization and analysis.  This means that the query term is not always a
> > substring of the original string!  Besides obvious transformations like
> > lowercasing, you have things like PhoneticFilter available as well.
> >
> > Here are my thoughts on some of the options:
> >
> > `column = term`.  This is what the POC does today and it’s super confusing
> > to overload = to mean something other than exact equality.  I am not a fan.
> >
> > `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
> > neither the wildcarded nor unwildcarded syntax matches the semantics of
> > term-based search.
> >
> > `column MATCHES term`. I rather like this one, although Mike points out
> > that “match” has a meaning in the context of regular expressions that could
> > cause confusion here.
> >
> > `column CONTAINS term`. Contains is used by both Java and Python for
> > substring searches, so at least some users will be surprised by term-based
> > behavior.
> >
> > `term_matches(column, term)`. Postgresql FTS makes you use functions like
> > this for everything.  It’s pretty clunky, and we would need to make the
> > amazingly hairy SelectStatement even hairier to handle “use a function
> > result in a predicate” like this.
> >
> > `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
> >
> > `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
> > indicate term matching.  Arguably more SQL-ish than a new bare symbol
> > operator.
> >
> > [1]
> > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> > [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
> >
> > --
> > Jonathan Ellis
> > co-founder, http://www.datastax.com
> > @spyced
> >
> >
> >
>

Re: Tokenization and SAI query syntax

Reply via email to