Tokenization and SAI query syntax

Jonathan Ellis Mon, 24 Jul 2023 05:13:03 -0700

Hi all,

With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
aligning around phase 2 features.

In particular, we need to nail down the syntax for doing non-exact string
matches. We have a proof of concept that includes full Lucene analyzer and
filter functionality – just the text transformation pieces, none of the
storage parts – which is the gold standard in this space. For example, the
StandardAnalyzer [1] lowercases all terms and removes stopwords (common
words like “a”, “is”, “the” that are usually not useful to search
against). Lucene also has classes that offer stemming, special case
handling for email, and many languages besides English [2].

What syntax should we use to express “rows whose analyzed tokens match this
search term?”

The syntax must be clear that we want to look for this term within the
column data using the configured index with corresponding query-time
tokenization and analysis. This means that the query term is not always a
substring of the original string! Besides obvious transformations like
lowercasing, you have things like PhoneticFilter available as well.

Here are my thoughts on some of the options:

`column = term`. This is what the POC does today and it’s super confusing
to overload = to mean something other than exact equality. I am not a fan.

`column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
neither the wildcarded nor unwildcarded syntax matches the semantics of
term-based search.

`column MATCHES term`. I rather like this one, although Mike points out
that “match” has a meaning in the context of regular expressions that could
cause confusion here.

`column CONTAINS term`. Contains is used by both Java and Python for
substring searches, so at least some users will be surprised by term-based
behavior.

`term_matches(column, term)`. Postgresql FTS makes you use functions like
this for everything. It’s pretty clunky, and we would need to make the
amazingly hairy SelectStatement even hairier to handle “use a function
result in a predicate” like this.

`column : term`. Inspired by Lucene’s syntax. I don’t actually hate it.

`column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
indicate term matching. Arguably more SQL-ish than a new bare symbol
operator.

[1]
https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
[2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html

--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Tokenization and SAI query syntax

Reply via email to