Re: Tokenization and SAI query syntax

Andrés de la Peña Mon, 24 Jul 2023 05:34:44 -0700

`column = term` is definitively problematic because it creates an ambiguity
when the queried column belongs to the primary key. For some queries we
wouldn't know whether the user wants a primary key query using regular
equality or an index query using the analyzer.


`term_matches(column, term)` seems quite clear and hard to misinterpret,
but it's quite long to write and its implementation will be challenging
since we would need a bunch of special casing around SelectStatement and
functions.

LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
evoke different behaviours to what they would have.

`column LIKE :term:` seems a bit redundant compared to just using `column :
term`, and we are still introducing a new symbol.

I think I like `column : term` the most, because it's brief, it's similar
to the equivalent Lucene's syntax, and it doesn't seem to clash with other
different meanings that I can think of.

On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> wrote:

> Hi all,
>
> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> aligning around phase 2 features.
>
> In particular, we need to nail down the syntax for doing non-exact string
> matches.  We have a proof of concept that includes full Lucene analyzer and
> filter functionality – just the text transformation pieces, none of the
> storage parts – which is the gold standard in this space.  For example, the
> StandardAnalyzer [1] lowercases all terms and removes stopwords (common
> words like “a”, “is”, “the” that are usually not useful to search
> against).  Lucene also has classes that offer stemming, special case
> handling for email, and many languages besides English [2].
>
> What syntax should we use to express “rows whose analyzed tokens match
> this search term?”
>
> The syntax must be clear that we want to look for this term within the
> column data using the configured index with corresponding query-time
> tokenization and analysis.  This means that the query term is not always a
> substring of the original string!  Besides obvious transformations like
> lowercasing, you have things like PhoneticFilter available as well.
>
> Here are my thoughts on some of the options:
>
> `column = term`.  This is what the POC does today and it’s super confusing
> to overload = to mean something other than exact equality.  I am not a fan.
>
> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
> neither the wildcarded nor unwildcarded syntax matches the semantics of
> term-based search.
>
> `column MATCHES term`. I rather like this one, although Mike points out
> that “match” has a meaning in the context of regular expressions that could
> cause confusion here.
>
> `column CONTAINS term`. Contains is used by both Java and Python for
> substring searches, so at least some users will be surprised by term-based
> behavior.
>
> `term_matches(column, term)`. Postgresql FTS makes you use functions like
> this for everything.  It’s pretty clunky, and we would need to make the
> amazingly hairy SelectStatement even hairier to handle “use a function
> result in a predicate” like this.
>
> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
>
> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
> indicate term matching.  Arguably more SQL-ish than a new bare symbol
> operator.
>
> [1]
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>

Re: Tokenization and SAI query syntax

Reply via email to