`column = term` is definitively problematic because it creates an ambiguity when the queried column belongs to the primary key. For some queries we wouldn't know whether the user wants a primary key query using regular equality or an index query using the analyzer.
`term_matches(column, term)` seems quite clear and hard to misinterpret, but it's quite long to write and its implementation will be challenging since we would need a bunch of special casing around SelectStatement and functions. LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to evoke different behaviours to what they would have. `column LIKE :term:` seems a bit redundant compared to just using `column : term`, and we are still introducing a new symbol. I think I like `column : term` the most, because it's brief, it's similar to the equivalent Lucene's syntax, and it doesn't seem to clash with other different meanings that I can think of. On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> wrote: > Hi all, > > With phase 1 of SAI wrapping up, I’d like to start the ball rolling on > aligning around phase 2 features. > > In particular, we need to nail down the syntax for doing non-exact string > matches. We have a proof of concept that includes full Lucene analyzer and > filter functionality – just the text transformation pieces, none of the > storage parts – which is the gold standard in this space. For example, the > StandardAnalyzer [1] lowercases all terms and removes stopwords (common > words like “a”, “is”, “the” that are usually not useful to search > against). Lucene also has classes that offer stemming, special case > handling for email, and many languages besides English [2]. > > What syntax should we use to express “rows whose analyzed tokens match > this search term?” > > The syntax must be clear that we want to look for this term within the > column data using the configured index with corresponding query-time > tokenization and analysis. This means that the query term is not always a > substring of the original string! Besides obvious transformations like > lowercasing, you have things like PhoneticFilter available as well. > > Here are my thoughts on some of the options: > > `column = term`. This is what the POC does today and it’s super confusing > to overload = to mean something other than exact equality. I am not a fan. > > `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but > neither the wildcarded nor unwildcarded syntax matches the semantics of > term-based search. > > `column MATCHES term`. I rather like this one, although Mike points out > that “match” has a meaning in the context of regular expressions that could > cause confusion here. > > `column CONTAINS term`. Contains is used by both Java and Python for > substring searches, so at least some users will be surprised by term-based > behavior. > > `term_matches(column, term)`. Postgresql FTS makes you use functions like > this for everything. It’s pretty clunky, and we would need to make the > amazingly hairy SelectStatement even hairier to handle “use a function > result in a predicate” like this. > > `column : term`. Inspired by Lucene’s syntax. I don’t actually hate it. > > `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to > indicate term matching. Arguably more SQL-ish than a new bare symbol > operator. > > [1] > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html > [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced >