I do not think LIKE actually applies here. LIKE is used for prefix, contains, or suffix searches in SASI depending on the index type.
This is about exact matching of tokens. > On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org> wrote: > > Certain bits of functionality also already exist on the SASI side of things, > but I'm not sure how much overlap there is. Currently, there's a LIKE > keyword that handles token matching, although it seems to have some > differences from the feature set in SAI. > > That said, there seems to be enough of an overlap that it would make sense to > consider using LIKE in the same manner, doesn't it? I think it would be a > little odd if we have different syntax for different indexes. > > https://github.com/apache/cassandra/blob/trunk/doc/SASI.md > > I think one complication here is that there seems to be a desire, that I very > much agree with, to expose as much of the underlying flexibility of Lucene as > much as possible. If it means we use Caleb's suggestion, I'd ask that the > queries that SASI and SAI both support use the same syntax, even if it means > there's two ways of writing the same query. To use Caleb's example, this > would mean supporting both LIKE and the `expr` column. > > Jon > >> On 2023/08/01 19:17:11 Caleb Rackliffe wrote: >> Here are some additional bits of prior art, if anyone finds them useful: >> >> >> The Stratio Lucene Index - >> https://github.com/Stratio/cassandra-lucene-index#examples >> >> Stratio was the reason C* added the "expr" functionality. They embedded >> something similar to ElasticSearch JSON, which probably isn't my favorite >> choice, but it's there. >> >> >> The ElasticSearch match query syntax - >> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$ >> >> >> Again, not my favorite. It's verbose, and probably too powerful for us. >> >> >> ElasticSearch's documentation for the basic Lucene query syntax - >> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$ >> >> >> One idea is to take the basic Lucene index, which it seems we already have >> some support for, and feed it to "expr". This is nice for two reasons: >> >> 1.) People can just write Lucene queries if they already know how. >> 2.) No changes to the grammar. >> >> Lucene has distinct concepts of filtering and querying, and this is kind of >> the latter. I'm not sure how, for example, we would want "expr" to interact >> w/ filters on other column indexes in vanilla CQL space... >> >> >>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <jmcken...@apache.org> wrote: >>> >>> `column CONTAINS term`. Contains is used by both Java and Python for >>> substring searches, so at least some users will be surprised by term-based >>> behavior. >>> >>> I wonder whether users are in their "programming language" headspace or in >>> their "querying a database" headspace when interacting with CQL? i.e. this >>> would only present confusion if we expected users to be thinking in the >>> idioms of their respective programming languages. If they're thinking in >>> terms of SQL, MATCHES would probably end up confusing them a bit since it >>> doesn't match the general structure of the MATCH operator. >>> >>> That said, I also think CONTAINS loses something important that you allude >>> to here Jonathan: >>> >>> with corresponding query-time tokenization and analysis. This means that >>> the query term is not always a substring of the original string! Besides >>> obvious transformations like lowercasing, you have things like >>> PhoneticFilter available as well. >>> >>> So to me, neither MATCHES nor CONTAINS are particularly great candidates. >>> >>> So +1 to the "I don't actually hate it" sentiment on: >>> >>> column : term`. Inspired by Lucene’s syntax >>> >>> >>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote: >>> >>> >>> I have a strong preference not to use the name of an SQL operator, since >>> it precludes us later providing the SQL standard operator to users. >>> >>> What about CONTAINS TOKEN term? Or CONTAINS TERM term? >>> >>> >>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <adelap...@apache.org> wrote: >>> >>> >>> `column = term` is definitively problematic because it creates an >>> ambiguity when the queried column belongs to the primary key. For some >>> queries we wouldn't know whether the user wants a primary key query using >>> regular equality or an index query using the analyzer. >>> >>> `term_matches(column, term)` seems quite clear and hard to misinterpret, >>> but it's quite long to write and its implementation will be challenging >>> since we would need a bunch of special casing around SelectStatement and >>> functions. >>> >>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to >>> evoke different behaviours to what they would have. >>> >>> `column LIKE :term:` seems a bit redundant compared to just using `column >>> : term`, and we are still introducing a new symbol. >>> >>> I think I like `column : term` the most, because it's brief, it's similar >>> to the equivalent Lucene's syntax, and it doesn't seem to clash with other >>> different meanings that I can think of. >>> >>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> wrote: >>> >>> Hi all, >>> >>> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on >>> aligning around phase 2 features. >>> >>> In particular, we need to nail down the syntax for doing non-exact string >>> matches. We have a proof of concept that includes full Lucene analyzer and >>> filter functionality – just the text transformation pieces, none of the >>> storage parts – which is the gold standard in this space. For example, the >>> StandardAnalyzer [1] lowercases all terms and removes stopwords (common >>> words like “a”, “is”, “the” that are usually not useful to search >>> against). Lucene also has classes that offer stemming, special case >>> handling for email, and many languages besides English [2]. >>> >>> What syntax should we use to express “rows whose analyzed tokens match >>> this search term?” >>> >>> The syntax must be clear that we want to look for this term within the >>> column data using the configured index with corresponding query-time >>> tokenization and analysis. This means that the query term is not always a >>> substring of the original string! Besides obvious transformations like >>> lowercasing, you have things like PhoneticFilter available as well. >>> >>> Here are my thoughts on some of the options: >>> >>> `column = term`. This is what the POC does today and it’s super confusing >>> to overload = to mean something other than exact equality. I am not a fan. >>> >>> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but >>> neither the wildcarded nor unwildcarded syntax matches the semantics of >>> term-based search. >>> >>> `column MATCHES term`. I rather like this one, although Mike points out >>> that “match” has a meaning in the context of regular expressions that could >>> cause confusion here. >>> >>> `column CONTAINS term`. Contains is used by both Java and Python for >>> substring searches, so at least some users will be surprised by term-based >>> behavior. >>> >>> `term_matches(column, term)`. Postgresql FTS makes you use functions like >>> this for everything. It’s pretty clunky, and we would need to make the >>> amazingly hairy SelectStatement even hairier to handle “use a function >>> result in a predicate” like this. >>> >>> `column : term`. Inspired by Lucene’s syntax. I don’t actually hate it. >>> >>> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to >>> indicate term matching. Arguably more SQL-ish than a new bare symbol >>> operator. >>> >>> [1] >>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html >>> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html >>> >>> -- >>> Jonathan Ellis >>> co-founder, http://www.datastax.com >>> @spyced >>> >>> >>> >>