For what it's worth, I'd very much like to completely remove SASI from the codebase for 6.0. The only remaining functionality gaps at the moment are LIKE (prefix/suffix) queries and its limited tokenization capabilities, both of which already have SAI Phase 2 Jiras.
On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com> wrote: > SASI just uses “=“ for the tokenized equality matching, which is the exact > thing this discussion is about changing/not liking. > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com> > wrote: > > > > I do not think LIKE actually applies here. LIKE is used for prefix, > contains, or suffix searches in SASI depending on the index type. > > > > This is about exact matching of tokens. > > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org> > wrote: > >> > >> Certain bits of functionality also already exist on the SASI side of > things, but I'm not sure how much overlap there is. Currently, there's a > LIKE keyword that handles token matching, although it seems to have some > differences from the feature set in SAI. > >> > >> That said, there seems to be enough of an overlap that it would make > sense to consider using LIKE in the same manner, doesn't it? I think it > would be a little odd if we have different syntax for different indexes. > >> > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md > >> > >> I think one complication here is that there seems to be a desire, that > I very much agree with, to expose as much of the underlying flexibility of > Lucene as much as possible. If it means we use Caleb's suggestion, I'd ask > that the queries that SASI and SAI both support use the same syntax, even > if it means there's two ways of writing the same query. To use Caleb's > example, this would mean supporting both LIKE and the `expr` column. > >> > >> Jon > >> > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote: > >>> Here are some additional bits of prior art, if anyone finds them > useful: > >>> > >>> > >>> The Stratio Lucene Index - > >>> https://github.com/Stratio/cassandra-lucene-index#examples > >>> > >>> Stratio was the reason C* added the "expr" functionality. They embedded > >>> something similar to ElasticSearch JSON, which probably isn't my > favorite > >>> choice, but it's there. > >>> > >>> > >>> The ElasticSearch match query syntax - > >>> > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$ > >>> > >>> Again, not my favorite. It's verbose, and probably too powerful for us. > >>> > >>> > >>> ElasticSearch's documentation for the basic Lucene query syntax - > >>> > https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$ > >>> > >>> One idea is to take the basic Lucene index, which it seems we already > have > >>> some support for, and feed it to "expr". This is nice for two reasons: > >>> > >>> 1.) People can just write Lucene queries if they already know how. > >>> 2.) No changes to the grammar. > >>> > >>> Lucene has distinct concepts of filtering and querying, and this is > kind of > >>> the latter. I'm not sure how, for example, we would want "expr" to > interact > >>> w/ filters on other column indexes in vanilla CQL space... > >>> > >>> > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <jmcken...@apache.org> > wrote: > >>>> > >>>> `column CONTAINS term`. Contains is used by both Java and Python for > >>>> substring searches, so at least some users will be surprised by > term-based > >>>> behavior. > >>>> > >>>> I wonder whether users are in their "programming language" headspace > or in > >>>> their "querying a database" headspace when interacting with CQL? i.e. > this > >>>> would only present confusion if we expected users to be thinking in > the > >>>> idioms of their respective programming languages. If they're thinking > in > >>>> terms of SQL, MATCHES would probably end up confusing them a bit > since it > >>>> doesn't match the general structure of the MATCH operator. > >>>> > >>>> That said, I also think CONTAINS loses something important that you > allude > >>>> to here Jonathan: > >>>> > >>>> with corresponding query-time tokenization and analysis. This means > that > >>>> the query term is not always a substring of the original string! > Besides > >>>> obvious transformations like lowercasing, you have things like > >>>> PhoneticFilter available as well. > >>>> > >>>> So to me, neither MATCHES nor CONTAINS are particularly great > candidates. > >>>> > >>>> So +1 to the "I don't actually hate it" sentiment on: > >>>> > >>>> column : term`. Inspired by Lucene’s syntax > >>>> > >>>> > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote: > >>>> > >>>> > >>>> I have a strong preference not to use the name of an SQL operator, > since > >>>> it precludes us later providing the SQL standard operator to users. > >>>> > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term? > >>>> > >>>> > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <adelap...@apache.org> > wrote: > >>>> > >>>> > >>>> `column = term` is definitively problematic because it creates an > >>>> ambiguity when the queried column belongs to the primary key. For some > >>>> queries we wouldn't know whether the user wants a primary key query > using > >>>> regular equality or an index query using the analyzer. > >>>> > >>>> `term_matches(column, term)` seems quite clear and hard to > misinterpret, > >>>> but it's quite long to write and its implementation will be > challenging > >>>> since we would need a bunch of special casing around SelectStatement > and > >>>> functions. > >>>> > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem > to > >>>> evoke different behaviours to what they would have. > >>>> > >>>> `column LIKE :term:` seems a bit redundant compared to just using > `column > >>>> : term`, and we are still introducing a new symbol. > >>>> > >>>> I think I like `column : term` the most, because it's brief, it's > similar > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash with > other > >>>> different meanings that I can think of. > >>>> > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> > wrote: > >>>> > >>>> Hi all, > >>>> > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on > >>>> aligning around phase 2 features. > >>>> > >>>> In particular, we need to nail down the syntax for doing non-exact > string > >>>> matches. We have a proof of concept that includes full Lucene > analyzer and > >>>> filter functionality – just the text transformation pieces, none of > the > >>>> storage parts – which is the gold standard in this space. For > example, the > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords > (common > >>>> words like “a”, “is”, “the” that are usually not useful to search > >>>> against). Lucene also has classes that offer stemming, special case > >>>> handling for email, and many languages besides English [2]. > >>>> > >>>> What syntax should we use to express “rows whose analyzed tokens match > >>>> this search term?” > >>>> > >>>> The syntax must be clear that we want to look for this term within the > >>>> column data using the configured index with corresponding query-time > >>>> tokenization and analysis. This means that the query term is not > always a > >>>> substring of the original string! Besides obvious transformations > like > >>>> lowercasing, you have things like PhoneticFilter available as well. > >>>> > >>>> Here are my thoughts on some of the options: > >>>> > >>>> `column = term`. This is what the POC does today and it’s super > confusing > >>>> to overload = to mean something other than exact equality. I am not > a fan. > >>>> > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, > but > >>>> neither the wildcarded nor unwildcarded syntax matches the semantics > of > >>>> term-based search. > >>>> > >>>> `column MATCHES term`. I rather like this one, although Mike points > out > >>>> that “match” has a meaning in the context of regular expressions that > could > >>>> cause confusion here. > >>>> > >>>> `column CONTAINS term`. Contains is used by both Java and Python for > >>>> substring searches, so at least some users will be surprised by > term-based > >>>> behavior. > >>>> > >>>> `term_matches(column, term)`. Postgresql FTS makes you use functions > like > >>>> this for everything. It’s pretty clunky, and we would need to make > the > >>>> amazingly hairy SelectStatement even hairier to handle “use a function > >>>> result in a predicate” like this. > >>>> > >>>> `column : term`. Inspired by Lucene’s syntax. I don’t actually hate > it. > >>>> > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new > symbol to > >>>> indicate term matching. Arguably more SQL-ish than a new bare symbol > >>>> operator. > >>>> > >>>> [1] > >>>> > https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html > >>>> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html > >>>> > >>>> -- > >>>> Jonathan Ellis > >>>> co-founder, http://www.datastax.com > >>>> @spyced > >>>> > >>>> > >>>> > >>> >