Why not start with SQLish operators supported by many databases (LIKE and CONTAINS)?
On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> wrote: > I am also -1 on directly exposing lucene like syntax here. Besides being > ugly, SAI is not lucene, I do not think we should start using lucene syntax > for it, it will make people think they can do everything else lucene allows. > > On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote: > > > I’m strongly opposed to : > > It is very dissimilar to our current operators. CQL is already not the > prettiest language, but let’s not make it a total mish mash. > > > > On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote: > > > I am also in agreement with 'column : token' in that 'I don't hate it' but > I'd like to offer an alternative to this in 'column HAS token'. HAS is > currently not a keyword that we use so wouldn't cause any brain conflicts. > > While I don't hate ':' I have a particular dislike of the lucene search > syntax because of its terseness and lack of easy readability. > > Saying that, I'm happy to do with ':' if that is the decision. > > On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> > wrote: > >> Assuming SAI is a superset of SASI, and we were to set up something so >> that SASI indexes auto convert to SAI, this gives even more weight to my >> point regarding how differing behavior for the same syntax can lead to >> issues. Imo the best case scenario results in the user not even noticing >> their indexes have changed. >> >> An (maybe better?) alternative is to add a flag to the index >> configuration for "compatibility mod", which might address the concerns >> around using an equality operator when it actually is a partial match. >> >> For what it's worth, I'm in agreement that = should mean full equality >> and not token match. >> >> On 2023/08/03 03:56:23 Caleb Rackliffe wrote: >> > For what it's worth, I'd very much like to completely remove SASI from >> the >> > codebase for 6.0. The only remaining functionality gaps at the moment >> are >> > LIKE (prefix/suffix) queries and its limited tokenization >> > capabilities, both of which already have SAI Phase 2 Jiras. >> > >> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com> >> > wrote: >> > >> > > SASI just uses “=“ for the tokenized equality matching, which is the >> exact >> > > thing this discussion is about changing/not liking. >> > > >> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com >> > >> > > wrote: >> > > > >> > > > I do not think LIKE actually applies here. LIKE is used for prefix, >> > > contains, or suffix searches in SASI depending on the index type. >> > > > >> > > > This is about exact matching of tokens. >> > > > >> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org >> > >> > > wrote: >> > > >> >> > > >> Certain bits of functionality also already exist on the SASI side >> of >> > > things, but I'm not sure how much overlap there is. Currently, >> there's a >> > > LIKE keyword that handles token matching, although it seems to have >> some >> > > differences from the feature set in SAI. >> > > >> >> > > >> That said, there seems to be enough of an overlap that it would >> make >> > > sense to consider using LIKE in the same manner, doesn't it? I think >> it >> > > would be a little odd if we have different syntax for different >> indexes. >> > > >> >> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md >> > > >> >> > > >> I think one complication here is that there seems to be a desire, >> that >> > > I very much agree with, to expose as much of the underlying >> flexibility of >> > > Lucene as much as possible. If it means we use Caleb's suggestion, >> I'd ask >> > > that the queries that SASI and SAI both support use the same syntax, >> even >> > > if it means there's two ways of writing the same query. To use >> Caleb's >> > > example, this would mean supporting both LIKE and the `expr` column. >> > > >> >> > > >> Jon >> > > >> >> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote: >> > > >>> Here are some additional bits of prior art, if anyone finds them >> > > useful: >> > > >>> >> > > >>> >> > > >>> The Stratio Lucene Index - >> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples >> > > >>> >> > > >>> Stratio was the reason C* added the "expr" functionality. They >> embedded >> > > >>> something similar to ElasticSearch JSON, which probably isn't my >> > > favorite >> > > >>> choice, but it's there. >> > > >>> >> > > >>> >> > > >>> The ElasticSearch match query syntax - >> > > >>> >> > > >> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$ >> > > >>> >> > > >>> Again, not my favorite. It's verbose, and probably too powerful >> for us. >> > > >>> >> > > >>> >> > > >>> ElasticSearch's documentation for the basic Lucene query syntax - >> > > >>> >> > > >> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$ >> > > >>> >> > > >>> One idea is to take the basic Lucene index, which it seems we >> already >> > > have >> > > >>> some support for, and feed it to "expr". This is nice for two >> reasons: >> > > >>> >> > > >>> 1.) People can just write Lucene queries if they already know how. >> > > >>> 2.) No changes to the grammar. >> > > >>> >> > > >>> Lucene has distinct concepts of filtering and querying, and this >> is >> > > kind of >> > > >>> the latter. I'm not sure how, for example, we would want "expr" to >> > > interact >> > > >>> w/ filters on other column indexes in vanilla CQL space... >> > > >>> >> > > >>> >> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie < >> jmcken...@apache.org> >> > > wrote: >> > > >>>> >> > > >>>> `column CONTAINS term`. Contains is used by both Java and Python >> for >> > > >>>> substring searches, so at least some users will be surprised by >> > > term-based >> > > >>>> behavior. >> > > >>>> >> > > >>>> I wonder whether users are in their "programming language" >> headspace >> > > or in >> > > >>>> their "querying a database" headspace when interacting with CQL? >> i.e. >> > > this >> > > >>>> would only present confusion if we expected users to be thinking >> in >> > > the >> > > >>>> idioms of their respective programming languages. If they're >> thinking >> > > in >> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit >> > > since it >> > > >>>> doesn't match the general structure of the MATCH operator. >> > > >>>> >> > > >>>> That said, I also think CONTAINS loses something important that >> you >> > > allude >> > > >>>> to here Jonathan: >> > > >>>> >> > > >>>> with corresponding query-time tokenization and analysis. This >> means >> > > that >> > > >>>> the query term is not always a substring of the original string! >> > > Besides >> > > >>>> obvious transformations like lowercasing, you have things like >> > > >>>> PhoneticFilter available as well. >> > > >>>> >> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great >> > > candidates. >> > > >>>> >> > > >>>> So +1 to the "I don't actually hate it" sentiment on: >> > > >>>> >> > > >>>> column : term`. Inspired by Lucene’s syntax >> > > >>>> >> > > >>>> >> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote: >> > > >>>> >> > > >>>> >> > > >>>> I have a strong preference not to use the name of an SQL >> operator, >> > > since >> > > >>>> it precludes us later providing the SQL standard operator to >> users. >> > > >>>> >> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term? >> > > >>>> >> > > >>>> >> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña < >> adelap...@apache.org> >> > > wrote: >> > > >>>> >> > > >>>> >> > > >>>> `column = term` is definitively problematic because it creates an >> > > >>>> ambiguity when the queried column belongs to the primary key. >> For some >> > > >>>> queries we wouldn't know whether the user wants a primary key >> query >> > > using >> > > >>>> regular equality or an index query using the analyzer. >> > > >>>> >> > > >>>> `term_matches(column, term)` seems quite clear and hard to >> > > misinterpret, >> > > >>>> but it's quite long to write and its implementation will be >> > > challenging >> > > >>>> since we would need a bunch of special casing around >> SelectStatement >> > > and >> > > >>>> functions. >> > > >>>> >> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they >> seem >> > > to >> > > >>>> evoke different behaviours to what they would have. >> > > >>>> >> > > >>>> `column LIKE :term:` seems a bit redundant compared to just using >> > > `column >> > > >>>> : term`, and we are still introducing a new symbol. >> > > >>>> >> > > >>>> I think I like `column : term` the most, because it's brief, it's >> > > similar >> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash >> with >> > > other >> > > >>>> different meanings that I can think of. >> > > >>>> >> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com >> > >> > > wrote: >> > > >>>> >> > > >>>> Hi all, >> > > >>>> >> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball >> rolling on >> > > >>>> aligning around phase 2 features. >> > > >>>> >> > > >>>> In particular, we need to nail down the syntax for doing >> non-exact >> > > string >> > > >>>> matches. We have a proof of concept that includes full Lucene >> > > analyzer and >> > > >>>> filter functionality – just the text transformation pieces, none >> of >> > > the >> > > >>>> storage parts – which is the gold standard in this space. For >> > > example, the >> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords >> > > (common >> > > >>>> words like “a”, “is”, “the” that are usually not useful to search >> > > >>>> against). Lucene also has classes that offer stemming, special >> case >> > > >>>> handling for email, and many languages besides English [2]. >> > > >>>> >> > > >>>> What syntax should we use to express “rows whose analyzed tokens >> match >> > > >>>> this search term?” >> > > >>>> >> > > >>>> The syntax must be clear that we want to look for this term >> within the >> > > >>>> column data using the configured index with corresponding >> query-time >> > > >>>> tokenization and analysis. This means that the query term is not >> > > always a >> > > >>>> substring of the original string! Besides obvious >> transformations >> > > like >> > > >>>> lowercasing, you have things like PhoneticFilter available as >> well. >> > > >>>> >> > > >>>> Here are my thoughts on some of the options: >> > > >>>> >> > > >>>> `column = term`. This is what the POC does today and it’s super >> > > confusing >> > > >>>> to overload = to mean something other than exact equality. I am >> not >> > > a fan. >> > > >>>> >> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL >> operator, >> > > but >> > > >>>> neither the wildcarded nor unwildcarded syntax matches the >> semantics >> > > of >> > > >>>> term-based search. >> > > >>>> >> > > >>>> `column MATCHES term`. I rather like this one, although Mike >> points >> > > out >> > > >>>> that “match” has a meaning in the context of regular expressions >> that >> > > could >> > > >>>> cause confusion here. >> > > >>>> >> > > >>>> `column CONTAINS term`. Contains is used by both Java and Python >> for >> > > >>>> substring searches, so at least some users will be surprised by >> > > term-based >> > > >>>> behavior. >> > > >>>> >> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use >> functions >> > > like >> > > >>>> this for everything. It’s pretty clunky, and we would need to >> make >> > > the >> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a >> function >> > > >>>> result in a predicate” like this. >> > > >>>> >> > > >>>> `column : term`. Inspired by Lucene’s syntax. I don’t actually >> hate >> > > it. >> > > >>>> >> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new >> > > symbol to >> > > >>>> indicate term matching. Arguably more SQL-ish than a new bare >> symbol >> > > >>>> operator. >> > > >>>> >> > > >>>> [1] >> > > >>>> >> > > >> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html >> > > >>>> [2] >> https://lucene.apache.org/core/9_7_0/analysis/common/index.html >> > > >>>> >> > > >>>> -- >> > > >>>> Jonathan Ellis >> > > >>>> co-founder, http://www.datastax.com >> > > >>>> @spyced >> > > >>>> >> > > >>>> >> > > >>>> >> > > >>> >> > > >> > >> > > > -- > [image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson* > Engineering > > +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/> > Find DataStax Online: [image: LinkedIn Logo] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> > [image: Facebook Logo] > <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> > [image: Twitter Logo] <https://twitter.com/DataStax> [image: RSS > Feed] <https://www.datastax.com/blog/rss.xml> [image: Github Logo] > <https://github.com/datastax> > > -- Regards, Atri Apache Concerted