We’ve already started down the path of using a git sub-module for the Accord library. That could be an option at some point.
> On Aug 13, 2023, at 12:53 PM, Jon Haddad <rustyrazorbl...@apache.org> wrote: > > Functions make sense to me too. In addition to the reasons listed, I if we > acknowledge that functions in predicates are inevitable, then it makes total > sense to use them here. I think this is the most forward thinking approach. > > Assuming this happens, one thing that would be great down the line would be > if the CQL parser was broken out into a subproject with an artifact published > so the soon to be additional complexity of parsing CQL didn't have to be > pushed to every single end user like it does today. I'm not trying to expand > the scope right now, just laying an idea down for the future. > > Jon > >> On 2023/08/07 21:26:40 Josh McKenzie wrote: >> Been chatting a bit w/Caleb about this offline and poking around to better >> educate myself. >> >>> using functions (ignoring the implementation complexity) at least removes >>> ambiguity. >> This, plus using functions lets us kick the can down the road a bit in terms >> of landing on an integrated grammar we agree on. It seems to me there's a >> tension between: >> 1. "SQL-like" (i.e. postgres-like) >> 2. "Indexing and Search domain-specific-like" (i.e. lucene syntax which, as >> Benedict points out, doesn't really jell w/what we have in CQL at this >> point), and >> 3. ??? Some other YOLO CQL / C* specific thing where we go our own road >> I don't think we're really going to know what our feature-set in terms of >> indexing is going to look like or the shape it's going to take for awhile, >> so backing ourselves into any of the 3 corners above right now feels very >> premature to me. >> >> So I'm coming around to the expr / method call approach to preserve that >> flexibility. It's maximally explicit and preserves optionality at the >> expense of being clunky. For now. >> >> On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote: >>>> I do not think we should start using lucene syntax for it, it will make >>>> people think they can do everything else lucene allows. >>> >>> I'm sure we won't be supporting everything Lucene allows, but this is going >>> to evolve. Right off the bat, if you introduce support for tokenization and >>> filtering, someone is, for example, going to ask for phrase queries. ("John >>> Smith landed in Virginia" is tokenized, but someone wants to match exactly >>> on "John Smith".) The whole point of the Vector project is to do relevance, >>> right? Are we going to do term boosting? Do we need queries like "field: >>> quick brown +fox -news" where fox must be present, news cannot be present, >>> and quick and brown increase relevance? >>> >>> SASI uses "=" and "LIKE" in a way that assumes the user understands the >>> tokenization scheme in use on the target field. I understand that's a bit >>> ambiguous. >>> >>> If we object to allowing expr embedding of a subset of the Lucene syntax, I >>> can't imagine we're okay w/ then jamming a subset of that syntax into the >>> main CQL grammar. >>> >>> If we want to do this in non-expr CQL space, I think using functions >>> (ignoring the implementation complexity) at least removes ambiguity. >>> "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be >>> pretty clear, although there may be other problems. For instance, what >>> happens when I try to use "token_match" on an indexed field whose analyzer >>> does not tokenize? We obviously can't use the index, so we'd be reduced to >>> requiring a filtering query, but maybe that's fine. My point is that, if >>> we're going to make write and read analyzers symmetrical, there's really no >>> way to make the semantics of our queries totally independent of analysis. >>> (ex. "field : foo bar" behaves differently w/ read tokenization than it >>> does without. It could even be an OR or AND query w/ tokenization, >>> depending on our defaults.) >>> >>> On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote: >>>> Why not start with SQLish operators supported by many databases (LIKE and >>>> CONTAINS)? >>>> >>>> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> >>>> wrote: >>>>> >>>>> I am also -1 on directly exposing lucene like syntax here. Besides being >>>>> ugly, SAI is not lucene, I do not think we should start using lucene >>>>> syntax for it, it will make people think they can do everything else >>>>> lucene allows. >>>>> >>>>>> On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote: >>>>>> >>>>>> >>>>>> I’m strongly opposed to : >>>>>> >>>>>> It is very dissimilar to our current operators. CQL is already not the >>>>>> prettiest language, but let’s not make it a total mish mash. >>>>>> >>>>>> >>>>>> >>>>>> >>>>>>> On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote: >>>>>>> >>>>>>> I am also in agreement with 'column : token' in that 'I don't hate it' >>>>>>> but I'd like to offer an alternative to this in 'column HAS token'. HAS >>>>>>> is currently not a keyword that we use so wouldn't cause any brain >>>>>>> conflicts. >>>>>>> >>>>>>> While I don't hate ':' I have a particular dislike of the lucene search >>>>>>> syntax because of its terseness and lack of easy readability. >>>>>>> >>>>>>> Saying that, I'm happy to do with ':' if that is the decision. >>>>>>> >>>>>>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> >>>>>>> wrote: >>>>>>>> Assuming SAI is a superset of SASI, and we were to set up something so >>>>>>>> that SASI indexes auto convert to SAI, this gives even more weight to >>>>>>>> my point regarding how differing behavior for the same syntax can lead >>>>>>>> to issues. Imo the best case scenario results in the user not even >>>>>>>> noticing their indexes have changed. >>>>>>>> >>>>>>>> An (maybe better?) alternative is to add a flag to the index >>>>>>>> configuration for "compatibility mod", which might address the >>>>>>>> concerns around using an equality operator when it actually is a >>>>>>>> partial match. >>>>>>>> >>>>>>>> For what it's worth, I'm in agreement that = should mean full equality >>>>>>>> and not token match. >>>>>>>> >>>>>>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote: >>>>>>>>> For what it's worth, I'd very much like to completely remove SASI >>>>>>>>> from the >>>>>>>>> codebase for 6.0. The only remaining functionality gaps at the moment >>>>>>>>> are >>>>>>>>> LIKE (prefix/suffix) queries and its limited tokenization >>>>>>>>> capabilities, both of which already have SAI Phase 2 Jiras. >>>>>>>>> >>>>>>>>> On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com> >>>>>>>>> wrote: >>>>>>>>> >>>>>>>>>> SASI just uses “=“ for the tokenized equality matching, which is the >>>>>>>>>> exact >>>>>>>>>> thing this discussion is about changing/not liking. >>>>>>>>>> >>>>>>>>>>> On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>> >>>>>>>>>>> I do not think LIKE actually applies here. LIKE is used for prefix, >>>>>>>>>> contains, or suffix searches in SASI depending on the index type. >>>>>>>>>>> >>>>>>>>>>> This is about exact matching of tokens. >>>>>>>>>>> >>>>>>>>>>>> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>>>> >>>>>>>>>>>> Certain bits of functionality also already exist on the SASI side >>>>>>>>>>>> of >>>>>>>>>> things, but I'm not sure how much overlap there is. Currently, >>>>>>>>>> there's a >>>>>>>>>> LIKE keyword that handles token matching, although it seems to have >>>>>>>>>> some >>>>>>>>>> differences from the feature set in SAI. >>>>>>>>>>>> >>>>>>>>>>>> That said, there seems to be enough of an overlap that it would >>>>>>>>>>>> make >>>>>>>>>> sense to consider using LIKE in the same manner, doesn't it? I >>>>>>>>>> think it >>>>>>>>>> would be a little odd if we have different syntax for different >>>>>>>>>> indexes. >>>>>>>>>>>> >>>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md >>>>>>>>>>>> >>>>>>>>>>>> I think one complication here is that there seems to be a desire, >>>>>>>>>>>> that >>>>>>>>>> I very much agree with, to expose as much of the underlying >>>>>>>>>> flexibility of >>>>>>>>>> Lucene as much as possible. If it means we use Caleb's suggestion, >>>>>>>>>> I'd ask >>>>>>>>>> that the queries that SASI and SAI both support use the same syntax, >>>>>>>>>> even >>>>>>>>>> if it means there's two ways of writing the same query. To use >>>>>>>>>> Caleb's >>>>>>>>>> example, this would mean supporting both LIKE and the `expr` column. >>>>>>>>>>>> >>>>>>>>>>>> Jon >>>>>>>>>>>> >>>>>>>>>>>>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote: >>>>>>>>>>>>> Here are some additional bits of prior art, if anyone finds them >>>>>>>>>> useful: >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The Stratio Lucene Index - >>>>>>>>>>>>> https://github.com/Stratio/cassandra-lucene-index#examples >>>>>>>>>>>>> >>>>>>>>>>>>> Stratio was the reason C* added the "expr" functionality. They >>>>>>>>>>>>> embedded >>>>>>>>>>>>> something similar to ElasticSearch JSON, which probably isn't my >>>>>>>>>> favorite >>>>>>>>>>>>> choice, but it's there. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> The ElasticSearch match query syntax - >>>>>>>>>>>>> >>>>>>>>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$ >>>>>>>>>>>>> >>>>>>>>>>>>> Again, not my favorite. It's verbose, and probably too powerful >>>>>>>>>>>>> for us. >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>> ElasticSearch's documentation for the basic Lucene query syntax - >>>>>>>>>>>>> >>>>>>>>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$ >>>>>>>>>>>>> >>>>>>>>>>>>> One idea is to take the basic Lucene index, which it seems we >>>>>>>>>>>>> already >>>>>>>>>> have >>>>>>>>>>>>> some support for, and feed it to "expr". This is nice for two >>>>>>>>>>>>> reasons: >>>>>>>>>>>>> >>>>>>>>>>>>> 1.) People can just write Lucene queries if they already know how. >>>>>>>>>>>>> 2.) No changes to the grammar. >>>>>>>>>>>>> >>>>>>>>>>>>> Lucene has distinct concepts of filtering and querying, and this >>>>>>>>>>>>> is >>>>>>>>>> kind of >>>>>>>>>>>>> the latter. I'm not sure how, for example, we would want "expr" to >>>>>>>>>> interact >>>>>>>>>>>>> w/ filters on other column indexes in vanilla CQL space... >>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie >>>>>>>>>>>>>> <jmcken...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column CONTAINS term`. Contains is used by both Java and Python >>>>>>>>>>>>>> for >>>>>>>>>>>>>> substring searches, so at least some users will be surprised by >>>>>>>>>> term-based >>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I wonder whether users are in their "programming language" >>>>>>>>>>>>>> headspace >>>>>>>>>> or in >>>>>>>>>>>>>> their "querying a database" headspace when interacting with CQL? >>>>>>>>>>>>>> i.e. >>>>>>>>>> this >>>>>>>>>>>>>> would only present confusion if we expected users to be thinking >>>>>>>>>>>>>> in >>>>>>>>>> the >>>>>>>>>>>>>> idioms of their respective programming languages. If they're >>>>>>>>>>>>>> thinking >>>>>>>>>> in >>>>>>>>>>>>>> terms of SQL, MATCHES would probably end up confusing them a bit >>>>>>>>>> since it >>>>>>>>>>>>>> doesn't match the general structure of the MATCH operator. >>>>>>>>>>>>>> >>>>>>>>>>>>>> That said, I also think CONTAINS loses something important that >>>>>>>>>>>>>> you >>>>>>>>>> allude >>>>>>>>>>>>>> to here Jonathan: >>>>>>>>>>>>>> >>>>>>>>>>>>>> with corresponding query-time tokenization and analysis. This >>>>>>>>>>>>>> means >>>>>>>>>> that >>>>>>>>>>>>>> the query term is not always a substring of the original string! >>>>>>>>>> Besides >>>>>>>>>>>>>> obvious transformations like lowercasing, you have things like >>>>>>>>>>>>>> PhoneticFilter available as well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So to me, neither MATCHES nor CONTAINS are particularly great >>>>>>>>>> candidates. >>>>>>>>>>>>>> >>>>>>>>>>>>>> So +1 to the "I don't actually hate it" sentiment on: >>>>>>>>>>>>>> >>>>>>>>>>>>>> column : term`. Inspired by Lucene’s syntax >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> I have a strong preference not to use the name of an SQL >>>>>>>>>>>>>> operator, >>>>>>>>>> since >>>>>>>>>>>>>> it precludes us later providing the SQL standard operator to >>>>>>>>>>>>>> users. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term? >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña >>>>>>>>>>>>>>> <adelap...@apache.org> >>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column = term` is definitively problematic because it creates an >>>>>>>>>>>>>> ambiguity when the queried column belongs to the primary key. >>>>>>>>>>>>>> For some >>>>>>>>>>>>>> queries we wouldn't know whether the user wants a primary key >>>>>>>>>>>>>> query >>>>>>>>>> using >>>>>>>>>>>>>> regular equality or an index query using the analyzer. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `term_matches(column, term)` seems quite clear and hard to >>>>>>>>>> misinterpret, >>>>>>>>>>>>>> but it's quite long to write and its implementation will be >>>>>>>>>> challenging >>>>>>>>>>>>>> since we would need a bunch of special casing around >>>>>>>>>>>>>> SelectStatement >>>>>>>>>> and >>>>>>>>>>>>>> functions. >>>>>>>>>>>>>> >>>>>>>>>>>>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they >>>>>>>>>>>>>> seem >>>>>>>>>> to >>>>>>>>>>>>>> evoke different behaviours to what they would have. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column LIKE :term:` seems a bit redundant compared to just using >>>>>>>>>> `column >>>>>>>>>>>>>> : term`, and we are still introducing a new symbol. >>>>>>>>>>>>>> >>>>>>>>>>>>>> I think I like `column : term` the most, because it's brief, it's >>>>>>>>>> similar >>>>>>>>>>>>>> to the equivalent Lucene's syntax, and it doesn't seem to clash >>>>>>>>>>>>>> with >>>>>>>>>> other >>>>>>>>>>>>>> different meanings that I can think of. >>>>>>>>>>>>>> >>>>>>>>>>>>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> >>>>>>>>>> wrote: >>>>>>>>>>>>>> >>>>>>>>>>>>>> Hi all, >>>>>>>>>>>>>> >>>>>>>>>>>>>> With phase 1 of SAI wrapping up, I’d like to start the ball >>>>>>>>>>>>>> rolling on >>>>>>>>>>>>>> aligning around phase 2 features. >>>>>>>>>>>>>> >>>>>>>>>>>>>> In particular, we need to nail down the syntax for doing >>>>>>>>>>>>>> non-exact >>>>>>>>>> string >>>>>>>>>>>>>> matches. We have a proof of concept that includes full Lucene >>>>>>>>>> analyzer and >>>>>>>>>>>>>> filter functionality – just the text transformation pieces, none >>>>>>>>>>>>>> of >>>>>>>>>> the >>>>>>>>>>>>>> storage parts – which is the gold standard in this space. For >>>>>>>>>> example, the >>>>>>>>>>>>>> StandardAnalyzer [1] lowercases all terms and removes stopwords >>>>>>>>>> (common >>>>>>>>>>>>>> words like “a”, “is”, “the” that are usually not useful to search >>>>>>>>>>>>>> against). Lucene also has classes that offer stemming, special >>>>>>>>>>>>>> case >>>>>>>>>>>>>> handling for email, and many languages besides English [2]. >>>>>>>>>>>>>> >>>>>>>>>>>>>> What syntax should we use to express “rows whose analyzed tokens >>>>>>>>>>>>>> match >>>>>>>>>>>>>> this search term?” >>>>>>>>>>>>>> >>>>>>>>>>>>>> The syntax must be clear that we want to look for this term >>>>>>>>>>>>>> within the >>>>>>>>>>>>>> column data using the configured index with corresponding >>>>>>>>>>>>>> query-time >>>>>>>>>>>>>> tokenization and analysis. This means that the query term is not >>>>>>>>>> always a >>>>>>>>>>>>>> substring of the original string! Besides obvious >>>>>>>>>>>>>> transformations >>>>>>>>>> like >>>>>>>>>>>>>> lowercasing, you have things like PhoneticFilter available as >>>>>>>>>>>>>> well. >>>>>>>>>>>>>> >>>>>>>>>>>>>> Here are my thoughts on some of the options: >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column = term`. This is what the POC does today and it’s super >>>>>>>>>> confusing >>>>>>>>>>>>>> to overload = to mean something other than exact equality. I am >>>>>>>>>>>>>> not >>>>>>>>>> a fan. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column LIKE term` or `column LIKE %term%`. The closest SQL >>>>>>>>>>>>>> operator, >>>>>>>>>> but >>>>>>>>>>>>>> neither the wildcarded nor unwildcarded syntax matches the >>>>>>>>>>>>>> semantics >>>>>>>>>> of >>>>>>>>>>>>>> term-based search. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column MATCHES term`. I rather like this one, although Mike >>>>>>>>>>>>>> points >>>>>>>>>> out >>>>>>>>>>>>>> that “match” has a meaning in the context of regular expressions >>>>>>>>>>>>>> that >>>>>>>>>> could >>>>>>>>>>>>>> cause confusion here. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column CONTAINS term`. Contains is used by both Java and Python >>>>>>>>>>>>>> for >>>>>>>>>>>>>> substring searches, so at least some users will be surprised by >>>>>>>>>> term-based >>>>>>>>>>>>>> behavior. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `term_matches(column, term)`. Postgresql FTS makes you use >>>>>>>>>>>>>> functions >>>>>>>>>> like >>>>>>>>>>>>>> this for everything. It’s pretty clunky, and we would need to >>>>>>>>>>>>>> make >>>>>>>>>> the >>>>>>>>>>>>>> amazingly hairy SelectStatement even hairier to handle “use a >>>>>>>>>>>>>> function >>>>>>>>>>>>>> result in a predicate” like this. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column : term`. Inspired by Lucene’s syntax. I don’t actually >>>>>>>>>>>>>> hate >>>>>>>>>> it. >>>>>>>>>>>>>> >>>>>>>>>>>>>> `column LIKE :term:`. Stick with the LIKE operator but add a new >>>>>>>>>> symbol to >>>>>>>>>>>>>> indicate term matching. Arguably more SQL-ish than a new bare >>>>>>>>>>>>>> symbol >>>>>>>>>>>>>> operator. >>>>>>>>>>>>>> >>>>>>>>>>>>>> [1] >>>>>>>>>>>>>> >>>>>>>>>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html >>>>>>>>>>>>>> [2] >>>>>>>>>>>>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html >>>>>>>>>>>>>> >>>>>>>>>>>>>> -- >>>>>>>>>>>>>> Jonathan Ellis >>>>>>>>>>>>>> co-founder, http://www.datastax.com >>>>>>>>>>>>>> @spyced >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>>> >>>>>>>>>>>>> >>>>>>>>>> >>>>>>>>> >>>>>>> >>>>>>> >>>>>>> -- >>>>>>> DataStax Logo Square <https://www.datastax.com/> >>>>>>> *Mike Adamson* >>>>>>> Engineering >>>>>>> +1 650 389 6000 <tel:16503896000> | datastax.com >>>>>>> <https://www.datastax.com/> >>>>>>> Find DataStax Online: >>>>>>> LinkedIn Logo >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=> >>>>>>> Facebook Logo >>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=> >>>>>>> Twitter Logo <https://twitter.com/DataStax> RSS Feed >>>>>>> <https://www.datastax.com/blog/rss.xml> Github Logo >>>>>>> <https://github.com/datastax> >>>> >>>> >>>> -- >>>> Regards, >>>> Atri >>>> Apache Concerted >>