We’ve already started down the path of using a git sub-module for the Accord 
library. That could be an option at some point.

> On Aug 13, 2023, at 12:53 PM, Jon Haddad <rustyrazorbl...@apache.org> wrote:
> Functions make sense to me too.  In addition to the reasons listed, I if we 
> acknowledge that functions in predicates are inevitable, then it makes total 
> sense to use them here.  I think this is the most forward thinking approach.
> Assuming this happens, one thing that would be great down the line would be 
> if the CQL parser was broken out into a subproject with an artifact published 
> so the soon to be additional complexity of parsing CQL didn't have to be 
> pushed to every single end user like it does today.  I'm not trying to expand 
> the scope right now, just laying an idea down for the future.  
> Jon
>> On 2023/08/07 21:26:40 Josh McKenzie wrote:
>> Been chatting a bit w/Caleb about this offline and poking around to better 
>> educate myself.
>>> using functions (ignoring the implementation complexity) at least removes 
>>> ambiguity. 
>> This, plus using functions lets us kick the can down the road a bit in terms 
>> of landing on an integrated grammar we agree on. It seems to me there's a 
>> tension between:
>> 1. "SQL-like" (i.e. postgres-like)
>> 2. "Indexing and Search domain-specific-like" (i.e. lucene syntax which, as 
>> Benedict points out, doesn't really jell w/what we have in CQL at this 
>> point), and
>> 3. ??? Some other YOLO CQL / C* specific thing where we go our own road
>> I don't think we're really going to know what our feature-set in terms of 
>> indexing is going to look like or the shape it's going to take for awhile, 
>> so backing ourselves into any of the 3 corners above right now feels very 
>> premature to me.
>> So I'm coming around to the expr / method call approach to preserve that 
>> flexibility. It's maximally explicit and preserves optionality at the 
>> expense of being clunky. For now.
>> On Mon, Aug 7, 2023, at 4:00 PM, Caleb Rackliffe wrote:
>>>> I do not think we should start using lucene syntax for it, it will make 
>>>> people think they can do everything else lucene allows.
>>> I'm sure we won't be supporting everything Lucene allows, but this is going 
>>> to evolve. Right off the bat, if you introduce support for tokenization and 
>>> filtering, someone is, for example, going to ask for phrase queries. ("John 
>>> Smith landed in Virginia" is tokenized, but someone wants to match exactly 
>>> on "John Smith".) The whole point of the Vector project is to do relevance, 
>>> right? Are we going to do term boosting? Do we need queries like "field: 
>>> quick brown +fox -news" where fox must be present, news cannot be present, 
>>> and quick and brown increase relevance?
>>> SASI uses "=" and "LIKE" in a way that assumes the user understands the 
>>> tokenization scheme in use on the target field. I understand that's a bit 
>>> ambiguous.
>>> If we object to allowing expr embedding of a subset of the Lucene syntax, I 
>>> can't imagine we're okay w/ then jamming a subset of that syntax into the 
>>> main CQL grammar.
>>> If we want to do this in non-expr CQL space, I think using functions 
>>> (ignoring the implementation complexity) at least removes ambiguity. 
>>> "token_match", "phrase_match", "token_like", "=", and "LIKE" would all be 
>>> pretty clear, although there may be other problems. For instance, what 
>>> happens when I try to use "token_match" on an indexed field whose analyzer 
>>> does not tokenize? We obviously can't use the index, so we'd be reduced to 
>>> requiring a filtering query, but maybe that's fine. My point is that, if 
>>> we're going to make write and read analyzers symmetrical, there's really no 
>>> way to make the semantics of our queries totally independent of analysis. 
>>> (ex. "field : foo bar" behaves differently w/ read tokenization than it 
>>> does without. It could even be an OR or AND query w/ tokenization, 
>>> depending on our defaults.)
>>> On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote:
>>>> Why not start with SQLish operators supported by many databases (LIKE and 
>>>> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com> 
>>>> wrote:
>>>>> I am also -1 on directly exposing lucene like syntax here. Besides being 
>>>>> ugly, SAI is not lucene, I do not think we should start using lucene 
>>>>> syntax for it, it will make people think they can do everything else 
>>>>> lucene allows.
>>>>>> On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote:
>>>>>> I’m strongly opposed to : 
>>>>>> It is very dissimilar to our current operators. CQL is already not the 
>>>>>> prettiest language, but let’s not make it a total mish mash.
>>>>>>> On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote:
>>>>>>> I am also in agreement with 'column : token' in that 'I don't hate it' 
>>>>>>> but I'd like to offer an alternative to this in 'column HAS token'. HAS 
>>>>>>> is currently not a keyword that we use so wouldn't cause any brain 
>>>>>>> conflicts.
>>>>>>> While I don't hate ':' I have a particular dislike of the lucene search 
>>>>>>> syntax because of its terseness and lack of easy readability. 
>>>>>>> Saying that, I'm happy to do with ':' if that is the decision. 
>>>>>>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org> 
>>>>>>> wrote:
>>>>>>>> Assuming SAI is a superset of SASI, and we were to set up something so 
>>>>>>>> that SASI indexes auto convert to SAI, this gives even more weight to 
>>>>>>>> my point regarding how differing behavior for the same syntax can lead 
>>>>>>>> to issues.  Imo the best case scenario results in the user not even 
>>>>>>>> noticing their indexes have changed.
>>>>>>>> An (maybe better?) alternative is to add a flag to the index 
>>>>>>>> configuration for "compatibility mod", which might address the 
>>>>>>>> concerns around using an equality operator when it actually is a 
>>>>>>>> partial match.
>>>>>>>> For what it's worth, I'm in agreement that = should mean full equality 
>>>>>>>> and not token match.
>>>>>>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
>>>>>>>>> For what it's worth, I'd very much like to completely remove SASI 
>>>>>>>>> from the
>>>>>>>>> codebase for 6.0. The only remaining functionality gaps at the moment 
>>>>>>>>> are
>>>>>>>>> LIKE (prefix/suffix) queries and its limited tokenization
>>>>>>>>> capabilities, both of which already have SAI Phase 2 Jiras.
>>>>>>>>> On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com>
>>>>>>>>> wrote:
>>>>>>>>>> SASI just uses “=“ for the tokenized equality matching, which is the 
>>>>>>>>>> exact
>>>>>>>>>> thing this discussion is about changing/not liking.
>>>>>>>>>>> On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>> I do not think LIKE actually applies here. LIKE is used for prefix,
>>>>>>>>>> contains, or suffix searches in SASI depending on the index type.
>>>>>>>>>>> This is about exact matching of tokens.
>>>>>>>>>>>> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>> Certain bits of functionality also already exist on the SASI side 
>>>>>>>>>>>> of
>>>>>>>>>> things, but I'm not sure how much overlap there is.  Currently, 
>>>>>>>>>> there's a
>>>>>>>>>> LIKE keyword that handles token matching, although it seems to have 
>>>>>>>>>> some
>>>>>>>>>> differences from the feature set in SAI.
>>>>>>>>>>>> That said, there seems to be enough of an overlap that it would 
>>>>>>>>>>>> make
>>>>>>>>>> sense to consider using LIKE in the same manner, doesn't it?  I 
>>>>>>>>>> think it
>>>>>>>>>> would be a little odd if we have different syntax for different 
>>>>>>>>>> indexes.
>>>>>>>>>>>> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>>>>>>>>>>>> I think one complication here is that there seems to be a desire, 
>>>>>>>>>>>> that
>>>>>>>>>> I very much agree with, to expose as much of the underlying 
>>>>>>>>>> flexibility of
>>>>>>>>>> Lucene as much as possible.  If it means we use Caleb's suggestion, 
>>>>>>>>>> I'd ask
>>>>>>>>>> that the queries that SASI and SAI both support use the same syntax, 
>>>>>>>>>> even
>>>>>>>>>> if it means there's two ways of writing the same query.  To use 
>>>>>>>>>> Caleb's
>>>>>>>>>> example, this would mean supporting both LIKE and the `expr` column.
>>>>>>>>>>>> Jon
>>>>>>>>>>>>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>>>>>>>>>>>>> Here are some additional bits of prior art, if anyone finds them
>>>>>>>>>> useful:
>>>>>>>>>>>>> The Stratio Lucene Index -
>>>>>>>>>>>>> https://github.com/Stratio/cassandra-lucene-index#examples
>>>>>>>>>>>>> Stratio was the reason C* added the "expr" functionality. They 
>>>>>>>>>>>>> embedded
>>>>>>>>>>>>> something similar to ElasticSearch JSON, which probably isn't my
>>>>>>>>>> favorite
>>>>>>>>>>>>> choice, but it's there.
>>>>>>>>>>>>> The ElasticSearch match query syntax -
>>>>>>>>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>>>>>>>>>>>> Again, not my favorite. It's verbose, and probably too powerful 
>>>>>>>>>>>>> for us.
>>>>>>>>>>>>> ElasticSearch's documentation for the basic Lucene query syntax -
>>>>>>>>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>>>>>>>>>>>> One idea is to take the basic Lucene index, which it seems we 
>>>>>>>>>>>>> already
>>>>>>>>>> have
>>>>>>>>>>>>> some support for, and feed it to "expr". This is nice for two 
>>>>>>>>>>>>> reasons:
>>>>>>>>>>>>> 1.) People can just write Lucene queries if they already know how.
>>>>>>>>>>>>> 2.) No changes to the grammar.
>>>>>>>>>>>>> Lucene has distinct concepts of filtering and querying, and this 
>>>>>>>>>>>>> is
>>>>>>>>>> kind of
>>>>>>>>>>>>> the latter. I'm not sure how, for example, we would want "expr" to
>>>>>>>>>> interact
>>>>>>>>>>>>> w/ filters on other column indexes in vanilla CQL space...
>>>>>>>>>>>>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie 
>>>>>>>>>>>>>> <jmcken...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> `column CONTAINS term`. Contains is used by both Java and Python 
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> substring searches, so at least some users will be surprised by
>>>>>>>>>> term-based
>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>> I wonder whether users are in their "programming language" 
>>>>>>>>>>>>>> headspace
>>>>>>>>>> or in
>>>>>>>>>>>>>> their "querying a database" headspace when interacting with CQL? 
>>>>>>>>>>>>>> i.e.
>>>>>>>>>> this
>>>>>>>>>>>>>> would only present confusion if we expected users to be thinking 
>>>>>>>>>>>>>> in
>>>>>>>>>> the
>>>>>>>>>>>>>> idioms of their respective programming languages. If they're 
>>>>>>>>>>>>>> thinking
>>>>>>>>>> in
>>>>>>>>>>>>>> terms of SQL, MATCHES would probably end up confusing them a bit
>>>>>>>>>> since it
>>>>>>>>>>>>>> doesn't match the general structure of the MATCH operator.
>>>>>>>>>>>>>> That said, I also think CONTAINS loses something important that 
>>>>>>>>>>>>>> you
>>>>>>>>>> allude
>>>>>>>>>>>>>> to here Jonathan:
>>>>>>>>>>>>>> with corresponding query-time tokenization and analysis.  This 
>>>>>>>>>>>>>> means
>>>>>>>>>> that
>>>>>>>>>>>>>> the query term is not always a substring of the original string!
>>>>>>>>>> Besides
>>>>>>>>>>>>>> obvious transformations like lowercasing, you have things like
>>>>>>>>>>>>>> PhoneticFilter available as well.
>>>>>>>>>>>>>> So to me, neither MATCHES nor CONTAINS are particularly great
>>>>>>>>>> candidates.
>>>>>>>>>>>>>> So +1 to the "I don't actually hate it" sentiment on:
>>>>>>>>>>>>>> column : term`. Inspired by Lucene’s syntax
>>>>>>>>>>>>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>>>>>>>>>>>>> I have a strong preference not to use the name of an SQL 
>>>>>>>>>>>>>> operator,
>>>>>>>>>> since
>>>>>>>>>>>>>> it precludes us later providing the SQL standard operator to 
>>>>>>>>>>>>>> users.
>>>>>>>>>>>>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>>>>>>>>>>>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña 
>>>>>>>>>>>>>>> <adelap...@apache.org>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> `column = term` is definitively problematic because it creates an
>>>>>>>>>>>>>> ambiguity when the queried column belongs to the primary key. 
>>>>>>>>>>>>>> For some
>>>>>>>>>>>>>> queries we wouldn't know whether the user wants a primary key 
>>>>>>>>>>>>>> query
>>>>>>>>>> using
>>>>>>>>>>>>>> regular equality or an index query using the analyzer.
>>>>>>>>>>>>>> `term_matches(column, term)` seems quite clear and hard to
>>>>>>>>>> misinterpret,
>>>>>>>>>>>>>> but it's quite long to write and its implementation will be
>>>>>>>>>> challenging
>>>>>>>>>>>>>> since we would need a bunch of special casing around 
>>>>>>>>>>>>>> SelectStatement
>>>>>>>>>> and
>>>>>>>>>>>>>> functions.
>>>>>>>>>>>>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they 
>>>>>>>>>>>>>> seem
>>>>>>>>>> to
>>>>>>>>>>>>>> evoke different behaviours to what they would have.
>>>>>>>>>>>>>> `column LIKE :term:` seems a bit redundant compared to just using
>>>>>>>>>> `column
>>>>>>>>>>>>>> : term`, and we are still introducing a new symbol.
>>>>>>>>>>>>>> I think I like `column : term` the most, because it's brief, it's
>>>>>>>>>> similar
>>>>>>>>>>>>>> to the equivalent Lucene's syntax, and it doesn't seem to clash 
>>>>>>>>>>>>>> with
>>>>>>>>>> other
>>>>>>>>>>>>>> different meanings that I can think of.
>>>>>>>>>>>>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com>
>>>>>>>>>> wrote:
>>>>>>>>>>>>>> Hi all,
>>>>>>>>>>>>>> With phase 1 of SAI wrapping up, I’d like to start the ball 
>>>>>>>>>>>>>> rolling on
>>>>>>>>>>>>>> aligning around phase 2 features.
>>>>>>>>>>>>>> In particular, we need to nail down the syntax for doing 
>>>>>>>>>>>>>> non-exact
>>>>>>>>>> string
>>>>>>>>>>>>>> matches.  We have a proof of concept that includes full Lucene
>>>>>>>>>> analyzer and
>>>>>>>>>>>>>> filter functionality – just the text transformation pieces, none 
>>>>>>>>>>>>>> of
>>>>>>>>>> the
>>>>>>>>>>>>>> storage parts – which is the gold standard in this space.  For
>>>>>>>>>> example, the
>>>>>>>>>>>>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
>>>>>>>>>> (common
>>>>>>>>>>>>>> words like “a”, “is”, “the” that are usually not useful to search
>>>>>>>>>>>>>> against).  Lucene also has classes that offer stemming, special 
>>>>>>>>>>>>>> case
>>>>>>>>>>>>>> handling for email, and many languages besides English [2].
>>>>>>>>>>>>>> What syntax should we use to express “rows whose analyzed tokens 
>>>>>>>>>>>>>> match
>>>>>>>>>>>>>> this search term?”
>>>>>>>>>>>>>> The syntax must be clear that we want to look for this term 
>>>>>>>>>>>>>> within the
>>>>>>>>>>>>>> column data using the configured index with corresponding 
>>>>>>>>>>>>>> query-time
>>>>>>>>>>>>>> tokenization and analysis.  This means that the query term is not
>>>>>>>>>> always a
>>>>>>>>>>>>>> substring of the original string!  Besides obvious 
>>>>>>>>>>>>>> transformations
>>>>>>>>>> like
>>>>>>>>>>>>>> lowercasing, you have things like PhoneticFilter available as 
>>>>>>>>>>>>>> well.
>>>>>>>>>>>>>> Here are my thoughts on some of the options:
>>>>>>>>>>>>>> `column = term`.  This is what the POC does today and it’s super
>>>>>>>>>> confusing
>>>>>>>>>>>>>> to overload = to mean something other than exact equality.  I am 
>>>>>>>>>>>>>> not
>>>>>>>>>> a fan.
>>>>>>>>>>>>>> `column LIKE term` or `column LIKE %term%`. The closest SQL 
>>>>>>>>>>>>>> operator,
>>>>>>>>>> but
>>>>>>>>>>>>>> neither the wildcarded nor unwildcarded syntax matches the 
>>>>>>>>>>>>>> semantics
>>>>>>>>>> of
>>>>>>>>>>>>>> term-based search.
>>>>>>>>>>>>>> `column MATCHES term`. I rather like this one, although Mike 
>>>>>>>>>>>>>> points
>>>>>>>>>> out
>>>>>>>>>>>>>> that “match” has a meaning in the context of regular expressions 
>>>>>>>>>>>>>> that
>>>>>>>>>> could
>>>>>>>>>>>>>> cause confusion here.
>>>>>>>>>>>>>> `column CONTAINS term`. Contains is used by both Java and Python 
>>>>>>>>>>>>>> for
>>>>>>>>>>>>>> substring searches, so at least some users will be surprised by
>>>>>>>>>> term-based
>>>>>>>>>>>>>> behavior.
>>>>>>>>>>>>>> `term_matches(column, term)`. Postgresql FTS makes you use 
>>>>>>>>>>>>>> functions
>>>>>>>>>> like
>>>>>>>>>>>>>> this for everything.  It’s pretty clunky, and we would need to 
>>>>>>>>>>>>>> make
>>>>>>>>>> the
>>>>>>>>>>>>>> amazingly hairy SelectStatement even hairier to handle “use a 
>>>>>>>>>>>>>> function
>>>>>>>>>>>>>> result in a predicate” like this.
>>>>>>>>>>>>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually 
>>>>>>>>>>>>>> hate
>>>>>>>>>> it.
>>>>>>>>>>>>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
>>>>>>>>>> symbol to
>>>>>>>>>>>>>> indicate term matching.  Arguably more SQL-ish than a new bare 
>>>>>>>>>>>>>> symbol
>>>>>>>>>>>>>> operator.
>>>>>>>>>>>>>> [1]
>>>>>>>>>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>>>>>>>>>>>>>> [2] 
>>>>>>>>>>>>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>>>>>>>>>>>>>> --
>>>>>>>>>>>>>> Jonathan Ellis
>>>>>>>>>>>>>> co-founder, http://www.datastax.com
>>>>>>>>>>>>>> @spyced
>>>>>>> --
>>>>>>> DataStax Logo Square <https://www.datastax.com/>
>>>>>>> *Mike Adamson*
>>>>>>> Engineering
>>>>>>> +1 650 389 6000 <tel:16503896000> | datastax.com 
>>>>>>> <https://www.datastax.com/>
>>>>>>> Find DataStax Online:
>>>>>>> LinkedIn Logo 
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>>>>>>    Facebook Logo 
>>>>>>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>>>>>>    Twitter Logo <https://twitter.com/DataStax>   RSS Feed 
>>>>>>> <https://www.datastax.com/blog/rss.xml>   Github Logo 
>>>>>>> <https://github.com/datastax>
>>>> --
>>>> Regards,
>>>> Atri
>>>> Apache Concerted

Reply via email to