Re: Tokenization and SAI query syntax

Jeremiah Jordan Wed, 02 Aug 2023 17:20:32 -0700

SASI just uses “=“ for the tokenized equality matching, which is the exact 
thing this discussion is about changing/not liking.


> On Aug 2, 2023, at 7:18 PM, J. D. Jordan <jeremiah.jor...@gmail.com> wrote:
> 
> I do not think LIKE actually applies here. LIKE is used for prefix, 
> contains, or suffix searches in SASI depending on the index type.
> 
> This is about exact matching of tokens.
> 
>> On Aug 2, 2023, at 5:53 PM, Jon Haddad <rustyrazorbl...@apache.org> wrote:
>> 
>> Certain bits of functionality also already exist on the SASI side of 
>> things, but I'm not sure how much overlap there is.  Currently, there's a 
>> LIKE keyword that handles token matching, although it seems to have some 
>> differences from the feature set in SAI.  
>> 
>> That said, there seems to be enough of an overlap that it would make sense 
>> to consider using LIKE in the same manner, doesn't it?  I think it would be 
>> a little odd if we have different syntax for different indexes.  
>> 
>> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>> 
>> I think one complication here is that there seems to be a desire, that I 
>> very much agree with, to expose as much of the underlying flexibility of 
>> Lucene as much as possible.  If it means we use Caleb's suggestion, I'd ask 
>> that the queries that SASI and SAI both support use the same syntax, even if 
>> it means there's two ways of writing the same query.  To use Caleb's 
>> example, this would mean supporting both LIKE and the `expr` column.  
>> 
>> Jon
>> 
>>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>>> Here are some additional bits of prior art, if anyone finds them useful:
>>> 
>>> 
>>> The Stratio Lucene Index -
>>> https://github.com/Stratio/cassandra-lucene-index#examples
>>> 
>>> Stratio was the reason C* added the "expr" functionality. They embedded
>>> something similar to ElasticSearch JSON, which probably isn't my favorite
>>> choice, but it's there.
>>> 
>>> 
>>> The ElasticSearch match query syntax -
>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>>  
>>> 
>>> Again, not my favorite. It's verbose, and probably too powerful for us.
>>> 
>>> 
>>> ElasticSearch's documentation for the basic Lucene query syntax -
>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>>  
>>> 
>>> One idea is to take the basic Lucene index, which it seems we already have
>>> some support for, and feed it to "expr". This is nice for two reasons:
>>> 
>>> 1.) People can just write Lucene queries if they already know how.
>>> 2.) No changes to the grammar.
>>> 
>>> Lucene has distinct concepts of filtering and querying, and this is kind of
>>> the latter. I'm not sure how, for example, we would want "expr" to interact
>>> w/ filters on other column indexes in vanilla CQL space...
>>> 
>>> 
>>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <jmcken...@apache.org> wrote:
>>>> 
>>>> `column CONTAINS term`. Contains is used by both Java and Python for
>>>> substring searches, so at least some users will be surprised by term-based
>>>> behavior.
>>>> 
>>>> I wonder whether users are in their "programming language" headspace or in
>>>> their "querying a database" headspace when interacting with CQL? i.e. this
>>>> would only present confusion if we expected users to be thinking in the
>>>> idioms of their respective programming languages. If they're thinking in
>>>> terms of SQL, MATCHES would probably end up confusing them a bit since it
>>>> doesn't match the general structure of the MATCH operator.
>>>> 
>>>> That said, I also think CONTAINS loses something important that you allude
>>>> to here Jonathan:
>>>> 
>>>> with corresponding query-time tokenization and analysis.  This means that
>>>> the query term is not always a substring of the original string!  Besides
>>>> obvious transformations like lowercasing, you have things like
>>>> PhoneticFilter available as well.
>>>> 
>>>> So to me, neither MATCHES nor CONTAINS are particularly great candidates.
>>>> 
>>>> So +1 to the "I don't actually hate it" sentiment on:
>>>> 
>>>> column : term`. Inspired by Lucene’s syntax
>>>> 
>>>> 
>>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>>> 
>>>> 
>>>> I have a strong preference not to use the name of an SQL operator, since
>>>> it precludes us later providing the SQL standard operator to users.
>>>> 
>>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>>> 
>>>> 
>>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <adelap...@apache.org> wrote:
>>>> 
>>>> 
>>>> `column = term` is definitively problematic because it creates an
>>>> ambiguity when the queried column belongs to the primary key. For some
>>>> queries we wouldn't know whether the user wants a primary key query using
>>>> regular equality or an index query using the analyzer.
>>>> 
>>>> `term_matches(column, term)` seems quite clear and hard to misinterpret,
>>>> but it's quite long to write and its implementation will be challenging
>>>> since we would need a bunch of special casing around SelectStatement and
>>>> functions.
>>>> 
>>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
>>>> evoke different behaviours to what they would have.
>>>> 
>>>> `column LIKE :term:` seems a bit redundant compared to just using `column
>>>> : term`, and we are still introducing a new symbol.
>>>> 
>>>> I think I like `column : term` the most, because it's brief, it's similar
>>>> to the equivalent Lucene's syntax, and it doesn't seem to clash with other
>>>> different meanings that I can think of.
>>>> 
>>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <jbel...@gmail.com> wrote:
>>>> 
>>>> Hi all,
>>>> 
>>>> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
>>>> aligning around phase 2 features.
>>>> 
>>>> In particular, we need to nail down the syntax for doing non-exact string
>>>> matches.  We have a proof of concept that includes full Lucene analyzer and
>>>> filter functionality – just the text transformation pieces, none of the
>>>> storage parts – which is the gold standard in this space.  For example, the
>>>> StandardAnalyzer [1] lowercases all terms and removes stopwords (common
>>>> words like “a”, “is”, “the” that are usually not useful to search
>>>> against).  Lucene also has classes that offer stemming, special case
>>>> handling for email, and many languages besides English [2].
>>>> 
>>>> What syntax should we use to express “rows whose analyzed tokens match
>>>> this search term?”
>>>> 
>>>> The syntax must be clear that we want to look for this term within the
>>>> column data using the configured index with corresponding query-time
>>>> tokenization and analysis.  This means that the query term is not always a
>>>> substring of the original string!  Besides obvious transformations like
>>>> lowercasing, you have things like PhoneticFilter available as well.
>>>> 
>>>> Here are my thoughts on some of the options:
>>>> 
>>>> `column = term`.  This is what the POC does today and it’s super confusing
>>>> to overload = to mean something other than exact equality.  I am not a fan.
>>>> 
>>>> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
>>>> neither the wildcarded nor unwildcarded syntax matches the semantics of
>>>> term-based search.
>>>> 
>>>> `column MATCHES term`. I rather like this one, although Mike points out
>>>> that “match” has a meaning in the context of regular expressions that could
>>>> cause confusion here.
>>>> 
>>>> `column CONTAINS term`. Contains is used by both Java and Python for
>>>> substring searches, so at least some users will be surprised by term-based
>>>> behavior.
>>>> 
>>>> `term_matches(column, term)`. Postgresql FTS makes you use functions like
>>>> this for everything.  It’s pretty clunky, and we would need to make the
>>>> amazingly hairy SelectStatement even hairier to handle “use a function
>>>> result in a predicate” like this.
>>>> 
>>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
>>>> 
>>>> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
>>>> indicate term matching.  Arguably more SQL-ish than a new bare symbol
>>>> operator.
>>>> 
>>>> [1]
>>>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>>>> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>>>> 
>>>> --
>>>> Jonathan Ellis
>>>> co-founder, http://www.datastax.com
>>>> @spyced
>>>> 
>>>> 
>>>> 
>>>

Re: Tokenization and SAI query syntax

Reply via email to