Re: Tokenization and SAI query syntax

Caleb Rackliffe Mon, 07 Aug 2023 13:01:17 -0700

> I do not think we should start using lucene syntax for it, it will make
people think they can do everything else lucene allows.


I'm sure we won't be supporting everything Lucene allows, but this is going
to evolve. Right off the bat, if you introduce support for tokenization and
filtering, someone is, for example, going to ask for phrase queries. ("John
Smith landed in Virginia" is tokenized, but someone wants to match exactly
on "John Smith".) The whole point of the Vector project is to do relevance,
right? Are we going to do term boosting? Do we need queries like "field:
quick brown +fox -news" where fox must be present, news cannot be present,
and quick and brown increase relevance?

SASI uses "=" and "LIKE" in a way that assumes the user understands the
tokenization scheme in use on the target field. I understand that's a bit
ambiguous.

If we object to allowing expr embedding of a subset of the Lucene syntax, I
can't imagine we're okay w/ then jamming a subset of that syntax into the
main CQL grammar.

If we want to do this in non-expr CQL space, I think using functions
(ignoring the implementation complexity) at least removes ambiguity.
"token_match", "phrase_match", "token_like", "=", and "LIKE" would all be
pretty clear, although there may be other problems. For instance, what
happens when I try to use "token_match" on an indexed field whose analyzer
does not tokenize? We obviously can't use the index, so we'd be reduced to
requiring a filtering query, but maybe that's fine. My point is that, if
we're going to make write and read analyzers symmetrical, there's really no
way to make the semantics of our queries totally independent of analysis.
(ex. "field : foo bar" behaves differently w/ read tokenization than it
does without. It could even be an OR or AND query w/ tokenization,
depending on our defaults.)

On Mon, Aug 7, 2023 at 12:55 PM Atri Sharma <a...@apache.org> wrote:

> Why not start with SQLish operators supported by many databases (LIKE and
> CONTAINS)?
>
> On Mon, Aug 7, 2023 at 10:01 PM J. D. Jordan <jeremiah.jor...@gmail.com>
> wrote:
>
>> I am also -1 on directly exposing lucene like syntax here. Besides being
>> ugly, SAI is not lucene, I do not think we should start using lucene syntax
>> for it, it will make people think they can do everything else lucene allows.
>>
>> On Aug 7, 2023, at 5:13 AM, Benedict <bened...@apache.org> wrote:
>>
>> 
>> I’m strongly opposed to :
>>
>> It is very dissimilar to our current operators. CQL is already not the
>> prettiest language, but let’s not make it a total mish mash.
>>
>>
>>
>> On 7 Aug 2023, at 10:59, Mike Adamson <madam...@datastax.com> wrote:
>>
>> 
>> I am also in agreement with 'column : token' in that 'I don't hate it'
>> but I'd like to offer an alternative to this in 'column HAS token'. HAS is
>> currently not a keyword that we use so wouldn't cause any brain conflicts.
>>
>> While I don't hate ':' I have a particular dislike of the lucene search
>> syntax because of its terseness and lack of easy readability.
>>
>> Saying that, I'm happy to do with ':' if that is the decision.
>>
>> On Fri, 4 Aug 2023 at 00:23, Jon Haddad <rustyrazorbl...@apache.org>
>> wrote:
>>
>>> Assuming SAI is a superset of SASI, and we were to set up something so
>>> that SASI indexes auto convert to SAI, this gives even more weight to my
>>> point regarding how differing behavior for the same syntax can lead to
>>> issues.  Imo the best case scenario results in the user not even noticing
>>> their indexes have changed.
>>>
>>> An (maybe better?) alternative is to add a flag to the index
>>> configuration for "compatibility mod", which might address the concerns
>>> around using an equality operator when it actually is a partial match.
>>>
>>> For what it's worth, I'm in agreement that = should mean full equality
>>> and not token match.
>>>
>>> On 2023/08/03 03:56:23 Caleb Rackliffe wrote:
>>> > For what it's worth, I'd very much like to completely remove SASI from
>>> the
>>> > codebase for 6.0. The only remaining functionality gaps at the moment
>>> are
>>> > LIKE (prefix/suffix) queries and its limited tokenization
>>> > capabilities, both of which already have SAI Phase 2 Jiras.
>>> >
>>> > On Wed, Aug 2, 2023 at 7:20 PM Jeremiah Jordan <jerem...@datastax.com>
>>> > wrote:
>>> >
>>> > > SASI just uses “=“ for the tokenized equality matching, which is the
>>> exact
>>> > > thing this discussion is about changing/not liking.
>>> > >
>>> > > > On Aug 2, 2023, at 7:18 PM, J. D. Jordan <
>>> jeremiah.jor...@gmail.com>
>>> > > wrote:
>>> > > >
>>> > > > I do not think LIKE actually applies here. LIKE is used for
>>> prefix,
>>> > > contains, or suffix searches in SASI depending on the index type.
>>> > > >
>>> > > > This is about exact matching of tokens.
>>> > > >
>>> > > >> On Aug 2, 2023, at 5:53 PM, Jon Haddad <
>>> rustyrazorbl...@apache.org>
>>> > > wrote:
>>> > > >>
>>> > > >> Certain bits of functionality also already exist on the SASI
>>> side of
>>> > > things, but I'm not sure how much overlap there is.  Currently,
>>> there's a
>>> > > LIKE keyword that handles token matching, although it seems to have
>>> some
>>> > > differences from the feature set in SAI.
>>> > > >>
>>> > > >> That said, there seems to be enough of an overlap that it would
>>> make
>>> > > sense to consider using LIKE in the same manner, doesn't it?  I
>>> think it
>>> > > would be a little odd if we have different syntax for different
>>> indexes.
>>> > > >>
>>> > > >> https://github.com/apache/cassandra/blob/trunk/doc/SASI.md
>>> > > >>
>>> > > >> I think one complication here is that there seems to be a desire,
>>> that
>>> > > I very much agree with, to expose as much of the underlying
>>> flexibility of
>>> > > Lucene as much as possible.  If it means we use Caleb's suggestion,
>>> I'd ask
>>> > > that the queries that SASI and SAI both support use the same syntax,
>>> even
>>> > > if it means there's two ways of writing the same query.  To use
>>> Caleb's
>>> > > example, this would mean supporting both LIKE and the `expr` column.
>>> > > >>
>>> > > >> Jon
>>> > > >>
>>> > > >>>> On 2023/08/01 19:17:11 Caleb Rackliffe wrote:
>>> > > >>> Here are some additional bits of prior art, if anyone finds them
>>> > > useful:
>>> > > >>>
>>> > > >>>
>>> > > >>> The Stratio Lucene Index -
>>> > > >>> https://github.com/Stratio/cassandra-lucene-index#examples
>>> > > >>>
>>> > > >>> Stratio was the reason C* added the "expr" functionality. They
>>> embedded
>>> > > >>> something similar to ElasticSearch JSON, which probably isn't my
>>> > > favorite
>>> > > >>> choice, but it's there.
>>> > > >>>
>>> > > >>>
>>> > > >>> The ElasticSearch match query syntax -
>>> > > >>>
>>> > >
>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html__;!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAf0MsxZ9$
>>> > > >>>
>>> > > >>> Again, not my favorite. It's verbose, and probably too powerful
>>> for us.
>>> > > >>>
>>> > > >>>
>>> > > >>> ElasticSearch's documentation for the basic Lucene query syntax -
>>> > > >>>
>>> > >
>>> https://urldefense.com/v3/__https://www.elastic.co/guide/en/elasticsearch/reference/8.9/query-dsl-query-string-query.html*query-string-syntax__;Iw!!PbtH5S7Ebw!ZHwYJ2xkivwTzYgjkp5QFAzALXCWPqkga6GBD-m2aK3j06ioSCRPsdZD0CIe50VpRrtW-1rY_m6lrSpp7zVlAXEPP1sK$
>>> > > >>>
>>> > > >>> One idea is to take the basic Lucene index, which it seems we
>>> already
>>> > > have
>>> > > >>> some support for, and feed it to "expr". This is nice for two
>>> reasons:
>>> > > >>>
>>> > > >>> 1.) People can just write Lucene queries if they already know
>>> how.
>>> > > >>> 2.) No changes to the grammar.
>>> > > >>>
>>> > > >>> Lucene has distinct concepts of filtering and querying, and this
>>> is
>>> > > kind of
>>> > > >>> the latter. I'm not sure how, for example, we would want "expr"
>>> to
>>> > > interact
>>> > > >>> w/ filters on other column indexes in vanilla CQL space...
>>> > > >>>
>>> > > >>>
>>> > > >>>> On Mon, Jul 24, 2023 at 9:37 AM Josh McKenzie <
>>> jmcken...@apache.org>
>>> > > wrote:
>>> > > >>>>
>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and
>>> Python for
>>> > > >>>> substring searches, so at least some users will be surprised by
>>> > > term-based
>>> > > >>>> behavior.
>>> > > >>>>
>>> > > >>>> I wonder whether users are in their "programming language"
>>> headspace
>>> > > or in
>>> > > >>>> their "querying a database" headspace when interacting with
>>> CQL? i.e.
>>> > > this
>>> > > >>>> would only present confusion if we expected users to be
>>> thinking in
>>> > > the
>>> > > >>>> idioms of their respective programming languages. If they're
>>> thinking
>>> > > in
>>> > > >>>> terms of SQL, MATCHES would probably end up confusing them a bit
>>> > > since it
>>> > > >>>> doesn't match the general structure of the MATCH operator.
>>> > > >>>>
>>> > > >>>> That said, I also think CONTAINS loses something important that
>>> you
>>> > > allude
>>> > > >>>> to here Jonathan:
>>> > > >>>>
>>> > > >>>> with corresponding query-time tokenization and analysis.  This
>>> means
>>> > > that
>>> > > >>>> the query term is not always a substring of the original string!
>>> > > Besides
>>> > > >>>> obvious transformations like lowercasing, you have things like
>>> > > >>>> PhoneticFilter available as well.
>>> > > >>>>
>>> > > >>>> So to me, neither MATCHES nor CONTAINS are particularly great
>>> > > candidates.
>>> > > >>>>
>>> > > >>>> So +1 to the "I don't actually hate it" sentiment on:
>>> > > >>>>
>>> > > >>>> column : term`. Inspired by Lucene’s syntax
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>> On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
>>> > > >>>>
>>> > > >>>>
>>> > > >>>> I have a strong preference not to use the name of an SQL
>>> operator,
>>> > > since
>>> > > >>>> it precludes us later providing the SQL standard operator to
>>> users.
>>> > > >>>>
>>> > > >>>> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>> On 24 Jul 2023, at 13:34, Andrés de la Peña <
>>> adelap...@apache.org>
>>> > > wrote:
>>> > > >>>>
>>> > > >>>> 
>>> > > >>>> `column = term` is definitively problematic because it creates
>>> an
>>> > > >>>> ambiguity when the queried column belongs to the primary key.
>>> For some
>>> > > >>>> queries we wouldn't know whether the user wants a primary key
>>> query
>>> > > using
>>> > > >>>> regular equality or an index query using the analyzer.
>>> > > >>>>
>>> > > >>>> `term_matches(column, term)` seems quite clear and hard to
>>> > > misinterpret,
>>> > > >>>> but it's quite long to write and its implementation will be
>>> > > challenging
>>> > > >>>> since we would need a bunch of special casing around
>>> SelectStatement
>>> > > and
>>> > > >>>> functions.
>>> > > >>>>
>>> > > >>>> LIKE, MATCHES and CONTAINS could be a bit misleading since they
>>> seem
>>> > > to
>>> > > >>>> evoke different behaviours to what they would have.
>>> > > >>>>
>>> > > >>>> `column LIKE :term:` seems a bit redundant compared to just
>>> using
>>> > > `column
>>> > > >>>> : term`, and we are still introducing a new symbol.
>>> > > >>>>
>>> > > >>>> I think I like `column : term` the most, because it's brief,
>>> it's
>>> > > similar
>>> > > >>>> to the equivalent Lucene's syntax, and it doesn't seem to clash
>>> with
>>> > > other
>>> > > >>>> different meanings that I can think of.
>>> > > >>>>
>>> > > >>>>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis <
>>> jbel...@gmail.com>
>>> > > wrote:
>>> > > >>>>
>>> > > >>>> Hi all,
>>> > > >>>>
>>> > > >>>> With phase 1 of SAI wrapping up, I’d like to start the ball
>>> rolling on
>>> > > >>>> aligning around phase 2 features.
>>> > > >>>>
>>> > > >>>> In particular, we need to nail down the syntax for doing
>>> non-exact
>>> > > string
>>> > > >>>> matches.  We have a proof of concept that includes full Lucene
>>> > > analyzer and
>>> > > >>>> filter functionality – just the text transformation pieces,
>>> none of
>>> > > the
>>> > > >>>> storage parts – which is the gold standard in this space.  For
>>> > > example, the
>>> > > >>>> StandardAnalyzer [1] lowercases all terms and removes stopwords
>>> > > (common
>>> > > >>>> words like “a”, “is”, “the” that are usually not useful to
>>> search
>>> > > >>>> against).  Lucene also has classes that offer stemming, special
>>> case
>>> > > >>>> handling for email, and many languages besides English [2].
>>> > > >>>>
>>> > > >>>> What syntax should we use to express “rows whose analyzed
>>> tokens match
>>> > > >>>> this search term?”
>>> > > >>>>
>>> > > >>>> The syntax must be clear that we want to look for this term
>>> within the
>>> > > >>>> column data using the configured index with corresponding
>>> query-time
>>> > > >>>> tokenization and analysis.  This means that the query term is
>>> not
>>> > > always a
>>> > > >>>> substring of the original string!  Besides obvious
>>> transformations
>>> > > like
>>> > > >>>> lowercasing, you have things like PhoneticFilter available as
>>> well.
>>> > > >>>>
>>> > > >>>> Here are my thoughts on some of the options:
>>> > > >>>>
>>> > > >>>> `column = term`.  This is what the POC does today and it’s super
>>> > > confusing
>>> > > >>>> to overload = to mean something other than exact equality.  I
>>> am not
>>> > > a fan.
>>> > > >>>>
>>> > > >>>> `column LIKE term` or `column LIKE %term%`. The closest SQL
>>> operator,
>>> > > but
>>> > > >>>> neither the wildcarded nor unwildcarded syntax matches the
>>> semantics
>>> > > of
>>> > > >>>> term-based search.
>>> > > >>>>
>>> > > >>>> `column MATCHES term`. I rather like this one, although Mike
>>> points
>>> > > out
>>> > > >>>> that “match” has a meaning in the context of regular
>>> expressions that
>>> > > could
>>> > > >>>> cause confusion here.
>>> > > >>>>
>>> > > >>>> `column CONTAINS term`. Contains is used by both Java and
>>> Python for
>>> > > >>>> substring searches, so at least some users will be surprised by
>>> > > term-based
>>> > > >>>> behavior.
>>> > > >>>>
>>> > > >>>> `term_matches(column, term)`. Postgresql FTS makes you use
>>> functions
>>> > > like
>>> > > >>>> this for everything.  It’s pretty clunky, and we would need to
>>> make
>>> > > the
>>> > > >>>> amazingly hairy SelectStatement even hairier to handle “use a
>>> function
>>> > > >>>> result in a predicate” like this.
>>> > > >>>>
>>> > > >>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually
>>> hate
>>> > > it.
>>> > > >>>>
>>> > > >>>> `column LIKE :term:`. Stick with the LIKE operator but add a new
>>> > > symbol to
>>> > > >>>> indicate term matching.  Arguably more SQL-ish than a new bare
>>> symbol
>>> > > >>>> operator.
>>> > > >>>>
>>> > > >>>> [1]
>>> > > >>>>
>>> > >
>>> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
>>> > > >>>> [2]
>>> https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>>> > > >>>>
>>> > > >>>> --
>>> > > >>>> Jonathan Ellis
>>> > > >>>> co-founder, http://www.datastax.com
>>> > > >>>> @spyced
>>> > > >>>>
>>> > > >>>>
>>> > > >>>>
>>> > > >>>
>>> > >
>>> >
>>>
>>
>>
>> --
>> [image: DataStax Logo Square] <https://www.datastax.com/> *Mike Adamson*
>> Engineering
>>
>> +1 650 389 6000 <16503896000> | datastax.com <https://www.datastax.com/>
>> Find DataStax Online: [image: LinkedIn Logo]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.linkedin.com_company_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=akx0E6l2bnTjOvA-YxtonbW0M4b6bNg4nRwmcHNDo4Q&e=>
>>    [image: Facebook Logo]
>> <https://urldefense.proofpoint.com/v2/url?u=https-3A__www.facebook.com_datastax&d=DwMFaQ&c=adz96Xi0w1RHqtPMowiL2g&r=IFj3MdIKYLLXIUhYdUGB0cTzTlxyCb7_VUmICBaYilU&m=uHzE4WhPViSF0rsjSxKhfwGDU1Bo7USObSc_aIcgelo&s=ncMlB41-6hHuqx-EhnM83-KVtjMegQ9c2l2zDzHAxiU&e=>
>>    [image: Twitter Logo] <https://twitter.com/DataStax>   [image: RSS
>> Feed] <https://www.datastax.com/blog/rss.xml>   [image: Github Logo]
>> <https://github.com/datastax>
>>
>>
>
> --
> Regards,
>
> Atri
> Apache Concerted
>

Re: Tokenization and SAI query syntax

Reply via email to