Re: [ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread C. Scott Andreas
Ekaterina, thank you for spearheading JDK17 support for Apache Cassandra! Exciting to get to this point.- ScottOn Jul 24, 2023, at 7:11 PM, Ekaterina Dimitrova  wrote:Good news! After run #1638-39 you should not see anything else failing than SSLFactory test class. This known issue will be fixed by potentially adding  bounty castle. More info in CASSANDRA-17992 and this netty PR:https://github.com/netty/netty/issues/10317We can probably mark the test class with @Ignore, but knowing how easily those are forgotten and 17992 being already in review, I prefer not to do it. The only new failure I found in #1636 is a rare flaky test we never saw in CircleCI before. (unit tests were running only there; they were not enabled in Jenkins until we cleaned them ). Ticket already opened -  CASSANDRA-18685Last but not least, eclipse-warnings is already removed (it doesn't work with post JDK8 versions), but the new static analysis from Checker Framework is already in review and soon to land in trunk - CASSANDRA-18239As usual - if you have any questions or concerns, please do let me know.Last but not least - thank you to everyone who helped in one way or another with this effort!!On Mon, 24 Jul 2023 at 16:37, Ekaterina Dimitrova  wrote:Ninja fix was required for Jenkins, new build started #1636On Mon, 24 Jul 2023 at 15:42, Ekaterina Dimitrova  wrote:Done!All commits from 18255 are in.The first run to monitor will be in Jenkins #1635There will be still fixes to be applied for some unit and in-jvm tests that were pending on the drop but I will do it when I see Jenkins kicking in this run properly.  (Which are those can be seen in CASSANDRA-16895, there is a table in its description)I will keep you posted on any new developments.On Mon, 24 Jul 2023 at 14:52, Ekaterina Dimitrova  wrote:Starting commits for 18255. Please put on hold any trunk commits. I will let you know when it is done. Thank youOn Mon, 24 Jul 2023 at 11:29, Ekaterina Dimitrova  wrote:Hi everyone,Happy Monday! I am working on dropping JDK 8 and adding JDK17 on trunk in both CI systems today.This requires numerous patches in a few repos so you will be seeing more failures in CI throughout the day today, but it shouldn’t be anything more 🤞 than what we have listed in the table of failures in CASSANDRA-16895’s description. I will be applying the fixes one by one today.I will keep you posted with updates. Also, please, do let me know if you have any questions or concerns.Best regards,Ekaterina






Re: [ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread Ekaterina Dimitrova
Good news!
After run #1638-39 you should not see anything else failing than SSLFactory
test class. This known issue will be fixed by potentially adding  bounty
castle. More info in CASSANDRA-17992 and this netty PR:
https://github.com/netty/netty/issues/10317
We can probably mark the test class with @Ignore, but knowing how easily
those are forgotten and 17992 being already in review, I prefer not to do
it.

The only new failure I found in #1636 is a rare flaky test we never saw in
CircleCI before. (unit tests were running only there; they were not enabled
in Jenkins until we cleaned them ). Ticket already opened -  CASSANDRA-18685


Last but not least, eclipse-warnings is already removed (it doesn't work
with post JDK8 versions), but the new static analysis from Checker
Framework is already in review and soon to land in trunk - CASSANDRA-18239

As usual - if you have any questions or concerns, please do let me know.
Last but not least - thank you to everyone who helped in one way or another
with this effort!!

On Mon, 24 Jul 2023 at 16:37, Ekaterina Dimitrova 
wrote:

> Ninja fix was required for Jenkins, new build started #1636
>
> On Mon, 24 Jul 2023 at 15:42, Ekaterina Dimitrova 
> wrote:
>
>> Done!
>>
>> All commits from 18255 are in.
>> The first run to monitor will be in Jenkins #1635
>>
>> There will be still fixes to be applied for some unit and in-jvm tests
>> that were pending on the drop but I will do it when I see Jenkins kicking
>> in this run properly.  (Which are those can be seen in CASSANDRA-16895,
>> there is a table in its description)
>>
>> I will keep you posted on any new developments.
>>
>>
>> On Mon, 24 Jul 2023 at 14:52, Ekaterina Dimitrova 
>> wrote:
>>
>>> Starting commits for 18255. Please put on hold any trunk commits. I will
>>> let you know when it is done. Thank you
>>>
>>> On Mon, 24 Jul 2023 at 11:29, Ekaterina Dimitrova 
>>> wrote:
>>>
 Hi everyone,

 Happy Monday!

 I am working on dropping JDK 8 and adding JDK17 on trunk in both CI
 systems today.
 This requires numerous patches in a few repos so you will be seeing
 more failures in CI throughout the day today, but it shouldn’t be anything
 more 🤞 than what we have listed in the table of failures in
 CASSANDRA-16895’s description. I will be applying the fixes one by one
 today.
 I will keep you posted with updates. Also, please, do let me know if
 you have any questions or concerns.

 Best regards,
 Ekaterina





Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
To clarify the repair solution timing, the one we have listed in the
article is not the recently developed one. We were hitting some
high-priority production challenges back in early 2018, and to address
that, we developed and rolled out the solution in production in just a few
months. The timing-wise, the solution was developed and productized by Q3
2018, of course, continued to evolve thereafter. Usually, we explore the
existing solutions we can leverage, but when we started our journey in
early 2018, most of the solutions were based on sidecar solutions. There is
nothing against the sidecar solution; it was just a pure business decision,
and in that, we wanted to avoid the sidecar to avoid a dependency on the
control plane. Every solution developed has its deep context, merits, and
pros and cons; they are all great solutions!

An appeal to the community members is to think one more time about having
repairs in the Open Source Cassandra itself. As mentioned in my previous
email, any solution getting adopted is fine; the important aspect is to
have a repair solution in the OSS Cassandra itself!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 3:46 PM Jaydeep Chovatia 
wrote:

> Hi German,
>
> The goal is always to backport our learnings back to the community. For
> example, I have already successfully backported the following two
> enhancements/bug fixes back to the Open Source Cassandra, which are
> described in the article. I am already currently working on open-source a
> few more enhancements mentioned in the article back to the open-source.
>
>1. https://issues.apache.org/jira/browse/CASSANDRA-18555
>2. https://issues.apache.org/jira/browse/CASSANDRA-13740
>
> There is definitely heavy interest in having the repair solution inside
> the Open Source Cassandra itself, very much like Compaction. As I write
> this email, we are internally working on a one-pager proposal doc to all
> the community members on having a repair inside the OSS Apache Cassandra
> along with our private fork - I will share it soon.
>
> Generally, we are ok with any solution getting adopted (either Joey's
> solution or our repair solution or any other solution). The primary
> motivation is to have the repair embedded inside the open-source Cassandra
> itself, so we can retire all various privately developed solutions
> eventually :)
>
> I am also happy to help (drive conversation, discussion, etc.) in any way
> to have a repair solution adopted inside Cassandra itself, please let me
> know. Happy to help!
>
> Yours Faithfully,
> Jaydeep
>
> On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
> dev@cassandra.apache.org> wrote:
>
>> All,
>>
>> We had a brief discussion in [2] about the Uber article [1] where they
>> talk about having integrated repair into Cassandra and how great that is. I
>> expressed my disappointment that they didn't work with the community on
>> that (Uber, if you are listening time to make amends 🙂) and it turns out
>> Joey already had the idea and wrote the code [3] - so I wanted to start a
>> discussion to gauge interest and maybe how to revive that effort.
>>
>> Thanks,
>> German
>>
>> [1]
>> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
>> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
>> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>>
>


July Contributor Meeting: Topic change

2023-07-24 Thread Hugh Lashbrooke
Hi all,

There has been a last minute change in topic for the July Contributor
Meeting!

Caleb Rackliffe will be talking about CEP-7: Storage Attached Index, rather
than the previously advertised CEP-15. He'll talk through the history of
CEP-7, look at the current work as it approaches merge into trunk, and take
any questions you may have.

The event will be at the same time and place - that's 25 July @ 10am PT via
the Zoom link on Confluence

.

See you all there!

-- 
Hugh Lashbrooke
Director of Community, Constantia.io
LinkedIn  | Twitter
 | Mastodon



Re: [Discuss] Repair inside C*

2023-07-24 Thread Jaydeep Chovatia
Hi German,

The goal is always to backport our learnings back to the community. For
example, I have already successfully backported the following two
enhancements/bug fixes back to the Open Source Cassandra, which are
described in the article. I am already currently working on open-source a
few more enhancements mentioned in the article back to the open-source.

   1. https://issues.apache.org/jira/browse/CASSANDRA-18555
   2. https://issues.apache.org/jira/browse/CASSANDRA-13740

There is definitely heavy interest in having the repair solution inside the
Open Source Cassandra itself, very much like Compaction. As I write this
email, we are internally working on a one-pager proposal doc to all the
community members on having a repair inside the OSS Apache Cassandra along
with our private fork - I will share it soon.

Generally, we are ok with any solution getting adopted (either Joey's
solution or our repair solution or any other solution). The primary
motivation is to have the repair embedded inside the open-source Cassandra
itself, so we can retire all various privately developed solutions
eventually :)

I am also happy to help (drive conversation, discussion, etc.) in any way
to have a repair solution adopted inside Cassandra itself, please let me
know. Happy to help!

Yours Faithfully,
Jaydeep

On Mon, Jul 24, 2023 at 1:44 PM German Eichberger via dev <
dev@cassandra.apache.org> wrote:

> All,
>
> We had a brief discussion in [2] about the Uber article [1] where they
> talk about having integrated repair into Cassandra and how great that is. I
> expressed my disappointment that they didn't work with the community on
> that (Uber, if you are listening time to make amends 🙂) and it turns out
> Joey already had the idea and wrote the code [3] - so I wanted to start a
> discussion to gauge interest and maybe how to revive that effort.
>
> Thanks,
> German
>
> [1]
> https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
> [2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
> [3] https://issues.apache.org/jira/browse/CASSANDRA-14346
>


[Discuss] Repair inside C*

2023-07-24 Thread German Eichberger via dev
All,

We had a brief discussion in [2] about the Uber article [1] where they talk 
about having integrated repair into Cassandra and how great that is. I 
expressed my disappointment that they didn't work with the community on that 
(Uber, if you are listening time to make amends 🙂) and it turns out Joey 
already had the idea and wrote the code [3] - so I wanted to start a discussion 
to gauge interest and maybe how to revive that effort.

Thanks,
German

[1] https://www.uber.com/blog/how-uber-optimized-cassandra-operations-at-scale/
[2] https://the-asf.slack.com/archives/CK23JSY2K/p1690225062383619
[3] https://issues.apache.org/jira/browse/CASSANDRA-14346


Re: [ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread Ekaterina Dimitrova
Ninja fix was required for Jenkins, new build started #1636

On Mon, 24 Jul 2023 at 15:42, Ekaterina Dimitrova 
wrote:

> Done!
>
> All commits from 18255 are in.
> The first run to monitor will be in Jenkins #1635
>
> There will be still fixes to be applied for some unit and in-jvm tests
> that were pending on the drop but I will do it when I see Jenkins kicking
> in this run properly.  (Which are those can be seen in CASSANDRA-16895,
> there is a table in its description)
>
> I will keep you posted on any new developments.
>
>
> On Mon, 24 Jul 2023 at 14:52, Ekaterina Dimitrova 
> wrote:
>
>> Starting commits for 18255. Please put on hold any trunk commits. I will
>> let you know when it is done. Thank you
>>
>> On Mon, 24 Jul 2023 at 11:29, Ekaterina Dimitrova 
>> wrote:
>>
>>> Hi everyone,
>>>
>>> Happy Monday!
>>>
>>> I am working on dropping JDK 8 and adding JDK17 on trunk in both CI
>>> systems today.
>>> This requires numerous patches in a few repos so you will be seeing more
>>> failures in CI throughout the day today, but it shouldn’t be anything more
>>> 🤞 than what we have listed in the table of failures in CASSANDRA-16895’s
>>> description. I will be applying the fixes one by one today.
>>> I will keep you posted with updates. Also, please, do let me know if you
>>> have any questions or concerns.
>>>
>>> Best regards,
>>> Ekaterina
>>>
>>>
>>>


Re: [ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread Ekaterina Dimitrova
Done!

All commits from 18255 are in.
The first run to monitor will be in Jenkins #1635

There will be still fixes to be applied for some unit and in-jvm tests that
were pending on the drop but I will do it when I see Jenkins kicking in
this run properly.  (Which are those can be seen in CASSANDRA-16895, there
is a table in its description)

I will keep you posted on any new developments.


On Mon, 24 Jul 2023 at 14:52, Ekaterina Dimitrova 
wrote:

> Starting commits for 18255. Please put on hold any trunk commits. I will
> let you know when it is done. Thank you
>
> On Mon, 24 Jul 2023 at 11:29, Ekaterina Dimitrova 
> wrote:
>
>> Hi everyone,
>>
>> Happy Monday!
>>
>> I am working on dropping JDK 8 and adding JDK17 on trunk in both CI
>> systems today.
>> This requires numerous patches in a few repos so you will be seeing more
>> failures in CI throughout the day today, but it shouldn’t be anything more
>> 🤞 than what we have listed in the table of failures in CASSANDRA-16895’s
>> description. I will be applying the fixes one by one today.
>> I will keep you posted with updates. Also, please, do let me know if you
>> have any questions or concerns.
>>
>> Best regards,
>> Ekaterina
>>
>>
>>


Re: [VOTE] CEP-34: mTLS based client and internode authenticators

2023-07-24 Thread Jyothsna Konisa
Thank you everyone! Voting passes with 8 +1s and no -1. Closing this thread
now.
Jyothsna Konisa.

On Sat, Jul 22, 2023 at 3:14 AM Brandon Williams  wrote:

> +1
>
> Kind Regards,
> Brandon
>
> On Fri, Jul 21, 2023 at 11:58 AM Jyothsna Konisa 
> wrote:
> >
> > Hi Everyone!
> >
> > I would like to start a vote thread for CEP-34.
> >
> > Proposal:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-34%3A+mTLS+based+client+and+internode+authenticators
> > JIRA   :
> https://issues.apache.org/jira/browse/CASSANDRA-18554
> > Draft Implementation : https://github.com/apache/cassandra/pull/2372
> > Discussion :
> https://lists.apache.org/thread/pnfg65r76rbbs70hwhsz94ds6yo2042f
> >
> > The vote will be open for 72 hours. A vote passes if there are at least
> 3 binding +1s and no binding vetoes.
> >
> > Thanks,
> > Jyothsna Konisa.
>


Re: [ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread Ekaterina Dimitrova
Starting commits for 18255. Please put on hold any trunk commits. I will
let you know when it is done. Thank you

On Mon, 24 Jul 2023 at 11:29, Ekaterina Dimitrova 
wrote:

> Hi everyone,
>
> Happy Monday!
>
> I am working on dropping JDK 8 and adding JDK17 on trunk in both CI
> systems today.
> This requires numerous patches in a few repos so you will be seeing more
> failures in CI throughout the day today, but it shouldn’t be anything more
> 🤞 than what we have listed in the table of failures in CASSANDRA-16895’s
> description. I will be applying the fixes one by one today.
> I will keep you posted with updates. Also, please, do let me know if you
> have any questions or concerns.
>
> Best regards,
> Ekaterina
>
>
>


[RELEASE] Apache Cassandra 4.1.3 released

2023-07-24 Thread Miklosovic, Stefan
The Cassandra team is pleased to announce the release of Apache Cassandra 
version 4.1.3.

Apache Cassandra is a fully distributed database. It is the right choice when 
you need scalability and high availability without compromising performance.

 http://cassandra.apache.org/

Downloads of source and binary distributions are listed in our download section:

 http://cassandra.apache.org/download/

This version is a bug fix release[1] on the 4.1 series. As always, please pay 
attention to the release notes[2] and Let us know[3] if you were to encounter 
any problem.

[WARNING] Debian and RedHat package repositories have moved! Debian 
/etc/apt/sources.list.d/cassandra.sources.list and RedHat 
/etc/yum.repos.d/cassandra.repo files must be updated to the new repository 
URLs. For Debian it is now https://debian.cassandra.apache.org . For RedHat it 
is now https://redhat.cassandra.apache.org/41x/ .

Enjoy!

[1]: CHANGES.txt 
https://github.com/apache/cassandra/blob/cassandra-4.1.3/CHANGES.txt
[2]: NEWS.txt https://github.com/apache/cassandra/blob/cassandra-4.1.3/NEWS.txt
[3]: https://issues.apache.org/jira/browse/CASSANDRA

[ANNOUNCEMENT] Expect failures today. Dropping JDK 8 and adding JDK 11

2023-07-24 Thread Ekaterina Dimitrova
Hi everyone,

Happy Monday!

I am working on dropping JDK 8 and adding JDK17 on trunk in both CI systems
today.
This requires numerous patches in a few repos so you will be seeing more
failures in CI throughout the day today, but it shouldn’t be anything more
🤞 than what we have listed in the table of failures in CASSANDRA-16895’s
description. I will be applying the fixes one by one today.
I will keep you posted with updates. Also, please, do let me know if you
have any questions or concerns.

Best regards,
Ekaterina


Re: Tokenization and SAI query syntax

2023-07-24 Thread Josh McKenzie
> `column CONTAINS term`. Contains is used by both Java and Python for 
> substring searches, so at least some users will be surprised by term-based 
> behavior.
I wonder whether users are in their "programming language" headspace or in 
their "querying a database" headspace when interacting with CQL? i.e. this 
would only present confusion if we expected users to be thinking in the idioms 
of their respective programming languages. If they're thinking in terms of SQL, 
MATCHES would probably end up confusing them a bit since it doesn't match the 
general structure of the MATCH operator.

That said, I also think CONTAINS loses something important that you allude to 
here Jonathan:
> with corresponding query-time tokenization and analysis.  This means that the 
> query term is not always a substring of the original string!  Besides obvious 
> transformations like lowercasing, you have things like PhoneticFilter 
> available as well.
So to me, neither MATCHES nor CONTAINS are particularly great candidates.

So +1 to the "I don't actually hate it" sentiment on:
> column : term`. Inspired by Lucene’s syntax

On Mon, Jul 24, 2023, at 8:35 AM, Benedict wrote:
> 
> I have a strong preference not to use the name of an SQL operator, since it 
> precludes us later providing the SQL standard operator to users.
> 
> What about CONTAINS TOKEN term? Or CONTAINS TERM term?
> 
> 
>> On 24 Jul 2023, at 13:34, Andrés de la Peña  wrote:
>> 
>> `column = term` is definitively problematic because it creates an ambiguity 
>> when the queried column belongs to the primary key. For some queries we 
>> wouldn't know whether the user wants a primary key query using regular 
>> equality or an index query using the analyzer.
>> 
>> `term_matches(column, term)` seems quite clear and hard to misinterpret, but 
>> it's quite long to write and its implementation will be challenging since we 
>> would need a bunch of special casing around SelectStatement and functions.
>> 
>> LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to 
>> evoke different behaviours to what they would have.
>> 
>> `column LIKE :term:` seems a bit redundant compared to just using `column : 
>> term`, and we are still introducing a new symbol.
>> 
>> I think I like `column : term` the most, because it's brief, it's similar to 
>> the equivalent Lucene's syntax, and it doesn't seem to clash with other 
>> different meanings that I can think of.
>> 
>> On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:
>>> Hi all,
>>> 
>>> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on 
>>> aligning around phase 2 features.
>>> 
>>> In particular, we need to nail down the syntax for doing non-exact string 
>>> matches.  We have a proof of concept that includes full Lucene analyzer and 
>>> filter functionality – just the text transformation pieces, none of the 
>>> storage parts – which is the gold standard in this space.  For example, the 
>>> StandardAnalyzer [1] lowercases all terms and removes stopwords (common 
>>> words like “a”, “is”, “the” that are usually not useful to search against). 
>>>  Lucene also has classes that offer stemming, special case handling for 
>>> email, and many languages besides English [2].
>>> 
>>> What syntax should we use to express “rows whose analyzed tokens match this 
>>> search term?”
>>> 
>>> The syntax must be clear that we want to look for this term within the 
>>> column data using the configured index with corresponding query-time 
>>> tokenization and analysis.  This means that the query term is not always a 
>>> substring of the original string!  Besides obvious transformations like 
>>> lowercasing, you have things like PhoneticFilter available as well.
>>> 
>>> Here are my thoughts on some of the options:
>>> 
>>> `column = term`.  This is what the POC does today and it’s super confusing 
>>> to overload = to mean something other than exact equality.  I am not a fan.
>>> 
>>> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but 
>>> neither the wildcarded nor unwildcarded syntax matches the semantics of 
>>> term-based search.
>>> 
>>> `column MATCHES term`. I rather like this one, although Mike points out 
>>> that “match” has a meaning in the context of regular expressions that could 
>>> cause confusion here.
>>> 
>>> `column CONTAINS term`. Contains is used by both Java and Python for 
>>> substring searches, so at least some users will be surprised by term-based 
>>> behavior.
>>> 
>>> `term_matches(column, term)`. Postgresql FTS makes you use functions like 
>>> this for everything.  It’s pretty clunky, and we would need to make the 
>>> amazingly hairy SelectStatement even hairier to handle “use a function 
>>> result in a predicate” like this.
>>> 
>>> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
>>> 
>>> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to 
>>> indicate term matching.  Arguably more SQL-ish than 

Re: Tokenization and SAI query syntax

2023-07-24 Thread Benedict
I have a strong preference not to use the name of an SQL operator, since it precludes us later providing the SQL standard operator to users.What about CONTAINS TOKEN term? Or CONTAINS TERM term?On 24 Jul 2023, at 13:34, Andrés de la Peña  wrote:`column = term` is definitively problematic because it creates an ambiguity when the queried column belongs to the primary key. For some queries we wouldn't know whether the user wants a primary key query using regular equality or an index query using the analyzer.`term_matches(column, term)` seems quite clear and hard to misinterpret, but it's quite long to write and its implementation will be challenging since we would need a bunch of special casing around SelectStatement and functions.LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to evoke different behaviours to what they would have.`column LIKE :term:` seems a bit redundant compared to just using `column : term`, and we are still introducing a new symbol.I think I like `column : term` the most, because it's brief, it's similar to the equivalent Lucene's syntax, and it doesn't seem to clash with other different meanings that I can think of.On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:Hi all,With phase 1 of SAI wrapping up, I’d like to start the ball rolling on aligning around phase 2 features.In particular, we need to nail down the syntax for doing non-exact string matches.  We have a proof of concept that includes full Lucene analyzer and filter functionality – just the text transformation pieces, none of the storage parts – which is the gold standard in this space.  For example, the StandardAnalyzer [1] lowercases all terms and removes stopwords (common words like “a”, “is”, “the” that are usually not useful to search against).  Lucene also has classes that offer stemming, special case handling for email, and many languages besides English [2].What syntax should we use to express “rows whose analyzed tokens match this search term?”The syntax must be clear that we want to look for this term within the column data using the configured index with corresponding query-time tokenization and analysis.  This means that the query term is not always a substring of the original string!  Besides obvious transformations like lowercasing, you have things like PhoneticFilter available as well.Here are my thoughts on some of the options:`column = term`.  This is what the POC does today and it’s super confusing to overload = to mean something other than exact equality.  I am not a fan.`column LIKE term` or `column LIKE %term%`. The closest SQL operator, but neither the wildcarded nor unwildcarded syntax matches the semantics of term-based search.`column MATCHES term`. I rather like this one, although Mike points out that “match” has a meaning in the context of regular expressions that could cause confusion here.`column CONTAINS term`. Contains is used by both Java and Python for substring searches, so at least some users will be surprised by term-based behavior.`term_matches(column, term)`. Postgresql FTS makes you use functions like this for everything.  It’s pretty clunky, and we would need to make the amazingly hairy SelectStatement even hairier to handle “use a function result in a predicate” like this.`column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.`column LIKE :term:`. Stick with the LIKE operator but add a new symbol to indicate term matching.  Arguably more SQL-ish than a new bare symbol operator.[1] https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html[2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html-- Jonathan Ellisco-founder, http://www.datastax.com@spyced



Re: Tokenization and SAI query syntax

2023-07-24 Thread Andrés de la Peña
`column = term` is definitively problematic because it creates an ambiguity
when the queried column belongs to the primary key. For some queries we
wouldn't know whether the user wants a primary key query using regular
equality or an index query using the analyzer.

`term_matches(column, term)` seems quite clear and hard to misinterpret,
but it's quite long to write and its implementation will be challenging
since we would need a bunch of special casing around SelectStatement and
functions.

LIKE, MATCHES and CONTAINS could be a bit misleading since they seem to
evoke different behaviours to what they would have.

`column LIKE :term:` seems a bit redundant compared to just using `column :
term`, and we are still introducing a new symbol.

I think I like `column : term` the most, because it's brief, it's similar
to the equivalent Lucene's syntax, and it doesn't seem to clash with other
different meanings that I can think of.

On Mon, 24 Jul 2023 at 13:13, Jonathan Ellis  wrote:

> Hi all,
>
> With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
> aligning around phase 2 features.
>
> In particular, we need to nail down the syntax for doing non-exact string
> matches.  We have a proof of concept that includes full Lucene analyzer and
> filter functionality – just the text transformation pieces, none of the
> storage parts – which is the gold standard in this space.  For example, the
> StandardAnalyzer [1] lowercases all terms and removes stopwords (common
> words like “a”, “is”, “the” that are usually not useful to search
> against).  Lucene also has classes that offer stemming, special case
> handling for email, and many languages besides English [2].
>
> What syntax should we use to express “rows whose analyzed tokens match
> this search term?”
>
> The syntax must be clear that we want to look for this term within the
> column data using the configured index with corresponding query-time
> tokenization and analysis.  This means that the query term is not always a
> substring of the original string!  Besides obvious transformations like
> lowercasing, you have things like PhoneticFilter available as well.
>
> Here are my thoughts on some of the options:
>
> `column = term`.  This is what the POC does today and it’s super confusing
> to overload = to mean something other than exact equality.  I am not a fan.
>
> `column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
> neither the wildcarded nor unwildcarded syntax matches the semantics of
> term-based search.
>
> `column MATCHES term`. I rather like this one, although Mike points out
> that “match” has a meaning in the context of regular expressions that could
> cause confusion here.
>
> `column CONTAINS term`. Contains is used by both Java and Python for
> substring searches, so at least some users will be surprised by term-based
> behavior.
>
> `term_matches(column, term)`. Postgresql FTS makes you use functions like
> this for everything.  It’s pretty clunky, and we would need to make the
> amazingly hairy SelectStatement even hairier to handle “use a function
> result in a predicate” like this.
>
> `column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.
>
> `column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
> indicate term matching.  Arguably more SQL-ish than a new bare symbol
> operator.
>
> [1]
> https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
> [2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


Tokenization and SAI query syntax

2023-07-24 Thread Jonathan Ellis
Hi all,

With phase 1 of SAI wrapping up, I’d like to start the ball rolling on
aligning around phase 2 features.

In particular, we need to nail down the syntax for doing non-exact string
matches.  We have a proof of concept that includes full Lucene analyzer and
filter functionality – just the text transformation pieces, none of the
storage parts – which is the gold standard in this space.  For example, the
StandardAnalyzer [1] lowercases all terms and removes stopwords (common
words like “a”, “is”, “the” that are usually not useful to search
against).  Lucene also has classes that offer stemming, special case
handling for email, and many languages besides English [2].

What syntax should we use to express “rows whose analyzed tokens match this
search term?”

The syntax must be clear that we want to look for this term within the
column data using the configured index with corresponding query-time
tokenization and analysis.  This means that the query term is not always a
substring of the original string!  Besides obvious transformations like
lowercasing, you have things like PhoneticFilter available as well.

Here are my thoughts on some of the options:

`column = term`.  This is what the POC does today and it’s super confusing
to overload = to mean something other than exact equality.  I am not a fan.

`column LIKE term` or `column LIKE %term%`. The closest SQL operator, but
neither the wildcarded nor unwildcarded syntax matches the semantics of
term-based search.

`column MATCHES term`. I rather like this one, although Mike points out
that “match” has a meaning in the context of regular expressions that could
cause confusion here.

`column CONTAINS term`. Contains is used by both Java and Python for
substring searches, so at least some users will be surprised by term-based
behavior.

`term_matches(column, term)`. Postgresql FTS makes you use functions like
this for everything.  It’s pretty clunky, and we would need to make the
amazingly hairy SelectStatement even hairier to handle “use a function
result in a predicate” like this.

`column : term`. Inspired by Lucene’s syntax.  I don’t actually hate it.

`column LIKE :term:`. Stick with the LIKE operator but add a new symbol to
indicate term matching.  Arguably more SQL-ish than a new bare symbol
operator.

[1]
https://lucene.apache.org/core/9_7_0/core/org/apache/lucene/analysis/standard/StandardAnalyzer.html
[2] https://lucene.apache.org/core/9_7_0/analysis/common/index.html

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


[RESULT][VOTE] Release Apache Cassandra 4.1.3

2023-07-24 Thread Miklosovic, Stefan
The vote passes with three binding and one non-binding +1s.

https://lists.apache.org/thread/8ot3wjc88k0rhx1m9m58k0bp4msbjw6w

[DISCUSS] Tiered Storage

2023-07-24 Thread Claude Warren, Jr via dev
I have been thinking about tiered storage wherein infrequently used data
can be moved off to slow (cold) storage (like S3).  I think that CEP-17 in
conjunction with CEP-21 provides an opportunity for an interesting approach.

As I understand it CEP-17 clarified the SSTables interface(s) so that
alternative implementations are possible, most notably CEM-25 (trie format
sstables).  CEP-21 provides a mechanism by which specific primary key
blocks can be assigned to specific servers.

It seems to me that we could implement an SSTable format that reads/writes
S3 storage and then use CEP-21 to direct specific keys to servers that
implement that storage.

I use primary key because I don't think we can reasonably partition the
records onto cold storage using any other method.

I think that records on the cold storage may be deleted, and may be updated
but both operations may take significant time and would require compaction
to be run at some point.  I expect that compaction would be very slow.

I am certain there are issues with this approach and am looking for
feedback before progressing an architecture proposal.

Thanks,
Claude