TAC Applications for Community Over Code North America and Asia now open

2023-06-16 Thread Gavin McDonald
Hi All,

(This email goes out to all our user and dev project mailing lists, so you
may receive this
email more than once.)

The Travel Assistance Committee has opened up applications to help get
people to the following events:


*Community Over Code Asia 2023 - *
*August 18th to August 20th in Beijing , China*

Applications for this event closes on the 6th July so time is short, please
apply as soon as possible. TAC is prioritising applications from the Asia
and Oceania regions.

More details on this event can be found at:
https://apachecon.com/acasia2023/

More information on how to apply please read: https://tac.apache.org/


*Community Over Code North America - *
*October 7th to October 10th in Halifax, Canada*

Applications for this event closes on the 22nd July. We expect many
applications so please do apply as soon as you can. TAC is prioritising
applications from the North and South America regions.

More details on this event can be found at: https://communityovercode.org/

More information on how to apply please read: https://tac.apache.org/


*Have you applied to be a Speaker?*

If you have applied or intend to apply as a Speaker at either of these
events, and think you
may require assistance for Travel and/or Accommodation - TAC advises that
you do not
wait until you have been notified of your speaker status and to apply
early. Should you
not be accepted as a speaker and still wish to attend you can amend you
application to
include Conference fees, or, you may withdraw your application.

The call for presentations for Halifax is here:
https://communityovercode.org/call-for-presentations/
and you have until the 13th of July to apply.

The call for presentations for Beijing is here:
https://apachecon.com/acasia2023/cfp.html
and you have until the 18th June to apply.

*IMPORTANT Note on Visas:*

It is important that you apply for a Visa as soon as possible - do not wait
until you know if you have been accepted for Travel Assistance or not, as
due to current wait times for Interviews in some Countries, waiting that
long may be too late, so please do apply for a Visa right away. Contact
tac-ap...@tac.apache.org if you need any more information or assistance in
this area.

*Spread the Word!!*

TAC encourages you to spread the word about Travel Assistance to get to
these events, so feel free to repost as you see fit on Social Media, at
work, schools, universities etc etc...

Thank You and hope to see you all soon

Gavin McDonald on behalf of the ASF Travel Assistance Committee.


Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-16 Thread Aleksey Yeshchenko
+1

> On 15 Jun 2023, at 15:19, Chris Lohfink  wrote:
> 
> +1
> 
> On Wed, Jun 14, 2023 at 9:05 PM Jon Haddad  > wrote:
>> +1
>> 
>> On 2023/06/13 14:14:35 Jeremy Hanna wrote:
>> > Calling for a vote on CEP-8 [1].
>> > 
>> > To clarify the intent, as Benjamin said in the discussion thread [2], the 
>> > goal of this vote is simply to ensure that the community is in favor of 
>> > the donation. Nothing more.
>> > The plan is to introduce the drivers, one by one. Each driver donation 
>> > will need to be accepted first by the PMC members, as it is the case for 
>> > any donation. Therefore the PMC should have full control on the pace at 
>> > which new drivers are accepted.
>> > 
>> > If this vote passes, we can start this process for the Java driver under 
>> > the direction of the PMC.
>> > 
>> > Jeremy
>> > 
>> > 1. 
>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
>> > 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp



Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Aleksey Yeshchenko
Sounds like a clear improvement to me. Only once this check flagged a 
legitimate issue I missed, if I’m remembering correctly. All other instances 
have just been annoyances, forcing to add a redundant suppressed annotation. 

> On 15 Jun 2023, at 19:01, Ekaterina Dimitrova  wrote:
> 
> Hi everyone,
> Happy Thursday!
> Some time ago, Jacek raised the point that ant eclipse-warnings is generating 
> too many false positives and not really working as expected. (CASSANDRA-18239)
> Reminder: ant eclipse-warnings is a task we run with the goal to check 
> Cassandra code - static analysis to warn on unsafe use of Autocloseable 
> instances; checks against two related particular compiler options
> While trying to upgrade ECJ compiler that we use for this task 
> (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
> JDK11 in preparation for dropping JDK8, I hit the following issues:
> - the latest version of ECJ is throwing more than 300 Potential Resource Leak 
> warnings. I looked at 10-15, and they were all false positives. 
> - Even if we file a bug report to the Eclipse community, JDK11 is about to be 
> removed with the next version of the compiler
> 
> So I shared this information with Jacek. He came up with a different solution:
> It seems we already pull through Guava CheckerFramework with an MIT license, 
> which appears to be acceptable according to this link -  
> https://www.apache.org/legal/resolved.html#category-a
> He already has an initial integration with Cassandra which shows the 
> following:
> - CheckerFramework does not understand the @SuppressWarnings("resource") 
> (there is a different one to be used), so it is immediately visible how it 
> does not report all those false positives that eclipse-warnings does. On the 
> flip side, I got the feedback that what it has witnessed so far is something 
> we should investigate.
> - Also, there are additional annotations like @Owning that let you fix many 
> problems at once because the tool understands that the ownership of the 
> resources was passed to another entity; It also enables you to do something 
> impossible with eclipse-warnings - you can tell the tool that there is 
> another method that needs to be called to release the resources, like 
> release, free, disconnect, etc.
> - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
> even to older branches (while at the same time keeping eclipse-warnings there)
> - though it runs 8 minutes so, we should not run it with every test, some 
> reorganization around ant tasks will be covered as even for eclipse-warnings 
> it was weird to call it on every single test run locally by default
> 
> 
> If there are no concerns, we will continue replacing ant eclipse-warnings 
> with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 in 
> trunk.
> Best regards,
> Ekaterina



Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Jacek Lewandowski
Additional question is whether we want to run the checks against the whole
project or just against the file changes between the feature branch and the
target release branch?


- - -- --- -  -
Jacek Lewandowski


pt., 16 cze 2023 o 13:09 Aleksey Yeshchenko  napisał(a):

> Sounds like a clear improvement to me. Only once this check flagged a
> legitimate issue I missed, if I’m remembering correctly. All other
> instances have just been annoyances, forcing to add a redundant suppressed
> annotation.
>
> On 15 Jun 2023, at 19:01, Ekaterina Dimitrova 
> wrote:
>
> Hi everyone,
> Happy Thursday!
> Some time ago, Jacek raised the point that ant eclipse-warnings is generating 
> too many false positives and not really working as expected. (CASSANDRA-18239)
>
> Reminder: ant eclipse-warnings is a task we run with the goal to check 
> Cassandra code - static analysis to warn on unsafe use of Autocloseable 
> instances; checks against two related particular compiler options
>
> While trying to upgrade ECJ compiler that we use for this task 
> (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
> JDK11 in preparation for dropping JDK8, I hit the following issues:
> - the latest version of ECJ is throwing more than 300 Potential Resource Leak 
> warnings. I looked at 10-15, and they were all false positives.
> - Even if we file a bug report to the Eclipse community, JDK11 is about to be 
> removed with the next version of the compiler
>
> So I shared this information with Jacek. He came up with a different solution:
> It seems we already pull through Guava CheckerFramework with an MIT license, 
> which appears to be acceptable according to this link -  
> https://www.apache.org/legal/resolved.html#category-a
> He already has an initial integration with Cassandra which shows the 
> following:
> - CheckerFramework does not understand the @SuppressWarnings("resource") 
> (there is a different one to be used), so it is immediately visible how it 
> does not report all those false positives that eclipse-warnings does. On the 
> flip side, I got the feedback that what it has witnessed so far is something 
> we should investigate.
> - Also, there are additional annotations like @Owning that let you fix many 
> problems at once because the tool understands that the ownership of the 
> resources was passed to another entity; It also enables you to do something 
> impossible with eclipse-warnings - you can tell the tool that there is 
> another method that needs to be called to release the resources, like 
> release, free, disconnect, etc.
> - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
> even to older branches (while at the same time keeping eclipse-warnings there)
> - though it runs 8 minutes so, we should not run it with every test, some 
> reorganization around ant tasks will be covered as even for eclipse-warnings 
> it was weird to call it on every single test run locally by default
>
>
> If there are no concerns, we will continue replacing ant eclipse-warnings 
> with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 in 
> trunk.
>
> Best regards,
>
> Ekaterina
>
>
>


Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Ekaterina Dimitrova
I think this is a great idea and it will probably reduce the time to run
it. Thank you!

On Fri, 16 Jun 2023 at 7:40, Jacek Lewandowski 
wrote:

> Additional question is whether we want to run the checks against the whole
> project or just against the file changes between the feature branch and the
> target release branch?
>
>
> - - -- --- -  -
> Jacek Lewandowski
>
>
> pt., 16 cze 2023 o 13:09 Aleksey Yeshchenko 
> napisał(a):
>
>> Sounds like a clear improvement to me. Only once this check flagged a
>> legitimate issue I missed, if I’m remembering correctly. All other
>> instances have just been annoyances, forcing to add a redundant suppressed
>> annotation.
>>
>> On 15 Jun 2023, at 19:01, Ekaterina Dimitrova 
>> wrote:
>>
>> Hi everyone,
>> Happy Thursday!
>> Some time ago, Jacek raised the point that ant eclipse-warnings is 
>> generating too many false positives and not really working as expected. 
>> (CASSANDRA-18239)
>>
>> Reminder: ant eclipse-warnings is a task we run with the goal to check 
>> Cassandra code - static analysis to warn on unsafe use of Autocloseable 
>> instances; checks against two related particular compiler options
>>
>> While trying to upgrade ECJ compiler that we use for this task 
>> (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
>> JDK11 in preparation for dropping JDK8, I hit the following issues:
>> - the latest version of ECJ is throwing more than 300 Potential Resource 
>> Leak warnings. I looked at 10-15, and they were all false positives.
>> - Even if we file a bug report to the Eclipse community, JDK11 is about to 
>> be removed with the next version of the compiler
>>
>> So I shared this information with Jacek. He came up with a different 
>> solution:
>> It seems we already pull through Guava CheckerFramework with an MIT license, 
>> which appears to be acceptable according to this link -  
>> https://www.apache.org/legal/resolved.html#category-a
>> He already has an initial integration with Cassandra which shows the 
>> following:
>> - CheckerFramework does not understand the @SuppressWarnings("resource") 
>> (there is a different one to be used), so it is immediately visible how it 
>> does not report all those false positives that eclipse-warnings does. On the 
>> flip side, I got the feedback that what it has witnessed so far is something 
>> we should investigate.
>> - Also, there are additional annotations like @Owning that let you fix many 
>> problems at once because the tool understands that the ownership of the 
>> resources was passed to another entity; It also enables you to do something 
>> impossible with eclipse-warnings - you can tell the tool that there is 
>> another method that needs to be called to release the resources, like 
>> release, free, disconnect, etc.
>> - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
>> even to older branches (while at the same time keeping eclipse-warnings 
>> there)
>> - though it runs 8 minutes so, we should not run it with every test, some 
>> reorganization around ant tasks will be covered as even for eclipse-warnings 
>> it was weird to call it on every single test run locally by default
>>
>>
>> If there are no concerns, we will continue replacing ant eclipse-warnings 
>> with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 in 
>> trunk.
>>
>> Best regards,
>>
>> Ekaterina
>>
>>
>>


Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Ekaterina Dimitrova
Got so excited that I forgot to say which of the two options exactly  I
meant - running the analysis only on changed files after the initial full
pass is done sounds like a good improvement to me

On Fri, 16 Jun 2023 at 7:43, Ekaterina Dimitrova 
wrote:

> I think this is a great idea and it will probably reduce the time to run
> it. Thank you!
>
> On Fri, 16 Jun 2023 at 7:40, Jacek Lewandowski <
> lewandowski.ja...@gmail.com> wrote:
>
>> Additional question is whether we want to run the checks against the
>> whole project or just against the file changes between the feature branch
>> and the target release branch?
>>
>>
>> - - -- --- -  -
>> Jacek Lewandowski
>>
>>
>> pt., 16 cze 2023 o 13:09 Aleksey Yeshchenko 
>> napisał(a):
>>
>>> Sounds like a clear improvement to me. Only once this check flagged a
>>> legitimate issue I missed, if I’m remembering correctly. All other
>>> instances have just been annoyances, forcing to add a redundant suppressed
>>> annotation.
>>>
>>> On 15 Jun 2023, at 19:01, Ekaterina Dimitrova 
>>> wrote:
>>>
>>> Hi everyone,
>>> Happy Thursday!
>>> Some time ago, Jacek raised the point that ant eclipse-warnings is 
>>> generating too many false positives and not really working as expected. 
>>> (CASSANDRA-18239)
>>>
>>> Reminder: ant eclipse-warnings is a task we run with the goal to check 
>>> Cassandra code - static analysis to warn on unsafe use of Autocloseable 
>>> instances; checks against two related particular compiler options
>>>
>>> While trying to upgrade ECJ compiler that we use for this task 
>>> (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
>>> JDK11 in preparation for dropping JDK8, I hit the following issues:
>>> - the latest version of ECJ is throwing more than 300 Potential Resource 
>>> Leak warnings. I looked at 10-15, and they were all false positives.
>>> - Even if we file a bug report to the Eclipse community, JDK11 is about to 
>>> be removed with the next version of the compiler
>>>
>>> So I shared this information with Jacek. He came up with a different 
>>> solution:
>>> It seems we already pull through Guava CheckerFramework with an MIT 
>>> license, which appears to be acceptable according to this link -  
>>> https://www.apache.org/legal/resolved.html#category-a
>>> He already has an initial integration with Cassandra which shows the 
>>> following:
>>> - CheckerFramework does not understand the @SuppressWarnings("resource") 
>>> (there is a different one to be used), so it is immediately visible how it 
>>> does not report all those false positives that eclipse-warnings does. On 
>>> the flip side, I got the feedback that what it has witnessed so far is 
>>> something we should investigate.
>>> - Also, there are additional annotations like @Owning that let you fix many 
>>> problems at once because the tool understands that the ownership of the 
>>> resources was passed to another entity; It also enables you to do something 
>>> impossible with eclipse-warnings - you can tell the tool that there is 
>>> another method that needs to be called to release the resources, like 
>>> release, free, disconnect, etc.
>>> - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
>>> even to older branches (while at the same time keeping eclipse-warnings 
>>> there)
>>> - though it runs 8 minutes so, we should not run it with every test, some 
>>> reorganization around ant tasks will be covered as even for 
>>> eclipse-warnings it was weird to call it on every single test run locally 
>>> by default
>>>
>>>
>>> If there are no concerns, we will continue replacing ant eclipse-warnings 
>>> with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 in 
>>> trunk.
>>>
>>> Best regards,
>>>
>>> Ekaterina
>>>
>>>
>>>


Re: [VOTE] CEP-30 ANN Vector Search

2023-06-16 Thread Andrew Cobley (Staff)
Hi,

I’ve got a question and a request about this CEP

In the example:


SELECT * FROM test.foo WHERE j ANN OF [3.4, 7.8, 9.1] limit 1;

I presume that limit n will return the nth nearest neighbours?

If that’s the case what order will they be in? Is it posssible to reverse the 
order ?

Secondly would it be possible to return the calculated distances?  This might 
be particular important if there are n returned neighbours?

Andy

From: Patrick McFadin 
Sent: 15 June 2023 01:03
To: dev@cassandra.apache.org 
Subject: Re: [VOTE] CEP-30 ANN Vector Search




CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.

Andy,

Good to see you on the ML again! CEP-30 is slated for release with 5.0 later in 
the year. Until then, you'll need to do a local build or try it out in a 
preview in Astra. A few of us have been talking about creating a preview docker 
image since there is some interest in having it run in k8ssandra. In any case, 
this is very alpha code and should be treated as such. Reporting errors or 
unusual results would be greatly appreciated!

Patrick



On Wed, Jun 14, 2023 at 7:10 AM Andrew Cobley (Staff) 
mailto:a.e.cob...@dundee.ac.uk>> wrote:

Hi All,



Great news this has gone through, I wondering if we have a timescale for this 
making it to Beta or release ?  I’m asking because we have a project that would 
benefit from this approach.



Andy





From: Jonathan Ellis mailto:jbel...@gmail.com>>
Date: Tuesday, 30 May 2023 at 14:44
To: dev mailto:dev@cassandra.apache.org>>
Subject: Re: [VOTE] CEP-30 ANN Vector Search



CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.

Thanks, all.  Closing the vote as accepted with 8 binding +1 (including me) and 
11 non-binding votes.



On Thu, May 25, 2023 at 10:45 AM Jonathan Ellis 
mailto:jbel...@gmail.com>> wrote:

Let's make this official.

CEP: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes



POC that demonstrates all the big rocks, including distributed queries: 
https://github.com/datastax/cassandra/tree/cep-vsearch

--

Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


--

Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

The University of Dundee is a registered Scottish Charity, No: SC015096

The University of Dundee is a registered Scottish Charity, No: SC015096


Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Jacek Lewandowski
Later, we may enable more checks than just leaked resources to improve the
code gradually.

- - -- --- -  -
Jacek Lewandowski


pt., 16 cze 2023 o 13:48 Ekaterina Dimitrova 
napisał(a):

> Got so excited that I forgot to say which of the two options exactly  I
> meant - running the analysis only on changed files after the initial full
> pass is done sounds like a good improvement to me
>
> On Fri, 16 Jun 2023 at 7:43, Ekaterina Dimitrova 
> wrote:
>
>> I think this is a great idea and it will probably reduce the time to run
>> it. Thank you!
>>
>> On Fri, 16 Jun 2023 at 7:40, Jacek Lewandowski <
>> lewandowski.ja...@gmail.com> wrote:
>>
>>> Additional question is whether we want to run the checks against the
>>> whole project or just against the file changes between the feature branch
>>> and the target release branch?
>>>
>>>
>>> - - -- --- -  -
>>> Jacek Lewandowski
>>>
>>>
>>> pt., 16 cze 2023 o 13:09 Aleksey Yeshchenko 
>>> napisał(a):
>>>
 Sounds like a clear improvement to me. Only once this check flagged a
 legitimate issue I missed, if I’m remembering correctly. All other
 instances have just been annoyances, forcing to add a redundant suppressed
 annotation.

 On 15 Jun 2023, at 19:01, Ekaterina Dimitrova 
 wrote:

 Hi everyone,
 Happy Thursday!
 Some time ago, Jacek raised the point that ant eclipse-warnings is 
 generating too many false positives and not really working as expected. 
 (CASSANDRA-18239)

 Reminder: ant eclipse-warnings is a task we run with the goal to check 
 Cassandra code - static analysis to warn on unsafe use of Autocloseable 
 instances; checks against two related particular compiler options

 While trying to upgrade ECJ compiler that we use for this task 
 (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
 JDK11 in preparation for dropping JDK8, I hit the following issues:
 - the latest version of ECJ is throwing more than 300 Potential Resource 
 Leak warnings. I looked at 10-15, and they were all false positives.
 - Even if we file a bug report to the Eclipse community, JDK11 is about to 
 be removed with the next version of the compiler

 So I shared this information with Jacek. He came up with a different 
 solution:
 It seems we already pull through Guava CheckerFramework with an MIT 
 license, which appears to be acceptable according to this link -  
 https://www.apache.org/legal/resolved.html#category-a
 He already has an initial integration with Cassandra which shows the 
 following:
 - CheckerFramework does not understand the @SuppressWarnings("resource") 
 (there is a different one to be used), so it is immediately visible how it 
 does not report all those false positives that eclipse-warnings does. On 
 the flip side, I got the feedback that what it has witnessed so far is 
 something we should investigate.
 - Also, there are additional annotations like @Owning that let you fix 
 many problems at once because the tool understands that the ownership of 
 the resources was passed to another entity; It also enables you to do 
 something impossible with eclipse-warnings - you can tell the tool that 
 there is another method that needs to be called to release the resources, 
 like release, free, disconnect, etc.
 - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
 even to older branches (while at the same time keeping eclipse-warnings 
 there)
 - though it runs 8 minutes so, we should not run it with every test, some 
 reorganization around ant tasks will be covered as even for 
 eclipse-warnings it was weird to call it on every single test run locally 
 by default


 If there are no concerns, we will continue replacing ant eclipse-warnings 
 with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 
 in trunk.

 Best regards,

 Ekaterina





Re: [DISCUSSIONS] Replace ant eclipse-warnings with CheckerFramework

2023-06-16 Thread Jeremiah Jordan
 +1 from me.

On Jun 15, 2023 at 1:01:01 PM, Ekaterina Dimitrova 
wrote:

> Hi everyone,
> Happy Thursday!
> Some time ago, Jacek raised the point that ant eclipse-warnings is generating 
> too many false positives and not really working as expected. (CASSANDRA-18239)
>
> Reminder: ant eclipse-warnings is a task we run with the goal to check 
> Cassandra code - static analysis to warn on unsafe use of Autocloseable 
> instances; checks against two related particular compiler options
>
> While trying to upgrade ECJ compiler that we use for this task 
> (CASSANDRA-18190) so we can switch the task from running it with JDK8 to 
> JDK11 in preparation for dropping JDK8, I hit the following issues:
> - the latest version of ECJ is throwing more than 300 Potential Resource Leak 
> warnings. I looked at 10-15, and they were all false positives.
> - Even if we file a bug report to the Eclipse community, JDK11 is about to be 
> removed with the next version of the compiler
>
> So I shared this information with Jacek. He came up with a different solution:
> It seems we already pull through Guava CheckerFramework with an MIT license, 
> which appears to be acceptable according to this link -  
> https://www.apache.org/legal/resolved.html#category-a
> He already has an initial integration with Cassandra which shows the 
> following:
> - CheckerFramework does not understand the @SuppressWarnings("resource") 
> (there is a different one to be used), so it is immediately visible how it 
> does not report all those false positives that eclipse-warnings does. On the 
> flip side, I got the feedback that what it has witnessed so far is something 
> we should investigate.
> - Also, there are additional annotations like @Owning that let you fix many 
> problems at once because the tool understands that the ownership of the 
> resources was passed to another entity; It also enables you to do something 
> impossible with eclipse-warnings - you can tell the tool that there is 
> another method that needs to be called to release the resources, like 
> release, free, disconnect, etc.
> - the tool works with JDK8, JDK11, JDK17, and JDK20, so we can backport it 
> even to older branches (while at the same time keeping eclipse-warnings there)
> - though it runs 8 minutes so, we should not run it with every test, some 
> reorganization around ant tasks will be covered as even for eclipse-warnings 
> it was weird to call it on every single test run locally by default
>
>
> If there are no concerns, we will continue replacing ant eclipse-warnings 
> with the CheckerFramework as part of CASSANDRA-18239 and CASSANDRA-18190 in 
> trunk.
>
> Best regards,
>
> Ekaterina
>
>


Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread Andrés de la Peña
It seems we agree on removing the default value for the old thresholds, and
don't count system keyspaces/tables on the new ones.

The old thresholds were on active duty for around ten months, and they have
been deprecated for around a year. They will have been deprecated for
longer by the time we release 5.0. If we want to keep them in perpetuity, I
guess the plan would be:

- Remove the default value of the old thresholds in Config.java to make
them disabled by default.
- Remove the old thresholds from the default cassandra.yaml, although old
yamls can still have them.
- Use converters (@Replaces tag in Config.java) to read the old threshold
values (if present) and apply them to the new guardrails.
- During the conversion from the old thresholds to the new guardrails,
subtract the current number of system keyspace/tables from the old value.
For example, 150 tables in the old threshold translate to 103 tables in the
new guardrail, considering that there are 47 system tables.

Does this sound good?

On Wed, 14 Jun 2023 at 17:26, David Capwell  wrote:

> That's problematic because the new thresholds we added in CASSANDRA-17147
> don't include system tables. Do you think we should change that?
>
>
> I wouldn’t change the semantics of the config as it’s already live.  I
> guess where I am coming from is that logically we have to think about the
> system tables, so to your point, if we think 150 is too much and the system
> already exposes 50… then we should recommend no more than 100….
>
> I find it's better for usability to not count the system tables and just
> say "It's recommended not to have more than 100 tables. This doesn't
> include system tables.”
>
>
> I am fine with this framing… internally we think about 150 but
> publicly speak 100 (due to our 50 tables)...
>
>
> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
>
> In my opinion including system tables defeats that purpose because it
> forces users to know details about the system tables.
>
> Perhaps having a unit test that caps our system tables at some value and
> keeping the guardrail user-scope specific would be a better approach. I see
> your point about leaking internal details to users, specifically on things
> they can't control at this point.
>
> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>
> > Default value I agree with you; features should be off by default!  If
> we remove the default then we disable the feature by default (which im cool
> with) and for anyone who changed the config, they would keep their behavior
>
>
> I'm glad we agree on at least removing the default value if we keep the
> deprecated properties.
>
> > With that, I kinda don’t agree that including system tables is a
> mistake, as we add more we allow less for user tables before we start to
> have issues….
>
>
> That's problematic because the new thresholds we added in CASSANDRA-17147
> don't include system tables. Do you think we should change that?
>
> I still think it's better not to include the system tables in the count.
> The thresholds on the number of keyspaces/tables/rows/columns/tombstones
> are just guidance since they cannot be exactly related to exact resource
> consumption. The main purpose of those thresholds is to prevent obvious
> antipatterns such as creating thousands of tables. A benefit of expressing
> the guardrails in terms of the number of schema entities, rather than
> counting the memory usage of those entities, is that they are easy to
> understand and reason about. In my opinion including system tables defeats
> that purpose because it forces users to know details about the system
> tables. The fact that those details change between versions doesn't help.
> Including system tables is not going to make the thresholds precise in
> terms of measuring memory consumption because that depends on other
> factors, such as the columns they store.
>
> Including system tables also imposes a minimum threshold value, like in
> 5.0 you cannot set a threshold value under 45 tables without triggering it
> with an empty db. For other thresholds, this can be more tricky. That would
> be the case of the guardrail on the number of columns in a partition, where
> you would need to know the size of the widest row in the system tables,
> which can change over time.
>
> I guess that if system tables were to be counted, a recommendation for the
> threshold would say something like "It's recommended not to have more than
> 150 tables. The system already includes 45 tables for internal usage, so
> you shouldn't create more than 105 user tables". I find it's better for
> usability to not count the system tables and just say "It's recommended not
> to have more than 100 tables. This doesn't include system tables."
>
> On Tue, 13 Jun 2023 at 23:51, Josh McKenzie  wrote:
>
>
> Warning that too many tables (including system) may have negative behavior
> I think is fine
>
> This reminds me of the current situation with our tests where we just keep
> addin

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread Ekaterina Dimitrova
Hi all,
I was following the discussion. What Andres just summarized sounds
reasonable to me. Let’s just not forget to document also all this.
Thank you
Ekaterina

On Fri, 16 Jun 2023 at 10:16, Andrés de la Peña 
wrote:

> It seems we agree on removing the default value for the old thresholds,
> and don't count system keyspaces/tables on the new ones.
>
> The old thresholds were on active duty for around ten months, and they
> have been deprecated for around a year. They will have been deprecated for
> longer by the time we release 5.0. If we want to keep them in perpetuity, I
> guess the plan would be:
>
> - Remove the default value of the old thresholds in Config.java to make
> them disabled by default.
> - Remove the old thresholds from the default cassandra.yaml, although old
> yamls can still have them.
> - Use converters (@Replaces tag in Config.java) to read the old threshold
> values (if present) and apply them to the new guardrails.
> - During the conversion from the old thresholds to the new guardrails,
> subtract the current number of system keyspace/tables from the old value.
> For example, 150 tables in the old threshold translate to 103 tables in the
> new guardrail, considering that there are 47 system tables.
>
> Does this sound good?
>
> On Wed, 14 Jun 2023 at 17:26, David Capwell  wrote:
>
>> That's problematic because the new thresholds we added in CASSANDRA-17147
>> don't include system tables. Do you think we should change that?
>>
>>
>> I wouldn’t change the semantics of the config as it’s already live.  I
>> guess where I am coming from is that logically we have to think about the
>> system tables, so to your point, if we think 150 is too much and the system
>> already exposes 50… then we should recommend no more than 100….
>>
>> I find it's better for usability to not count the system tables and just
>> say "It's recommended not to have more than 100 tables. This doesn't
>> include system tables.”
>>
>>
>> I am fine with this framing… internally we think about 150 but
>> publicly speak 100 (due to our 50 tables)...
>>
>>
>> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
>>
>> In my opinion including system tables defeats that purpose because it
>> forces users to know details about the system tables.
>>
>> Perhaps having a unit test that caps our system tables at some value and
>> keeping the guardrail user-scope specific would be a better approach. I see
>> your point about leaking internal details to users, specifically on things
>> they can't control at this point.
>>
>> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>>
>> > Default value I agree with you; features should be off by default!  If
>> we remove the default then we disable the feature by default (which im cool
>> with) and for anyone who changed the config, they would keep their behavior
>>
>>
>> I'm glad we agree on at least removing the default value if we keep the
>> deprecated properties.
>>
>> > With that, I kinda don’t agree that including system tables is a
>> mistake, as we add more we allow less for user tables before we start to
>> have issues….
>>
>>
>> That's problematic because the new thresholds we added in CASSANDRA-17147
>> don't include system tables. Do you think we should change that?
>>
>> I still think it's better not to include the system tables in the count.
>> The thresholds on the number of keyspaces/tables/rows/columns/tombstones
>> are just guidance since they cannot be exactly related to exact resource
>> consumption. The main purpose of those thresholds is to prevent obvious
>> antipatterns such as creating thousands of tables. A benefit of expressing
>> the guardrails in terms of the number of schema entities, rather than
>> counting the memory usage of those entities, is that they are easy to
>> understand and reason about. In my opinion including system tables defeats
>> that purpose because it forces users to know details about the system
>> tables. The fact that those details change between versions doesn't help.
>> Including system tables is not going to make the thresholds precise in
>> terms of measuring memory consumption because that depends on other
>> factors, such as the columns they store.
>>
>> Including system tables also imposes a minimum threshold value, like in
>> 5.0 you cannot set a threshold value under 45 tables without triggering it
>> with an empty db. For other thresholds, this can be more tricky. That would
>> be the case of the guardrail on the number of columns in a partition, where
>> you would need to know the size of the widest row in the system tables,
>> which can change over time.
>>
>> I guess that if system tables were to be counted, a recommendation for
>> the threshold would say something like "It's recommended not to have more
>> than 150 tables. The system already includes 45 tables for internal usage,
>> so you shouldn't create more than 105 user tables". I find it's better for
>> usability to not count the system tables and just say

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread Josh McKenzie
> I was following the discussion. What Andres just summarized sounds reasonable 
> to me. Let’s just not forget to document also all this.
+1 here. Maybe also add a warning to the log to let users know we subtracted 
system tables from that count since they used the old param and try and point 
them to the new one?

On Fri, Jun 16, 2023, at 10:57 AM, Ekaterina Dimitrova wrote:
> Hi all,
> I was following the discussion. What Andres just summarized sounds reasonable 
> to me. Let’s just not forget to document also all this.
> Thank you
> Ekaterina
> 
> On Fri, 16 Jun 2023 at 10:16, Andrés de la Peña  wrote:
>> It seems we agree on removing the default value for the old thresholds, and 
>> don't count system keyspaces/tables on the new ones.
>> 
>> The old thresholds were on active duty for around ten months, and they have 
>> been deprecated for around a year. They will have been deprecated for longer 
>> by the time we release 5.0. If we want to keep them in perpetuity, I guess 
>> the plan would be:
>> 
>> - Remove the default value of the old thresholds in Config.java to make them 
>> disabled by default.
>> - Remove the old thresholds from the default cassandra.yaml, although old 
>> yamls can still have them.
>> - Use converters (@Replaces tag in Config.java) to read the old threshold 
>> values (if present) and apply them to the new guardrails.
>> - During the conversion from the old thresholds to the new guardrails, 
>> subtract the current number of system keyspace/tables from the old value. 
>> For example, 150 tables in the old threshold translate to 103 tables in the 
>> new guardrail, considering that there are 47 system tables.
>> 
>> Does this sound good?
>> 
>> On Wed, 14 Jun 2023 at 17:26, David Capwell  wrote:
 That's problematic because the new thresholds we added in CASSANDRA-17147 
 don't include system tables. Do you think we should change that?
>>> 
>>> I wouldn’t change the semantics of the config as it’s already live.  I 
>>> guess where I am coming from is that logically we have to think about the 
>>> system tables, so to your point, if we think 150 is too much and the system 
>>> already exposes 50… then we should recommend no more than 100…. 
>>> 
 I find it's better for usability to not count the system tables and just 
 say "It's recommended not to have more than 100 tables. This doesn't 
 include system tables.”
>>> 
>>> I am fine with this framing… internally we think about 150 but publicly 
>>> speak 100 (due to our 50 tables)...
>>> 
>>> 
 On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
 
> In my opinion including system tables defeats that purpose because it 
> forces users to know details about the system tables.
 Perhaps having a unit test that caps our system tables at some value and 
 keeping the guardrail user-scope specific would be a better approach. I 
 see your point about leaking internal details to users, specifically on 
 things they can't control at this point.
 
 On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>> > Default value I agree with you; features should be off by default!  If 
>> > we remove the default then we disable the feature by default (which im 
>> > cool with) and for anyone who changed the config, they would keep 
>> > their behavior
> 
> I'm glad we agree on at least removing the default value if we keep the 
> deprecated properties.
> 
>> > With that, I kinda don’t agree that including system tables is a 
>> > mistake, as we add more we allow less for user tables before we start 
>> > to have issues….
> 
> That's problematic because the new thresholds we added in CASSANDRA-17147 
> don't include system tables. Do you think we should change that?
> 
> I still think it's better not to include the system tables in the count. 
> The thresholds on the number of keyspaces/tables/rows/columns/tombstones 
> are just guidance since they cannot be exactly related to exact resource 
> consumption. The main purpose of those thresholds is to prevent obvious 
> antipatterns such as creating thousands of tables. A benefit of 
> expressing the guardrails in terms of the number of schema entities, 
> rather than counting the memory usage of those entities, is that they are 
> easy to understand and reason about. In my opinion including system 
> tables defeats that purpose because it forces users to know details about 
> the system tables. The fact that those details change between versions 
> doesn't help. Including system tables is not going to make the thresholds 
> precise in terms of measuring memory consumption because that depends on 
> other factors, such as the columns they store.
> 
> Including system tables also imposes a minimum threshold value, like in 
> 5.0 you cannot set a threshold value under 45 tables without triggering 
> it with an empty

Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread Dan Jatnieks
Hi all,

Apologies for the late reply; I didn't mean to start a thread and then
disappear - it was unintended and I feel bad about that.

I've been taking notes to summarize the discussion points and it matches
what Andres already listed, so I'm glad for that. And thank you Andres for
doing that - much appreciated!

The plan Andres outlined also sounds good to me. I was not aware
of @Replaces before, and now that I learned it, I agree it should be used
here.

Converting the old threshold values by subtracting the system
keyspace/tables makes sense to keep the existing guardrail semantics - and
including a message about that will be a good step to reduce any confusion
about how the, possibly odd-looking, new value was determined.

Dan


On Fri, Jun 16, 2023 at 9:58 AM Ekaterina Dimitrova 
wrote:

> Hi all,
> I was following the discussion. What Andres just summarized sounds
> reasonable to me. Let’s just not forget to document also all this.
> Thank you
> Ekaterina
>
> On Fri, 16 Jun 2023 at 10:16, Andrés de la Peña 
> wrote:
>
>> It seems we agree on removing the default value for the old thresholds,
>> and don't count system keyspaces/tables on the new ones.
>>
>> The old thresholds were on active duty for around ten months, and they
>> have been deprecated for around a year. They will have been deprecated for
>> longer by the time we release 5.0. If we want to keep them in perpetuity, I
>> guess the plan would be:
>>
>> - Remove the default value of the old thresholds in Config.java to make
>> them disabled by default.
>> - Remove the old thresholds from the default cassandra.yaml, although old
>> yamls can still have them.
>> - Use converters (@Replaces tag in Config.java) to read the old threshold
>> values (if present) and apply them to the new guardrails.
>> - During the conversion from the old thresholds to the new guardrails,
>> subtract the current number of system keyspace/tables from the old value.
>> For example, 150 tables in the old threshold translate to 103 tables in the
>> new guardrail, considering that there are 47 system tables.
>>
>> Does this sound good?
>>
>> On Wed, 14 Jun 2023 at 17:26, David Capwell  wrote:
>>
>>> That's problematic because the new thresholds we added in
>>> CASSANDRA-17147 don't include system tables. Do you think we should change
>>> that?
>>>
>>>
>>> I wouldn’t change the semantics of the config as it’s already live.  I
>>> guess where I am coming from is that logically we have to think about the
>>> system tables, so to your point, if we think 150 is too much and the system
>>> already exposes 50… then we should recommend no more than 100….
>>>
>>> I find it's better for usability to not count the system tables and just
>>> say "It's recommended not to have more than 100 tables. This doesn't
>>> include system tables.”
>>>
>>>
>>> I am fine with this framing… internally we think about 150 but
>>> publicly speak 100 (due to our 50 tables)...
>>>
>>>
>>> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  wrote:
>>>
>>> In my opinion including system tables defeats that purpose because it
>>> forces users to know details about the system tables.
>>>
>>> Perhaps having a unit test that caps our system tables at some value and
>>> keeping the guardrail user-scope specific would be a better approach. I see
>>> your point about leaking internal details to users, specifically on things
>>> they can't control at this point.
>>>
>>> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>>>
>>> > Default value I agree with you; features should be off by default!  If
>>> we remove the default then we disable the feature by default (which im cool
>>> with) and for anyone who changed the config, they would keep their behavior
>>>
>>>
>>> I'm glad we agree on at least removing the default value if we keep the
>>> deprecated properties.
>>>
>>> > With that, I kinda don’t agree that including system tables is a
>>> mistake, as we add more we allow less for user tables before we start to
>>> have issues….
>>>
>>>
>>> That's problematic because the new thresholds we added in
>>> CASSANDRA-17147 don't include system tables. Do you think we should change
>>> that?
>>>
>>> I still think it's better not to include the system tables in the count.
>>> The thresholds on the number of keyspaces/tables/rows/columns/tombstones
>>> are just guidance since they cannot be exactly related to exact resource
>>> consumption. The main purpose of those thresholds is to prevent obvious
>>> antipatterns such as creating thousands of tables. A benefit of expressing
>>> the guardrails in terms of the number of schema entities, rather than
>>> counting the memory usage of those entities, is that they are easy to
>>> understand and reason about. In my opinion including system tables defeats
>>> that purpose because it forces users to know details about the system
>>> tables. The fact that those details change between versions doesn't help.
>>> Including system tables is not going to make the thresholds p

Re: [VOTE] CEP-8 Datastax Drivers Donation

2023-06-16 Thread Jeremy Hanna
After 72 hours, this is the summary of the voting:

binding +1 votes: 21
non-binding +1 votes: 9
-1 votes: 0
vetoes: 0

It looks like CEP-8 has passed at long last.

Thanks everyone!

Jeremy

> On Jun 16, 2023, at 6:04 AM, Aleksey Yeshchenko  wrote:
> 
> +1
> 
>> On 15 Jun 2023, at 15:19, Chris Lohfink  wrote:
>> 
>> +1
>> 
>> On Wed, Jun 14, 2023 at 9:05 PM Jon Haddad > > wrote:
>>> +1
>>> 
>>> On 2023/06/13 14:14:35 Jeremy Hanna wrote:
>>> > Calling for a vote on CEP-8 [1].
>>> > 
>>> > To clarify the intent, as Benjamin said in the discussion thread [2], the 
>>> > goal of this vote is simply to ensure that the community is in favor of 
>>> > the donation. Nothing more.
>>> > The plan is to introduce the drivers, one by one. Each driver donation 
>>> > will need to be accepted first by the PMC members, as it is the case for 
>>> > any donation. Therefore the PMC should have full control on the pace at 
>>> > which new drivers are accepted.
>>> > 
>>> > If this vote passes, we can start this process for the Java driver under 
>>> > the direction of the PMC.
>>> > 
>>> > Jeremy
>>> > 
>>> > 1. 
>>> > https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-8%3A+Datastax+Drivers+Donation
>>> > 2. https://lists.apache.org/thread/opt630do09phh7hlt28odztxdv6g58dp
> 



Re: [DISCUSS] Remove deprecated keyspace_count_warn_threshold and table_count_warn_threshold

2023-06-16 Thread David Capwell
> Does this sound good?

Sounds good to me

> On Jun 16, 2023, at 8:30 AM, Dan Jatnieks  wrote:
> 
> Hi all,
> 
> Apologies for the late reply; I didn't mean to start a thread and then 
> disappear - it was unintended and I feel bad about that.
> 
> I've been taking notes to summarize the discussion points and it matches what 
> Andres already listed, so I'm glad for that. And thank you Andres for doing 
> that - much appreciated!
> 
> The plan Andres outlined also sounds good to me. I was not aware of @Replaces 
> before, and now that I learned it, I agree it should be used here.
> 
> Converting the old threshold values by subtracting the system keyspace/tables 
> makes sense to keep the existing guardrail semantics - and including a 
> message about that will be a good step to reduce any confusion about how the, 
> possibly odd-looking, new value was determined.
> 
> Dan
> 
> 
> On Fri, Jun 16, 2023 at 9:58 AM Ekaterina Dimitrova  > wrote:
>> Hi all,
>> I was following the discussion. What Andres just summarized sounds 
>> reasonable to me. Let’s just not forget to document also all this.
>> Thank you
>> Ekaterina
>> 
>> On Fri, 16 Jun 2023 at 10:16, Andrés de la Peña > > wrote:
>>> It seems we agree on removing the default value for the old thresholds, and 
>>> don't count system keyspaces/tables on the new ones.
>>> 
>>> The old thresholds were on active duty for around ten months, and they have 
>>> been deprecated for around a year. They will have been deprecated for 
>>> longer by the time we release 5.0. If we want to keep them in perpetuity, I 
>>> guess the plan would be:
>>> 
>>> - Remove the default value of the old thresholds in Config.java to make 
>>> them disabled by default.
>>> - Remove the old thresholds from the default cassandra.yaml, although old 
>>> yamls can still have them.
>>> - Use converters (@Replaces tag in Config.java) to read the old threshold 
>>> values (if present) and apply them to the new guardrails.
>>> - During the conversion from the old thresholds to the new guardrails, 
>>> subtract the current number of system keyspace/tables from the old value. 
>>> For example, 150 tables in the old threshold translate to 103 tables in the 
>>> new guardrail, considering that there are 47 system tables.
>>> 
>>> Does this sound good?
>>> 
>>> On Wed, 14 Jun 2023 at 17:26, David Capwell >> > wrote:
> That's problematic because the new thresholds we added in CASSANDRA-17147 
> don't include system tables. Do you think we should change that?
 
 I wouldn’t change the semantics of the config as it’s already live.  I 
 guess where I am coming from is that logically we have to think about the 
 system tables, so to your point, if we think 150 is too much and the 
 system already exposes 50… then we should recommend no more than 100…. 
 
> I find it's better for usability to not count the system tables and just 
> say "It's recommended not to have more than 100 tables. This doesn't 
> include system tables.”
 
 
 I am fine with this framing… internally we think about 150 but publicly 
 speak 100 (due to our 50 tables)...
 
 
> On Jun 14, 2023, at 8:29 AM, Josh McKenzie  > wrote:
> 
>> In my opinion including system tables defeats that purpose because it 
>> forces users to know details about the system tables.
> Perhaps having a unit test that caps our system tables at some value and 
> keeping the guardrail user-scope specific would be a better approach. I 
> see your point about leaking internal details to users, specifically on 
> things they can't control at this point.
> 
> On Wed, Jun 14, 2023, at 8:19 AM, Andrés de la Peña wrote:
>> > Default value I agree with you; features should be off by default!  If 
>> > we remove the default then we disable the feature by default (which im 
>> > cool with) and for anyone who changed the config, they would keep 
>> > their behavior
>> 
>> I'm glad we agree on at least removing the default value if we keep the 
>> deprecated properties.
>> 
>> > With that, I kinda don’t agree that including system tables is a 
>> > mistake, as we add more we allow less for user tables before we start 
>> > to have issues….
>> 
>> That's problematic because the new thresholds we added in 
>> CASSANDRA-17147 don't include system tables. Do you think we should 
>> change that?
>> 
>> I still think it's better not to include the system tables in the count. 
>> The thresholds on the number of keyspaces/tables/rows/columns/tombstones 
>> are just guidance since they cannot be exactly related to exact resource 
>> consumption. The main purpose of those thresholds is to prevent obvious 
>> antipatterns such as creating thousands of tables. A benefit of 
>>

Re: [VOTE] CEP-30 ANN Vector Search

2023-06-16 Thread Jonathan Ellis
Correct.  They will be ordered closest-first.

Unfortunately it's not possible for the near or medium future to do
farthest-first.  HNSW index gets to log(n) time by only keeping a subset of
the closest neighbors for each vector.  So you'd need a separate index with
a inverse-cosine similarity metric, and it's not possible today to use a
custom metric function.

(This has been GA for over a year in Elastic and Solr and so far nobody has
needed farthest-first badly enough to add this as an option to the
underlying Lucene library.)

You can get the distances back today, like this:

SELECT my_text, similarity_cosine(my_embedding, ?)
FROM my_table
ORDER BY my_embedding ANN OF ? LIMIT 2

Then just pass the query vector into both bind variables.

On Fri, Jun 16, 2023 at 7:09 AM Andrew Cobley (Staff) <
a.e.cob...@dundee.ac.uk> wrote:

> Hi,
>
> I’ve got a question and a request about this CEP
>
> In the example:
>
> SELECT * FROM test.foo WHERE j ANN OF [3.4, 7.8, 9.1] limit 1;
>
>
> I presume that limit n will return the nth nearest neighbours?
>
> If that’s the case what order will they be in? Is it posssible to reverse
> the order ?
>
> Secondly would it be possible to return the calculated distances?  This
> might be particular important if there are n returned neighbours?
>
> Andy
> --
> *From:* Patrick McFadin 
> *Sent:* 15 June 2023 01:03
> *To:* dev@cassandra.apache.org 
> *Subject:* Re: [VOTE] CEP-30 ANN Vector Search
>
>
>
>
> CAUTION: This email originated from outside the University of Dundee. Do
> not click links or open attachments unless you recognise the sender's email
> address and know the content is safe.
> Andy,
>
> Good to see you on the ML again! CEP-30 is slated for release with 5.0
> later in the year. Until then, you'll need to do a local build or try it
> out in a preview in Astra. A few of us have been talking about creating a
> preview docker image since there is some interest in having it run in
> k8ssandra. In any case, this is very alpha code and should be treated as
> such. Reporting errors or unusual results would be greatly appreciated!
>
> Patrick
>
>
>
> On Wed, Jun 14, 2023 at 7:10 AM Andrew Cobley (Staff) <
> a.e.cob...@dundee.ac.uk> wrote:
>
> Hi All,
>
>
>
> Great news this has gone through, I wondering if we have a timescale for
> this making it to Beta or release ?  I’m asking because we have a project
> that would benefit from this approach.
>
>
>
> Andy
>
>
>
>
>
> *From: *Jonathan Ellis 
> *Date: *Tuesday, 30 May 2023 at 14:44
> *To: *dev 
> *Subject: *Re: [VOTE] CEP-30 ANN Vector Search
>
>
>
> CAUTION: This email originated from outside the University of Dundee. Do
> not click links or open attachments unless you recognise the sender's email
> address and know the content is safe.
>
> Thanks, all.  Closing the vote as accepted with 8 binding +1 (including
> me) and 11 non-binding votes.
>
>
>
> On Thu, May 25, 2023 at 10:45 AM Jonathan Ellis  wrote:
>
> Let's make this official.
>
>
> CEP:
> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>
>
>
> POC that demonstrates all the big rocks, including distributed queries:
> https://github.com/datastax/cassandra/tree/cep-vsearch
>
>
> --
>
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>
>
> --
>
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
>
>
> The University of Dundee is a registered Scottish Charity, No: SC015096
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [VOTE] CEP-30 ANN Vector Search

2023-06-16 Thread Andrew Cobley (Staff)
Thanks Jonathan,

That’s good to know.

Andy


From: Jonathan Ellis 
Date: Friday, 16 June 2023 at 18:04
To: dev@cassandra.apache.org 
Subject: Re: [VOTE] CEP-30 ANN Vector Search

CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.
Correct.  They will be ordered closest-first.

Unfortunately it's not possible for the near or medium future to do 
farthest-first.  HNSW index gets to log(n) time by only keeping a subset of the 
closest neighbors for each vector.  So you'd need a separate index with a 
inverse-cosine similarity metric, and it's not possible today to use a custom 
metric function.

(This has been GA for over a year in Elastic and Solr and so far nobody has 
needed farthest-first badly enough to add this as an option to the underlying 
Lucene library.)

You can get the distances back today, like this:

SELECT my_text, similarity_cosine(my_embedding, ?)
FROM my_table
ORDER BY my_embedding ANN OF ? LIMIT 2

Then just pass the query vector into both bind variables.

On Fri, Jun 16, 2023 at 7:09 AM Andrew Cobley (Staff) 
mailto:a.e.cob...@dundee.ac.uk>> wrote:
Hi,

I’ve got a question and a request about this CEP

In the example:


SELECT * FROM test.foo WHERE j ANN OF [3.4, 7.8, 9.1] limit 1;

I presume that limit n will return the nth nearest neighbours?

If that’s the case what order will they be in? Is it posssible to reverse the 
order ?

Secondly would it be possible to return the calculated distances?  This might 
be particular important if there are n returned neighbours?

Andy

From: Patrick McFadin mailto:pmcfa...@gmail.com>>
Sent: 15 June 2023 01:03
To: dev@cassandra.apache.org 
mailto:dev@cassandra.apache.org>>
Subject: Re: [VOTE] CEP-30 ANN Vector Search




CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.
Andy,

Good to see you on the ML again! CEP-30 is slated for release with 5.0 later in 
the year. Until then, you'll need to do a local build or try it out in a 
preview in Astra. A few of us have been talking about creating a preview docker 
image since there is some interest in having it run in k8ssandra. In any case, 
this is very alpha code and should be treated as such. Reporting errors or 
unusual results would be greatly appreciated!

Patrick



On Wed, Jun 14, 2023 at 7:10 AM Andrew Cobley (Staff) 
mailto:a.e.cob...@dundee.ac.uk>> wrote:

Hi All,



Great news this has gone through, I wondering if we have a timescale for this 
making it to Beta or release ?  I’m asking because we have a project that would 
benefit from this approach.



Andy





From: Jonathan Ellis mailto:jbel...@gmail.com>>
Date: Tuesday, 30 May 2023 at 14:44
To: dev mailto:dev@cassandra.apache.org>>
Subject: Re: [VOTE] CEP-30 ANN Vector Search



CAUTION: This email originated from outside the University of Dundee. Do not 
click links or open attachments unless you recognise the sender's email address 
and know the content is safe.

Thanks, all.  Closing the vote as accepted with 8 binding +1 (including me) and 
11 non-binding votes.



On Thu, May 25, 2023 at 10:45 AM Jonathan Ellis 
mailto:jbel...@gmail.com>> wrote:

Let's make this official.

CEP: 
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes



POC that demonstrates all the big rocks, including distributed queries: 
https://github.com/datastax/cassandra/tree/cep-vsearch

--

Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


--

Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

The University of Dundee is a registered Scottish Charity, No: SC015096

The University of Dundee is a registered Scottish Charity, No: SC015096


--
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

The University of Dundee is a registered Scottish Charity, No: SC015096