Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Andrés de la Peña
If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
tuples are more convenient than lists. So FLOAT[N] could be equivalent to
TUPLE.

Differently to collections, tuples have a fixed size, they are always
frozen and I think they don't support random access. These properties seem
desirable for vectors.

Tuples however support null values, whereas collections doesn't. I mean,
you can remove elements from a collection, but I think you are never going
to see an explicit null in the collection. Tuples don't allow to remove a
value, but the entire tuple can be written with null values. Like in INSERT
INTO t (key, tuple) VALUES (0,  (1, null, 3)).

On Wed, 26 Apr 2023 at 21:53, Mick Semb Wever  wrote:

> My inclination then would be to say you declare an ARRAY (which
>> is semantic sugar for FROZEN>). This is very consistent with
>> our existing style. We then simply permit such columns to define ANN
>> indexes.
>>
>
>
> So long as nulls aren't a problem as David questions, an alternative is:
>
>  FLOAT[N] as semantic sugar for LIST
>
> And ANN requiring FROZEN
>
> Maybe taking a poll in a few days will be positive to keep this
> moving forward.
>


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Mick Semb Wever
>
> My inclination then would be to say you declare an ARRAY (which
> is semantic sugar for FROZEN>). This is very consistent with
> our existing style. We then simply permit such columns to define ANN
> indexes.
>


So long as nulls aren't a problem as David questions, an alternative is:

 FLOAT[N] as semantic sugar for LIST

And ANN requiring FROZEN

Maybe taking a poll in a few days will be positive to keep this
moving forward.


Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-26 Thread J. D. Jordan
If we look to postgresql it allows defining arrays using FLOAT[N] or FLOAT 
ARRAY[N].

So that is an extra point for me to just using FLOAT[N].

From my quick search neither oracle* nor MySQL directly support arrays in 
columns.

* oracle supports declaring a custom type using VARRAY and then using that type 
for a column.
CREATE TYPE float_array AS VARRAY(100) OF FLOAT;

> On Apr 26, 2023, at 12:17 PM, David Capwell  wrote:
> 
> 
>> 
>> DENSE seems to just be an array? So very similar to a frozen list, but with 
>> a fixed size?
> 
> How I read the doc, DENSE = ARRAY, but knew that couldn’t be the case, so 
> when I read the code its fixed size array…. So the real syntax was “DENSE 
> FLOAT32[42]”
> 
> Not a fan of the type naming, and feel that a fixed size array could be 
> useful for other cases as well, so think we can improve here (personally 
> prefer float[42], text[42], etc… vector maybe closer to our 
> existing syntax but not a fan).
> 
>> I guess this is an excellent example to explore the minima of what 
>> constitutes a CEP
> 
> The ANN change itself feels like a CEP makes sense.  Are we going to depend 
> on Lucene’s HNSW or build our own?  How do we validate this for correctness?  
> What does correctness mean in a distributed context?  Is this going to be 
> pluggable (big push recently to offer plugability)?
> 
> 
>> On Apr 26, 2023, at 7:37 AM, Patrick McFadin  wrote:
>> 
>> I guess this is an excellent example to explore the minima of what 
>> constitutes a CEP. So far, CEPs have been some large changes, so where does 
>> something like this fit? (Wait. Did I beat Benedict to a Bike Shed? I think 
>> I did.)
>> 
>> This is a list of everything needed for a CEP:
>> 
>> Status
>> Scope
>> Goals
>> Approach
>> Timeline
>> Mailing list / Slack channels
>> Related JIRA tickets
>> Motivation
>> Audience
>> Proposed Changes
>> New or Changed Public Interfaces
>> Compatibility, Deprecation, and Migration Plan
>> Test Plan
>> Rejected Alternatives
>> 
>> This is a big enough change to provide information for each element. Going 
>> back to the spirit of why we started CEPs, we wanted to avoid a mega-commit 
>> without some shaping and agreement before code goes into trunk. I don't have 
>> a clear indication of where that line lies. From our own wiki: "It is highly 
>> recommended to pursue a CEP for significant user-facing or changes that cut 
>> across multiple subsystems." That seems to fit here. Part of my motivation 
>> is being clear with potential new contributors by example and encouraging 
>> more awesomeness.  
>> 
>> The changes for operators:
>> - New drivers
>> - New gaurdrails?
>> - Indexing == storage requirements
>> 
>> Patrick
>> 
>> On Tue, Apr 25, 2023 at 10:53 PM Mick Semb Wever  wrote:
>> I was soo happy when I saw this, I know many users are going to be 
>> thrilled about it.
>> 
>> 
>> On Wed, 26 Apr 2023 at 05:15, Patrick McFadin  wrote:
>> Not sure if this is what you are saying, Josh, but I believe this needs to 
>> be its own CEP. It's a change in CQL syntax and changes how clusters 
>> operate. The change needs to be documented and voted on. Jonathan, you know 
>> how to find me if you want me to help write it. :) 
>> 
>> I'd be fine with just a DISCUSS thread to agree to the CQL change, since it: 
>> `DENSE FLOAT32` appears to be a minimal,  and the overall patch building on 
>> SAI. As Henrik mentioned there's other SAI extensions being added too 
>> without CEPs.  Can you elaborate on how you see this changing how the 
>> cluster operates?
>> 
>> This will be easier to decide once we have a patch to look at, but that 
>> depends on a CEP-7 base (e.g. no feature branch exists). If we do want a CEP 
>> we need to allow a few weeks to get it through, but that can happen in 
>> parallel and maybe drafting up something now will be valuable anyway for an 
>> eventual CEP that proposes the more complete features (e.g. 
>> cosine_similarity(…)). 
>> 
> 


Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Benedicts comments also makes me question; can any of the values in the vector 
be null?  The patch sent works with float arrays, so null isn’t possible… is 
null not valid for a vector type?  If so this would help justify why is a 
vector not a array or a list (both allow null)

> On Apr 26, 2023, at 10:50 AM, David Capwell  wrote:
> 
> Thanks for starting this thread!
> 
>> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
>> loved that, so we considered a bunch of alternatives, including
>> 
>> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
>> would make it familiar for many users. However, this syntax raises the 
>> question of why arrays cannot be created for other types.  Additionally, the 
>> expectation for an array is to provide random access to its contents, which 
>> is not supported for vectors.
>> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
>> vectors, not sparse ones. However, since Lucene had sparse vector support in 
>> the past but removed it for lack of compelling use cases, it is unlikely 
>> that it will be added back, making the "DENSE" qualifier less relevant.
>> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
>> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
>> reasons mentioned above.
>> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
>> natural word order.
>> `VECTOR`: This follows the syntax of our Collections, but again 
>> this would imply that random access is supported, which we want to avoid 
>> doing.
>> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
>> could make it difficult to add other vector types, such as byte vectors 
>> (already supported by Lucene), in the future.
> 
> I didn’t look close enough when I saw your patch, is this type multicell or 
> not?  Aka is this acting like a frozen> of fixed size?  I had 
> assumed its non-multicell…. Main reason I ask this now is this pushback for 
> random access…. Lets say I have the following table
> 
> CREATE TABLE fluffy_kittens (
>   pk int PRIMARY KEY,
>   vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
> )
> 
> If I do the following query, I would expect it to work
> 
> SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42
> 
> While working on accord’s CQL integration Caleb and I kept getting bitten by 
> frozen vs non frozen behavior, so many cases just stopped working on frozen 
> collections and should be easy to add (we force user to load the full value 
> already, why can we not touch it?).
> 
> Now, back to the random access comment, assuming this is not multicell why 
> would random access be blocked?  If the type isValueLengthFixed() == true 
> then random access should be simple (else it does require walking the array 
> in-order or to fully deserialize the BB (if working with Lucene I assume we 
> already deserialized out of BB)).  I am just trying to flesh out if there is 
> a limitation not being brought up or is this trying to limit the scope of 
> access for easier testing?
> 
>> However, this syntax raises the question of why arrays cannot be created for 
>> other types
> 
> Left this comment in the other thread, why not?  This could be useful outside 
> the float use case, so having a new "VectorType(AbstractType elements, int 
> size)” is easier/better than a float only version.  I also did a lot of work 
> to fuzz test our type system, so just adding that into the existing generator 
> would get good coverage right off the bat (have another fuzz tester I have 
> not contributed yet, it was done for Accord… it fuzz tests the AST, so would 
> be easy to add this there, that would test type specific access, which the 
> existing tests don’t)
> 
>> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
>> consistency if we add other float types like FLOAT16 or FLOAT64
> 
> I do not think we should add a new FLOAT32 type, but I am cool with an alias 
> that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
> where we return schema back to users would do FLOAT even if user wrote 
> FLOAT32… other than that negative I don’t see any other problems.
> 
>> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
>> of clarity, conciseness, and extensibility. It is more natural in its word 
>> order than the original proposal and avoids unnecessary qualifiers, while 
>> still being clear about the data type it represents. Finally, this syntax is 
>> straighforwardly extensible should we choose to support other vector types 
>> in the future.
> 
> My preference is TYPE[n_dimension] but I am ok with this syntax if others 
> prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
> seems to be an assumption that this will tell users that random access isn’t 
> allowed and only blessed types are

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread David Capwell
Thanks for starting this thread!

> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
> loved that, so we considered a bunch of alternatives, including
> 
> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
> would make it familiar for many users. However, this syntax raises the 
> question of why arrays cannot be created for other types.  Additionally, the 
> expectation for an array is to provide random access to its contents, which 
> is not supported for vectors.
> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
> vectors, not sparse ones. However, since Lucene had sparse vector support in 
> the past but removed it for lack of compelling use cases, it is unlikely that 
> it will be added back, making the "DENSE" qualifier less relevant.
> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
> reasons mentioned above.
> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
> natural word order.
> `VECTOR`: This follows the syntax of our Collections, but again 
> this would imply that random access is supported, which we want to avoid 
> doing.
> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
> could make it difficult to add other vector types, such as byte vectors 
> (already supported by Lucene), in the future.

I didn’t look close enough when I saw your patch, is this type multicell or 
not?  Aka is this acting like a frozen> of fixed size?  I had 
assumed its non-multicell…. Main reason I ask this now is this pushback for 
random access…. Lets say I have the following table

CREATE TABLE fluffy_kittens (
  pk int PRIMARY KEY,
  vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
)

If I do the following query, I would expect it to work

SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42

While working on accord’s CQL integration Caleb and I kept getting bitten by 
frozen vs non frozen behavior, so many cases just stopped working on frozen 
collections and should be easy to add (we force user to load the full value 
already, why can we not touch it?).

Now, back to the random access comment, assuming this is not multicell why 
would random access be blocked?  If the type isValueLengthFixed() == true then 
random access should be simple (else it does require walking the array in-order 
or to fully deserialize the BB (if working with Lucene I assume we already 
deserialized out of BB)).  I am just trying to flesh out if there is a 
limitation not being brought up or is this trying to limit the scope of access 
for easier testing?

> However, this syntax raises the question of why arrays cannot be created for 
> other types

Left this comment in the other thread, why not?  This could be useful outside 
the float use case, so having a new "VectorType(AbstractType elements, int 
size)” is easier/better than a float only version.  I also did a lot of work to 
fuzz test our type system, so just adding that into the existing generator 
would get good coverage right off the bat (have another fuzz tester I have not 
contributed yet, it was done for Accord… it fuzz tests the AST, so would be 
easy to add this there, that would test type specific access, which the 
existing tests don’t)

> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
> consistency if we add other float types like FLOAT16 or FLOAT64

I do not think we should add a new FLOAT32 type, but I am cool with an alias 
that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
where we return schema back to users would do FLOAT even if user wrote FLOAT32… 
other than that negative I don’t see any other problems.

> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
> of clarity, conciseness, and extensibility. It is more natural in its word 
> order than the original proposal and avoids unnecessary qualifiers, while 
> still being clear about the data type it represents. Finally, this syntax is 
> straighforwardly extensible should we choose to support other vector types in 
> the future.

My preference is TYPE[n_dimension] but I am ok with this syntax if others 
prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
seems to be an assumption that this will tell users that random access isn’t 
allowed and only blessed types are allowed… both points I feel are not valid 
(or not seen anything published why they should be valid).  There is a 
difference between what a type “could” do and what we implement day 1, I 
wouldn’t want to add more verbosity because of intentions of the day 1 
implementation. 


> On Apr 26, 2023, at 7:31 AM, Jonathan Ellis  wrote:
> 
> Hi all,
> 
> Splitting this out per the suggestion in the initial VS thread so we can work 
> on driver support in parallel with the 

Re: [DISCUSS] New data type for vector search

2023-04-26 Thread Benedict Elliott Smith
I think we need to briefly step back and think about what the syntax means and how it fits into existing syntax.It seems that the dimensionality verbiage assumes we’re logically introducing N vector fields, so that each row adopts a value for all of the vector fields or none. But in practice we are actually introducing a fixed-length frozen list in Cassandra terms, and our API treats this as a per-row array/vector rather than a number of column vectors.My inclination then would be to say you declare an ARRAY (which is semantic sugar for FROZEN>). This is very consistent with our existing style. We then simply permit such columns to define ANN indexes.Otherwise, I think we should lean into the idea that this is a set of N vectors, as “dimensions" makes limited sense when discussing an array length. In this case I would lean towards declaring e.g. 1500 FLOAT VECTORS, maybe. But then I think we should reconsider our presentation a little, and perhaps the result set should treat each vector as a separate field (or something like this).On 26 Apr 2023, at 15:31, Jonathan Ellis  wrote:Hi all,Splitting this out per the suggestion in the initial VS thread so we can work on driver support in parallel with the server-side changes.I propose adding a new data type for vector search indexes:FLOAT VECTOR[N_DIMENSIONS]In the initial commits and thread, this was DENSE FLOAT32. Nobody really loved that, so we considered a bunch of alternatives, including- `FLOAT[N]`: This minimal option resembles C and Java array syntax, which would make it familiar for many users. However, this syntax raises the question of why arrays cannot be created for other types.  Additionally, the expectation for an array is to provide random access to its contents, which is not supported for vectors.- `DENSE FLOAT[N]`: This option clarifies that we are supporting dense vectors, not sparse ones. However, since Lucene had sparse vector support in the past but removed it for lack of compelling use cases, it is unlikely that it will be added back, making the "DENSE" qualifier less relevant.- `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the reasons mentioned above.- `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less natural word order.`VECTOR`: This follows the syntax of our Collections, but again this would imply that random access is supported, which we want to avoid doing.- `VECTOR[N]`: This syntax is not very clear about the vector's contents and could make it difficult to add other vector types, such as byte vectors (already supported by Lucene), in the future.Finally, the original qualifier of 32 in `FLOAT32` was intended to allow consistency if we add other float types like FLOAT16 or FLOAT64, both of which are sometimes used in ML. However, we already have a CQL data type for a 64-bit float (`DOUBLE`), so it would make more sense to add future variants (which remain hypothetical at this point) along that line instead.Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance of clarity, conciseness, and extensibility. It is more natural in its word order than the original proposal and avoids unnecessary qualifiers, while still being clear about the data type it represents. Finally, this syntax is straighforwardly extensible should we choose to support other vector types in the future.-- Jonathan Ellisco-founder, http://www.datastax.com@spyced


Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-26 Thread David Capwell
> DENSE seems to just be an array? So very similar to a frozen list, but with a 
> fixed size?

How I read the doc, DENSE = ARRAY, but knew that couldn’t be the case, so when 
I read the code its fixed size array…. So the real syntax was “DENSE 
FLOAT32[42]”

Not a fan of the type naming, and feel that a fixed size array could be useful 
for other cases as well, so think we can improve here (personally prefer 
float[42], text[42], etc… vector maybe closer to our existing syntax 
but not a fan).

> I guess this is an excellent example to explore the minima of what 
> constitutes a CEP

The ANN change itself feels like a CEP makes sense.  Are we going to depend on 
Lucene’s HNSW or build our own?  How do we validate this for correctness?  What 
does correctness mean in a distributed context?  Is this going to be pluggable 
(big push recently to offer plugability)?


> On Apr 26, 2023, at 7:37 AM, Patrick McFadin  wrote:
> 
> I guess this is an excellent example to explore the minima of what 
> constitutes a CEP. So far, CEPs have been some large changes, so where does 
> something like this fit? (Wait. Did I beat Benedict to a Bike Shed? I think I 
> did.)
> 
> This is a list of everything needed for a CEP:
> 
> Status
> Scope
> Goals
> Approach
> Timeline
> Mailing list / Slack channels
> Related JIRA tickets
> Motivation
> Audience
> Proposed Changes
> New or Changed Public Interfaces
> Compatibility, Deprecation, and Migration Plan
> Test Plan
> Rejected Alternatives
> 
> This is a big enough change to provide information for each element. Going 
> back to the spirit of why we started CEPs, we wanted to avoid a mega-commit 
> without some shaping and agreement before code goes into trunk. I don't have 
> a clear indication of where that line lies. From our own wiki: "It is highly 
> recommended to pursue a CEP for significant user-facing or changes that cut 
> across multiple subsystems." That seems to fit here. Part of my motivation is 
> being clear with potential new contributors by example and encouraging more 
> awesomeness.  
> 
> The changes for operators:
>  - New drivers
>  - New gaurdrails?
>  - Indexing == storage requirements
> 
> Patrick
> 
> On Tue, Apr 25, 2023 at 10:53 PM Mick Semb Wever  wrote:
> I was soo happy when I saw this, I know many users are going to be 
> thrilled about it.
> 
> 
> On Wed, 26 Apr 2023 at 05:15, Patrick McFadin  wrote:
> Not sure if this is what you are saying, Josh, but I believe this needs to be 
> its own CEP. It's a change in CQL syntax and changes how clusters operate. 
> The change needs to be documented and voted on. Jonathan, you know how to 
> find me if you want me to help write it. :) 
> 
> I'd be fine with just a DISCUSS thread to agree to the CQL change, since it: 
> `DENSE FLOAT32` appears to be a minimal,  and the overall patch building on 
> SAI. As Henrik mentioned there's other SAI extensions being added too without 
> CEPs.  Can you elaborate on how you see this changing how the cluster 
> operates?
> 
> This will be easier to decide once we have a patch to look at, but that 
> depends on a CEP-7 base (e.g. no feature branch exists). If we do want a CEP 
> we need to allow a few weeks to get it through, but that can happen in 
> parallel and maybe drafting up something now will be valuable anyway for an 
> eventual CEP that proposes the more complete features (e.g. 
> cosine_similarity(…)). 
> 



Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-26 Thread Patrick McFadin
I guess this is an excellent example to explore the minima of what
constitutes a CEP. So far, CEPs have been some large changes, so where does
something like this fit? (Wait. Did I beat Benedict to a Bike Shed? I think
I did.)

This is a list of everything needed for a CEP:

Status
Scope
Goals
Approach
Timeline
Mailing list / Slack channels
Related JIRA tickets
Motivation
Audience
Proposed Changes
New or Changed Public Interfaces
Compatibility, Deprecation, and Migration Plan
Test Plan
Rejected Alternatives

This is a big enough change to provide information for each element. Going
back to the spirit of why we started CEPs, we wanted to avoid a mega-commit
without some shaping and agreement before code goes into trunk. I don't
have a clear indication of where that line lies. From our own wiki: "It is
highly recommended to pursue a CEP for significant user-facing or changes
that cut across multiple subsystems." That seems to fit here. Part of my
motivation is being clear with potential new contributors by example and
encouraging more awesomeness.

The changes for operators:
 - New drivers
 - New gaurdrails?
 - Indexing == storage requirements

Patrick

On Tue, Apr 25, 2023 at 10:53 PM Mick Semb Wever  wrote:

> I was soo happy when I saw this, I know many users are going to be
> thrilled about it.
>
>
> On Wed, 26 Apr 2023 at 05:15, Patrick McFadin  wrote:
>
>> Not sure if this is what you are saying, Josh, but I believe this needs
>> to be its own CEP. It's a change in CQL syntax and changes how clusters
>> operate. The change needs to be documented and voted on. Jonathan, you know
>> how to find me if you want me to help write it. :)
>>
>
> I'd be fine with just a DISCUSS thread to agree to the CQL change, since
> it: `DENSE FLOAT32` appears to be a minimal,  and the overall patch
> building on SAI. As Henrik mentioned there's other SAI extensions being
> added too without CEPs.  Can you elaborate on how you see this changing how
> the cluster operates?
>
> This will be easier to decide once we have a patch to look at, but that
> depends on a CEP-7 base (e.g. no feature branch exists). If we do want a
> CEP we need to allow a few weeks to get it through, but that can happen in
> parallel and maybe drafting up something now will be valuable anyway for an
> eventual CEP that proposes the more complete features (e.g.
> cosine_similarity(…)).
>
>
>


[DISCUSS] New data type for vector search

2023-04-26 Thread Jonathan Ellis
Hi all,

Splitting this out per the suggestion in the initial VS thread so we can
work on driver support in parallel with the server-side changes.

I propose adding a new data type for vector search indexes:

FLOAT VECTOR[N_DIMENSIONS]

In the initial commits and thread, this was DENSE FLOAT32. Nobody really
loved that, so we considered a bunch of alternatives, including

- `FLOAT[N]`: This minimal option resembles C and Java array syntax, which
would make it familiar for many users. However, this syntax raises the
question of why arrays cannot be created for other types.  Additionally,
the expectation for an array is to provide random access to its contents,
which is not supported for vectors.
- `DENSE FLOAT[N]`: This option clarifies that we are supporting dense
vectors, not sparse ones. However, since Lucene had sparse vector support
in the past but removed it for lack of compelling use cases, it is unlikely
that it will be added back, making the "DENSE" qualifier less relevant.
- `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with
the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the
reasons mentioned above.
- `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a
less natural word order.
`VECTOR`: This follows the syntax of our Collections, but again
this would imply that random access is supported, which we want to avoid
doing.
- `VECTOR[N]`: This syntax is not very clear about the vector's contents
and could make it difficult to add other vector types, such as byte vectors
(already supported by Lucene), in the future.

Finally, the original qualifier of 32 in `FLOAT32` was intended to allow
consistency if we add other float types like FLOAT16 or FLOAT64, both of
which are sometimes used in ML. However, we already have a CQL data type
for a 64-bit float (`DOUBLE`), so it would make more sense to add future
variants (which remain hypothetical at this point) along that line instead.

Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best
balance of clarity, conciseness, and extensibility. It is more natural in
its word order than the original proposal and avoids unnecessary
qualifiers, while still being clear about the data type it represents.
Finally, this syntax is straighforwardly extensible should we choose to
support other vector types in the future.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [EXTERNAL] Re: (CVE only) support for 3,11 beyond published EOL

2023-04-26 Thread Mick Semb Wever
On Sat, 15 Apr 2023 at 03:17, C. Scott Andreas  wrote:

> If there’s lack of clarity around EOL policy and dates, we should
> absolutely make this clear.
>


Fix is here:
https://github.com/thelastpickle/cassandra-website/tree/mck/update-5-0_dates_download_page


w/ html generated here:
https://raw.githack.com/thelastpickle/cassandra-website/mck/update-5-0_dates_download_page_generated/content/_/download.html


I'll merge this tomorrow if there's no further input.


Re: Adding vector search to SAI with heirarchical navigable small world graph index

2023-04-26 Thread Benedict
We probably at least need to bike shed naming as we already have FLOAT, DOUBLE, and LIST - which are similar/overlapping types, and we shoo on should be consistent.If we introduce FLOAT32 we probably need that to be an alias of FLOAT and introduce FLOAT64 to alias DOUBLE for consistency.DENSE seems to just be an array? So very similar to a frozen list, but with a fixed size?On 26 Apr 2023, at 06:53, Mick Semb Wever  wrote:I was soo happy when I saw this, I know many users are going to be thrilled about it.On Wed, 26 Apr 2023 at 05:15, Patrick McFadin  wrote:Not sure if this is what you are saying, Josh, but I believe this needs to be its own CEP. It's a change in CQL syntax and changes how clusters operate. The change needs to be documented and voted on. Jonathan, you know how to find me if you want me to help write it. :) I'd be fine with just a DISCUSS thread to agree to the CQL change, since it: `DENSE FLOAT32` appears to be a minimal,  and the overall patch building on SAI. As Henrik mentioned there's other SAI extensions being added too without CEPs.  Can you elaborate on how you see this changing how the cluster operates?This will be easier to decide once we have a patch to look at, but that depends on a CEP-7 base (e.g. no feature branch exists). If we do want a CEP we need to allow a few weeks to get it through, but that can happen in parallel and maybe drafting up something now will be valuable anyway for an eventual CEP that proposes the more complete features (e.g. cosine_similarity(…)).