Re: [DISCUSS] New data type for vector search

2023-04-27 Thread steve landiss via dev
 
+1On Thursday, April 27, 2023 at 07:36:19 PM PDT, Caleb Rackliffe 
 wrote:  
 
 I don’t have a lot to add here, other than to say I’m broadly in agreement w/ 
David on syntax preference, element selectability, and making this a new type 
that roughly corresponds to a primitive (non-null-allowing) array.


On Apr 27, 2023, at 9:18 PM, Anthony Grasso  wrote:



It would be strange for this declaration to look different from other 
collection types. We may want to reconsider using the collection syntax. I also 
like the idea of the vector dimensions being declared with the VECTOR keyword. 
An alternative syntax option to explore is:
VECTOR[size]
On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:

>From a machine learning perspective, vectors are a well-known concept that are 
>effectively immutable fixed-length n-dimensional values that are then later 
>used either as part of a model or in conjunction with a model after the fact.

While we could have this be non-frozen and not call it a vector, I'd be 
inclined to still make the argument for a layer of syntactic sugar on top that 
met ML users where they were with concepts they understood rather than forcing 
them through the cognitive lift of figuring out the Cassandra specific 
contortions to replicate something that's ubiquitous in their space. We did the 
same "Cassandra-first" approach with our JSON support and that didn't do us any 
favors in terms of adoption and usage as far as I know.

So is the goal here to provide something specific and idiomatic for the ML 
community or is the goal to make a primitive that's C*-centric that then 
another layer can write to? I personally argue for the former; I don't see this 
specific data type going away any time soon.
On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:


but as you point out it has the problem of allowing nulls.


If nulls are not allowed for the elements, then either we need  a) a new type, 
or b) add some way to say elements may not be null…. As much as I do like b, I 
am leaning towards new type for this use case.

So, to flesh out the type requirements I have seen so far

1) represents a fixed size array of element type
* on write path we will need to validate this
2) element may not be null
* on write path we will need to validate this
3) “frozen” (is this really a requirement for the type or is this just simpler 
for the ANN work?  I feel that this shouldn’t be a requirement)
4) works for all types (my requirement; original proposal is float only, but 
could logically expand to primitive types)

Anything else?


The key thing about a vector is that unlike lists or tuples you really don't 
care about individual elements, you care about doing vector and matrix 
multiplications with the thing as a unit. 


That maybe true for this use case, but “should” this be true for the type 
itself?  I feel like no… if a user wants the Nth element of a vector why would 
we block them?  I am not saying the first patch, or even 5.0 adds support for 
index access, I am just trying to push back saying that the type should not 
block this.


(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)


Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer 
this syntax but that limitation may not be desired for all use cases… we could 
always add LIST and ARRAY later to address that case.

In terms of syntax I have seen, here is my ordered preference:

1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
semantic…. Could even be NON NULL TYPE[size]


On Apr 27, 2023, at 9:00 AM, Benedict  wrote:


That’s a bounded ring buffer, not a fixed length array.

This definitely isn’t a tuple because the types are all the same, which is 
pretty crucial for matrix operations. Matrix libraries generally work on arrays 
of known dimensionality, or sparse representations.

Whether we draw any semantic link between the frozen list and whatever we do 
here, it is fundamentally a frozen list with a restriction on its size. What 
we’re defining here are “statically” sized arrays, whereas a frozen list is 
essentially a dynamically sized array.

I do not think vector is a good name because vector is used in some other 
popular languages to mean a (dynamic) list, which is confusing when we also 
have a list concept.

I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with 
list. Though it is a bit strange that this particular type declaration looks so 
different to other collection types.


On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:





On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:

It's been a while, so I may be missing something, but do we already have 
fixed-size lists?  If not, I don't see why we'd try to make this fit into a 
List-shaped problem.


We do not. The proposal got closed as wont-fix  
https://issues.apache.org/jira/browse/CASSAND

Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Caleb Rackliffe
I don’t have a lot to add here, other than to say I’m broadly in agreement w/ David on syntax preference, element selectability, and making this a new type that roughly corresponds to a primitive (non-null-allowing) array.On Apr 27, 2023, at 9:18 PM, Anthony Grasso  wrote:It would be strange for this declaration to look different from other collection types. We may want to reconsider using the collection syntax. I also like the idea of the vector dimensions being declared with the VECTOR keyword. An alternative syntax option to explore is:VECTOR[size]On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:From a machine learning perspective, vectors are a well-known concept that are effectively immutable fixed-length n-dimensional values that are then later used either as part of a model or in conjunction with a model after the fact.While we could have this be non-frozen and not call it a vector, I'd be inclined to still make the argument for a layer of syntactic sugar on top that met ML users where they were with concepts they understood rather than forcing them through the cognitive lift of figuring out the Cassandra specific contortions to replicate something that's ubiquitous in their space. We did the same "Cassandra-first" approach with our JSON support and that didn't do us any favors in terms of adoption and usage as far as I know.So is the goal here to provide something specific and idiomatic for the ML community or is the goal to make a primitive that's C*-centric that then another layer can write to? I personally argue for the former; I don't see this specific data type going away any time soon.On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:but as you point out it has the problem of allowing nulls.If nulls are not allowed for the elements, then either we need  a) a new type, or b) add some way to say elements may not be null…. As much as I do like b, I am leaning towards new type for this use case.So, to flesh out the type requirements I have seen so far1) represents a fixed size array of element type* on write path we will need to validate this2) element may not be null* on write path we will need to validate this3) “frozen” (is this really a requirement for the type or is this just simpler for the ANN work?  I feel that this shouldn’t be a requirement)4) works for all types (my requirement; original proposal is float only, but could logically expand to primitive types)Anything else?The key thing about a vector is that unlike lists or tuples you really don't care about individual elements, you care about doing vector and matrix multiplications with the thing as a unit. That maybe true for this use case, but “should” this be true for the type itself?  I feel like no… if a user wants the Nth element of a vector why would we block them?  I am not saying the first patch, or even 5.0 adds support for index access, I am just trying to push back saying that the type should not block this.(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT VECTOR[N].)Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer this syntax but that limitation may not be desired for all use cases… we could always add LIST and ARRAY later to address that case.In terms of syntax I have seen, here is my ordered preference:1) TYPE[size] - have mixed feelings due to non-null, but still prefer it2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this semantic…. Could even be NON NULL TYPE[size]On Apr 27, 2023, at 9:00 AM, Benedict  wrote:That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:It's been a while, so I may be missing something, but do we already have fixed-size lists?  If not, I don't see why we'd try to make this fit into a List-shaped problem.We do not. The proposal got closed as wont-fix  https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Anthony Grasso
It would be strange for this declaration to look different from other
collection types. We may want to reconsider using the collection syntax. I
also like the idea of the vector dimensions being declared with the VECTOR
keyword. An alternative syntax option to explore is:

VECTOR[size]

On Fri, 28 Apr 2023 at 10:49, Josh McKenzie  wrote:

> From a machine learning perspective, vectors are a well-known concept that
> are effectively immutable fixed-length n-dimensional values that are then
> later used either as part of a model or in conjunction with a model after
> the fact.
>
> While we could have this be non-frozen and not call it a vector, I'd be
> inclined to still make the argument for a layer of syntactic sugar on top
> that met ML users where they were with concepts they understood rather than
> forcing them through the cognitive lift of figuring out the Cassandra
> specific contortions to replicate something that's ubiquitous in their
> space. We did the same "Cassandra-first" approach with our JSON support and
> that didn't do us any favors in terms of adoption and usage as far as I
> know.
>
> So is the goal here to provide something specific and idiomatic for the ML
> community or is the goal to make a primitive that's C*-centric that then
> another layer can write to? I personally argue for the former; I don't see
> this specific data type going away any time soon.
>
> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>
> but as you point out it has the problem of allowing nulls.
>
>
> If nulls are not allowed for the elements, then either we need  a) a new
> type, or b) add some way to say elements may not be null…. As much as I do
> like b, I am leaning towards new type for this use case.
>
> So, to flesh out the type requirements I have seen so far
>
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this
> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only,
> but could logically expand to primitive types)
>
> Anything else?
>
> The key thing about a vector is that unlike lists or tuples you really
> don't care about individual elements, you care about doing vector and
> matrix multiplications with the thing as a unit.
>
>
> That maybe true for this use case, but “should” this be true for the type
> itself?  I feel like no… if a user wants the Nth element of a vector why
> would we block them?  I am not saying the first patch, or even 5.0 adds
> support for index access, I am just trying to push back saying that the
> type should not block this.
>
> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
> VECTOR[N].)
>
>
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
> prefer this syntax but that limitation may not be desired for all use
> cases… we could always add LIST and ARRAY later
> to address that case.
>
> In terms of syntax I have seen, here is my ordered preference:
>
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
> semantic…. Could even be NON NULL TYPE[size]
>
> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>
>
> That’s a bounded ring buffer, not a fixed length array.
>
> This definitely isn’t a tuple because the types are all the same, which is
> pretty crucial for matrix operations. Matrix libraries generally work on
> arrays of known dimensionality, or sparse representations.
>
> Whether we draw any semantic link between the frozen list and whatever we
> do here, it is fundamentally a frozen list with a restriction on its size.
> What we’re defining here are “statically” sized arrays, whereas a frozen
> list is essentially a dynamically sized array.
>
> I do not think vector is a good name because vector is used in some other
> popular languages to mean a (dynamic) list, which is confusing when we also
> have a list concept.
>
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link
> with list. Though it is a bit strange that this particular type declaration
> looks so different to other collection types.
>
> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>
> 
>
>
> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:
>
> It's been a while, so I may be missing something, but do we already have
> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>
>
> We do not. The proposal got closed as wont-fix
> https://issues.apache.org/jira/browse/CASSANDRA-9110
>
>
>
>


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Josh McKenzie
>From a machine learning perspective, vectors are a well-known concept that are 
>effectively immutable fixed-length n-dimensional values that are then later 
>used either as part of a model or in conjunction with a model after the fact.

While we could have this be non-frozen and not call it a vector, I'd be 
inclined to still make the argument for a layer of syntactic sugar on top that 
met ML users where they were with concepts they understood rather than forcing 
them through the cognitive lift of figuring out the Cassandra specific 
contortions to replicate something that's ubiquitous in their space. We did the 
same "Cassandra-first" approach with our JSON support and that didn't do us any 
favors in terms of adoption and usage as far as I know.

So is the goal here to provide something specific and idiomatic for the ML 
community or is the goal to make a primitive that's C*-centric that then 
another layer can write to? I personally argue for the former; I don't see this 
specific data type going away any time soon.

On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>> but as you point out it has the problem of allowing nulls.
> 
> If nulls are not allowed for the elements, then either we need  a) a new 
> type, or b) add some way to say elements may not be null…. As much as I do 
> like b, I am leaning towards new type for this use case.
> 
> So, to flesh out the type requirements I have seen so far
> 
> 1) represents a fixed size array of element type
> * on write path we will need to validate this
> 2) element may not be null
> * on write path we will need to validate this
> 3) “frozen” (is this really a requirement for the type or is this just 
> simpler for the ANN work?  I feel that this shouldn’t be a requirement)
> 4) works for all types (my requirement; original proposal is float only, but 
> could logically expand to primitive types)
> 
> Anything else?
> 
>> The key thing about a vector is that unlike lists or tuples you really don't 
>> care about individual elements, you care about doing vector and matrix 
>> multiplications with the thing as a unit. 
> 
> That maybe true for this use case, but “should” this be true for the type 
> itself?  I feel like no… if a user wants the Nth element of a vector why 
> would we block them?  I am not saying the first patch, or even 5.0 adds 
> support for index access, I am just trying to push back saying that the type 
> should not block this.
> 
>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT 
>> VECTOR[N].)
> 
> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I 
> prefer this syntax but that limitation may not be desired for all use cases… 
> we could always add LIST and ARRAY later to address that 
> case.
> 
> In terms of syntax I have seen, here is my ordered preference:
> 
> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
> semantic…. Could even be NON NULL TYPE[size]
> 
>> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
>> 
>> 
>> That’s a bounded ring buffer, not a fixed length array.
>> 
>> This definitely isn’t a tuple because the types are all the same, which is 
>> pretty crucial for matrix operations. Matrix libraries generally work on 
>> arrays of known dimensionality, or sparse representations.
>> 
>> Whether we draw any semantic link between the frozen list and whatever we do 
>> here, it is fundamentally a frozen list with a restriction on its size. What 
>> we’re defining here are “statically” sized arrays, whereas a frozen list is 
>> essentially a dynamically sized array.
>> 
>> I do not think vector is a good name because vector is used in some other 
>> popular languages to mean a (dynamic) list, which is confusing when we also 
>> have a list concept.
>> 
>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link 
>> with list. Though it is a bit strange that this particular type declaration 
>> looks so different to other collection types.
>> 
>>> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>>> 
>>> 
>>> 
>>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:
 It's been a while, so I may be missing something, but do we already have 
 fixed-size lists?  If not, I don't see why we'd try to make this fit into 
 a List-shaped problem.
>>> 
>>> We do not. The proposal got closed as wont-fix  
>>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>>> 
>>> 


RE: Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-04-27 Thread Francisco Guerrero
Hi folks,


We have updated the confluence page with the source code for CEP-28.

There are two repositories with contributions. One is the patch [1]

for Cassandra Sidecar with the bulk APIs that enable the Cassandra

Spark Analytics library. The second is a new repository [2] with

contributions to the Cassandra Spark Analytics code


We also have a README markdown file that you can follow to give the

code a try:


https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md


Best,

- Francisco


[1] Apache Cassandra Sidecar bulk APIs source code:
https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis

[2] Apache Cassandra Spark Analytics source code:
https://github.com/frankgh/cassandra-analytics


On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
responding here - yes, we can add some diagrams to the CEP - I’ll try to
get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
some data flow diagrams could be added to the cep showing some example
operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
 wrote: > >> > >>  > >> A lot of great discussions! >
>> > >> On the sidecar front, especially what the role sidecar plays in
terms of this CEP, I feel there might be some confusion. Once the code is
published, we should have clarity. > >> Sidecar does not read sstables nor
do any coordination for analytics queries. It is local to the companion
Cassandra instance. For bulk read, it takes snapshots and streams sstables
to spark workers to read. For bulk write, it imports the sstables uploaded
from spark workers. All commands are existing jmx/nodetool functionalities
from Cassandra. Sidecar adds the http interface to them. It might be an
over simplified description. The complex computation is performed in spark
clusters only. > >> > >> In the long run, Cassandra might evolve into a
database that does both OLTP and OLAP. (Not what this thread aims for) > >>
At the current stage, Spark is very suited for analytic purposes. > >> > >>
On Tue, Mar 28, 2023 at 9:06 AM Benedict > wrote: > >>> I disagree with the first claim, as the
process has all the information it chooses to utilise about which resources
it’s using and what it’s using those resources for. > >>> > >>> The
inability to isolate GC domains is something we cannot address, but also
probably not a problem if we were doing everything with memory management
as well as we could be. > >>> > >>> But, not worth detailing this thread
for. Today we do very little well on this front within the process, and a
separate process is well justified given the state of play. > >>> >  On
28 Mar 2023, at 16:38, Derek Chen-Becker > wrote: >  >   >  >  On Tue, Mar
28, 2023 at 9:03 AM Joseph Lynch > wrote: >  ... >  > > I think we might
be underselling how valuable JVM isolation is, > > especially for
analytics queries that are going to pass the entire > > dataset through
heap somewhat constantly. >  >  Big +1 here. The JVM simply does
not have significant granularity of control for resource utilization, but
this is explicitly a feature of separate processes. Add in being able to
separate GC domains and you can avoid a lot of noisy neighbor in-VM
behavior for the disparate workloads. >  >  Cheers, >  > 
Derek >  >  >  -- > 
+---+ >  |
Derek Chen-Becker | >  | GPG Key available at
https://keybase.io/dchenbecker and | >  |
https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
+---+ >  >
>
-- 
Francisco Guerrero


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread David Capwell
> but as you point out it has the problem of allowing nulls.

If nulls are not allowed for the elements, then either we need  a) a new type, 
or b) add some way to say elements may not be null…. As much as I do like b, I 
am leaning towards new type for this use case.

So, to flesh out the type requirements I have seen so far

1) represents a fixed size array of element type
* on write path we will need to validate this
2) element may not be null
* on write path we will need to validate this
3) “frozen” (is this really a requirement for the type or is this just simpler 
for the ANN work?  I feel that this shouldn’t be a requirement)
4) works for all types (my requirement; original proposal is float only, but 
could logically expand to primitive types)

Anything else?

> The key thing about a vector is that unlike lists or tuples you really don't 
> care about individual elements, you care about doing vector and matrix 
> multiplications with the thing as a unit. 

That maybe true for this use case, but “should” this be true for the type 
itself?  I feel like no… if a user wants the Nth element of a vector why would 
we block them?  I am not saying the first patch, or even 5.0 adds support for 
index access, I am just trying to push back saying that the type should not 
block this.

> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT 
> VECTOR[N].)

Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I prefer 
this syntax but that limitation may not be desired for all use cases… we could 
always add LIST and ARRAY later to address that case.

In terms of syntax I have seen, here is my ordered preference:

1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this 
semantic…. Could even be NON NULL TYPE[size]

> On Apr 27, 2023, at 9:00 AM, Benedict  wrote:
> 
> That’s a bounded ring buffer, not a fixed length array.
> 
> This definitely isn’t a tuple because the types are all the same, which is 
> pretty crucial for matrix operations. Matrix libraries generally work on 
> arrays of known dimensionality, or sparse representations.
> 
> Whether we draw any semantic link between the frozen list and whatever we do 
> here, it is fundamentally a frozen list with a restriction on its size. What 
> we’re defining here are “statically” sized arrays, whereas a frozen list is 
> essentially a dynamically sized array.
> 
> I do not think vector is a good name because vector is used in some other 
> popular languages to mean a (dynamic) list, which is confusing when we also 
> have a list concept.
> 
> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with 
> list. Though it is a bit strange that this particular type declaration looks 
> so different to other collection types.
> 
>> On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:
>> 
>> 
>> 
>> 
>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis > > wrote:
>>> It's been a while, so I may be missing something, but do we already have 
>>> fixed-size lists?  If not, I don't see why we'd try to make this fit into a 
>>> List-shaped problem.
>> 
>> We do not. The proposal got closed as wont-fix  
>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>> 
>> 



Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Benedict
That’s a bounded ring buffer, not a fixed length array.This definitely isn’t a tuple because the types are all the same, which is pretty crucial for matrix operations. Matrix libraries generally work on arrays of known dimensionality, or sparse representations.Whether we draw any semantic link between the frozen list and whatever we do here, it is fundamentally a frozen list with a restriction on its size. What we’re defining here are “statically” sized arrays, whereas a frozen list is essentially a dynamically sized array.I do not think vector is a good name because vector is used in some other popular languages to mean a (dynamic) list, which is confusing when we also have a list concept.I’m fine with just using the FLOAT[N] syntax, and drawing no direct link with list. Though it is a bit strange that this particular type declaration looks so different to other collection types.On 27 Apr 2023, at 16:48, Jeff Jirsa  wrote:On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:It's been a while, so I may be missing something, but do we already have fixed-size lists?  If not, I don't see why we'd try to make this fit into a List-shaped problem.We do not. The proposal got closed as wont-fix  https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jeff Jirsa
On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis  wrote:

> It's been a while, so I may be missing something, but do we already have
> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
> List-shaped problem.
>

We do not. The proposal got closed as wont-fix
https://issues.apache.org/jira/browse/CASSANDRA-9110


Re: [DISCUSS] New data type for vector search

2023-04-27 Thread Jonathan Ellis
It's been a while, so I may be missing something, but do we already have
fixed-size lists?  If not, I don't see why we'd try to make this fit into a
List-shaped problem.

A tuple would be a better fit from that perspective, but as you point out
it has the problem of allowing nulls.

The key thing about a vector is that unlike lists or tuples you really
don't care about individual elements, you care about doing vector and
matrix multiplications with the thing as a unit.  That's the key reason
that it makes more sense to me as a separate type.

(Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
VECTOR[N].)


On Wed, Apr 26, 2023 at 4:31 PM Andrés de la Peña 
wrote:

> If we are going to use FLOAT[N] as sugar for another CQL data type, maybe
> tuples are more convenient than lists. So FLOAT[N] could be equivalent to
> TUPLE.
>
> Differently to collections, tuples have a fixed size, they are always
> frozen and I think they don't support random access. These properties seem
> desirable for vectors.
>
> Tuples however support null values, whereas collections doesn't. I mean,
> you can remove elements from a collection, but I think you are never going
> to see an explicit null in the collection. Tuples don't allow to remove a
> value, but the entire tuple can be written with null values. Like in INSERT
> INTO t (key, tuple) VALUES (0,  (1, null, 3)).
>
> On Wed, 26 Apr 2023 at 21:53, Mick Semb Wever  wrote:
>
>> My inclination then would be to say you declare an ARRAY (which
>>> is semantic sugar for FROZEN>). This is very consistent with
>>> our existing style. We then simply permit such columns to define ANN
>>> indexes.
>>>
>>
>>
>> So long as nulls aren't a problem as David questions, an alternative is:
>>
>>  FLOAT[N] as semantic sugar for LIST
>>
>> And ANN requiring FROZEN
>>
>> Maybe taking a poll in a few days will be positive to keep this
>> moving forward.
>>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced