Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
\o/

Bring it in team. Group hug.

Now if you'll excuse me, I'm going to go build my preso on how Cassandra is
the only distributed database you can do vector search in an ACID
transaction.

Patrick

On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis  wrote:

> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
>
> - Fixed length
> - No nulls
> - Random access not supported
>
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
>
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at
> the index and function level, than at the type level.
>
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
>
> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
>
>>  How about it, David? Did you already make this?
>>
>>
>> I checked out the patch, fixed serialize/deserialize, added the
>> constraints, then added a composeForFloat(ByteBuffer), with this the impact
>> to the POC patch was the following
>>
>> 1) move away from VectorType.instance.serializer().deserialize(bb) to
>> type.composeForFloat(bb), both return float[]
>> 2) change the index validate logic to move away from checking VectorType
>> and instead check for that plus the element type == FloatType.  I didn’t
>> bother to do this as its trivial
>>
>> David. End this argument. SHOW THE CODE!
>>
>>
>> If this argument ends and people are cool with vector supporting abstract
>> type, more than glad to help get this in.
>>
>> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
>> wrote:
>>
>> I'm all for bringing more functionality to the masses sooner, but the
>> original idea has a very very specific use case.  Do we have use cases for
>> a general purpose Vector/Array data structure?  If so, awesome.  I just
>> wondered if generalizing provides value, beyond being straightforward to
>> implement.  I'm just trying to be sensitive to the database code
>> maintenance and driver support for general types versus a single type for a
>> specific, well defined purpose.
>>
>> If it could easily be a plugin, that's great - but the full picture
>> involves drivers that need to support it or you end up getting binary blobs
>> you have to decode client side and then do stuff with.  So ideally if you
>> have a well defined use case that you can build into the database, having
>> it just be part of the database and associated drivers - that makes the
>> experience much much better.
>>
>> I'm not trying to say B couldn't be valuable or that a plugin couldn't be
>> feasible.  I'm just trying to enlarge the picture a bit to see what that
>> means for this use case and for the supporting drivers/clients.
>>
>> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>>
>> But it’s so trivial it was already implemented by David in the span of
>> ten minutes? If anything, we’re slowing progress down by refusing to do the
>> extra types, as we’re busy arguing about it rather than delivering a
>> feature?
>>
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
>> support types beyond float. Not that we should start with float.
>>
>> So, this whole debate is a mess, I think. But hey ho.
>>
>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>>
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where
>> my head is. I get accused of scope creep (a lot) and looking at the initial
>> proposal Jonathan put on the ML it was mostly "Developers are adopting
>> vector search at a furious pace and I think I have a simple way of adding
>> support to keep Cassandra relevant for these use cases" Instead of just
>> focusing on this use case, I feel the arguments have bike shedded into
>> scope creep which means it will take forever to get into the project.
>>
>> My preference is to see one thing validated with an MVP and get it into
>> the hands of developers sooner so we can continue to iterate based on
>> actual usage.
>>
>> It doesn't say your points are wrong or your opinions are broken, I'm
>> voting for what I think will be awesome for users sooner.
>>
>> Patrick
>>
>> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>>
>>> Could folk voting against a general purpose type (that could well be
>>> called a vector) briefly explain their reasoning?
>>>
>>> We established in the other thread that it’s technically trivial,
>>> meaning folk must think it is strictly superior to only support float
>>> rather than eg all numeric types (note: for the type, not the ANN).
>>>
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>>> touch on this, mostly just endorsing the idea of a vector.
>>>
>>>
>>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>>
>>

Re: [POLL] Vector type for ML

2023-05-02 Thread Dinesh Joshi
I'm also in favor of having a general data type that is not tied to numeric 
data types alone.

On 2023/05/02 22:27:24 Jonathan Ellis wrote:
> I had a call with David.  We agreed that we want a "vector" data type with
> these properties
> 
> - Fixed length
> - No nulls
> - Random access not supported
> 
> Where we disagreed was on my proposal to restrict vectors to only numeric
> data.  David's points were that
> 
> (1) He has a use case today for a data type with the other vector
> properties,
> (2) It doesn't seem reasonable to create two data types with the same
> properties, one of which is restricted to numerics, and
> (3) The restrictions that I want for numeric vectors make more sense at the
> index and function level, than at the type level.
> 
> I'm ready to concede that David has the better case here and move forward
> with a vector implementation without that restriction.
> 
> On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:
> 
> >  How about it, David? Did you already make this?
> >
> >
> > I checked out the patch, fixed serialize/deserialize, added the
> > constraints, then added a composeForFloat(ByteBuffer), with this the impact
> > to the POC patch was the following
> >
> > 1) move away from VectorType.instance.serializer().deserialize(bb) to
> > type.composeForFloat(bb), both return float[]
> > 2) change the index validate logic to move away from checking VectorType
> > and instead check for that plus the element type == FloatType.  I didn’t
> > bother to do this as its trivial
> >
> > David. End this argument. SHOW THE CODE!
> >
> >
> > If this argument ends and people are cool with vector supporting abstract
> > type, more than glad to help get this in.
> >
> > On May 2, 2023, at 1:53 PM, Jeremy Hanna 
> > wrote:
> >
> > I'm all for bringing more functionality to the masses sooner, but the
> > original idea has a very very specific use case.  Do we have use cases for
> > a general purpose Vector/Array data structure?  If so, awesome.  I just
> > wondered if generalizing provides value, beyond being straightforward to
> > implement.  I'm just trying to be sensitive to the database code
> > maintenance and driver support for general types versus a single type for a
> > specific, well defined purpose.
> >
> > If it could easily be a plugin, that's great - but the full picture
> > involves drivers that need to support it or you end up getting binary blobs
> > you have to decode client side and then do stuff with.  So ideally if you
> > have a well defined use case that you can build into the database, having
> > it just be part of the database and associated drivers - that makes the
> > experience much much better.
> >
> > I'm not trying to say B couldn't be valuable or that a plugin couldn't be
> > feasible.  I'm just trying to enlarge the picture a bit to see what that
> > means for this use case and for the supporting drivers/clients.
> >
> > On May 2, 2023, at 3:04 PM, Benedict  wrote:
> >
> > But it’s so trivial it was already implemented by David in the span of ten
> > minutes? If anything, we’re slowing progress down by refusing to do the
> > extra types, as we’re busy arguing about it rather than delivering a
> > feature?
> >
> > FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> > support types beyond float. Not that we should start with float.
> >
> > So, this whole debate is a mess, I think. But hey ho.
> >
> > On 2 May 2023, at 20:57, Patrick McFadin  wrote:
> >
> > 
> > I'll speak up on that one. If you look at my ranked voting, that is where
> > my head is. I get accused of scope creep (a lot) and looking at the initial
> > proposal Jonathan put on the ML it was mostly "Developers are adopting
> > vector search at a furious pace and I think I have a simple way of adding
> > support to keep Cassandra relevant for these use cases" Instead of just
> > focusing on this use case, I feel the arguments have bike shedded into
> > scope creep which means it will take forever to get into the project.
> >
> > My preference is to see one thing validated with an MVP and get it into
> > the hands of developers sooner so we can continue to iterate based on
> > actual usage.
> >
> > It doesn't say your points are wrong or your opinions are broken, I'm
> > voting for what I think will be awesome for users sooner.
> >
> > Patrick
> >
> > On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
> >
> >> Could folk voting against a general purpose type (that could well be
> >> called a vector) briefly explain their reasoning?
> >>
> >> We established in the other thread that it’s technically trivial, meaning
> >> folk must think it is strictly superior to only support float rather than
> >> eg all numeric types (note: for the type, not the ANN).
> >>
> >> I am surprised, and the blurbs accompanying votes so far don’t seem to
> >> touch on this, mostly just endorsing the idea of a vector.
> >>
> >>
> >> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
> >>
> >> 
> >> A > B > C 

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
I had a call with David.  We agreed that we want a "vector" data type with
these properties

- Fixed length
- No nulls
- Random access not supported

Where we disagreed was on my proposal to restrict vectors to only numeric
data.  David's points were that

(1) He has a use case today for a data type with the other vector
properties,
(2) It doesn't seem reasonable to create two data types with the same
properties, one of which is restricted to numerics, and
(3) The restrictions that I want for numeric vectors make more sense at the
index and function level, than at the type level.

I'm ready to concede that David has the better case here and move forward
with a vector implementation without that restriction.

On Tue, May 2, 2023 at 4:03 PM David Capwell  wrote:

>  How about it, David? Did you already make this?
>
>
> I checked out the patch, fixed serialize/deserialize, added the
> constraints, then added a composeForFloat(ByteBuffer), with this the impact
> to the POC patch was the following
>
> 1) move away from VectorType.instance.serializer().deserialize(bb) to
> type.composeForFloat(bb), both return float[]
> 2) change the index validate logic to move away from checking VectorType
> and instead check for that plus the element type == FloatType.  I didn’t
> bother to do this as its trivial
>
> David. End this argument. SHOW THE CODE!
>
>
> If this argument ends and people are cool with vector supporting abstract
> type, more than glad to help get this in.
>
> On May 2, 2023, at 1:53 PM, Jeremy Hanna 
> wrote:
>
> I'm all for bringing more functionality to the masses sooner, but the
> original idea has a very very specific use case.  Do we have use cases for
> a general purpose Vector/Array data structure?  If so, awesome.  I just
> wondered if generalizing provides value, beyond being straightforward to
> implement.  I'm just trying to be sensitive to the database code
> maintenance and driver support for general types versus a single type for a
> specific, well defined purpose.
>
> If it could easily be a plugin, that's great - but the full picture
> involves drivers that need to support it or you end up getting binary blobs
> you have to decode client side and then do stuff with.  So ideally if you
> have a well defined use case that you can build into the database, having
> it just be part of the database and associated drivers - that makes the
> experience much much better.
>
> I'm not trying to say B couldn't be valuable or that a plugin couldn't be
> feasible.  I'm just trying to enlarge the picture a bit to see what that
> means for this use case and for the supporting drivers/clients.
>
> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>
> But it’s so trivial it was already implemented by David in the span of ten
> minutes? If anything, we’re slowing progress down by refusing to do the
> extra types, as we’re busy arguing about it rather than delivering a
> feature?
>
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> support types beyond float. Not that we should start with float.
>
> So, this whole debate is a mess, I think. But hey ho.
>
> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>
> 
> I'll speak up on that one. If you look at my ranked voting, that is where
> my head is. I get accused of scope creep (a lot) and looking at the initial
> proposal Jonathan put on the ML it was mostly "Developers are adopting
> vector search at a furious pace and I think I have a simple way of adding
> support to keep Cassandra relevant for these use cases" Instead of just
> focusing on this use case, I feel the arguments have bike shedded into
> scope creep which means it will take forever to get into the project.
>
> My preference is to see one thing validated with an MVP and get it into
> the hands of developers sooner so we can continue to iterate based on
> actual usage.
>
> It doesn't say your points are wrong or your opinions are broken, I'm
> voting for what I think will be awesome for users sooner.
>
> Patrick
>
> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>
>> Could folk voting against a general purpose type (that could well be
>> called a vector) briefly explain their reasoning?
>>
>> We established in the other thread that it’s technically trivial, meaning
>> folk must think it is strictly superior to only support float rather than
>> eg all numeric types (note: for the type, not the ANN).
>>
>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>> touch on this, mostly just endorsing the idea of a vector.
>>
>>
>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>
>> 
>> A > B > C on both polls.
>>
>> Having talked to several users in the community that are highly excited
>> about this change, this gets to what developers want to do at Cassandra
>> scale: store embeddings and retrieve them.
>>
>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>> wrote:
>>
>>> A > B > C
>>>
>>> I don't think that ML is such a niche application that it c

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
We're reusing existing Cassandra code so the performance characteristics for 
parsing should be the same as Cassandra. I will need to check if we have 
benchmarks. If we do, we'll add it to the CEP wiki page.

On 2023/05/02 19:52:28 Sebastian Estevez wrote:
> Hey Dinesh,
> 
> Yeah it makes sense that the sstable streaming is network bound since it's
> mostly just moving files.
> 
> Do you have any performance stats on the sstable parsing side inside spark?
> 
> --Seb
> 
> On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi  wrote:
> 
> > It is line rate / network bound. We have a patch out in vert.x that should
> > use the zero copy path for it. But it's not a strict prereq for it.


Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
>  How about it, David? Did you already make this?

I checked out the patch, fixed serialize/deserialize, added the constraints, 
then added a composeForFloat(ByteBuffer), with this the impact to the POC patch 
was the following

1) move away from VectorType.instance.serializer().deserialize(bb) to 
type.composeForFloat(bb), both return float[]
2) change the index validate logic to move away from checking VectorType and 
instead check for that plus the element type == FloatType.  I didn’t bother to 
do this as its trivial

> David. End this argument. SHOW THE CODE! 

If this argument ends and people are cool with vector supporting abstract type, 
more than glad to help get this in.

> On May 2, 2023, at 1:53 PM, Jeremy Hanna  wrote:
> 
> I'm all for bringing more functionality to the masses sooner, but the 
> original idea has a very very specific use case.  Do we have use cases for a 
> general purpose Vector/Array data structure?  If so, awesome.  I just 
> wondered if generalizing provides value, beyond being straightforward to 
> implement.  I'm just trying to be sensitive to the database code maintenance 
> and driver support for general types versus a single type for a specific, 
> well defined purpose.
> 
> If it could easily be a plugin, that's great - but the full picture involves 
> drivers that need to support it or you end up getting binary blobs you have 
> to decode client side and then do stuff with.  So ideally if you have a well 
> defined use case that you can build into the database, having it just be part 
> of the database and associated drivers - that makes the experience much much 
> better.
> 
> I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
> feasible.  I'm just trying to enlarge the picture a bit to see what that 
> means for this use case and for the supporting drivers/clients.
> 
>> On May 2, 2023, at 3:04 PM, Benedict  wrote:
>> 
>> But it’s so trivial it was already implemented by David in the span of ten 
>> minutes? If anything, we’re slowing progress down by refusing to do the 
>> extra types, as we’re busy arguing about it rather than delivering a feature?
>> 
>> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
>> support types beyond float. Not that we should start with float.
>> 
>> So, this whole debate is a mess, I think. But hey ho.
>> 
>>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>>> 
>>> 
>>> I'll speak up on that one. If you look at my ranked voting, that is where 
>>> my head is. I get accused of scope creep (a lot) and looking at the initial 
>>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>>> vector search at a furious pace and I think I have a simple way of adding 
>>> support to keep Cassandra relevant for these use cases" Instead of just 
>>> focusing on this use case, I feel the arguments have bike shedded into 
>>> scope creep which means it will take forever to get into the project.
>>> 
>>> My preference is to see one thing validated with an MVP and get it into the 
>>> hands of developers sooner so we can continue to iterate based on actual 
>>> usage. 
>>> 
>>> It doesn't say your points are wrong or your opinions are broken, I'm 
>>> voting for what I think will be awesome for users sooner. 
>>> 
>>> Patrick
>>> 
>>> On Tue, May 2, 2023 at 12:29 PM Benedict >> > wrote:
 Could folk voting against a general purpose type (that could well be 
 called a vector) briefly explain their reasoning?
 
 We established in the other thread that it’s technically trivial, meaning 
 folk must think it is strictly superior to only support float rather than 
 eg all numeric types (note: for the type, not the ANN). 
 
 I am surprised, and the blurbs accompanying votes so far don’t seem to 
 touch on this, mostly just endorsing the idea of a vector.
 
 
> On 2 May 2023, at 20:20, Patrick McFadin  > wrote:
> 
> 
> A > B > C on both polls. 
> 
> Having talked to several users in the community that are highly excited 
> about this change, this gets to what developers want to do at Cassandra 
> scale: store embeddings and retrieve them. 
> 
> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  > wrote:
>> A > B > C
>> 
>> I don't think that ML is such a niche application that it can't have its 
>> own CQL data type. Also, vectors are mathematical elements that have 
>> more applications that ML.
>> 
>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever > > wrote:
>>> 
>>> 
>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis >> > wrote:
 Should we add a vector type to Cassandra designed to meet the needs of 
 machine learning use cases, specifically feature and embedding vectors 
 for training, inference, an

Re: [POLL] Vector type for ML

2023-05-02 Thread Jeremy Hanna
I'm all for bringing more functionality to the masses sooner, but the original 
idea has a very very specific use case.  Do we have use cases for a general 
purpose Vector/Array data structure?  If so, awesome.  I just wondered if 
generalizing provides value, beyond being straightforward to implement.  I'm 
just trying to be sensitive to the database code maintenance and driver support 
for general types versus a single type for a specific, well defined purpose.

If it could easily be a plugin, that's great - but the full picture involves 
drivers that need to support it or you end up getting binary blobs you have to 
decode client side and then do stuff with.  So ideally if you have a well 
defined use case that you can build into the database, having it just be part 
of the database and associated drivers - that makes the experience much much 
better.

I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
feasible.  I'm just trying to enlarge the picture a bit to see what that means 
for this use case and for the supporting drivers/clients.

> On May 2, 2023, at 3:04 PM, Benedict  wrote:
> 
> But it’s so trivial it was already implemented by David in the span of ten 
> minutes? If anything, we’re slowing progress down by refusing to do the extra 
> types, as we’re busy arguing about it rather than delivering a feature?
> 
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
> support types beyond float. Not that we should start with float.
> 
> So, this whole debate is a mess, I think. But hey ho.
> 
>> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>> 
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where my 
>> head is. I get accused of scope creep (a lot) and looking at the initial 
>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>> vector search at a furious pace and I think I have a simple way of adding 
>> support to keep Cassandra relevant for these use cases" Instead of just 
>> focusing on this use case, I feel the arguments have bike shedded into scope 
>> creep which means it will take forever to get into the project.
>> 
>> My preference is to see one thing validated with an MVP and get it into the 
>> hands of developers sooner so we can continue to iterate based on actual 
>> usage. 
>> 
>> It doesn't say your points are wrong or your opinions are broken, I'm voting 
>> for what I think will be awesome for users sooner. 
>> 
>> Patrick
>> 
>> On Tue, May 2, 2023 at 12:29 PM Benedict > > wrote:
>>> Could folk voting against a general purpose type (that could well be called 
>>> a vector) briefly explain their reasoning?
>>> 
>>> We established in the other thread that it’s technically trivial, meaning 
>>> folk must think it is strictly superior to only support float rather than 
>>> eg all numeric types (note: for the type, not the ANN). 
>>> 
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to 
>>> touch on this, mostly just endorsing the idea of a vector.
>>> 
>>> 
 On 2 May 2023, at 20:20, Patrick McFadin >>> > wrote:
 
 
 A > B > C on both polls. 
 
 Having talked to several users in the community that are highly excited 
 about this change, this gets to what developers want to do at Cassandra 
 scale: store embeddings and retrieve them. 
 
 On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña >>> > wrote:
> A > B > C
> 
> I don't think that ML is such a niche application that it can't have its 
> own CQL data type. Also, vectors are mathematical elements that have more 
> applications that ML.
> 
> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  > wrote:
>> 
>> 
>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis > > wrote:
>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>> machine learning use cases, specifically feature and embedding vectors 
>>> for training, inference, and vector search?  
>>> 
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric 
>>> types, with no nulls allowed, and with no need for random access. The 
>>> ML industry overwhelmingly uses float32 vectors, to the point that the 
>>> industry-leading special-purpose vector database ONLY supports that 
>>> data type.
>>> 
>>> This poll is to gauge consensus subsequent to the recent discussion 
>>> thread at 
>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>> 
>>> Please rank the discussed options from most preferred option to least, 
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C 
>>> > B = A (C is my preference, followed by B or A approximately equally.)
>>> 
>>> (A) I am in favor of adding a vector type for float

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
Yeah, it's a bit of a mess but mailing list yo. People reading this would
have no idea we are friends. ;) (Which we are, for anyone reading this
later!)

I must have missed the point of this already being done. How about it,
David? Did you already make this?

"FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
support types beyond float. Not that we should start with float"
That is not my interpretation and I can definitely see how that may be
frustrating. If B is pretty much done then we are good. My concern, as
noted earlier, is the scope creep component that will delay this happening
for much longer.

David. End this argument. SHOW THE CODE!

Patrick


On Tue, May 2, 2023 at 1:04 PM Benedict  wrote:

> But it’s so trivial it was already implemented by David in the span of ten
> minutes? If anything, we’re slowing progress down by refusing to do the
> extra types, as we’re busy arguing about it rather than delivering a
> feature?
>
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever)
> support types beyond float. Not that we should start with float.
>
> So, this whole debate is a mess, I think. But hey ho.
>
> On 2 May 2023, at 20:57, Patrick McFadin  wrote:
>
> 
> I'll speak up on that one. If you look at my ranked voting, that is where
> my head is. I get accused of scope creep (a lot) and looking at the initial
> proposal Jonathan put on the ML it was mostly "Developers are adopting
> vector search at a furious pace and I think I have a simple way of adding
> support to keep Cassandra relevant for these use cases" Instead of just
> focusing on this use case, I feel the arguments have bike shedded into
> scope creep which means it will take forever to get into the project.
>
> My preference is to see one thing validated with an MVP and get it into
> the hands of developers sooner so we can continue to iterate based on
> actual usage.
>
> It doesn't say your points are wrong or your opinions are broken, I'm
> voting for what I think will be awesome for users sooner.
>
> Patrick
>
> On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:
>
>> Could folk voting against a general purpose type (that could well be
>> called a vector) briefly explain their reasoning?
>>
>> We established in the other thread that it’s technically trivial, meaning
>> folk must think it is strictly superior to only support float rather than
>> eg all numeric types (note: for the type, not the ANN).
>>
>> I am surprised, and the blurbs accompanying votes so far don’t seem to
>> touch on this, mostly just endorsing the idea of a vector.
>>
>>
>> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>>
>> 
>> A > B > C on both polls.
>>
>> Having talked to several users in the community that are highly excited
>> about this change, this gets to what developers want to do at Cassandra
>> scale: store embeddings and retrieve them.
>>
>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
>> wrote:
>>
>>> A > B > C
>>>
>>> I don't think that ML is such a niche application that it can't have its
>>> own CQL data type. Also, vectors are mathematical elements that have more
>>> applications that ML.
>>>
>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>>>


 On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric
> types, with no nulls allowed, and with no need for random access. The ML
> industry overwhelmingly uses float32 vectors, to the point that the
> industry-leading special-purpose vector database ONLY supports that data
> type.
>
> This poll is to gauge consensus subsequent to the recent discussion
> thread at
> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > 
> B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe
> we need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add
> array types that compose with all Cassandra types first, and make vectors 
> a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>



 A  > B > C

 B is stated as "must add array types…".  I think this is a bit loaded.
 If B was the (A + the implementation needs to be a non-null frozen float32
 array, serialisation forward compatible with other frozen arrays later
 implemented) I would put this before (A).  Especia

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) support types beyond float. Not that we should start with float.So, this whole debate is a mess, I think. But hey ho.On 2 May 2023, at 20:57, Patrick McFadin  wrote:I'll speak up on that one. If you look at my ranked voting, that is where my head is. I get accused of scope creep (a lot) and looking at the initial proposal Jonathan put on the ML it was mostly "Developers are adopting vector search at a furious pace and I think I have a simple way of adding support to keep Cassandra relevant for these use cases" Instead of just focusing on this use case, I feel the arguments have bike shedded into scope creep which means it will take forever to get into the project.My preference is to see one thing validated with an MVP and get it into the hands of developers sooner so we can continue to iterate based on actual usage. It doesn't say your points are wrong or your opinions are broken, I'm voting for what I think will be awesome for users sooner. PatrickOn Tue, May 2, 2023 at 12:29 PM Benedict  wrote:Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin  wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement. 





Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
I'll speak up on that one. If you look at my ranked voting, that is where
my head is. I get accused of scope creep (a lot) and looking at the initial
proposal Jonathan put on the ML it was mostly "Developers are adopting
vector search at a furious pace and I think I have a simple way of adding
support to keep Cassandra relevant for these use cases" Instead of just
focusing on this use case, I feel the arguments have bike shedded into
scope creep which means it will take forever to get into the project.

My preference is to see one thing validated with an MVP and get it into the
hands of developers sooner so we can continue to iterate based on actual
usage.

It doesn't say your points are wrong or your opinions are broken, I'm
voting for what I think will be awesome for users sooner.

Patrick

On Tue, May 2, 2023 at 12:29 PM Benedict  wrote:

> Could folk voting against a general purpose type (that could well be
> called a vector) briefly explain their reasoning?
>
> We established in the other thread that it’s technically trivial, meaning
> folk must think it is strictly superior to only support float rather than
> eg all numeric types (note: for the type, not the ANN).
>
> I am surprised, and the blurbs accompanying votes so far don’t seem to
> touch on this, mostly just endorsing the idea of a vector.
>
>
> On 2 May 2023, at 20:20, Patrick McFadin  wrote:
>
> 
> A > B > C on both polls.
>
> Having talked to several users in the community that are highly excited
> about this change, this gets to what developers want to do at Cassandra
> scale: store embeddings and retrieve them.
>
> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
> wrote:
>
>> A > B > C
>>
>> I don't think that ML is such a niche application that it can't have its
>> own CQL data type. Also, vectors are mathematical elements that have more
>> applications that ML.
>>
>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>>
>>>
>>>
>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>>>
 Should we add a vector type to Cassandra designed to meet the needs of
 machine learning use cases, specifically feature and embedding vectors for
 training, inference, and vector search?

 ML vectors are fixed-dimension (fixed-length) sequences of numeric
 types, with no nulls allowed, and with no need for random access. The ML
 industry overwhelmingly uses float32 vectors, to the point that the
 industry-leading special-purpose vector database ONLY supports that data
 type.

 This poll is to gauge consensus subsequent to the recent discussion
 thread at
 https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.

 Please rank the discussed options from most preferred option to least,
 e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
 = A (C is my preference, followed by B or A approximately equally.)

 (A) I am in favor of adding a vector type for floats; I do not believe
 we need to tie it to any particular implementation details.

 (B) I am okay with adding a vector type but I believe we must add array
 types that compose with all Cassandra types first, and make vectors a
 special case of arrays-without-null-elements.

 (C) I am not in favor of adding a built-in vector type.

>>>
>>>
>>>
>>> A  > B > C
>>>
>>> B is stated as "must add array types…".  I think this is a bit loaded.
>>> If B was the (A + the implementation needs to be a non-null frozen float32
>>> array, serialisation forward compatible with other frozen arrays later
>>> implemented) I would put this before (A).  Especially because it's been
>>> shown already this is easy to implement.
>>>
>>>
>>>
>>


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hey Dinesh,

Yeah it makes sense that the sstable streaming is network bound since it's
mostly just moving files.

Do you have any performance stats on the sstable parsing side inside spark?

--Seb

On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi  wrote:

> It is line rate / network bound. We have a patch out in vert.x that should
> use the zero copy path for it. But it's not a strict prereq for it.
>
> On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> > Hi folks,
> >
> > Great stuff thanks for sharing.
> >
> > The performance numbers I've seen so far are for the sidecar streaming
> > sstables (seems like this is just network bound?). What kind of perf are
> > you seeing at the Spark executors (at the per task level)?
> >
> > --Seb
> >
> > On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:
> >
> > > Does anybody have any questions that we could answer about this
> proposal?
> > >
> > > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero <
> frank.guerr...@gmail.com>
> > > wrote:
> > >
> > > Hi folks,
> > >
> > > We have updated the confluence page with the source code for CEP-28.
> > > There are two repositories with contributions. One is the patch [1]
> > > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > > Spark Analytics library. The second is a new repository [2] with
> > > contributions to the Cassandra Spark Analytics code
> > >
> > > We also have a README markdown file that you can follow to give the
> > > code a try:
> > >
> > >
> > >
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> > >
> > > Best,
> > > - Francisco
> > >
> > > [1] Apache Cassandra Sidecar bulk APIs source code:
> > > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > > [2] Apache Cassandra Spark Analytics source code:
> > > https://github.com/frankgh/cassandra-analytics
> > >
> > >
> > > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > > responding here - yes, we can add some diagrams to the CEP - I’ll try
> to
> > > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28,
> 2023,
> > > at 1:14 PM, J. D. Jordan  wrote: > > > >
> Maybe
> > > some data flow diagrams could be added to the cep showing some example
> > > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM,
> Yifan Cai
> > >  wrote: > >> > >>  > >> A lot of great
> discussions!
> > > > >> > >> On the sidecar front, especially what the role sidecar plays
> in
> > > terms of this CEP, I feel there might be some confusion. Once the code
> is
> > > published, we should have clarity. > >> Sidecar does not read sstables
> nor
> > > do any coordination for analytics queries. It is local to the companion
> > > Cassandra instance. For bulk read, it takes snapshots and streams
> sstables
> > > to spark workers to read. For bulk write, it imports the sstables
> uploaded
> > > from spark workers. All commands are existing jmx/nodetool
> functionalities
> > > from Cassandra. Sidecar adds the http interface to them. It might be an
> > > over simplified description. The complex computation is performed in
> spark
> > > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > > database that does both OLTP and OLAP. (Not what this thread aims for)
> > >>
> > > At the current stage, Spark is very suited for analytic purposes. > >>
> > >>
> > > On Tue, Mar 28, 2023 at 9:06 AM Benedict  > > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> > > the process has all the information it chooses to utilise about which
> > > resources it’s using and what it’s using those resources for. > >>> >
> >>>
> > > The inability to isolate GC domains is something we cannot address, but
> > > also probably not a problem if we were doing everything with memory
> > > management as well as we could be. > >>> > >>> But, not worth detailing
> > > this thread for. Today we do very little well on this front within the
> > > process, and a separate process is well justified given the state of
> play.
> > > > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > > de...@chen-becker.org > wrote: >  >
> > >   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > > joe.e.ly...@gmail.com > wrote: > 
> ... >
> > >  > > I think we might be underselling how valuable JVM
> isolation
> > > is, > > especially for analytics queries that are going to pass the
> > > entire > > dataset through heap somewhat constantly. >  > 
> Big
> > > +1 here. The JVM simply does not have significant granularity of
> control
> > > for resource utilization, but this is explicitly a feature of separate
> > > processes. Add in being able to separate GC domains and you can avoid
> a lot
> > > of noisy neighbor in-VM behavior for the disparate workloads. >  >
> 
> > > Cheers, >  >  Derek >  >  >  -- > 
> > > +--

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
It is line rate / network bound. We have a patch out in vert.x that should use 
the zero copy path for it. But it's not a strict prereq for it.

On 2023/05/02 15:39:02 Sebastian Estevez wrote:
> Hi folks,
> 
> Great stuff thanks for sharing.
> 
> The performance numbers I've seen so far are for the sidecar streaming
> sstables (seems like this is just network bound?). What kind of perf are
> you seeing at the Spark executors (at the per task level)?
> 
> --Seb
> 
> On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:
> 
> > Does anybody have any questions that we could answer about this proposal?
> >
> > On Apr 27, 2023, at 1:24 PM, Francisco Guerrero 
> > wrote:
> >
> > Hi folks,
> >
> > We have updated the confluence page with the source code for CEP-28.
> > There are two repositories with contributions. One is the patch [1]
> > for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> > Spark Analytics library. The second is a new repository [2] with
> > contributions to the Cassandra Spark Analytics code
> >
> > We also have a README markdown file that you can follow to give the
> > code a try:
> >
> >
> > https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
> >
> > Best,
> > - Francisco
> >
> > [1] Apache Cassandra Sidecar bulk APIs source code:
> > https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> > [2] Apache Cassandra Spark Analytics source code:
> > https://github.com/frankgh/cassandra-analytics
> >
> >
> > On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> > responding here - yes, we can add some diagrams to the CEP - I’ll try to
> > get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> > at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
> > some data flow diagrams could be added to the cep showing some example
> > operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
> >  wrote: > >> > >>  > >> A lot of great discussions!
> > > >> > >> On the sidecar front, especially what the role sidecar plays in
> > terms of this CEP, I feel there might be some confusion. Once the code is
> > published, we should have clarity. > >> Sidecar does not read sstables nor
> > do any coordination for analytics queries. It is local to the companion
> > Cassandra instance. For bulk read, it takes snapshots and streams sstables
> > to spark workers to read. For bulk write, it imports the sstables uploaded
> > from spark workers. All commands are existing jmx/nodetool functionalities
> > from Cassandra. Sidecar adds the http interface to them. It might be an
> > over simplified description. The complex computation is performed in spark
> > clusters only. > >> > >> In the long run, Cassandra might evolve into a
> > database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> > At the current stage, Spark is very suited for analytic purposes. > >> > >>
> > On Tue, Mar 28, 2023 at 9:06 AM Benedict  > bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> > the process has all the information it chooses to utilise about which
> > resources it’s using and what it’s using those resources for. > >>> > >>>
> > The inability to isolate GC domains is something we cannot address, but
> > also probably not a problem if we were doing everything with memory
> > management as well as we could be. > >>> > >>> But, not worth detailing
> > this thread for. Today we do very little well on this front within the
> > process, and a separate process is well justified given the state of play.
> > > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> > de...@chen-becker.org > wrote: >  >
> >   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> > joe.e.ly...@gmail.com > wrote: >  ... >
> >  > > I think we might be underselling how valuable JVM isolation
> > is, > > especially for analytics queries that are going to pass the
> > entire > > dataset through heap somewhat constantly. >  >  Big
> > +1 here. The JVM simply does not have significant granularity of control
> > for resource utilization, but this is explicitly a feature of separate
> > processes. Add in being able to separate GC domains and you can avoid a lot
> > of noisy neighbor in-VM behavior for the disparate workloads. >  > 
> > Cheers, >  >  Derek >  >  >  -- > 
> > +---+ >  |
> > Derek Chen-Becker | >  | GPG Key available at
> > https://keybase.io/dchenbecker and | >  |
> > https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
> > Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
> > +---+ >  >
> > >
> > --
> > Francisco Guerrero
> >
> >
> >
> 
> -- 
> All the best,
> 
> Sebastián
> 


Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for the type, not the ANN). I am surprised, and the blurbs accompanying votes so far don’t seem to touch on this, mostly just endorsing the idea of a vector.On 2 May 2023, at 20:20, Patrick McFadin  wrote:A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña  wrote:A > B > CI don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML.On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.A  > B > CB is stated as "must add array types…".  I think this is a bit loaded.  If B was the (A + the implementation needs to be a non-null frozen float32 array, serialisation forward compatible with other frozen arrays later implemented) I would put this before (A).  Especially because it's been shown already this is easy to implement. 




Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
A > B > C on both polls.

Having talked to several users in the community that are highly excited
about this change, this gets to what developers want to do at Cassandra
scale: store embeddings and retrieve them.

On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña 
wrote:

> A > B > C
>
> I don't think that ML is such a niche application that it can't have its
> own CQL data type. Also, vectors are mathematical elements that have more
> applications that ML.
>
> On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:
>
>>
>>
>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>>
>>> Should we add a vector type to Cassandra designed to meet the needs of
>>> machine learning use cases, specifically feature and embedding vectors for
>>> training, inference, and vector search?
>>>
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric
>>> types, with no nulls allowed, and with no need for random access. The ML
>>> industry overwhelmingly uses float32 vectors, to the point that the
>>> industry-leading special-purpose vector database ONLY supports that data
>>> type.
>>>
>>> This poll is to gauge consensus subsequent to the recent discussion
>>> thread at
>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>
>>> Please rank the discussed options from most preferred option to least,
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
>>> = A (C is my preference, followed by B or A approximately equally.)
>>>
>>> (A) I am in favor of adding a vector type for floats; I do not believe
>>> we need to tie it to any particular implementation details.
>>>
>>> (B) I am okay with adding a vector type but I believe we must add array
>>> types that compose with all Cassandra types first, and make vectors a
>>> special case of arrays-without-null-elements.
>>>
>>> (C) I am not in favor of adding a built-in vector type.
>>>
>>
>>
>>
>> A  > B > C
>>
>> B is stated as "must add array types…".  I think this is a bit loaded.
>> If B was the (A + the implementation needs to be a non-null frozen float32
>> array, serialisation forward compatible with other frozen arrays later
>> implemented) I would put this before (A).  Especially because it's been
>> shown already this is easy to implement.
>>
>>
>>
>


Re: [POLL] Vector type for ML

2023-05-02 Thread Andrés de la Peña
A > B > C

I don't think that ML is such a niche application that it can't have its
own CQL data type. Also, vectors are mathematical elements that have more
applications that ML.

On Tue, 2 May 2023 at 19:15, Mick Semb Wever  wrote:

>
>
> On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:
>
>> Should we add a vector type to Cassandra designed to meet the needs of
>> machine learning use cases, specifically feature and embedding vectors for
>> training, inference, and vector search?
>>
>> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
>> with no nulls allowed, and with no need for random access. The ML industry
>> overwhelmingly uses float32 vectors, to the point that the industry-leading
>> special-purpose vector database ONLY supports that data type.
>>
>> This poll is to gauge consensus subsequent to the recent discussion
>> thread at
>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>
>> Please rank the discussed options from most preferred option to least,
>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
>> = A (C is my preference, followed by B or A approximately equally.)
>>
>> (A) I am in favor of adding a vector type for floats; I do not believe we
>> need to tie it to any particular implementation details.
>>
>> (B) I am okay with adding a vector type but I believe we must add array
>> types that compose with all Cassandra types first, and make vectors a
>> special case of arrays-without-null-elements.
>>
>> (C) I am not in favor of adding a built-in vector type.
>>
>
>
>
> A  > B > C
>
> B is stated as "must add array types…".  I think this is a bit loaded.  If
> B was the (A + the implementation needs to be a non-null frozen float32
> array, serialisation forward compatible with other frozen arrays later
> implemented) I would put this before (A).  Especially because it's been
> shown already this is easy to implement.
>
>
>


Re: [POLL] Vector type for ML

2023-05-02 Thread Mick Semb Wever
On Tue, 2 May 2023 at 17:14, Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
> with no nulls allowed, and with no need for random access. The ML industry
> overwhelmingly uses float32 vectors, to the point that the industry-leading
> special-purpose vector database ONLY supports that data type.
>
> This poll is to gauge consensus subsequent to the recent discussion thread
> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe we
> need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add array
> types that compose with all Cassandra types first, and make vectors a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>



A  > B > C

B is stated as "must add array types…".  I think this is a bit loaded.  If
B was the (A + the implementation needs to be a non-null frozen float32
array, serialisation forward compatible with other frozen arrays later
implemented) I would put this before (A).  Especially because it's been
shown already this is easy to implement.


Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> B) Should we introduce a type that is general purpose, and supports all 
> Cassandra types, so that this may be used to support ML (and perhaps other) 
> workloads

I vote B only as well...

> On May 2, 2023, at 9:02 AM, Benedict  wrote:
> 
> This is not the poll I thought we would be conducting, and I don’t really 
> support its framing. There are two parallel questions: what the functionality 
> should be and how they should be exposed. This poll compresses the 
> optionality poorly.
> 
> Whether or not we support a “vector” concept (or something isomorphic with 
> it), the first question this poll wants to answer is:
> 
> A) Should we introduce a new CQL collection type that is unique to ML and 
> *only* supports float32
> B) Should we introduce a type that is general purpose, and supports all 
> Cassandra types, so that this may be used to support ML (and perhaps other) 
> workloads
> C) Should we not introduce new types to CQL at all
> 
> For this question, I vote B only.
> 
> Once this question is answered it makes sense to answer how it will be 
> exposed semantically/syntactically. 
> 
> 
>> On 2 May 2023, at 16:43, Jonathan Ellis  wrote:
>> 
>> 
>> My preference: A > B > C.  Vectors are distinct enough from arrays that we 
>> should not make adding the latter a prerequisite for adding the former.
>> 
>> On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis > > wrote:
>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>> machine learning use cases, specifically feature and embedding vectors for 
>>> training, inference, and vector search?  
>>> 
>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric types, 
>>> with no nulls allowed, and with no need for random access. The ML industry 
>>> overwhelmingly uses float32 vectors, to the point that the industry-leading 
>>> special-purpose vector database ONLY supports that data type.
>>> 
>>> This poll is to gauge consensus subsequent to the recent discussion thread 
>>> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>> 
>>> Please rank the discussed options from most preferred option to least, 
>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B 
>>> = A (C is my preference, followed by B or A approximately equally.)
>>> 
>>> (A) I am in favor of adding a vector type for floats; I do not believe we 
>>> need to tie it to any particular implementation details.
>>> 
>>> (B) I am okay with adding a vector type but I believe we must add array 
>>> types that compose with all Cassandra types first, and make vectors a 
>>> special case of arrays-without-null-elements.
>>> 
>>> (C) I am not in favor of adding a built-in vector type.
>>> 
>>> -- 
>>> Jonathan Ellis
>>> co-founder, http://www.datastax.com 
>>> @spyced
>> 
>> 
>> -- 
>> Jonathan Ellis
>> co-founder, http://www.datastax.com 
>> @spyced



Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something isomorphic with it), the first question this poll wants to answer is:A) Should we introduce a new CQL collection type that is unique to ML and *only* supports float32B) Should we introduce a type that is general purpose, and supports all Cassandra types, so that this may be used to support ML (and perhaps other) workloadsC) Should we not introduce new types to CQL at allFor this question, I vote B only.Once this question is answered it makes sense to answer how it will be exposed semantically/syntactically. On 2 May 2023, at 16:43, Jonathan Ellis  wrote:My preference: A > B > C.  Vectors are distinct enough from arrays that we should not make adding the latter a prerequisite for adding the former.On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis  wrote:Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search?  ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no need for random access. The ML industry overwhelmingly uses float32 vectors, to the point that the industry-leading special-purpose vector database ONLY supports that data type.This poll is to gauge consensus subsequent to the recent discussion thread at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.Please rank the discussed options from most preferred option to least, e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B = A (C is my preference, followed by B or A approximately equally.)(A) I am in favor of adding a vector type for floats; I do not believe we need to tie it to any particular implementation details.(B) I am okay with adding a vector type but I believe we must add array types that compose with all Cassandra types first, and make vectors a special case of arrays-without-null-elements.(C) I am not in favor of adding a built-in vector type.-- Jonathan Ellisco-founder, http://www.datastax.com@spyced
-- Jonathan Ellisco-founder, http://www.datastax.com@spyced


Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
My preference: A > B > C.  Vectors are distinct enough from arrays that we
should not make adding the latter a prerequisite for adding the former.

On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis  wrote:

> Should we add a vector type to Cassandra designed to meet the needs of
> machine learning use cases, specifically feature and embedding vectors for
> training, inference, and vector search?
>
> ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
> with no nulls allowed, and with no need for random access. The ML industry
> overwhelmingly uses float32 vectors, to the point that the industry-leading
> special-purpose vector database ONLY supports that data type.
>
> This poll is to gauge consensus subsequent to the recent discussion thread
> at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>
> Please rank the discussed options from most preferred option to least,
> e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
> = A (C is my preference, followed by B or A approximately equally.)
>
> (A) I am in favor of adding a vector type for floats; I do not believe we
> need to tie it to any particular implementation details.
>
> (B) I am okay with adding a vector type but I believe we must add array
> types that compose with all Cassandra types first, and make vectors a
> special case of arrays-without-null-elements.
>
> (C) I am not in favor of adding a built-in vector type.
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>


-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hi folks,

Great stuff thanks for sharing.

The performance numbers I've seen so far are for the sidecar streaming
sstables (seems like this is just network bound?). What kind of perf are
you seeing at the Spark executors (at the per task level)?

--Seb

On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi  wrote:

> Does anybody have any questions that we could answer about this proposal?
>
> On Apr 27, 2023, at 1:24 PM, Francisco Guerrero 
> wrote:
>
> Hi folks,
>
> We have updated the confluence page with the source code for CEP-28.
> There are two repositories with contributions. One is the patch [1]
> for Cassandra Sidecar with the bulk APIs that enable the Cassandra
> Spark Analytics library. The second is a new repository [2] with
> contributions to the Cassandra Spark Analytics code
>
> We also have a README markdown file that you can follow to give the
> code a try:
>
>
> https://github.com/frankgh/cassandra-analytics/blob/trunk/cassandra-analytics-core-example/README.md
>
> Best,
> - Francisco
>
> [1] Apache Cassandra Sidecar bulk APIs source code:
> https://github.com/frankgh/cassandra-sidecar/tree/CEP-28-bulk-apis
> [2] Apache Cassandra Spark Analytics source code:
> https://github.com/frankgh/cassandra-analytics
>
>
> On 2023/04/05 15:18:07 Doug Rohrer wrote: > Sorry for the delay in
> responding here - yes, we can add some diagrams to the CEP - I’ll try to
> get that done by end-of-week. > > Thanks, > > Doug > > > On Mar 28, 2023,
> at 1:14 PM, J. D. Jordan  wrote: > > > > Maybe
> some data flow diagrams could be added to the cep showing some example
> operations for read/write? > > > >> On Mar 28, 2023, at 11:35 AM, Yifan Cai
>  wrote: > >> > >>  > >> A lot of great discussions!
> > >> > >> On the sidecar front, especially what the role sidecar plays in
> terms of this CEP, I feel there might be some confusion. Once the code is
> published, we should have clarity. > >> Sidecar does not read sstables nor
> do any coordination for analytics queries. It is local to the companion
> Cassandra instance. For bulk read, it takes snapshots and streams sstables
> to spark workers to read. For bulk write, it imports the sstables uploaded
> from spark workers. All commands are existing jmx/nodetool functionalities
> from Cassandra. Sidecar adds the http interface to them. It might be an
> over simplified description. The complex computation is performed in spark
> clusters only. > >> > >> In the long run, Cassandra might evolve into a
> database that does both OLTP and OLAP. (Not what this thread aims for) > >>
> At the current stage, Spark is very suited for analytic purposes. > >> > >>
> On Tue, Mar 28, 2023 at 9:06 AM Benedict  bened...@apache.org>> wrote: > >>> I disagree with the first claim, as
> the process has all the information it chooses to utilise about which
> resources it’s using and what it’s using those resources for. > >>> > >>>
> The inability to isolate GC domains is something we cannot address, but
> also probably not a problem if we were doing everything with memory
> management as well as we could be. > >>> > >>> But, not worth detailing
> this thread for. Today we do very little well on this front within the
> process, and a separate process is well justified given the state of play.
> > >>> >  On 28 Mar 2023, at 16:38, Derek Chen-Becker <
> de...@chen-becker.org > wrote: >  >
>   >  >  On Tue, Mar 28, 2023 at 9:03 AM Joseph Lynch <
> joe.e.ly...@gmail.com > wrote: >  ... >
>  > > I think we might be underselling how valuable JVM isolation
> is, > > especially for analytics queries that are going to pass the
> entire > > dataset through heap somewhat constantly. >  >  Big
> +1 here. The JVM simply does not have significant granularity of control
> for resource utilization, but this is explicitly a feature of separate
> processes. Add in being able to separate GC domains and you can avoid a lot
> of noisy neighbor in-VM behavior for the disparate workloads. >  > 
> Cheers, >  >  Derek >  >  >  -- > 
> +---+ >  |
> Derek Chen-Becker | >  | GPG Key available at
> https://keybase.io/dchenbecker and | >  |
> https://pgp.mit.edu/pks/lookup?search=derek%40chen-becker.org | >  |
> Fngrprnt: EB8A 6480 F0A3 C8EB C1E7 7F42 AFC5 AFEE 96E4 6ACC | > 
> +---+ >  >
> >
> --
> Francisco Guerrero
>
>
>

-- 
All the best,

Sebastián


[POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
Should we add a vector type to Cassandra designed to meet the needs of
machine learning use cases, specifically feature and embedding vectors for
training, inference, and vector search?

ML vectors are fixed-dimension (fixed-length) sequences of numeric types,
with no nulls allowed, and with no need for random access. The ML industry
overwhelmingly uses float32 vectors, to the point that the industry-leading
special-purpose vector database ONLY supports that data type.

This poll is to gauge consensus subsequent to the recent discussion thread
at https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.

Please rank the discussed options from most preferred option to least,
e.g., A > B > C (A is my preference, followed by B, followed by C) or C > B
= A (C is my preference, followed by B or A approximately equally.)

(A) I am in favor of adding a vector type for floats; I do not believe we
need to tie it to any particular implementation details.

(B) I am okay with adding a vector type but I believe we must add array
types that compose with all Cassandra types first, and make vectors a
special case of arrays-without-null-elements.

(C) I am not in favor of adding a built-in vector type.

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced


Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Benedict
If we agree we’re delivering some general purpose array type, that supports all types as elements (ie, is logicaly equivalent to a frozen list of fixed length, however it is actually implemented), I think we are in technical agreement and it’s just a matter of presentation.At which point I think we should simply collect the possible syntax options and put them to a poll. I’m not keen on vector for previously stated reasons, but it’s probably not worth litigating further and we should let the silent majority adjudicate.On 2 May 2023, at 12:43, Jonathan Ellis  wrote:To make sure I understand correctly -- are you saying that you're fine with a vector type, but you want to see it implemented as a special case of arrays, or that you are not fine with a vector type because you would prefer to only add arrays and that should be "good enough" for ML?On Mon, May 1, 2023 at 4:27 PM Benedict  wrote:A data type plug-in is actually really easy today, I think? But, developing further hooks should probably be thought through as they’re necessary. I think in this case it would be simpler to deliver a general purpose type, which is why I’m trying to propose types that would be acceptable.I also think we’re pretty close to agreement, really?But if not, let’s flesh out potential plug-in requirements.On 1 May 2023, at 21:58, Josh McKenzie  wrote:If we want to make an ML-specific data type, it should be in an ML plug-in.How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.htmlpostgres: https://www.postgresql.org/docs/current/contrib.htmlI'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem.If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route.On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:I have explained repeatedly why I am opposed to ML-specific data types. If we want to make an ML-specific data type, it should be in an ML plug-in. We should not pollute the general purpose language with hastily-considered features that target specific bandwagons - at best partially - no matter how exciting the bandwagon.I think a simple and easy case can be made for fixed length array types that do not seem to create random bits of cruft in the language that dangle by themselves should this play not pan out. This is an easy way for this effort to make progress without negatively impacting the language.That is, unless we want to start supporting totally random types for every use case at the top level language layer. I don’t think this is a good idea, personally, and I’m quite confident we would now be regretting this approach had it been taken for earlier bandwagons.Nor do I think anyone’s priors about how successful this effort will be should matter. As a matter of principle, we should simply never deliver a specialist functionality as a high level CQL language feature without at least baking it for several years as a plug-in.On 1 May 2023, at 21:03, Mick Semb Wever  wrote:Yes!  What you (David) and Benedict write beautifully supports `VECTOR FLOAT[n]` imho.You are definitely bringing up valid implementation details, and that can be dealt with during patch review. This thread is about the CQL API addition.  No matter which way the technical review goes with the implementation details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML idiomatic approach and the best long-term CQL API.  It's a win-win situation – no matter how you look at it imho it is the best solution api wise.  Unless the suggestion is that an ideal implementation can give us a better CQL API – but I don't see what that could be.   Maybe the suggestion is we deny the possibility of using the VECTOR keyword and bring us back to something like `NON-NULL FROZEN`.   This is odd to me because `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the patch's audience and their idioms.  I have no problems with introducing such an alias to meet the ML crowd.Another way I think of this is `VECTOR FLOAT[n]` is the porcelain ML cql api, `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the general-use plumbing cql apis. This would allow implementation details to be moved out of this thread and to the review phase.On Mon, 1 May 2023 at 20:57, David Capwell  wrote:> I think it is totally reasonable that the ANN patch (and Jonathan) is not asked to implement on top of, or towards, other array (or other) new data types.   This impacts serialization, if you do not think about this day 1 you then can’t add later on 

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Jonathan Ellis
To make sure I understand correctly -- are you saying that you're fine with
a vector type, but you want to see it implemented as a special case of
arrays, or that you are not fine with a vector type because you would
prefer to only add arrays and that should be "good enough" for ML?

On Mon, May 1, 2023 at 4:27 PM Benedict  wrote:

> A data type plug-in is actually really easy today, I think? But,
> developing further hooks should probably be thought through as they’re
> necessary.
>
> I think in this case it would be simpler to deliver a general purpose
> type, which is why I’m trying to propose types that would be acceptable.
>
> I also think we’re pretty close to agreement, really?
>
> But if not, let’s flesh out potential plug-in requirements.
>
>
> On 1 May 2023, at 21:58, Josh McKenzie  wrote:
>
> 
>
> If we want to make an ML-specific data type, it should be in an ML plug-in.
>
> How can we encourage a healthier plug-in ecosystem? As far as I know it's
> been pretty anemic historically:
>
> cassandra:
> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
> postgres: https://www.postgresql.org/docs/current/contrib.html
>
> I'm really interested to hear if there's more in the ecosystem I'm not
> aware of or if there's been strides made in this regard; users in the
> ecosystem being able to write durable extensions to Cassandra that they can
> then distribute and gain momentum could potentially be a great incubator
> for new features or functionality in the ecosystem.
>
> If our support for extensions remains as bare as I believe it to be, I
> wouldn't recommend anyone go that route.
>
> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
>
>
> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell  wrote:
>
> > I think it is totally reasonable that the ANN patch (and Jonathan) is
> not asked to implement on top of, or towards, other array (or other) new
> data types.
>
>
> This impacts serialization, if you do not think about this day 1 you then
> can’t add later on without having to worry about migration and versioning…
>
> Honestly I wanted to better understand the cost to be generic and the
> impact to ANN, so I took
> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
> and made it handle every requirement I have listed so far (size, null, all
> types)… the current patch has several bugs at the type level that would
> nee

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Mick Semb Wever
I have no problem with `VECTOR` hanging around forever as an alias for
`NON-NULL FROZEN`.  Even without ANN, it makes sense and will stick with
new C* users.

A plug-in system would be great, but it shouldn't hold back this work imho.



On Mon, 1 May 2023 at 22:17, Benedict  wrote:

> I have explained repeatedly why I am opposed to ML-specific data types. If
> we want to make an ML-specific data type, it should be in an ML plug-in. We
> should not pollute the general purpose language with hastily-considered
> features that target specific bandwagons - at best partially - no matter
> how exciting the bandwagon.
>
> I think a simple and easy case can be made for fixed length array types
> that do not seem to create random bits of cruft in the language that dangle
> by themselves should this play not pan out. This is an easy way for this
> effort to make progress without negatively impacting the language.
>
> That is, unless we want to start supporting totally random types for every
> use case at the top level language layer. I don’t think this is a good
> idea, personally, and I’m quite confident we would now be regretting this
> approach had it been taken for earlier bandwagons.
>
> Nor do I think anyone’s priors about how successful this effort will be
> should matter. As a matter of principle, we should simply never deliver a
> specialist functionality as a high level CQL language feature without at
> least baking it for several years as a plug-in.
>
> On 1 May 2023, at 21:03, Mick Semb Wever  wrote:
>
> 
>
> Yes!  What you (David) and Benedict write beautifully supports `VECTOR
> FLOAT[n]` imho.
>
> You are definitely bringing up valid implementation details, and that can
> be dealt with during patch review. This thread is about the CQL API
> addition.
>
> No matter which way the technical review goes with the implementation
> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML
> idiomatic approach and the best long-term CQL API.  It's a win-win
> situation – no matter how you look at it imho it is the best solution api
> wise.
>
> Unless the suggestion is that an ideal implementation can give us a better
> CQL API – but I don't see what that could be.   Maybe the suggestion is we
> deny the possibility of using the VECTOR keyword and bring us back to
> something like `NON-NULL FROZEN`.   This is odd to me because
> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the
> patch's audience and their idioms.  I have no problems with introducing
> such an alias to meet the ML crowd.
>
> Another way I think of this is
>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>  `NON-NULL FROZEN` and `FROZEN` and `FLOAT[n]` are the
> general-use plumbing cql apis.
>
> This would allow implementation details to be moved out of this thread and
> to the review phase.
>
>
>
>
> On Mon, 1 May 2023 at 20:57, David Capwell  wrote:
>
>> > I think it is totally reasonable that the ANN patch (and Jonathan) is
>> not asked to implement on top of, or towards, other array (or other) new
>> data types.
>>
>>
>> This impacts serialization, if you do not think about this day 1 you then
>> can’t add later on without having to worry about migration and versioning…
>>
>> Honestly I wanted to better understand the cost to be generic and the
>> impact to ANN, so I took
>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>> and made it handle every requirement I have listed so far (size, null, all
>> types)… the current patch has several bugs at the type level that would
>> need to be fixed, so had to fix those as well…. Total time to do this was
>> 10 minutes… and this includes adding a method "public float[]
>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic
>> small (change VectorType.Serializer.instance.deserialize(buffer) to
>> type.composeAsFloats(buffer))….
>>
>> Did this have any impact to the final ByteBuffer?  Nope, it had identical
>> layout for the FloatType case, but works for all types…. I didn’t change
>> the fact we store the size (felt this could be removed, but then we could
>> never support expanding the vector in the future…)
>>
>> So, given the fact it takes a few minutes to implement all these
>> requirements, I do find it very reasonable to push back and say we should
>> make sure the new type is not leaking details from a special ANN index…. We
>> have spent more time debating this than it takes to support… we also have
>> fuzz testing on trunk so just updating
>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new
>> type means we get type coverage as well…
>>
>> I have zero issues helping to review this patch and make sure the testing
>> is on-par with existing types (this is a strong requirement for me)
>>
>>
>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever  wrote:
>> >
>> >
>> > > But suggesting that Jonathan should work on implementing general
>> purpose arrays see