Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
\o/ Bring it in team. Group hug. Now if you'll excuse me, I'm going to go build my preso on how Cassandra is the only distributed database you can do vector search in an ACID transaction. Patrick On Tue, May 2, 2023 at 3:27 PM Jonathan Ellis wrote: > I had a call with David. We agreed that w

Re: [POLL] Vector type for ML

2023-05-02 Thread Dinesh Joshi
I'm also in favor of having a general data type that is not tied to numeric data types alone. On 2023/05/02 22:27:24 Jonathan Ellis wrote: > I had a call with David. We agreed that we want a "vector" data type with > these properties > > - Fixed length > - No nulls > - Random access not support

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
I had a call with David. We agreed that we want a "vector" data type with these properties - Fixed length - No nulls - Random access not supported Where we disagreed was on my proposal to restrict vectors to only numeric data. David's points were that (1) He has a use case today for a data typ

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
We're reusing existing Cassandra code so the performance characteristics for parsing should be the same as Cassandra. I will need to check if we have benchmarks. If we do, we'll add it to the CEP wiki page. On 2023/05/02 19:52:28 Sebastian Estevez wrote: > Hey Dinesh, > > Yeah it makes sense th

Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> How about it, David? Did you already make this? I checked out the patch, fixed serialize/deserialize, added the constraints, then added a composeForFloat(ByteBuffer), with this the impact to the POC patch was the following 1) move away from VectorType.instance.serializer().deserialize(bb) to

Re: [POLL] Vector type for ML

2023-05-02 Thread Jeremy Hanna
I'm all for bringing more functionality to the masses sooner, but the original idea has a very very specific use case. Do we have use cases for a general purpose Vector/Array data structure? If so, awesome. I just wondered if generalizing provides value, beyond being straightforward to implem

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
Yeah, it's a bit of a mess but mailing list yo. People reading this would have no idea we are friends. ;) (Which we are, for anyone reading this later!) I must have missed the point of this already being done. How about it, David? Did you already make this? "FWIW, my interpretation of the votes t

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
But it’s so trivial it was already implemented by David in the span of ten minutes? If anything, we’re slowing progress down by refusing to do the extra types, as we’re busy arguing about it rather than delivering a feature?FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) sup

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
I'll speak up on that one. If you look at my ranked voting, that is where my head is. I get accused of scope creep (a lot) and looking at the initial proposal Jonathan put on the ML it was mostly "Developers are adopting vector search at a furious pace and I think I have a simple way of adding supp

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hey Dinesh, Yeah it makes sense that the sstable streaming is network bound since it's mostly just moving files. Do you have any performance stats on the sstable parsing side inside spark? --Seb On Tue, May 2, 2023 at 3:31 PM Dinesh Joshi wrote: > It is line rate / network bound. We have a pa

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Dinesh Joshi
It is line rate / network bound. We have a patch out in vert.x that should use the zero copy path for it. But it's not a strict prereq for it. On 2023/05/02 15:39:02 Sebastian Estevez wrote: > Hi folks, > > Great stuff thanks for sharing. > > The performance numbers I've seen so far are for the

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
Could folk voting against a general purpose type (that could well be called a vector) briefly explain their reasoning?We established in the other thread that it’s technically trivial, meaning folk must think it is strictly superior to only support float rather than eg all numeric types (note: for t

Re: [POLL] Vector type for ML

2023-05-02 Thread Patrick McFadin
A > B > C on both polls. Having talked to several users in the community that are highly excited about this change, this gets to what developers want to do at Cassandra scale: store embeddings and retrieve them. On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña wrote: > A > B > C > > I don't th

Re: [POLL] Vector type for ML

2023-05-02 Thread Andrés de la Peña
A > B > C I don't think that ML is such a niche application that it can't have its own CQL data type. Also, vectors are mathematical elements that have more applications that ML. On Tue, 2 May 2023 at 19:15, Mick Semb Wever wrote: > > > On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote: > >> S

Re: [POLL] Vector type for ML

2023-05-02 Thread Mick Semb Wever
On Tue, 2 May 2023 at 17:14, Jonathan Ellis wrote: > Should we add a vector type to Cassandra designed to meet the needs of > machine learning use cases, specifically feature and embedding vectors for > training, inference, and vector search? > > ML vectors are fixed-dimension (fixed-length) sequ

Re: [POLL] Vector type for ML

2023-05-02 Thread David Capwell
> B) Should we introduce a type that is general purpose, and supports all > Cassandra types, so that this may be used to support ML (and perhaps other) > workloads I vote B only as well... > On May 2, 2023, at 9:02 AM, Benedict wrote: > > This is not the poll I thought we would be conducting,

Re: [POLL] Vector type for ML

2023-05-02 Thread Benedict
This is not the poll I thought we would be conducting, and I don’t really support its framing. There are two parallel questions: what the functionality should be and how they should be exposed. This poll compresses the optionality poorly.Whether or not we support a “vector” concept (or something is

Re: [POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
My preference: A > B > C. Vectors are distinct enough from arrays that we should not make adding the latter a prerequisite for adding the former. On Tue, May 2, 2023 at 10:13 AM Jonathan Ellis wrote: > Should we add a vector type to Cassandra designed to meet the needs of > machine learning use

Re: [DISCUSS] CEP-28: Reading and Writing Cassandra Data with Spark Bulk Analytics

2023-05-02 Thread Sebastian Estevez
Hi folks, Great stuff thanks for sharing. The performance numbers I've seen so far are for the sidecar streaming sstables (seems like this is just network bound?). What kind of perf are you seeing at the Spark executors (at the per task level)? --Seb On Mon, May 1, 2023 at 3:50 PM Dinesh Joshi

[POLL] Vector type for ML

2023-05-02 Thread Jonathan Ellis
Should we add a vector type to Cassandra designed to meet the needs of machine learning use cases, specifically feature and embedding vectors for training, inference, and vector search? ML vectors are fixed-dimension (fixed-length) sequences of numeric types, with no nulls allowed, and with no nee

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Benedict
If we agree we’re delivering some general purpose array type, that supports all types as elements (ie, is logicaly equivalent to a frozen list of fixed length, however it is actually implemented), I think we are in technical agreement and it’s just a matter of presentation.At which point I think we

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Jonathan Ellis
To make sure I understand correctly -- are you saying that you're fine with a vector type, but you want to see it implemented as a special case of arrays, or that you are not fine with a vector type because you would prefer to only add arrays and that should be "good enough" for ML? On Mon, May 1,

Re: [DISCUSS] New data type for vector search

2023-05-02 Thread Mick Semb Wever
I have no problem with `VECTOR` hanging around forever as an alias for `NON-NULL FROZEN`. Even without ANN, it makes sense and will stick with new C* users. A plug-in system would be great, but it shouldn't hold back this work imho. On Mon, 1 May 2023 at 22:17, Benedict wrote: > I have expla