Re: [POLL] Vector type for ML

Jeremy Hanna Tue, 02 May 2023 13:53:24 -0700

I'm all for bringing more functionality to the masses sooner, but the original 
idea has a very very specific use case.  Do we have use cases for a general 
purpose Vector/Array data structure?  If so, awesome.  I just wondered if 
generalizing provides value, beyond being straightforward to implement.  I'm 
just trying to be sensitive to the database code maintenance and driver support 
for general types versus a single type for a specific, well defined purpose.


If it could easily be a plugin, that's great - but the full picture involves 
drivers that need to support it or you end up getting binary blobs you have to 
decode client side and then do stuff with.  So ideally if you have a well 
defined use case that you can build into the database, having it just be part 
of the database and associated drivers - that makes the experience much much 
better.

I'm not trying to say B couldn't be valuable or that a plugin couldn't be 
feasible.  I'm just trying to enlarge the picture a bit to see what that means 
for this use case and for the supporting drivers/clients.

> On May 2, 2023, at 3:04 PM, Benedict <[email protected]> wrote:
> 
> But it’s so trivial it was already implemented by David in the span of ten 
> minutes? If anything, we’re slowing progress down by refusing to do the extra 
> types, as we’re busy arguing about it rather than delivering a feature?
> 
> FWIW, my interpretation of the votes today is that we SHOULD NOT (ever) 
> support types beyond float. Not that we should start with float.
> 
> So, this whole debate is a mess, I think. But hey ho.
> 
>> On 2 May 2023, at 20:57, Patrick McFadin <[email protected]> wrote:
>> 
>> 
>> I'll speak up on that one. If you look at my ranked voting, that is where my 
>> head is. I get accused of scope creep (a lot) and looking at the initial 
>> proposal Jonathan put on the ML it was mostly "Developers are adopting 
>> vector search at a furious pace and I think I have a simple way of adding 
>> support to keep Cassandra relevant for these use cases" Instead of just 
>> focusing on this use case, I feel the arguments have bike shedded into scope 
>> creep which means it will take forever to get into the project.
>> 
>> My preference is to see one thing validated with an MVP and get it into the 
>> hands of developers sooner so we can continue to iterate based on actual 
>> usage. 
>> 
>> It doesn't say your points are wrong or your opinions are broken, I'm voting 
>> for what I think will be awesome for users sooner. 
>> 
>> Patrick
>> 
>> On Tue, May 2, 2023 at 12:29 PM Benedict <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> Could folk voting against a general purpose type (that could well be called 
>>> a vector) briefly explain their reasoning?
>>> 
>>> We established in the other thread that it’s technically trivial, meaning 
>>> folk must think it is strictly superior to only support float rather than 
>>> eg all numeric types (note: for the type, not the ANN). 
>>> 
>>> I am surprised, and the blurbs accompanying votes so far don’t seem to 
>>> touch on this, mostly just endorsing the idea of a vector.
>>> 
>>> 
>>>> On 2 May 2023, at 20:20, Patrick McFadin <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>> A > B > C on both polls. 
>>>> 
>>>> Having talked to several users in the community that are highly excited 
>>>> about this change, this gets to what developers want to do at Cassandra 
>>>> scale: store embeddings and retrieve them. 
>>>> 
>>>> On Tue, May 2, 2023 at 11:47 AM Andrés de la Peña <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> A > B > C
>>>>> 
>>>>> I don't think that ML is such a niche application that it can't have its 
>>>>> own CQL data type. Also, vectors are mathematical elements that have more 
>>>>> applications that ML.
>>>>> 
>>>>> On Tue, 2 May 2023 at 19:15, Mick Semb Wever <[email protected] 
>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> 
>>>>>> On Tue, 2 May 2023 at 17:14, Jonathan Ellis <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>>> Should we add a vector type to Cassandra designed to meet the needs of 
>>>>>>> machine learning use cases, specifically feature and embedding vectors 
>>>>>>> for training, inference, and vector search?  
>>>>>>> 
>>>>>>> ML vectors are fixed-dimension (fixed-length) sequences of numeric 
>>>>>>> types, with no nulls allowed, and with no need for random access. The 
>>>>>>> ML industry overwhelmingly uses float32 vectors, to the point that the 
>>>>>>> industry-leading special-purpose vector database ONLY supports that 
>>>>>>> data type.
>>>>>>> 
>>>>>>> This poll is to gauge consensus subsequent to the recent discussion 
>>>>>>> thread at 
>>>>>>> https://lists.apache.org/thread/0lj1nk9jbhkf1rlgqcvxqzfyntdjrnk0.
>>>>>>> 
>>>>>>> Please rank the discussed options from most preferred option to least, 
>>>>>>> e.g., A > B > C (A is my preference, followed by B, followed by C) or C 
>>>>>>> > B = A (C is my preference, followed by B or A approximately equally.)
>>>>>>> 
>>>>>>> (A) I am in favor of adding a vector type for floats; I do not believe 
>>>>>>> we need to tie it to any particular implementation details.
>>>>>>> 
>>>>>>> (B) I am okay with adding a vector type but I believe we must add array 
>>>>>>> types that compose with all Cassandra types first, and make vectors a 
>>>>>>> special case of arrays-without-null-elements.
>>>>>>> 
>>>>>>> (C) I am not in favor of adding a built-in vector type.
>>>>>> 
>>>>>> 
>>>>>> 
>>>>>> A  > B > C
>>>>>> 
>>>>>> B is stated as "must add array types…".  I think this is a bit loaded.  
>>>>>> If B was the (A + the implementation needs to be a non-null frozen 
>>>>>> float32 array, serialisation forward compatible with other frozen arrays 
>>>>>> later implemented) I would put this before (A).  Especially because it's 
>>>>>> been shown already this is easy to implement.
>>>>>> 
>>>>>>

Re: [POLL] Vector type for ML

Reply via email to