Re: [DISCUSS] New data type for vector search

David Capwell Mon, 01 May 2023 15:12:55 -0700

> A data type plug-in is actually really easy today, I think? 

Sadly not, the client reads the class from our schema tables and has to have 
duplicate logic to serialize/deserialize results… types are easy to add if you 
are ok with client not understanding them (and will some clients fail due to 
every language having its own logic?)


> On May 1, 2023, at 2:26 PM, Benedict <[email protected]> wrote:
> 
> A data type plug-in is actually really easy today, I think? But, developing 
> further hooks should probably be thought through as they’re necessary. 
> 
> I think in this case it would be simpler to deliver a general purpose type, 
> which is why I’m trying to propose types that would be acceptable.
> 
> I also think we’re pretty close to agreement, really?
> 
> But if not, let’s flesh out potential plug-in requirements.
> 
> 
>> On 1 May 2023, at 21:58, Josh McKenzie <[email protected]> wrote:
>> 
>> 
>>> 
>>> If we want to make an ML-specific data type, it should be in an ML plug-in.
>> How can we encourage a healthier plug-in ecosystem? As far as I know it's 
>> been pretty anemic historically:
>> 
>> cassandra: 
>> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
>> postgres: https://www.postgresql.org/docs/current/contrib.html
>> 
>> I'm really interested to hear if there's more in the ecosystem I'm not aware 
>> of or if there's been strides made in this regard; users in the ecosystem 
>> being able to write durable extensions to Cassandra that they can then 
>> distribute and gain momentum could potentially be a great incubator for new 
>> features or functionality in the ecosystem.
>> 
>> If our support for extensions remains as bare as I believe it to be, I 
>> wouldn't recommend anyone go that route.
>> 
>> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
>>> 
>>> I have explained repeatedly why I am opposed to ML-specific data types. If 
>>> we want to make an ML-specific data type, it should be in an ML plug-in. We 
>>> should not pollute the general purpose language with hastily-considered 
>>> features that target specific bandwagons - at best partially - no matter 
>>> how exciting the bandwagon.
>>> 
>>> I think a simple and easy case can be made for fixed length array types 
>>> that do not seem to create random bits of cruft in the language that dangle 
>>> by themselves should this play not pan out. This is an easy way for this 
>>> effort to make progress without negatively impacting the language.
>>> 
>>> That is, unless we want to start supporting totally random types for every 
>>> use case at the top level language layer. I don’t think this is a good 
>>> idea, personally, and I’m quite confident we would now be regretting this 
>>> approach had it been taken for earlier bandwagons.
>>> 
>>> Nor do I think anyone’s priors about how successful this effort will be 
>>> should matter. As a matter of principle, we should simply never deliver a 
>>> specialist functionality as a high level CQL language feature without at 
>>> least baking it for several years as a plug-in.
>>> 
>>>> On 1 May 2023, at 21:03, Mick Semb Wever <[email protected]> wrote:
>>>> 
>>>> 
>>>> Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
>>>> FLOAT[n]` imho.
>>>> 
>>>> You are definitely bringing up valid implementation details, and that can 
>>>> be dealt with during patch review. This thread is about the CQL API 
>>>> addition.  
>>>> 
>>>> No matter which way the technical review goes with the implementation 
>>>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
>>>> idiomatic approach and the best long-term CQL API.  It's a win-win 
>>>> situation – no matter how you look at it imho it is the best solution api 
>>>> wise.  
>>>> 
>>>> Unless the suggestion is that an ideal implementation can give us a better 
>>>> CQL API – but I don't see what that could be.   Maybe the suggestion is we 
>>>> deny the possibility of using the VECTOR keyword and bring us back to 
>>>> something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because 
>>>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
>>>> patch's audience and their idioms.  I have no problems with introducing 
>>>> such an alias to meet the ML crowd.
>>>> 
>>>> Another way I think of this is
>>>>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>>>>  `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the 
>>>> general-use plumbing cql apis. 
>>>> 
>>>> This would allow implementation details to be moved out of this thread and 
>>>> to the review phase.
>>>> 
>>>> 
>>>> 
>>>> 
>>>> On Mon, 1 May 2023 at 20:57, David Capwell <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is 
>>>> > not asked to implement on top of, or towards, other array (or other) new 
>>>> > data types.
>>>> 
>>>> 
>>>> This impacts serialization, if you do not think about this day 1 you then 
>>>> can’t add later on without having to worry about migration and versioning… 
>>>> 
>>>> Honestly I wanted to better understand the cost to be generic and the 
>>>> impact to ANN, so I took 
>>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>>>>  and made it handle every requirement I have listed so far (size, null, 
>>>> all types)… the current patch has several bugs at the type level that 
>>>> would need to be fixed, so had to fix those as well…. Total time to do 
>>>> this was 10 minutes… and this includes adding a method "public float[] 
>>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic 
>>>> small (change VectorType.Serializer.instance.deserialize(buffer) to 
>>>> type.composeAsFloats(buffer))….
>>>> 
>>>> Did this have any impact to the final ByteBuffer?  Nope, it had identical 
>>>> layout for the FloatType case, but works for all types…. I didn’t change 
>>>> the fact we store the size (felt this could be removed, but then we could 
>>>> never support expanding the vector in the future…)
>>>> 
>>>> So, given the fact it takes a few minutes to implement all these 
>>>> requirements, I do find it very reasonable to push back and say we should 
>>>> make sure the new type is not leaking details from a special ANN index…. 
>>>> We have spent more time debating this than it takes to support… we also 
>>>> have fuzz testing on trunk so just updating 
>>>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new 
>>>> type means we get type coverage as well…
>>>> 
>>>> I have zero issues helping to review this patch and make sure the testing 
>>>> is on-par with existing types (this is a strong requirement for me)
>>>> 
>>>> 
>>>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <[email protected] 
>>>> > <mailto:[email protected]>> wrote:
>>>> > 
>>>> > 
>>>> > > But suggesting that Jonathan should work on implementing general 
>>>> > > purpose arrays seems to fall outside the scope of this discussion, 
>>>> > > since the result of such work wouldn't even fill the need Jonathan is 
>>>> > > targeting for here. 
>>>> > 
>>>> > Every comment I have made so far I have argued that the v1 work doesn’t 
>>>> > need to do some things, but that the limitations proposed so far are not 
>>>> > real requirements; there is a big difference between what “could be 
>>>> > allowed” and what is implemented day one… I am pushing back on what 
>>>> > “could be allowed”, so far every justification has been that it slows 
>>>> > down the ANN work…
>>>> > 
>>>> > Simple examples of this already exists in C* (every example could be 
>>>> > enhanced logically, we just have yet to put in the work)
>>>> > 
>>>> > * updating a element of a list is only allowed for multi-cell
>>>> > * appending to a list is only allowed for multi-cell
>>>> > * etc.
>>>> > 
>>>> > By saying that the type "shall not support", you actively block future 
>>>> > work and future possibilities...
>>>> > 
>>>> > 
>>>> > 
>>>> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
>>>> > 
>>>> > This gives Jonathan the simplest path right now with ths ANN work, while 
>>>> > also ensuring the CQL API gets the best future potential.
>>>> > 
>>>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null 
>>>> > and frozen, and that allows both today and in the future, as desired, 
>>>> > for its implementation to be entirely different to `FLOAT[n]`.  This 
>>>> > addresses a number of people's concerns that we meet ML's idioms head on.
>>>> > 
>>>> > IMHO it feels like it will fit into the ideal future CQL , where all 
>>>> > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and 
>>>> > maybe VECTOR BYTE[n]). This will also permit in the future 
>>>> > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
>>>> > 
>>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is 
>>>> > not asked to implement on top of, or towards, other array (or other) new 
>>>> > data types.
>>>> > 
>>>> > I also think it is correct that we think about the evolution of CQL's 
>>>> > API,  and how it might exist in the future when we have both ml vectors 
>>>> > and general use arrays.
>>

Re: [DISCUSS] New data type for vector search

Reply via email to