Re: [DISCUSS] New data type for vector search

Josh McKenzie Mon, 01 May 2023 13:58:34 -0700

> If we want to make an ML-specific data type, it should be in an ML plug-in.
How can we encourage a healthier plug-in ecosystem? As far as I know it's been 
pretty anemic historically:


cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html
postgres: https://www.postgresql.org/docs/current/contrib.html

I'm really interested to hear if there's more in the ecosystem I'm not aware of 
or if there's been strides made in this regard; users in the ecosystem being 
able to write durable extensions to Cassandra that they can then distribute and 
gain momentum could potentially be a great incubator for new features or 
functionality in the ecosystem.

If our support for extensions remains as bare as I believe it to be, I wouldn't 
recommend anyone go that route.

On Mon, May 1, 2023, at 4:17 PM, Benedict wrote:
> 
> I have explained repeatedly why I am opposed to ML-specific data types. If we 
> want to make an ML-specific data type, it should be in an ML plug-in. We 
> should not pollute the general purpose language with hastily-considered 
> features that target specific bandwagons - at best partially - no matter how 
> exciting the bandwagon.
> 
> I think a simple and easy case can be made for fixed length array types that 
> do not seem to create random bits of cruft in the language that dangle by 
> themselves should this play not pan out. This is an easy way for this effort 
> to make progress without negatively impacting the language.
> 
> That is, unless we want to start supporting totally random types for every 
> use case at the top level language layer. I don’t think this is a good idea, 
> personally, and I’m quite confident we would now be regretting this approach 
> had it been taken for earlier bandwagons.
> 
> Nor do I think anyone’s priors about how successful this effort will be 
> should matter. As a matter of principle, we should simply never deliver a 
> specialist functionality as a high level CQL language feature without at 
> least baking it for several years as a plug-in.
> 
>> On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote:
>> 
>> 
>> Yes!  What you (David) and Benedict write beautifully supports `VECTOR 
>> FLOAT[n]` imho.
>> 
>> You are definitely bringing up valid implementation details, and that can be 
>> dealt with during patch review. This thread is about the CQL API addition.  
>> 
>> No matter which way the technical review goes with the implementation 
>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML 
>> idiomatic approach and the best long-term CQL API.  It's a win-win situation 
>> – no matter how you look at it imho it is the best solution api wise.  
>> 
>> Unless the suggestion is that an ideal implementation can give us a better 
>> CQL API – but I don't see what that could be.   Maybe the suggestion is we 
>> deny the possibility of using the VECTOR keyword and bring us back to 
>> something like `NON-NULL FROZEN<FLOAT[n]>`.   This is odd to me because 
>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the 
>> patch's audience and their idioms.  I have no problems with introducing such 
>> an alias to meet the ML crowd.
>> 
>> Another way I think of this is
>>  `VECTOR FLOAT[n]` is the porcelain ML cql api,
>>  `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the 
>> general-use plumbing cql apis. 
>> 
>> This would allow implementation details to be moved out of this thread and 
>> to the review phase.
>> 
>> 
>> 
>> 
>> On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote:
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> 
>>> 
>>> This impacts serialization, if you do not think about this day 1 you then 
>>> can’t add later on without having to worry about migration and versioning… 
>>> 
>>> Honestly I wanted to better understand the cost to be generic and the 
>>> impact to ANN, so I took 
>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
>>>  and made it handle every requirement I have listed so far (size, null, all 
>>> types)… the current patch has several bugs at the type level that would 
>>> need to be fixed, so had to fix those as well…. Total time to do this was 
>>> 10 minutes… and this includes adding a method "public float[] 
>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic 
>>> small (change VectorType.Serializer.instance.deserialize(buffer) to 
>>> type.composeAsFloats(buffer))….
>>> 
>>> Did this have any impact to the final ByteBuffer?  Nope, it had identical 
>>> layout for the FloatType case, but works for all types…. I didn’t change 
>>> the fact we store the size (felt this could be removed, but then we could 
>>> never support expanding the vector in the future…)
>>> 
>>> So, given the fact it takes a few minutes to implement all these 
>>> requirements, I do find it very reasonable to push back and say we should 
>>> make sure the new type is not leaking details from a special ANN index…. We 
>>> have spent more time debating this than it takes to support… we also have 
>>> fuzz testing on trunk so just updating 
>>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new 
>>> type means we get type coverage as well…
>>> 
>>> I have zero issues helping to review this patch and make sure the testing 
>>> is on-par with existing types (this is a strong requirement for me)
>>> 
>>> 
>>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote:
>>> > 
>>> > 
>>> > > But suggesting that Jonathan should work on implementing general 
>>> > > purpose arrays seems to fall outside the scope of this discussion, 
>>> > > since the result of such work wouldn't even fill the need Jonathan is 
>>> > > targeting for here. 
>>> > 
>>> > Every comment I have made so far I have argued that the v1 work doesn’t 
>>> > need to do some things, but that the limitations proposed so far are not 
>>> > real requirements; there is a big difference between what “could be 
>>> > allowed” and what is implemented day one… I am pushing back on what 
>>> > “could be allowed”, so far every justification has been that it slows 
>>> > down the ANN work…
>>> > 
>>> > Simple examples of this already exists in C* (every example could be 
>>> > enhanced logically, we just have yet to put in the work)
>>> > 
>>> > * updating a element of a list is only allowed for multi-cell
>>> > * appending to a list is only allowed for multi-cell
>>> > * etc.
>>> > 
>>> > By saying that the type "shall not support", you actively block future 
>>> > work and future possibilities...
>>> > 
>>> > 
>>> > 
>>> > I am coming around strongly to the `VECTOR FLOAT[n]` option.
>>> > 
>>> > This gives Jonathan the simplest path right now with ths ANN work, while 
>>> > also ensuring the CQL API gets the best future potential.
>>> > 
>>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null 
>>> > and frozen, and that allows both today and in the future, as desired, for 
>>> > its implementation to be entirely different to `FLOAT[n]`.  This 
>>> > addresses a number of people's concerns that we meet ML's idioms head on.
>>> > 
>>> > IMHO it feels like it will fit into the ideal future CQL , where all 
>>> > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and 
>>> > maybe VECTOR BYTE[n]). This will also permit in the future 
>>> > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays.
>>> > 
>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not 
>>> > asked to implement on top of, or towards, other array (or other) new data 
>>> > types.
>>> > 
>>> > I also think it is correct that we think about the evolution of CQL's 
>>> > API,  and how it might exist in the future when we have both ml vectors 
>>> > and general use arrays.

Re: [DISCUSS] New data type for vector search

Reply via email to