Re: [DISCUSS] New data type for vector search

David Capwell Mon, 01 May 2023 11:56:20 -0700

> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.

This impacts serialization, if you do not think about this day 1 you then can’t 
add later on without having to worry about migration and versioning… 

Honestly I wanted to better understand the cost to be generic and the impact to 
ANN, so I took 
https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java
 and made it handle every requirement I have listed so far (size, null, all 
types)… the current patch has several bugs at the type level that would need to 
be fixed, so had to fix those as well…. Total time to do this was 10 minutes… 
and this includes adding a method "public float[] composeAsFloats(ByteBuffer 
bytes)” which made the change to existing logic small (change 
VectorType.Serializer.instance.deserialize(buffer) to 
type.composeAsFloats(buffer))….

Did this have any impact to the final ByteBuffer?  Nope, it had identical 
layout for the FloatType case, but works for all types…. I didn’t change the 
fact we store the size (felt this could be removed, but then we could never 
support expanding the vector in the future…)

So, given the fact it takes a few minutes to implement all these requirements, 
I do find it very reasonable to push back and say we should make sure the new 
type is not leaking details from a special ANN index…. We have spent more time 
debating this than it takes to support… we also have fuzz testing on trunk so 
just updating org.apache.cassandra.utils.AbstractTypeGenerators to know about 
this new type means we get type coverage as well…

I have zero issues helping to review this patch and make sure the testing is 
on-par with existing types (this is a strong requirement for me)

> On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote:
> 
> 
> > But suggesting that Jonathan should work on implementing general purpose 
> > arrays seems to fall outside the scope of this discussion, since the result 
> > of such work wouldn't even fill the need Jonathan is targeting for here. 
> 
> Every comment I have made so far I have argued that the v1 work doesn’t need 
> to do some things, but that the limitations proposed so far are not real 
> requirements; there is a big difference between what “could be allowed” and 
> what is implemented day one… I am pushing back on what “could be allowed”, so 
> far every justification has been that it slows down the ANN work…
> 
> Simple examples of this already exists in C* (every example could be enhanced 
> logically, we just have yet to put in the work)
> 
> * updating a element of a list is only allowed for multi-cell
> * appending to a list is only allowed for multi-cell
> * etc.
> 
> By saying that the type "shall not support", you actively block future work 
> and future possibilities...
> 
> 
> 
> I am coming around strongly to the `VECTOR FLOAT[n]` option.
> 
> This gives Jonathan the simplest path right now with ths ANN work, while also 
> ensuring the CQL API gets the best future potential.
> 
> With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and 
> frozen, and that allows both today and in the future, as desired, for its 
> implementation to be entirely different to `FLOAT[n]`.  This addresses a 
> number of people's concerns that we meet ML's idioms head on.
> 
> IMHO it feels like it will fit into the ideal future CQL , where all 
> `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe 
> VECTOR BYTE[n]). This will also permit in the future `FROZEN<primitive[n]>` 
> if we wanted nulls in frozen arrays.
> 
> I think it is totally reasonable that the ANN patch (and Jonathan) is not 
> asked to implement on top of, or towards, other array (or other) new data 
> types.
> 
> I also think it is correct that we think about the evolution of CQL's API,  
> and how it might exist in the future when we have both ml vectors and general 
> use arrays.

Re: [DISCUSS] New data type for vector search

Reply via email to