> I think it is totally reasonable that the ANN patch (and Jonathan) is not > asked to implement on top of, or towards, other array (or other) new data > types.
This impacts serialization, if you do not think about this day 1 you then can’t add later on without having to worry about migration and versioning… Honestly I wanted to better understand the cost to be generic and the impact to ANN, so I took https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java and made it handle every requirement I have listed so far (size, null, all types)… the current patch has several bugs at the type level that would need to be fixed, so had to fix those as well…. Total time to do this was 10 minutes… and this includes adding a method "public float[] composeAsFloats(ByteBuffer bytes)” which made the change to existing logic small (change VectorType.Serializer.instance.deserialize(buffer) to type.composeAsFloats(buffer))…. Did this have any impact to the final ByteBuffer? Nope, it had identical layout for the FloatType case, but works for all types…. I didn’t change the fact we store the size (felt this could be removed, but then we could never support expanding the vector in the future…) So, given the fact it takes a few minutes to implement all these requirements, I do find it very reasonable to push back and say we should make sure the new type is not leaking details from a special ANN index…. We have spent more time debating this than it takes to support… we also have fuzz testing on trunk so just updating org.apache.cassandra.utils.AbstractTypeGenerators to know about this new type means we get type coverage as well… I have zero issues helping to review this patch and make sure the testing is on-par with existing types (this is a strong requirement for me) > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote: > > > > But suggesting that Jonathan should work on implementing general purpose > > arrays seems to fall outside the scope of this discussion, since the result > > of such work wouldn't even fill the need Jonathan is targeting for here. > > Every comment I have made so far I have argued that the v1 work doesn’t need > to do some things, but that the limitations proposed so far are not real > requirements; there is a big difference between what “could be allowed” and > what is implemented day one… I am pushing back on what “could be allowed”, so > far every justification has been that it slows down the ANN work… > > Simple examples of this already exists in C* (every example could be enhanced > logically, we just have yet to put in the work) > > * updating a element of a list is only allowed for multi-cell > * appending to a list is only allowed for multi-cell > * etc. > > By saying that the type "shall not support", you actively block future work > and future possibilities... > > > > I am coming around strongly to the `VECTOR FLOAT[n]` option. > > This gives Jonathan the simplest path right now with ths ANN work, while also > ensuring the CQL API gets the best future potential. > > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null and > frozen, and that allows both today and in the future, as desired, for its > implementation to be entirely different to `FLOAT[n]`. This addresses a > number of people's concerns that we meet ML's idioms head on. > > IMHO it feels like it will fit into the ideal future CQL , where all > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and maybe > VECTOR BYTE[n]). This will also permit in the future `FROZEN<primitive[n]>` > if we wanted nulls in frozen arrays. > > I think it is totally reasonable that the ANN patch (and Jonathan) is not > asked to implement on top of, or towards, other array (or other) new data > types. > > I also think it is correct that we think about the evolution of CQL's API, > and how it might exist in the future when we have both ml vectors and general > use arrays.