> A data type plug-in is actually really easy today, I think? Sadly not, the client reads the class from our schema tables and has to have duplicate logic to serialize/deserialize results… types are easy to add if you are ok with client not understanding them (and will some clients fail due to every language having its own logic?)
> On May 1, 2023, at 2:26 PM, Benedict <bened...@apache.org> wrote: > > A data type plug-in is actually really easy today, I think? But, developing > further hooks should probably be thought through as they’re necessary. > > I think in this case it would be simpler to deliver a general purpose type, > which is why I’m trying to propose types that would be acceptable. > > I also think we’re pretty close to agreement, really? > > But if not, let’s flesh out potential plug-in requirements. > > >> On 1 May 2023, at 21:58, Josh McKenzie <jmcken...@apache.org> wrote: >> >> >>> >>> If we want to make an ML-specific data type, it should be in an ML plug-in. >> How can we encourage a healthier plug-in ecosystem? As far as I know it's >> been pretty anemic historically: >> >> cassandra: >> https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html >> postgres: https://www.postgresql.org/docs/current/contrib.html >> >> I'm really interested to hear if there's more in the ecosystem I'm not aware >> of or if there's been strides made in this regard; users in the ecosystem >> being able to write durable extensions to Cassandra that they can then >> distribute and gain momentum could potentially be a great incubator for new >> features or functionality in the ecosystem. >> >> If our support for extensions remains as bare as I believe it to be, I >> wouldn't recommend anyone go that route. >> >> On Mon, May 1, 2023, at 4:17 PM, Benedict wrote: >>> >>> I have explained repeatedly why I am opposed to ML-specific data types. If >>> we want to make an ML-specific data type, it should be in an ML plug-in. We >>> should not pollute the general purpose language with hastily-considered >>> features that target specific bandwagons - at best partially - no matter >>> how exciting the bandwagon. >>> >>> I think a simple and easy case can be made for fixed length array types >>> that do not seem to create random bits of cruft in the language that dangle >>> by themselves should this play not pan out. This is an easy way for this >>> effort to make progress without negatively impacting the language. >>> >>> That is, unless we want to start supporting totally random types for every >>> use case at the top level language layer. I don’t think this is a good >>> idea, personally, and I’m quite confident we would now be regretting this >>> approach had it been taken for earlier bandwagons. >>> >>> Nor do I think anyone’s priors about how successful this effort will be >>> should matter. As a matter of principle, we should simply never deliver a >>> specialist functionality as a high level CQL language feature without at >>> least baking it for several years as a plug-in. >>> >>>> On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote: >>>> >>>> >>>> Yes! What you (David) and Benedict write beautifully supports `VECTOR >>>> FLOAT[n]` imho. >>>> >>>> You are definitely bringing up valid implementation details, and that can >>>> be dealt with during patch review. This thread is about the CQL API >>>> addition. >>>> >>>> No matter which way the technical review goes with the implementation >>>> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML >>>> idiomatic approach and the best long-term CQL API. It's a win-win >>>> situation – no matter how you look at it imho it is the best solution api >>>> wise. >>>> >>>> Unless the suggestion is that an ideal implementation can give us a better >>>> CQL API – but I don't see what that could be. Maybe the suggestion is we >>>> deny the possibility of using the VECTOR keyword and bring us back to >>>> something like `NON-NULL FROZEN<FLOAT[n]>`. This is odd to me because >>>> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the >>>> patch's audience and their idioms. I have no problems with introducing >>>> such an alias to meet the ML crowd. >>>> >>>> Another way I think of this is >>>> `VECTOR FLOAT[n]` is the porcelain ML cql api, >>>> `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the >>>> general-use plumbing cql apis. >>>> >>>> This would allow implementation details to be moved out of this thread and >>>> to the review phase. >>>> >>>> >>>> >>>> >>>> On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com >>>> <mailto:dcapw...@apple.com>> wrote: >>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is >>>> > not asked to implement on top of, or towards, other array (or other) new >>>> > data types. >>>> >>>> >>>> This impacts serialization, if you do not think about this day 1 you then >>>> can’t add later on without having to worry about migration and versioning… >>>> >>>> Honestly I wanted to better understand the cost to be generic and the >>>> impact to ANN, so I took >>>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java >>>> and made it handle every requirement I have listed so far (size, null, >>>> all types)… the current patch has several bugs at the type level that >>>> would need to be fixed, so had to fix those as well…. Total time to do >>>> this was 10 minutes… and this includes adding a method "public float[] >>>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic >>>> small (change VectorType.Serializer.instance.deserialize(buffer) to >>>> type.composeAsFloats(buffer))…. >>>> >>>> Did this have any impact to the final ByteBuffer? Nope, it had identical >>>> layout for the FloatType case, but works for all types…. I didn’t change >>>> the fact we store the size (felt this could be removed, but then we could >>>> never support expanding the vector in the future…) >>>> >>>> So, given the fact it takes a few minutes to implement all these >>>> requirements, I do find it very reasonable to push back and say we should >>>> make sure the new type is not leaking details from a special ANN index…. >>>> We have spent more time debating this than it takes to support… we also >>>> have fuzz testing on trunk so just updating >>>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new >>>> type means we get type coverage as well… >>>> >>>> I have zero issues helping to review this patch and make sure the testing >>>> is on-par with existing types (this is a strong requirement for me) >>>> >>>> >>>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org >>>> > <mailto:m...@apache.org>> wrote: >>>> > >>>> > >>>> > > But suggesting that Jonathan should work on implementing general >>>> > > purpose arrays seems to fall outside the scope of this discussion, >>>> > > since the result of such work wouldn't even fill the need Jonathan is >>>> > > targeting for here. >>>> > >>>> > Every comment I have made so far I have argued that the v1 work doesn’t >>>> > need to do some things, but that the limitations proposed so far are not >>>> > real requirements; there is a big difference between what “could be >>>> > allowed” and what is implemented day one… I am pushing back on what >>>> > “could be allowed”, so far every justification has been that it slows >>>> > down the ANN work… >>>> > >>>> > Simple examples of this already exists in C* (every example could be >>>> > enhanced logically, we just have yet to put in the work) >>>> > >>>> > * updating a element of a list is only allowed for multi-cell >>>> > * appending to a list is only allowed for multi-cell >>>> > * etc. >>>> > >>>> > By saying that the type "shall not support", you actively block future >>>> > work and future possibilities... >>>> > >>>> > >>>> > >>>> > I am coming around strongly to the `VECTOR FLOAT[n]` option. >>>> > >>>> > This gives Jonathan the simplest path right now with ths ANN work, while >>>> > also ensuring the CQL API gets the best future potential. >>>> > >>>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null >>>> > and frozen, and that allows both today and in the future, as desired, >>>> > for its implementation to be entirely different to `FLOAT[n]`. This >>>> > addresses a number of people's concerns that we meet ML's idioms head on. >>>> > >>>> > IMHO it feels like it will fit into the ideal future CQL , where all >>>> > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and >>>> > maybe VECTOR BYTE[n]). This will also permit in the future >>>> > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays. >>>> > >>>> > I think it is totally reasonable that the ANN patch (and Jonathan) is >>>> > not asked to implement on top of, or towards, other array (or other) new >>>> > data types. >>>> > >>>> > I also think it is correct that we think about the evolution of CQL's >>>> > API, and how it might exist in the future when we have both ml vectors >>>> > and general use arrays. >>