> If we want to make an ML-specific data type, it should be in an ML plug-in. How can we encourage a healthier plug-in ecosystem? As far as I know it's been pretty anemic historically:
cassandra: https://cassandra.apache.org/doc/latest/cassandra/plugins/index.html postgres: https://www.postgresql.org/docs/current/contrib.html I'm really interested to hear if there's more in the ecosystem I'm not aware of or if there's been strides made in this regard; users in the ecosystem being able to write durable extensions to Cassandra that they can then distribute and gain momentum could potentially be a great incubator for new features or functionality in the ecosystem. If our support for extensions remains as bare as I believe it to be, I wouldn't recommend anyone go that route. On Mon, May 1, 2023, at 4:17 PM, Benedict wrote: > > I have explained repeatedly why I am opposed to ML-specific data types. If we > want to make an ML-specific data type, it should be in an ML plug-in. We > should not pollute the general purpose language with hastily-considered > features that target specific bandwagons - at best partially - no matter how > exciting the bandwagon. > > I think a simple and easy case can be made for fixed length array types that > do not seem to create random bits of cruft in the language that dangle by > themselves should this play not pan out. This is an easy way for this effort > to make progress without negatively impacting the language. > > That is, unless we want to start supporting totally random types for every > use case at the top level language layer. I don’t think this is a good idea, > personally, and I’m quite confident we would now be regretting this approach > had it been taken for earlier bandwagons. > > Nor do I think anyone’s priors about how successful this effort will be > should matter. As a matter of principle, we should simply never deliver a > specialist functionality as a high level CQL language feature without at > least baking it for several years as a plug-in. > >> On 1 May 2023, at 21:03, Mick Semb Wever <m...@apache.org> wrote: >> >> >> Yes! What you (David) and Benedict write beautifully supports `VECTOR >> FLOAT[n]` imho. >> >> You are definitely bringing up valid implementation details, and that can be >> dealt with during patch review. This thread is about the CQL API addition. >> >> No matter which way the technical review goes with the implementation >> details, `VECTOR FLOAT[n]` does not limit it, and gives us the most ML >> idiomatic approach and the best long-term CQL API. It's a win-win situation >> – no matter how you look at it imho it is the best solution api wise. >> >> Unless the suggestion is that an ideal implementation can give us a better >> CQL API – but I don't see what that could be. Maybe the suggestion is we >> deny the possibility of using the VECTOR keyword and bring us back to >> something like `NON-NULL FROZEN<FLOAT[n]>`. This is odd to me because >> `VECTOR` here can be just an alias for `NON-NULL FROZEN` while meeting the >> patch's audience and their idioms. I have no problems with introducing such >> an alias to meet the ML crowd. >> >> Another way I think of this is >> `VECTOR FLOAT[n]` is the porcelain ML cql api, >> `NON-NULL FROZEN<FLOAT[n]>` and `FROZEN<FLOAT[n]>` and `FLOAT[n]` are the >> general-use plumbing cql apis. >> >> This would allow implementation details to be moved out of this thread and >> to the review phase. >> >> >> >> >> On Mon, 1 May 2023 at 20:57, David Capwell <dcapw...@apple.com> wrote: >>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not >>> > asked to implement on top of, or towards, other array (or other) new data >>> > types. >>> >>> >>> This impacts serialization, if you do not think about this day 1 you then >>> can’t add later on without having to worry about migration and versioning… >>> >>> Honestly I wanted to better understand the cost to be generic and the >>> impact to ANN, so I took >>> https://github.com/jbellis/cassandra/blob/vsearch/src/java/org/apache/cassandra/db/marshal/VectorType.java >>> and made it handle every requirement I have listed so far (size, null, all >>> types)… the current patch has several bugs at the type level that would >>> need to be fixed, so had to fix those as well…. Total time to do this was >>> 10 minutes… and this includes adding a method "public float[] >>> composeAsFloats(ByteBuffer bytes)” which made the change to existing logic >>> small (change VectorType.Serializer.instance.deserialize(buffer) to >>> type.composeAsFloats(buffer))…. >>> >>> Did this have any impact to the final ByteBuffer? Nope, it had identical >>> layout for the FloatType case, but works for all types…. I didn’t change >>> the fact we store the size (felt this could be removed, but then we could >>> never support expanding the vector in the future…) >>> >>> So, given the fact it takes a few minutes to implement all these >>> requirements, I do find it very reasonable to push back and say we should >>> make sure the new type is not leaking details from a special ANN index…. We >>> have spent more time debating this than it takes to support… we also have >>> fuzz testing on trunk so just updating >>> org.apache.cassandra.utils.AbstractTypeGenerators to know about this new >>> type means we get type coverage as well… >>> >>> I have zero issues helping to review this patch and make sure the testing >>> is on-par with existing types (this is a strong requirement for me) >>> >>> >>> > On May 1, 2023, at 10:40 AM, Mick Semb Wever <m...@apache.org> wrote: >>> > >>> > >>> > > But suggesting that Jonathan should work on implementing general >>> > > purpose arrays seems to fall outside the scope of this discussion, >>> > > since the result of such work wouldn't even fill the need Jonathan is >>> > > targeting for here. >>> > >>> > Every comment I have made so far I have argued that the v1 work doesn’t >>> > need to do some things, but that the limitations proposed so far are not >>> > real requirements; there is a big difference between what “could be >>> > allowed” and what is implemented day one… I am pushing back on what >>> > “could be allowed”, so far every justification has been that it slows >>> > down the ANN work… >>> > >>> > Simple examples of this already exists in C* (every example could be >>> > enhanced logically, we just have yet to put in the work) >>> > >>> > * updating a element of a list is only allowed for multi-cell >>> > * appending to a list is only allowed for multi-cell >>> > * etc. >>> > >>> > By saying that the type "shall not support", you actively block future >>> > work and future possibilities... >>> > >>> > >>> > >>> > I am coming around strongly to the `VECTOR FLOAT[n]` option. >>> > >>> > This gives Jonathan the simplest path right now with ths ANN work, while >>> > also ensuring the CQL API gets the best future potential. >>> > >>> > With `VECTOR FLOAT[n]` the 'vector' is the ml sugar that means non-null >>> > and frozen, and that allows both today and in the future, as desired, for >>> > its implementation to be entirely different to `FLOAT[n]`. This >>> > addresses a number of people's concerns that we meet ML's idioms head on. >>> > >>> > IMHO it feels like it will fit into the ideal future CQL , where all >>> > `primitive[N]` are implemented, and where we have VECTOR FLOAT[n] (and >>> > maybe VECTOR BYTE[n]). This will also permit in the future >>> > `FROZEN<primitive[n]>` if we wanted nulls in frozen arrays. >>> > >>> > I think it is totally reasonable that the ANN patch (and Jonathan) is not >>> > asked to implement on top of, or towards, other array (or other) new data >>> > types. >>> > >>> > I also think it is correct that we think about the evolution of CQL's >>> > API, and how it might exist in the future when we have both ml vectors >>> > and general use arrays.