I'm proposing a vector data type for ML use cases. It's not the same thing as an array or a list and it's not supposed to be.
While it's true that it would be possible to build a vector type on top of an array type, it's not necessary to do it that way, and given the lack of interest in an array type for its own sake I don't see why we would want to make that a requirement. It's relevant that pgvector, which among the systems offering vector search is based on the most similar system to Cassandra in terms of its query language, adds a vector data type that only supports floats *even though postgresql already has an array data type* because the semantics are different. Random access doesn't make sense, string and collection and other datatypes don't make sense, typical ordered indexes don't make sense, etc. It's just a different beast from arrays, for a different use case. On Fri, Apr 28, 2023 at 10:40 AM Benedict <bened...@apache.org> wrote: > But you’re proposing introducing a general purpose type - this isn’t an ML > plug-in, it’s modifying the core language in a manner that makes targeting > your workload easier. Which is fine, but that means you have to consider > its impact on the general language, not just your target use case. > > On 28 Apr 2023, at 16:29, Jonathan Ellis <jbel...@gmail.com> wrote: > > > That's exactly right. > > In particular it makes no sense at all from an ML perspective to have > vector types of anything other than numerics. And as I mentioned in the > POC thread (but I did not mention here), float is overwhelmingly the most > frequently used vector type, to the point that Pinecone (by far the most > popular vector search engine) ONLY supports that type. > > Lucene and Elastic also add support for vectors of bytes (8-bit ints), > which are useful for optimizing models that you have already built with > floats, but we have no reasonable path towards supporting indexing and > searches against any other vector type. > > So in order of what makes sense to me: > > 1. Add a vector type for just floats; consider adding bytes later if > demand materializes. This gives us 99% of the value and limits the scope so > we can deliver quickly. > > 2. Add a vector type for floats or bytes. This gives us another 1% of > value in exchange for an extra 20% or so of effort. > > 3. Add a vector type for all numeric primitives, but you can only index > floats and bytes. I think this is confusing to users and a bad idea. > > 4. Add a vector type that composes with all Cassandra types. I can't see > a reason to do this, nobody wants it, and we killed the most similar > proposal in the past as wontfix. > > On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <jmcken...@apache.org> > wrote: > >> From a machine learning perspective, vectors are a well-known concept >> that are effectively immutable fixed-length n-dimensional values that are >> then later used either as part of a model or in conjunction with a model >> after the fact. >> >> While we could have this be non-frozen and not call it a vector, I'd be >> inclined to still make the argument for a layer of syntactic sugar on top >> that met ML users where they were with concepts they understood rather than >> forcing them through the cognitive lift of figuring out the Cassandra >> specific contortions to replicate something that's ubiquitous in their >> space. We did the same "Cassandra-first" approach with our JSON support and >> that didn't do us any favors in terms of adoption and usage as far as I >> know. >> >> So is the goal here to provide something specific and idiomatic for the >> ML community or is the goal to make a primitive that's C*-centric that then >> another layer can write to? I personally argue for the former; I don't see >> this specific data type going away any time soon. >> >> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote: >> >> but as you point out it has the problem of allowing nulls. >> >> >> If nulls are not allowed for the elements, then either we need a) a new >> type, or b) add some way to say elements may not be null…. As much as I do >> like b, I am leaning towards new type for this use case. >> >> So, to flesh out the type requirements I have seen so far >> >> 1) represents a fixed size array of element type >> * on write path we will need to validate this >> 2) element may not be null >> * on write path we will need to validate this >> 3) “frozen” (is this really a requirement for the type or is this >> just simpler for the ANN work? I feel that this shouldn’t be a requirement) >> 4) works for all types (my requirement; original proposal is float only, >> but could logically expand to primitive types) >> >> Anything else? >> >> The key thing about a vector is that unlike lists or tuples you really >> don't care about individual elements, you care about doing vector and >> matrix multiplications with the thing as a unit. >> >> >> That maybe true for this use case, but “should” this be true for the type >> itself? I feel like no… if a user wants the Nth element of a vector why >> would we block them? I am not saying the first patch, or even 5.0 adds >> support for index access, I am just trying to push back saying that the >> type should not block this. >> >> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT >> VECTOR[N].) >> >> >> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I >> prefer this syntax but that limitation may not be desired for all use >> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later >> to address that case. >> >> In terms of syntax I have seen, here is my ordered preference: >> >> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it >> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this >> semantic…. Could even be NON NULL TYPE[size] >> >> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote: >> >> >> That’s a bounded ring buffer, not a fixed length array. >> >> This definitely isn’t a tuple because the types are all the same, which >> is pretty crucial for matrix operations. Matrix libraries generally work on >> arrays of known dimensionality, or sparse representations. >> >> Whether we draw any semantic link between the frozen list and whatever we >> do here, it is fundamentally a frozen list with a restriction on its size. >> What we’re defining here are “statically” sized arrays, whereas a frozen >> list is essentially a dynamically sized array. >> >> I do not think vector is a good name because vector is used in some other >> popular languages to mean a (dynamic) list, which is confusing when we also >> have a list concept. >> >> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link >> with list. Though it is a bit strange that this particular type declaration >> looks so different to other collection types. >> >> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote: >> >> >> >> >> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com> wrote: >> >> It's been a while, so I may be missing something, but do we already have >> fixed-size lists? If not, I don't see why we'd try to make this fit into a >> List-shaped problem. >> >> >> We do not. The proposal got closed as wont-fix >> https://issues.apache.org/jira/browse/CASSANDRA-9110 >> >> >> >> > > -- > Jonathan Ellis > co-founder, http://www.datastax.com > @spyced > > -- Jonathan Ellis co-founder, http://www.datastax.com @spyced