I'm proposing a vector data type for ML use cases.  It's not the same thing
as an array or a list and it's not supposed to be.

While it's true that it would be possible to build a vector type on top of
an array type, it's not necessary to do it that way, and given the lack of
interest in an array type for its own sake I don't see why we would want to
make that a requirement.

It's relevant that pgvector, which among the systems offering vector search
is based on the most similar system to Cassandra in terms of its query
language, adds a vector data type that only supports floats *even though
postgresql already has an array data type* because the semantics are
different.  Random access doesn't make sense, string and collection and
other datatypes don't make sense, typical ordered indexes don't make sense,
etc.  It's just a different beast from arrays, for a different use case.

On Fri, Apr 28, 2023 at 10:40 AM Benedict <bened...@apache.org> wrote:

> But you’re proposing introducing a general purpose type - this isn’t an ML
> plug-in, it’s modifying the core language in a manner that makes targeting
> your workload easier. Which is fine, but that means you have to consider
> its impact on the general language, not just your target use case.
>
> On 28 Apr 2023, at 16:29, Jonathan Ellis <jbel...@gmail.com> wrote:
>
> 
> That's exactly right.
>
> In particular it makes no sense at all from an ML perspective to have
> vector types of anything other than numerics.  And as I mentioned in the
> POC thread (but I did not mention here), float is overwhelmingly the most
> frequently used vector type, to the point that Pinecone (by far the most
> popular vector search engine) ONLY supports that type.
>
> Lucene and Elastic also add support for vectors of bytes (8-bit ints),
> which are useful for optimizing models that you have already built with
> floats, but we have no reasonable path towards supporting indexing and
> searches against any other vector type.
>
> So in order of what makes sense to me:
>
> 1. Add a vector type for just floats; consider adding bytes later if
> demand materializes. This gives us 99% of the value and limits the scope so
> we can deliver quickly.
>
> 2. Add a vector type for floats or bytes. This gives us another 1% of
> value in exchange for an extra 20% or so of effort.
>
> 3. Add a vector type for all numeric primitives, but you can only index
> floats and bytes.  I think this is confusing to users and a bad idea.
>
> 4. Add a vector type that composes with all Cassandra types.  I can't see
> a reason to do this, nobody wants it, and we killed the most similar
> proposal in the past as wontfix.
>
> On Thu, Apr 27, 2023 at 7:49 PM Josh McKenzie <jmcken...@apache.org>
> wrote:
>
>> From a machine learning perspective, vectors are a well-known concept
>> that are effectively immutable fixed-length n-dimensional values that are
>> then later used either as part of a model or in conjunction with a model
>> after the fact.
>>
>> While we could have this be non-frozen and not call it a vector, I'd be
>> inclined to still make the argument for a layer of syntactic sugar on top
>> that met ML users where they were with concepts they understood rather than
>> forcing them through the cognitive lift of figuring out the Cassandra
>> specific contortions to replicate something that's ubiquitous in their
>> space. We did the same "Cassandra-first" approach with our JSON support and
>> that didn't do us any favors in terms of adoption and usage as far as I
>> know.
>>
>> So is the goal here to provide something specific and idiomatic for the
>> ML community or is the goal to make a primitive that's C*-centric that then
>> another layer can write to? I personally argue for the former; I don't see
>> this specific data type going away any time soon.
>>
>> On Thu, Apr 27, 2023, at 12:39 PM, David Capwell wrote:
>>
>> but as you point out it has the problem of allowing nulls.
>>
>>
>> If nulls are not allowed for the elements, then either we need  a) a new
>> type, or b) add some way to say elements may not be null…. As much as I do
>> like b, I am leaning towards new type for this use case.
>>
>> So, to flesh out the type requirements I have seen so far
>>
>> 1) represents a fixed size array of element type
>> * on write path we will need to validate this
>> 2) element may not be null
>> * on write path we will need to validate this
>> 3) “frozen” (is this really a requirement for the type or is this
>> just simpler for the ANN work?  I feel that this shouldn’t be a requirement)
>> 4) works for all types (my requirement; original proposal is float only,
>> but could logically expand to primitive types)
>>
>> Anything else?
>>
>> The key thing about a vector is that unlike lists or tuples you really
>> don't care about individual elements, you care about doing vector and
>> matrix multiplications with the thing as a unit.
>>
>>
>> That maybe true for this use case, but “should” this be true for the type
>> itself?  I feel like no… if a user wants the Nth element of a vector why
>> would we block them?  I am not saying the first patch, or even 5.0 adds
>> support for index access, I am just trying to push back saying that the
>> type should not block this.
>>
>> (Maybe this is making the case for VECTOR FLOAT[N] rather than FLOAT
>> VECTOR[N].)
>>
>>
>> Now that nulls are not allowed, I have mixed feelings about FLOAT[N], I
>> prefer this syntax but that limitation may not be desired for all use
>> cases… we could always add LIST<TYPE, N> and ARRAY<TYPE, N> later
>> to address that case.
>>
>> In terms of syntax I have seen, here is my ordered preference:
>>
>> 1) TYPE[size] - have mixed feelings due to non-null, but still prefer it
>> 2) QUALIFIER TYPE[size] - QUALIFIER is just a Term we use to denote this
>> semantic…. Could even be NON NULL TYPE[size]
>>
>> On Apr 27, 2023, at 9:00 AM, Benedict <bened...@apache.org> wrote:
>>
>>
>> That’s a bounded ring buffer, not a fixed length array.
>>
>> This definitely isn’t a tuple because the types are all the same, which
>> is pretty crucial for matrix operations. Matrix libraries generally work on
>> arrays of known dimensionality, or sparse representations.
>>
>> Whether we draw any semantic link between the frozen list and whatever we
>> do here, it is fundamentally a frozen list with a restriction on its size.
>> What we’re defining here are “statically” sized arrays, whereas a frozen
>> list is essentially a dynamically sized array.
>>
>> I do not think vector is a good name because vector is used in some other
>> popular languages to mean a (dynamic) list, which is confusing when we also
>> have a list concept.
>>
>> I’m fine with just using the FLOAT[N] syntax, and drawing no direct link
>> with list. Though it is a bit strange that this particular type declaration
>> looks so different to other collection types.
>>
>> On 27 Apr 2023, at 16:48, Jeff Jirsa <jji...@gmail.com> wrote:
>>
>> 
>>
>>
>> On Thu, Apr 27, 2023 at 7:39 AM Jonathan Ellis <jbel...@gmail.com> wrote:
>>
>> It's been a while, so I may be missing something, but do we already have
>> fixed-size lists?  If not, I don't see why we'd try to make this fit into a
>> List-shaped problem.
>>
>>
>> We do not. The proposal got closed as wont-fix
>> https://issues.apache.org/jira/browse/CASSANDRA-9110
>>
>>
>>
>>
>
> --
> Jonathan Ellis
> co-founder, http://www.datastax.com
> @spyced
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to