Re: [DISCUSS] New data type for vector search

David Capwell Wed, 26 Apr 2023 11:03:12 -0700

Benedicts comments also makes me question; can any of the values in the vector 
be null?  The patch sent works with float arrays, so null isn’t possible… is 
null not valid for a vector type?  If so this would help justify why is a 
vector not a array or a list (both allow null)


> On Apr 26, 2023, at 10:50 AM, David Capwell <[email protected]> wrote:
> 
> Thanks for starting this thread!
> 
>> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
>> loved that, so we considered a bunch of alternatives, including
>> 
>> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
>> would make it familiar for many users. However, this syntax raises the 
>> question of why arrays cannot be created for other types.  Additionally, the 
>> expectation for an array is to provide random access to its contents, which 
>> is not supported for vectors.
>> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
>> vectors, not sparse ones. However, since Lucene had sparse vector support in 
>> the past but removed it for lack of compelling use cases, it is unlikely 
>> that it will be added back, making the "DENSE" qualifier less relevant.
>> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
>> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
>> reasons mentioned above.
>> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
>> natural word order.
>> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again 
>> this would imply that random access is supported, which we want to avoid 
>> doing.
>> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
>> could make it difficult to add other vector types, such as byte vectors 
>> (already supported by Lucene), in the future.
> 
> I didn’t look close enough when I saw your patch, is this type multicell or 
> not?  Aka is this acting like a frozen<array<float>> of fixed size?  I had 
> assumed its non-multicell…. Main reason I ask this now is this pushback for 
> random access…. Lets say I have the following table
> 
> CREATE TABLE fluffy_kittens (
>   pk int PRIMARY KEY,
>   vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
> )
> 
> If I do the following query, I would expect it to work
> 
> SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42
> 
> While working on accord’s CQL integration Caleb and I kept getting bitten by 
> frozen vs non frozen behavior, so many cases just stopped working on frozen 
> collections and should be easy to add (we force user to load the full value 
> already, why can we not touch it?).
> 
> Now, back to the random access comment, assuming this is not multicell why 
> would random access be blocked?  If the type isValueLengthFixed() == true 
> then random access should be simple (else it does require walking the array 
> in-order or to fully deserialize the BB (if working with Lucene I assume we 
> already deserialized out of BB)).  I am just trying to flesh out if there is 
> a limitation not being brought up or is this trying to limit the scope of 
> access for easier testing?
> 
>> However, this syntax raises the question of why arrays cannot be created for 
>> other types
> 
> Left this comment in the other thread, why not?  This could be useful outside 
> the float use case, so having a new "VectorType(AbstractType<T> elements, int 
> size)” is easier/better than a float only version.  I also did a lot of work 
> to fuzz test our type system, so just adding that into the existing generator 
> would get good coverage right off the bat (have another fuzz tester I have 
> not contributed yet, it was done for Accord… it fuzz tests the AST, so would 
> be easy to add this there, that would test type specific access, which the 
> existing tests don’t)
> 
>> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
>> consistency if we add other float types like FLOAT16 or FLOAT64
> 
> I do not think we should add a new FLOAT32 type, but I am cool with an alias 
> that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
> where we return schema back to users would do FLOAT even if user wrote 
> FLOAT32… other than that negative I don’t see any other problems.
> 
>> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
>> of clarity, conciseness, and extensibility. It is more natural in its word 
>> order than the original proposal and avoids unnecessary qualifiers, while 
>> still being clear about the data type it represents. Finally, this syntax is 
>> straighforwardly extensible should we choose to support other vector types 
>> in the future.
> 
> My preference is TYPE[n_dimension] but I am ok with this syntax if others 
> prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
> seems to be an assumption that this will tell users that random access isn’t 
> allowed and only blessed types are allowed… both points I feel are not valid 
> (or not seen anything published why they should be valid).  There is a 
> difference between what a type “could” do and what we implement day 1, I 
> wouldn’t want to add more verbosity because of intentions of the day 1 
> implementation. 
> 
> 
>> On Apr 26, 2023, at 7:31 AM, Jonathan Ellis <[email protected]> wrote:
>> 
>> Hi all,
>> 
>> Splitting this out per the suggestion in the initial VS thread so we can 
>> work on driver support in parallel with the server-side changes.
>> 
>> I propose adding a new data type for vector search indexes:
>> 
>> FLOAT VECTOR[N_DIMENSIONS]
>> 
>> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
>> loved that, so we considered a bunch of alternatives, including
>> 
>> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
>> would make it familiar for many users. However, this syntax raises the 
>> question of why arrays cannot be created for other types.  Additionally, the 
>> expectation for an array is to provide random access to its contents, which 
>> is not supported for vectors.
>> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
>> vectors, not sparse ones. However, since Lucene had sparse vector support in 
>> the past but removed it for lack of compelling use cases, it is unlikely 
>> that it will be added back, making the "DENSE" qualifier less relevant.
>> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
>> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
>> reasons mentioned above.
>> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
>> natural word order.
>> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again 
>> this would imply that random access is supported, which we want to avoid 
>> doing.
>> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
>> could make it difficult to add other vector types, such as byte vectors 
>> (already supported by Lucene), in the future.
>> 
>> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
>> consistency if we add other float types like FLOAT16 or FLOAT64, both of 
>> which are sometimes used in ML. However, we already have a CQL data type for 
>> a 64-bit float (`DOUBLE`), so it would make more sense to add future 
>> variants (which remain hypothetical at this point) along that line instead.
>> 
>> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
>> of clarity, conciseness, and extensibility. It is more natural in its word 
>> order than the original proposal and avoids unnecessary qualifiers, while 
>> still being clear about the data type it represents. Finally, this syntax is 
>> straighforwardly extensible should we choose to support other vector types 
>> in the future.
>> 
>> -- 
>> Jonathan Ellis
>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>> @spyced
>

Re: [DISCUSS] New data type for vector search

Reply via email to