Re: [DISCUSS] New data type for vector search

David Capwell Wed, 26 Apr 2023 10:50:42 -0700

Thanks for starting this thread!

> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
> loved that, so we considered a bunch of alternatives, including
> 
> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
> would make it familiar for many users. However, this syntax raises the 
> question of why arrays cannot be created for other types.  Additionally, the 
> expectation for an array is to provide random access to its contents, which 
> is not supported for vectors.
> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
> vectors, not sparse ones. However, since Lucene had sparse vector support in 
> the past but removed it for lack of compelling use cases, it is unlikely that 
> it will be added back, making the "DENSE" qualifier less relevant.
> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
> reasons mentioned above.
> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
> natural word order.
> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again 
> this would imply that random access is supported, which we want to avoid 
> doing.
> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
> could make it difficult to add other vector types, such as byte vectors 
> (already supported by Lucene), in the future.

I didn’t look close enough when I saw your patch, is this type multicell or 
not?  Aka is this acting like a frozen<array<float>> of fixed size?  I had 
assumed its non-multicell…. Main reason I ask this now is this pushback for 
random access…. Lets say I have the following table

CREATE TABLE fluffy_kittens (
  pk int PRIMARY KEY,
  vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do!
)

If I do the following query, I would expect it to work

SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42

While working on accord’s CQL integration Caleb and I kept getting bitten by 
frozen vs non frozen behavior, so many cases just stopped working on frozen 
collections and should be easy to add (we force user to load the full value 
already, why can we not touch it?).

Now, back to the random access comment, assuming this is not multicell why 
would random access be blocked?  If the type isValueLengthFixed() == true then 
random access should be simple (else it does require walking the array in-order 
or to fully deserialize the BB (if working with Lucene I assume we already 
deserialized out of BB)).  I am just trying to flesh out if there is a 
limitation not being brought up or is this trying to limit the scope of access 
for easier testing?

> However, this syntax raises the question of why arrays cannot be created for 
> other types

Left this comment in the other thread, why not?  This could be useful outside 
the float use case, so having a new "VectorType(AbstractType<T> elements, int 
size)” is easier/better than a float only version.  I also did a lot of work to 
fuzz test our type system, so just adding that into the existing generator 
would get good coverage right off the bat (have another fuzz tester I have not 
contributed yet, it was done for Accord… it fuzz tests the AST, so would be 
easy to add this there, that would test type specific access, which the 
existing tests don’t)

> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
> consistency if we add other float types like FLOAT16 or FLOAT64

I do not think we should add a new FLOAT32 type, but I am cool with an alias 
that has FLOAT32 point to FLOAT.  One negative of this is that the code paths 
where we return schema back to users would do FLOAT even if user wrote FLOAT32… 
other than that negative I don’t see any other problems.

> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
> of clarity, conciseness, and extensibility. It is more natural in its word 
> order than the original proposal and avoids unnecessary qualifiers, while 
> still being clear about the data type it represents. Finally, this syntax is 
> straighforwardly extensible should we choose to support other vector types in 
> the future.

My preference is TYPE[n_dimension] but I am ok with this syntax if others 
prefer it.  I don’t agree that this extra verbosity adds more clarity, there 
seems to be an assumption that this will tell users that random access isn’t 
allowed and only blessed types are allowed… both points I feel are not valid 
(or not seen anything published why they should be valid).  There is a 
difference between what a type “could” do and what we implement day 1, I 
wouldn’t want to add more verbosity because of intentions of the day 1 
implementation. 

> On Apr 26, 2023, at 7:31 AM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Hi all,
> 
> Splitting this out per the suggestion in the initial VS thread so we can work 
> on driver support in parallel with the server-side changes.
> 
> I propose adding a new data type for vector search indexes:
> 
> FLOAT VECTOR[N_DIMENSIONS]
> 
> In the initial commits and thread, this was DENSE FLOAT32. Nobody really 
> loved that, so we considered a bunch of alternatives, including
> 
> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which 
> would make it familiar for many users. However, this syntax raises the 
> question of why arrays cannot be created for other types.  Additionally, the 
> expectation for an array is to provide random access to its contents, which 
> is not supported for vectors.
> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense 
> vectors, not sparse ones. However, since Lucene had sparse vector support in 
> the past but removed it for lack of compelling use cases, it is unlikely that 
> it will be added back, making the "DENSE" qualifier less relevant.
> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with 
> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the 
> reasons mentioned above.
> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less 
> natural word order.
> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again 
> this would imply that random access is supported, which we want to avoid 
> doing.
> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and 
> could make it difficult to add other vector types, such as byte vectors 
> (already supported by Lucene), in the future.
> 
> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow 
> consistency if we add other float types like FLOAT16 or FLOAT64, both of 
> which are sometimes used in ML. However, we already have a CQL data type for 
> a 64-bit float (`DOUBLE`), so it would make more sense to add future variants 
> (which remain hypothetical at this point) along that line instead.
> 
> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance 
> of clarity, conciseness, and extensibility. It is more natural in its word 
> order than the original proposal and avoids unnecessary qualifiers, while 
> still being clear about the data type it represents. Finally, this syntax is 
> straighforwardly extensible should we choose to support other vector types in 
> the future.
> 
> -- 
> Jonathan Ellis
> co-founder, http://www.datastax.com <http://www.datastax.com/>
> @spyced

Re: [DISCUSS] New data type for vector search

Reply via email to