Thanks for starting this thread! > In the initial commits and thread, this was DENSE FLOAT32. Nobody really > loved that, so we considered a bunch of alternatives, including > > - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which > would make it familiar for many users. However, this syntax raises the > question of why arrays cannot be created for other types. Additionally, the > expectation for an array is to provide random access to its contents, which > is not supported for vectors. > - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense > vectors, not sparse ones. However, since Lucene had sparse vector support in > the past but removed it for lack of compelling use cases, it is unlikely that > it will be added back, making the "DENSE" qualifier less relevant. > - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with > the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the > reasons mentioned above. > - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less > natural word order. > `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again > this would imply that random access is supported, which we want to avoid > doing. > - `VECTOR[N]`: This syntax is not very clear about the vector's contents and > could make it difficult to add other vector types, such as byte vectors > (already supported by Lucene), in the future.
I didn’t look close enough when I saw your patch, is this type multicell or not? Aka is this acting like a frozen<array<float>> of fixed size? I had assumed its non-multicell…. Main reason I ask this now is this pushback for random access…. Lets say I have the following table CREATE TABLE fluffy_kittens ( pk int PRIMARY KEY, vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do! ) If I do the following query, I would expect it to work SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42 While working on accord’s CQL integration Caleb and I kept getting bitten by frozen vs non frozen behavior, so many cases just stopped working on frozen collections and should be easy to add (we force user to load the full value already, why can we not touch it?). Now, back to the random access comment, assuming this is not multicell why would random access be blocked? If the type isValueLengthFixed() == true then random access should be simple (else it does require walking the array in-order or to fully deserialize the BB (if working with Lucene I assume we already deserialized out of BB)). I am just trying to flesh out if there is a limitation not being brought up or is this trying to limit the scope of access for easier testing? > However, this syntax raises the question of why arrays cannot be created for > other types Left this comment in the other thread, why not? This could be useful outside the float use case, so having a new "VectorType(AbstractType<T> elements, int size)” is easier/better than a float only version. I also did a lot of work to fuzz test our type system, so just adding that into the existing generator would get good coverage right off the bat (have another fuzz tester I have not contributed yet, it was done for Accord… it fuzz tests the AST, so would be easy to add this there, that would test type specific access, which the existing tests don’t) > Finally, the original qualifier of 32 in `FLOAT32` was intended to allow > consistency if we add other float types like FLOAT16 or FLOAT64 I do not think we should add a new FLOAT32 type, but I am cool with an alias that has FLOAT32 point to FLOAT. One negative of this is that the code paths where we return schema back to users would do FLOAT even if user wrote FLOAT32… other than that negative I don’t see any other problems. > Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance > of clarity, conciseness, and extensibility. It is more natural in its word > order than the original proposal and avoids unnecessary qualifiers, while > still being clear about the data type it represents. Finally, this syntax is > straighforwardly extensible should we choose to support other vector types in > the future. My preference is TYPE[n_dimension] but I am ok with this syntax if others prefer it. I don’t agree that this extra verbosity adds more clarity, there seems to be an assumption that this will tell users that random access isn’t allowed and only blessed types are allowed… both points I feel are not valid (or not seen anything published why they should be valid). There is a difference between what a type “could” do and what we implement day 1, I wouldn’t want to add more verbosity because of intentions of the day 1 implementation. > On Apr 26, 2023, at 7:31 AM, Jonathan Ellis <jbel...@gmail.com> wrote: > > Hi all, > > Splitting this out per the suggestion in the initial VS thread so we can work > on driver support in parallel with the server-side changes. > > I propose adding a new data type for vector search indexes: > > FLOAT VECTOR[N_DIMENSIONS] > > In the initial commits and thread, this was DENSE FLOAT32. Nobody really > loved that, so we considered a bunch of alternatives, including > > - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which > would make it familiar for many users. However, this syntax raises the > question of why arrays cannot be created for other types. Additionally, the > expectation for an array is to provide random access to its contents, which > is not supported for vectors. > - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense > vectors, not sparse ones. However, since Lucene had sparse vector support in > the past but removed it for lack of compelling use cases, it is unlikely that > it will be added back, making the "DENSE" qualifier less relevant. > - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with > the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the > reasons mentioned above. > - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less > natural word order. > `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again > this would imply that random access is supported, which we want to avoid > doing. > - `VECTOR[N]`: This syntax is not very clear about the vector's contents and > could make it difficult to add other vector types, such as byte vectors > (already supported by Lucene), in the future. > > Finally, the original qualifier of 32 in `FLOAT32` was intended to allow > consistency if we add other float types like FLOAT16 or FLOAT64, both of > which are sometimes used in ML. However, we already have a CQL data type for > a 64-bit float (`DOUBLE`), so it would make more sense to add future variants > (which remain hypothetical at this point) along that line instead. > > Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance > of clarity, conciseness, and extensibility. It is more natural in its word > order than the original proposal and avoids unnecessary qualifiers, while > still being clear about the data type it represents. Finally, this syntax is > straighforwardly extensible should we choose to support other vector types in > the future. > > -- > Jonathan Ellis > co-founder, http://www.datastax.com <http://www.datastax.com/> > @spyced