Benedicts comments also makes me question; can any of the values in the vector be null? The patch sent works with float arrays, so null isn’t possible… is null not valid for a vector type? If so this would help justify why is a vector not a array or a list (both allow null)
> On Apr 26, 2023, at 10:50 AM, David Capwell <dcapw...@apple.com> wrote: > > Thanks for starting this thread! > >> In the initial commits and thread, this was DENSE FLOAT32. Nobody really >> loved that, so we considered a bunch of alternatives, including >> >> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which >> would make it familiar for many users. However, this syntax raises the >> question of why arrays cannot be created for other types. Additionally, the >> expectation for an array is to provide random access to its contents, which >> is not supported for vectors. >> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense >> vectors, not sparse ones. However, since Lucene had sparse vector support in >> the past but removed it for lack of compelling use cases, it is unlikely >> that it will be added back, making the "DENSE" qualifier less relevant. >> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with >> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the >> reasons mentioned above. >> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less >> natural word order. >> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again >> this would imply that random access is supported, which we want to avoid >> doing. >> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and >> could make it difficult to add other vector types, such as byte vectors >> (already supported by Lucene), in the future. > > I didn’t look close enough when I saw your patch, is this type multicell or > not? Aka is this acting like a frozen<array<float>> of fixed size? I had > assumed its non-multicell…. Main reason I ask this now is this pushback for > random access…. Lets say I have the following table > > CREATE TABLE fluffy_kittens ( > pk int PRIMARY KEY, > vector FLOAT[42] — don’t ask why fluffy kittens need a vector, they just do! > ) > > If I do the following query, I would expect it to work > > SELECT vector[7] FROM fluffy_kittens WHERE pk=0; — 7 is less than 42 > > While working on accord’s CQL integration Caleb and I kept getting bitten by > frozen vs non frozen behavior, so many cases just stopped working on frozen > collections and should be easy to add (we force user to load the full value > already, why can we not touch it?). > > Now, back to the random access comment, assuming this is not multicell why > would random access be blocked? If the type isValueLengthFixed() == true > then random access should be simple (else it does require walking the array > in-order or to fully deserialize the BB (if working with Lucene I assume we > already deserialized out of BB)). I am just trying to flesh out if there is > a limitation not being brought up or is this trying to limit the scope of > access for easier testing? > >> However, this syntax raises the question of why arrays cannot be created for >> other types > > Left this comment in the other thread, why not? This could be useful outside > the float use case, so having a new "VectorType(AbstractType<T> elements, int > size)” is easier/better than a float only version. I also did a lot of work > to fuzz test our type system, so just adding that into the existing generator > would get good coverage right off the bat (have another fuzz tester I have > not contributed yet, it was done for Accord… it fuzz tests the AST, so would > be easy to add this there, that would test type specific access, which the > existing tests don’t) > >> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow >> consistency if we add other float types like FLOAT16 or FLOAT64 > > I do not think we should add a new FLOAT32 type, but I am cool with an alias > that has FLOAT32 point to FLOAT. One negative of this is that the code paths > where we return schema back to users would do FLOAT even if user wrote > FLOAT32… other than that negative I don’t see any other problems. > >> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance >> of clarity, conciseness, and extensibility. It is more natural in its word >> order than the original proposal and avoids unnecessary qualifiers, while >> still being clear about the data type it represents. Finally, this syntax is >> straighforwardly extensible should we choose to support other vector types >> in the future. > > My preference is TYPE[n_dimension] but I am ok with this syntax if others > prefer it. I don’t agree that this extra verbosity adds more clarity, there > seems to be an assumption that this will tell users that random access isn’t > allowed and only blessed types are allowed… both points I feel are not valid > (or not seen anything published why they should be valid). There is a > difference between what a type “could” do and what we implement day 1, I > wouldn’t want to add more verbosity because of intentions of the day 1 > implementation. > > >> On Apr 26, 2023, at 7:31 AM, Jonathan Ellis <jbel...@gmail.com> wrote: >> >> Hi all, >> >> Splitting this out per the suggestion in the initial VS thread so we can >> work on driver support in parallel with the server-side changes. >> >> I propose adding a new data type for vector search indexes: >> >> FLOAT VECTOR[N_DIMENSIONS] >> >> In the initial commits and thread, this was DENSE FLOAT32. Nobody really >> loved that, so we considered a bunch of alternatives, including >> >> - `FLOAT[N]`: This minimal option resembles C and Java array syntax, which >> would make it familiar for many users. However, this syntax raises the >> question of why arrays cannot be created for other types. Additionally, the >> expectation for an array is to provide random access to its contents, which >> is not supported for vectors. >> - `DENSE FLOAT[N]`: This option clarifies that we are supporting dense >> vectors, not sparse ones. However, since Lucene had sparse vector support in >> the past but removed it for lack of compelling use cases, it is unlikely >> that it will be added back, making the "DENSE" qualifier less relevant. >> - `DENSE FLOAT VECTOR[N]`: This is the most verbose option and aligns with >> the CQL/SQL spirit. However, the "DENSE" qualifier is unnecessary for the >> reasons mentioned above. >> - `VECTOR FLOAT[N]`: This option omits the "DENSE" qualifier, but has a less >> natural word order. >> `VECTOR<FLOAT, N>`: This follows the syntax of our Collections, but again >> this would imply that random access is supported, which we want to avoid >> doing. >> - `VECTOR[N]`: This syntax is not very clear about the vector's contents and >> could make it difficult to add other vector types, such as byte vectors >> (already supported by Lucene), in the future. >> >> Finally, the original qualifier of 32 in `FLOAT32` was intended to allow >> consistency if we add other float types like FLOAT16 or FLOAT64, both of >> which are sometimes used in ML. However, we already have a CQL data type for >> a 64-bit float (`DOUBLE`), so it would make more sense to add future >> variants (which remain hypothetical at this point) along that line instead. >> >> Thus, we believe that `FLOAT VECTOR[N_DIMENSIONS]` provides the best balance >> of clarity, conciseness, and extensibility. It is more natural in its word >> order than the original proposal and avoids unnecessary qualifiers, while >> still being clear about the data type it represents. Finally, this syntax is >> straighforwardly extensible should we choose to support other vector types >> in the future. >> >> -- >> Jonathan Ellis >> co-founder, http://www.datastax.com <http://www.datastax.com/> >> @spyced >