Interesting proposal Jonathan. Will grok it over the weekend and play around with the branch.
Do you intend to make this part of CEP-7 or as an incremental update to SAI once it is committed? > On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote: > > Happy Friday, everyone! > > Rich text formatting ahead, I've attached a PDF for those who prefer that. > > I propose adding approximate nearest neighbor (ANN) vector search capability > to Apache Cassandra via storage-attached indexes (SAI). This is a > medium-sized effort that will significantly enhance Cassandra’s > functionality, particularly for AI use cases. This addition will not only > provide a new and important feature for existing Cassandra users, but also > attract new users to the community from the AI space, further expanding > Cassandra’s reach and relevance. > Introduction > Vector search is a powerful document search technique that enables developers > to quickly find relevant content within an extensive collection of documents, > which is useful as a standalone technique, but it is particularly hot now > because it significantly enhances the performance of LLMs. > > Vector search uses ML models to match the semantics of a question rather than > just the words it contains, avoiding the classic false positives and false > negatives associated with term-based search. Alessandro Benedetti gives some > good examples in his excellent talk > <https://www.youtube.com/watch?v=z-i8mOHAhlU>: > > <image.png> > > <image.png> > > You can search across any set of vectors, which are just ordered sets of > numbers. In the context of natural language queries and document search, we > are specifically concerned with a type of vector called an embedding. > > An embedding is a high-dimensional vector that captures the underlying > semantic relationships and contextual information of words or phrases. > Embeddings are generated by ML models trained for this purpose; OpenAI > provides an API to do this, but open-source and self-hostable models like > BERT are also popular. Creating more accurate and smaller embeddings are > active research areas in ML. > > Large language models (LLMs) can be described as a mile wide and an inch > deep. They are not experts on any narrow domain (although they will > hallucinate that they are, sometimes convincingly). You can remedy this by > giving the LLM additional context for your query, but the context window is > small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very > selective about giving the LLM the most relevant possible information. > > Vector search is red-hot now because it allows us to easily answer the > question “what are the most relevant documents to provide as context” by > performing a similarity search between the embeddings vector of the query, > and those of your document universe. Doing exact search is prohibitively > expensive, since you necessarily have to compare with each and every > document; this is intractable when you have billions or trillions of docs. > However, there are well-understood algorithms for turning this into a > logarithmic problem if you are willing to accept approximately the most > similar documents. This is the “approximate nearest neighbor” problem. (You > will see these referred to as kNN – k nearest neighbors – or ANN.) > > Pinecone DB has a good example of what this looks like in Python code > <https://docs.pinecone.io/docs/gen-qa-openai>. > > Vector search is the foundation underlying effectively all of the AI > applications that are launching now. This is particularly relevant to Apache > Cassandra users, who tend to manage the types of large datasets that benefit > the most from fast similarity search. Adding vector search to Cassandra’s > unique strengths of scale, reliability, and low latency, will further enhance > its appeal and effectiveness for these users while also making it more > attractive to newcomers looking to harness AI’s potential. The faster we > deliver vector search, the more valuable it will be for this expanding user > base. > Requirements > Perform vector search as outlined in the Pinecone example above > Support Float32 embeddings in the form of a new DENSE FLOAT32 cql type > This is also useful for “classic” ML applications that derive and serve their > own feature vectors > Add ANN (approximate nearest neighbor) search. > Work with normal Cassandra data flow > Inserting one row at a time is fine; cannot require batch ingest > Updating, deleting rows is also fine > Must compose with other SAI predicates as well as partition keys > Not requirements > Other datatypes besides Float32 > Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s > make things easy on ourselves and follow their precedent. > I don’t want to scope creep beyond ANN. In particular, I don’t want to wait > for ORDER BY to get exact search in as well. > How we can do this > There is exactly one production quality, actively maintained, > state-of-the-art ANN library for Java: Lucene’s HNSW. Elastic has been GA > with ANN search backed by this library for just over a year; Solr added it > not long after. As usual with Lucene code, it is not super concerned with > making it usable by third parties like Cassandra, but HNSW offers public APIs > that are low level enough that we don’t have to bring in Lucene baggage that > we don’t want. > > The SAI framework is flexible enough to allow plugging in an HNSW vector > index while enjoying the baseline benefits of SAI and playing nice with the > other SAI index types. > What I have today > I have a pre-alpha branch that demonstrates > A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally). > A new SAI index, VectorMemtableIndex > A new CQL operator, ANN > > So this works: > cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', > 'replication_factor': 1}; > cqlsh> use test; > > cqlsh:test> create table foo(i int primary key, j dense float32); > cqlsh:test> create custom index ann_index on foo(j) using > 'StorageAttachedIndex'; > > cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]); > cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]); > cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]); > cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1; > > i | j > ---+--------------------------------------------------------- > 5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a' > > Anything else you can imagine asking "does it do X" the answer is no, it > doesn't. It doesn't save to disk, it doesn't compose with other predicates, > it doesn't have driver support. That last is why the answer comes back as raw > hex bytes. The CQL parser handling the inserts is server-side and doesn't > need driver support, is why we can do inserts and queries at all. > > (If you’re tempted to try this out, remember that “it doesn’t save to disk” > means that if you do inserts first and then create the index, instead of the > other way around, it will break, because we flush existing data as part of > adding a new index to a table. So it’s quite fragile at the moment!) > > The branch is public at https://github.com/datastax/cassandra/tree/vsearch > <https://github.com/jbellis/cassandra/tree/vsearch> > I based it on the DataStax converged cassandra repo because that’s the only > place to get fully working SAI for now. We plan to upstream vector search > with the rest of SAI as soon as it’s ready. > What’s next > I think we can get to alpha pretty quickly: > Add support for flushing, and serving queries from disk without reading the > whole graph into memory (Lucene does this already, we will probably have to > rewrite a bunch of the code for use in Cassandra, but the design is vetted > and works). > Make sure that scatter/gather works across nodes, this should Just Work but > I’ve already needed to tweak a couple things that I thought would just work, > so I’m calling this out. > And then to be feature complete we would want to add: > CQL support for specifying the dimensionality of a column up front, e.g. > DENSE FLOAT32(1500). > Pinecone accepts the first vector as the Correct dimensions, and throws an > error if you give it one that doesn’t match, but this doesn’t work in a > distributed system where the second vector might go to a different node than > the first. Elastic requires specifying it up front like I propose here. > Driver support > Support for updating the vector in a row that hasn’t been flushed yet, either > by updating HNSW to support deletes, or by adding some kind of invalidation > marker to the overwritten vector. > More performant inserts. Currently I have a big synchronized lock around the > HNSW graph. So we either need to shard it, like we do for the other SAI > indexes, or add fine-grained locking like ConcurrentSkipListMap to make > HnswGraphBuilder concurrent-ish. I prefer the second option, since it allows > us to avoid building the graph once in memory and then a second time on > flush, but it might be more work than it appears. > Add index configurability options: similarity function and HNSW parameters M > and ef. > Expose the similarity functions to CQL so you can SELECT x, > cosine_similarity(x, query_vector) FROM … > Special thanks to > Mike Adamson and Zhao Yang made substantial contributions to the branch you > see here and provided indispensable help understanding SAI. > > <Vector search for Cassandra.pdf>