Interesting proposal Jonathan. Will grok it over the weekend and play around 
with the branch.

Do you intend to make this part of CEP-7 or as an incremental update to SAI 
once it is committed?

> On Apr 21, 2023, at 2:19 PM, Jonathan Ellis <jbel...@gmail.com> wrote:
> 
> Happy Friday, everyone!
> 
> Rich text formatting ahead, I've attached a PDF for those who prefer that.
> 
> I propose adding approximate nearest neighbor (ANN) vector search capability 
> to Apache Cassandra via storage-attached indexes (SAI). This is a 
> medium-sized effort that will significantly enhance Cassandra’s 
> functionality, particularly for AI use cases. This addition will not only 
> provide a new and important feature for existing Cassandra users, but also 
> attract new users to the community from the AI space, further expanding 
> Cassandra’s reach and relevance.
> Introduction
> Vector search is a powerful document search technique that enables developers 
> to quickly find relevant content within an extensive collection of documents, 
> which is useful as a standalone technique, but it is particularly hot now 
> because it significantly enhances the performance of LLMs.
> 
> Vector search uses ML models to match the semantics of a question rather than 
> just the words it contains, avoiding the classic false positives and false 
> negatives associated with term-based search.  Alessandro Benedetti gives some 
> good examples in his excellent talk 
> <https://www.youtube.com/watch?v=z-i8mOHAhlU>:
> 
> <image.png>
> 
> <image.png>
> 
> You can search across any set of vectors, which are just ordered sets of 
> numbers.  In the context of natural language queries and document search, we 
> are specifically concerned with a type of vector called an embedding.  
> 
> An embedding is a high-dimensional vector that captures the underlying 
> semantic relationships and contextual information of words or phrases. 
> Embeddings are generated by ML models trained for this purpose; OpenAI 
> provides an API to do this, but open-source and self-hostable models like 
> BERT are also popular. Creating more accurate and smaller embeddings are 
> active research areas in ML.
> 
> Large language models (LLMs) can be described as a mile wide and an inch 
> deep. They are not experts on any narrow domain (although they will 
> hallucinate that they are, sometimes convincingly).  You can remedy this by 
> giving the LLM additional context for your query, but the context window is 
> small (4k tokens for GPT-3.5, up to 32k for GPT-4), so you want to be very 
> selective about giving the LLM the most relevant possible information.
> 
> Vector search is red-hot now because it allows us to easily answer the 
> question “what are the most relevant documents to provide as context” by 
> performing a similarity search between the embeddings vector of the query, 
> and those of your document universe.  Doing exact search is prohibitively 
> expensive, since you necessarily have to compare with each and every 
> document; this is intractable when you have billions or trillions of docs.  
> However, there are well-understood algorithms for turning this into a 
> logarithmic problem if you are willing to accept approximately the most 
> similar documents.  This is the “approximate nearest neighbor” problem.  (You 
> will see these referred to as kNN – k nearest neighbors – or ANN.)
> 
> Pinecone DB has a good example of what this looks like in Python code 
> <https://docs.pinecone.io/docs/gen-qa-openai>.
> 
> Vector search is the foundation underlying effectively all of the AI 
> applications that are launching now.  This is particularly relevant to Apache 
> Cassandra users, who tend to manage the types of large datasets that benefit 
> the most from fast similarity search. Adding vector search to Cassandra’s 
> unique strengths of scale, reliability, and low latency, will further enhance 
> its appeal and effectiveness for these users while also making it more 
> attractive to newcomers looking to harness AI’s potential.  The faster we 
> deliver vector search, the more valuable it will be for this expanding user 
> base.
> Requirements
> Perform vector search as outlined in the Pinecone example above
> Support Float32 embeddings in the form of a new DENSE FLOAT32 cql type
> This is also useful for “classic” ML applications that derive and serve their 
> own feature vectors
> Add ANN (approximate nearest neighbor) search.
> Work with normal Cassandra data flow
> Inserting one row at a time is fine; cannot require batch ingest
> Updating, deleting rows is also fine
> Must compose with other SAI predicates as well as partition keys
> Not requirements
> Other datatypes besides Float32
> Pinecone supports only Float32 and it’s hugely ahead in mindshare so let’s 
> make things easy on ourselves and follow their precedent.
> I don’t want to scope creep beyond ANN. In particular, I don’t want to wait 
> for ORDER BY to get exact search in as well.
> How we can do this
> There is exactly one production quality, actively maintained, 
> state-of-the-art ANN library for Java: Lucene’s HNSW.  Elastic has been GA 
> with ANN search backed by this library for just over a year; Solr added it 
> not long after.  As usual with Lucene code, it is not super concerned with 
> making it usable by third parties like Cassandra, but HNSW offers public APIs 
> that are low level enough that we don’t have to bring in Lucene baggage that 
> we don’t want.
> 
> The SAI framework is flexible enough to allow plugging in an HNSW vector 
> index while enjoying the baseline benefits of SAI and playing nice with the 
> other SAI index types.
> What I have today
> I have a pre-alpha branch that demonstrates
> A new data type, DENSE FLOAT32 (CQL) DenseFloat32Type (internally).
> A new SAI index, VectorMemtableIndex
> A new CQL operator, ANN
> 
> So this works:
> cqlsh> create KEYSPACE test WITH replication = {'class': 'SimpleStrategy', 
> 'replication_factor': 1};
> cqlsh> use test;
> 
> cqlsh:test> create table foo(i int primary key, j dense float32);
> cqlsh:test> create custom index ann_index on foo(j) using 
> 'StorageAttachedIndex';
> 
> cqlsh:test> insert into foo (i, j) values (1, [8, 2.3, 58]);
> cqlsh:test> insert into foo (i, j) values (2, [1.2, 3.4, 5.6]);
> cqlsh:test> insert into foo (i, j) values (5, [23, 18, 3.9]);
> cqlsh:test> select * from foo where j ann [3.4, 7.8, 9.1] limit 1;
> 
>  i | j
> ---+---------------------------------------------------------
>  5 | b'\x00\x00\x00\x03A\xb8\x00\x00A\x90\x00\x00@y\x99\x9a'
> 
> Anything else you can imagine asking "does it do X" the answer is no, it 
> doesn't.  It doesn't save to disk, it doesn't compose with other predicates, 
> it doesn't have driver support. That last is why the answer comes back as raw 
> hex bytes.  The CQL parser handling the inserts is server-side and doesn't 
> need driver support, is why we can do inserts and queries at all.
> 
> (If you’re tempted to try this out, remember that “it doesn’t save to disk” 
> means that if you do inserts first and then create the index, instead of the 
> other way around, it will break, because we flush existing data as part of 
> adding a new index to a table.  So it’s quite fragile at the moment!)
> 
> The branch is public at https://github.com/datastax/cassandra/tree/vsearch 
> <https://github.com/jbellis/cassandra/tree/vsearch>
> I based it on the DataStax converged cassandra repo because that’s the only 
> place to get fully working SAI for now.  We plan to upstream vector search 
> with the rest of SAI as soon as it’s ready.
> What’s next
> I think we can get to alpha pretty quickly:
> Add support for flushing, and serving queries from disk without reading the 
> whole graph into memory (Lucene does this already, we will probably have to 
> rewrite a bunch of the code for use in Cassandra, but the design is vetted 
> and works).
> Make sure that scatter/gather works across nodes, this should Just Work but 
> I’ve already needed to tweak a couple things that I thought would just work, 
> so I’m calling this out.
> And then to be feature complete we would want to add:
> CQL support for specifying the dimensionality of a column up front, e.g. 
> DENSE FLOAT32(1500).
> Pinecone accepts the first vector as the Correct dimensions, and throws an 
> error if you give it one that doesn’t match, but this doesn’t work in a 
> distributed system where the second vector might go to a different node than 
> the first.  Elastic requires specifying it up front like I propose here.
> Driver support
> Support for updating the vector in a row that hasn’t been flushed yet, either 
> by updating HNSW to support deletes, or by adding some kind of invalidation 
> marker to the overwritten vector.
> More performant inserts.  Currently I have a big synchronized lock around the 
> HNSW graph.  So we either need to shard it, like we do for the other SAI 
> indexes, or add fine-grained locking like ConcurrentSkipListMap to make 
> HnswGraphBuilder concurrent-ish.  I prefer the second option, since it allows 
> us to avoid building the graph once in memory and then a second time on 
> flush, but it might be more work than it appears.
> Add index configurability options: similarity function and HNSW parameters M 
> and ef.
> Expose the similarity functions to CQL so you can SELECT x, 
> cosine_similarity(x, query_vector) FROM …
> Special thanks to
> Mike Adamson and Zhao Yang made substantial contributions to the branch you 
> see here and provided indispensable help understanding SAI.
> 
> <Vector search for Cassandra.pdf>

Reply via email to