Just wanted to add that I don't have any special knowledge of CEP-30 beyond what Jonathan posted and just trying to help clarify and answer questions as I can with some knowledge and experience from DSE Search and SAI. Thanks to Caleb for helping validate some things as well. And to be clear about partial results - the default with DSE Search at least is to fail a query if it can't get the full token range coverage. However there is an option to allow for shards being unavailable and return partial results.
> On May 9, 2023, at 3:38 PM, Jeremy Hanna <jeremy.hanna1...@gmail.com> wrote: > > I talked to David and some others in slack to hopefully clarify: > > With SAI, can you have partial results? When you have a query that is > non-key based, you need to have full token range coverage of the results. If > that isn't possible, will Vector Search/SAI return partial results? > > Anything can happen in the implementation, but for scoring, it may not make > sense to return partial results because it's misleading. For non-global > queries, it could or couldn't return partial results depending on > implementation/configuration. In DSE you could have partial results > depending on the options. However I couldn't find partial results defined > in CEP-7 or CEP-30. > > The other questions are about scoring. > > First, how is ordering/scoring done? > > Each replica returns back to the coordinator a sorted set of results and the > coordinator will have to see all of the results globally in order to do a > global ordering. You can't know what the top result is unless you've seen > everything. As to the scoring, I'm not sure how that will get calculated. > > Second, if I am ordering the results like for a Vector Search and I want to > have the top 1 result. How is the scoring done and what happens if there are > 20 that have the same score? How will the coordinator decide which 1 is > returned out of 20? > > It returns results in token/partition and then clustering order. > >> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <calebrackli...@gmail.com> wrote: >> >> Anyone on this ML who still remembers DSE Search (or has experience w/ >> Elastic or SolrCloud) probably also knows that there are some significant >> pieces of an optimized scatter/gather apparatus for IR (even without >> sorting, which also doesn't exist yet) that do not exist in C* or it's range >> query system (which SAI and all other 2i implementations use). SAI, like all >> C* 2i implementations, is still a local index, and as that is the case, >> anything built on it will perform best in partition-scoped (at least on the >> read side) use-cases. (On the bright side, the project is moving toward >> larger partitions being a possibility.) With smaller clusters or use-cases >> that are extremely write-heavy/read-light, it's possible that the full >> scatter/gather won't be too onerous, especially w/ a few small tweaks (on >> top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep >> range/index queries to a single pass to minimize latency. >> >> Whatever we do, we just need to avoid a situation down the road where users >> don't understand these nuances and hit a wall where they try to use this in >> a way that is fundamentally incompatible w/ the way the database >> scales/works. (I've done my best to call this out in all discussions around >> SAI over time, and there may even end up being further guardrails put in >> place to make it even harder to misuse it...but I digress.) >> >> Having said all that, I don't fundamentally have a problem w/ the proposal. >> >> On Tue, May 9, 2023 at 2:11 PM Benedict <bened...@apache.org >> <mailto:bened...@apache.org>> wrote: >>> HNSW can in principle be made into a distributed index. But that would be >>> quite a different paradigm to SAI. >>> >>>> On 9 May 2023, at 19:30, Patrick McFadin <pmcfa...@gmail.com >>>> <mailto:pmcfa...@gmail.com>> wrote: >>>> >>>> >>>> Under the goals section, there is this line: >>>> >>>> Scatter/gather across replicas, combining topK from each to get global >>>> topK. >>>> >>>> But what I'm hearing is, exactly how will that happen? Maybe this is an >>>> SAI question too. How is that verified in SAI? >>>> >>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <dcapw...@apple.com >>>> <mailto:dcapw...@apple.com>> wrote: >>>>> Approach section doesn’t go over how this will handle cross replica >>>>> search, this would be good to flesh out… given results have a real >>>>> ranking, the current 2i logic may yield incorrect results… so would think >>>>> we need num_ranges / rf queries in the best case, with some new >>>>> capability to sort the results? If my assumption is correct, then how >>>>> errors are handled should also be fleshed out… Example: 1k cluster >>>>> without vnode and RF=3, so 333 queries fanned out to match, then >>>>> coordinator needs to sort… if 1 of the queries fails and can’t fall back >>>>> to peers… does the query fail (I assume so)? >>>>> >>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <jbel...@gmail.com >>>>>> <mailto:jbel...@gmail.com>> wrote: >>>>>> >>>>>> Hi all, >>>>>> >>>>>> Following the recent discussion threads, I would like to propose CEP-30 >>>>>> to add Approximate Nearest Neighbor (ANN) Vector Search via >>>>>> Storage-Attached Indexes (SAI) to Apache Cassandra. >>>>>> >>>>>> The primary goal of this proposal is to implement ANN vector search >>>>>> capabilities, making Cassandra more useful to AI developers and >>>>>> organizations managing large datasets that can benefit from fast >>>>>> similarity search. >>>>>> >>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small >>>>>> World (HNSW) library and introduce a new CQL data type for vector >>>>>> embeddings, a new SAI index for ANN search functionality, and a new CQL >>>>>> operator for performing ANN search queries. >>>>>> >>>>>> We are targeting the 5.0 release for this feature, in conjunction with >>>>>> the release of SAI. The proposed changes will maintain compatibility >>>>>> with existing Cassandra functionality and compose well with the >>>>>> already-approved SAI features. >>>>>> >>>>>> Please find the full CEP document here: >>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes >>>>>> >>>>>> -- >>>>>> Jonathan Ellis >>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/> >>>>>> @spyced >>>>> >