Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Jeremy Hanna Tue, 09 May 2023 14:39:09 -0700

Just wanted to add that I don't have any special knowledge of CEP-30 beyond 
what Jonathan posted and just trying to help clarify and answer questions as I 
can with some knowledge and experience from DSE Search and SAI.  Thanks to 
Caleb for helping validate some things as well.  And to be clear about partial 
results - the default with DSE Search at least is to fail a query if it can't 
get the full token range coverage.  However there is an option to allow for 
shards being unavailable and return partial results.


> On May 9, 2023, at 3:38 PM, Jeremy Hanna <[email protected]> wrote:
> 
> I talked to David and some others in slack to hopefully clarify:
> 
> With SAI, can you have partial results?  When you have a query that is 
> non-key based, you need to have full token range coverage of the results.  If 
> that isn't possible, will Vector Search/SAI return partial results?
> 
> Anything can happen in the implementation, but for scoring, it may not make 
> sense to return partial results because it's misleading.  For non-global 
> queries, it could or couldn't return partial results depending on 
> implementation/configuration.  In DSE you could have partial results 
> depending on the options.   However I couldn't find partial results defined 
> in CEP-7 or CEP-30.
> 
> The other questions are about scoring.
> 
> First, how is ordering/scoring done?
> 
> Each replica returns back to the coordinator a sorted set of results and the 
> coordinator will have to see all of the results globally in order to do a 
> global ordering.  You can't know what the top result is unless you've seen 
> everything.  As to the scoring, I'm not sure how that will get calculated.
> 
> Second, if I am ordering the results like for a Vector Search and I want to 
> have the top 1 result.  How is the scoring done and what happens if there are 
> 20 that have the same score?  How will the coordinator decide which 1 is 
> returned out of 20?
> 
> It returns results in token/partition and then clustering order.
> 
>> On May 9, 2023, at 2:53 PM, Caleb Rackliffe <[email protected]> wrote:
>> 
>> Anyone on this ML who still remembers DSE Search (or has experience w/ 
>> Elastic or SolrCloud) probably also knows that there are some significant 
>> pieces of an optimized scatter/gather apparatus for IR (even without 
>> sorting, which also doesn't exist yet) that do not exist in C* or it's range 
>> query system (which SAI and all other 2i implementations use). SAI, like all 
>> C* 2i implementations, is still a local index, and as that is the case, 
>> anything built on it will perform best in partition-scoped (at least on the 
>> read side) use-cases. (On the bright side, the project is moving toward 
>> larger partitions being a possibility.) With smaller clusters or use-cases 
>> that are extremely write-heavy/read-light, it's possible that the full 
>> scatter/gather won't be too onerous, especially w/ a few small tweaks (on 
>> top of a non-vnode cluster) to a.) keep fanout minimal and b.) keep 
>> range/index queries to a single pass to minimize latency.
>> 
>> Whatever we do, we just need to avoid a situation down the road where users 
>> don't understand these nuances and hit a wall where they try to use this in 
>> a way that is fundamentally incompatible w/ the way the database 
>> scales/works. (I've done my best to call this out in all discussions around 
>> SAI over time, and there may even end up being further guardrails put in 
>> place to make it even harder to misuse it...but I digress.)
>> 
>> Having said all that, I don't fundamentally have a problem w/ the proposal.
>> 
>> On Tue, May 9, 2023 at 2:11 PM Benedict <[email protected] 
>> <mailto:[email protected]>> wrote:
>>> HNSW can in principle be made into a distributed index. But that would be 
>>> quite a different paradigm to SAI.
>>> 
>>>> On 9 May 2023, at 19:30, Patrick McFadin <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>> 
>>>> 
>>>> Under the goals section, there is this line:
>>>> 
>>>> Scatter/gather across replicas, combining topK from each to get global 
>>>> topK.
>>>> 
>>>> But what I'm hearing is, exactly how will that happen? Maybe this is an 
>>>> SAI question too. How is that verified in SAI?
>>>> 
>>>> On Tue, May 9, 2023 at 11:07 AM David Capwell <[email protected] 
>>>> <mailto:[email protected]>> wrote:
>>>>> Approach section doesn’t go over how this will handle cross replica 
>>>>> search, this would be good to flesh out… given results have a real 
>>>>> ranking, the current 2i logic may yield incorrect results… so would think 
>>>>> we need num_ranges / rf queries in the best case, with some new 
>>>>> capability to sort the results?  If my assumption is correct, then how 
>>>>> errors are handled should also be fleshed out… Example: 1k cluster 
>>>>> without vnode and RF=3, so 333 queries fanned out to match, then 
>>>>> coordinator needs to sort… if 1 of the queries fails and can’t fall back 
>>>>> to peers… does the query fail (I assume so)?
>>>>> 
>>>>>> On May 8, 2023, at 7:20 PM, Jonathan Ellis <[email protected] 
>>>>>> <mailto:[email protected]>> wrote:
>>>>>> 
>>>>>> Hi all,
>>>>>> 
>>>>>> Following the recent discussion threads, I would like to propose CEP-30 
>>>>>> to add Approximate Nearest Neighbor (ANN) Vector Search via 
>>>>>> Storage-Attached Indexes (SAI) to Apache Cassandra.
>>>>>> 
>>>>>> The primary goal of this proposal is to implement ANN vector search 
>>>>>> capabilities, making Cassandra more useful to AI developers and 
>>>>>> organizations managing large datasets that can benefit from fast 
>>>>>> similarity search.
>>>>>> 
>>>>>> The implementation will leverage Lucene's Hierarchical Navigable Small 
>>>>>> World (HNSW) library and introduce a new CQL data type for vector 
>>>>>> embeddings, a new SAI index for ANN search functionality, and a new CQL 
>>>>>> operator for performing ANN search queries.
>>>>>> 
>>>>>> We are targeting the 5.0 release for this feature, in conjunction with 
>>>>>> the release of SAI. The proposed changes will maintain compatibility 
>>>>>> with existing Cassandra functionality and compose well with the 
>>>>>> already-approved SAI features.
>>>>>> 
>>>>>> Please find the full CEP document here: 
>>>>>> https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
>>>>>> 
>>>>>> -- 
>>>>>> Jonathan Ellis
>>>>>> co-founder, http://www.datastax.com <http://www.datastax.com/>
>>>>>> @spyced
>>>>> 
>

Re: CEP-30: Approximate Nearest Neighbor(ANN) Vector Search via Storage-Attached Indexes

Reply via email to