michaeljmarshall opened a new pull request, #4615:
URL: https://github.com/apache/cassandra/pull/4615

   (cherry picked from commit f2e34117d6a313be14156d92a3485f29424144c1)
   
   This PR implements the feature proposed in 
https://issues.apache.org/jira/browse/CASSANDRA-20086.
   
   ## Motivation
   
   Current ANN search does the following:
   
   1. Get top k vectors from each index segment (memory or sstable)
   2. Sort them in Primary Key order
   3. Consume and materialize all of of them (`LIMIT * numSegments`)
   4. Re-query with higher LIMIT if shadow keys are encountered 
(updates/overwrites)
   5. Re-score to sort by similarity score
   6. Take LIMIT best vectors
   7. Sort by Primary Key
   
   This flow is both inefficient due to multiple sorts/requeries and can 
produce invalid results in cases where a row has an update to a vector to a 
worse score.
   
   The new flow:
   
   1. Get top k vectors from each index segment (memory or sstable)
   2. Build score ordered iterators
   3. Merge those iterators, sorting by score descending
   4. Consume, materialize, and validate rows from the global iterator until we 
have LIMIT rows
   5. If an iterator is exhausted during step 4, requery the exhausted 
iterator's graph to get more vectors, using a bitset to ignore already visited 
nodes, until the graph is exhausted.
   6. Sort LIMIT rows by primary key.
   
   This change removes two sort operations. It also fixes all update cases by 
asserting that a row's position within the score ordered iterator is only valid 
if the vector cell's materialized value is from the same sstable.
   
   ## Other minor changes
   
   1.  Increased test coverage by parameterizing the logic to test brute force, 
graph search, and "graph's choice".
   2. Implemented custom brute force iterator that utilizes two priority queues 
to do the PQ based ranking then the full resolution ranking
   3. Fixed a bug in the mapping of `limit` to `topk` where we could get `topk 
< limit` for limit 1000
   
   There are still a number of TODOs. Some of these are left as possible jira 
tickets to create while others are questions for reviewers.
   
   I plan on completing some basic bench marks next week to demonstrate the 
value of this change. For now, I am opening this to start to get some reviews.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: [email protected]

For queries about this service, please contact Infrastructure at:
[email protected]


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to