[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17057911#comment-17057911
 ] 

Jim Ferenczi commented on LUCENE-9004:
--------------------------------------

I like this issue a lot and all the discussions around it, thanks all for 
working on this!

I'd like to share some of the findings we had while working on similar 
solutions for Elasticsearch. The known limitations for graph based approach are 
the computational cost of building the graph and the memory needed to store the 
neighbors per node. Regarding the computational cost, inserting a new node is 
equivalent to a query in the solution so we should expect that the number of 
comparisons needed to insert a node in the graph will grow logarithmically with 
the size of the data. We've made some tests to check the number of comparisons 
needed on the million scale and found out that this number doesn't vary too 
much on the dataset present in the ann-benchmark repo. To get good performance 
at search time, the efConstruction parameter need to be set high (from 200 to 
800 in the best results) while M (max numbers of neighbors per node) can can 
remain lower (16 to 64). This led to around 10k comparisons in average for the 
ann-benchmark dataset in the 1-10M ranges.

10K comparisons for 1-10M ranges at query time is very compelling. Users can 
also trade some recall with performance and get acceptable results in the 1-10k 
ranges. However this trade-offs are more difficult to apply at build time where 
the quality of the graph is important to maintain. I mainly see this cost as 
static due to its logarithmic growth that is verified in the paper around 
small-world graph approaches. This is the main trade-offs that users need to 
make when using graph-based approaches, building will be slow.

 

Regarding the memory consumption, I have mixed feelings. The fact that we need 
to keep M nearest neighbors per node should not be a problem at search time 
since the graph can be static and accessed through a file. The random reads 
nature of a query in the graph will require disk seeks and reads but we 
retrieve M neighbors each time so we're not talking of tiny random reads and 
the filesystem cache will keep the hot nodes in direct memory (upper layer in 
the hierarchical graph?). I am saying this because it seems that we're 
expecting to load the entire graph in RAM at some point. I don't think this is 
needed at query time, hence my comment.

The tricky part in my opinion here is at build time where the graph is updated 
dynamically. This requires more efficient access and the ability to change the 
data. We also need to keep the nearest neighbor distances for each neighbor so 
the total cost is N*M*8 where N is the total number of documents, M the maximum 
number of neighbors per node and 8 the cost associated with keeping a doc id 
and the distance for each neighbor (int+float). The formula is slightly more 
complicated for hierarchical graph but doesn't change the scale. This memory 
requirement seems acceptable for medium-sized graph in the range of 1-10M but 
can become problematic when building large graphs of hundreds of millions 
nodes. Considering the logarithmic growth of the number of operations needed to 
find a local minimum when the dataset grows, building large graphs is 
encouraged at the expense of more memory. I don't know what would be acceptable 
but requiring tens of gigabytes of heap memories to build such graph doesn't 
seem compelling to me. Considering that the benefit of using a graph are 
already visible in the 1-10M ranges I also wonder if we could make a compromise 
and cap the size of the graphs that we build. So instead of having one graph 
per segment, we'd build N depending on how much memory the user is willing to 
allocate for the build and the total number of docs present in the segment. 
Obviously, searching these graphs sequentially would be more costly than having 
a single giant graph. However, this could also have interesting properties when 
merging segments since we wouldn't need to rebuild graphs that reached the 
maximum size allowed (assuming there's no deleted documents).

This is an idea that I wanted to not limit ourselves to the overall size of a 
single graph in a single segment of 2 billions vectors (maximum allowed per 
index in Lucene).

> Approximate nearest vector search
> ---------------------------------
>
>                 Key: LUCENE-9004
>                 URL: https://issues.apache.org/jira/browse/LUCENE-9004
>             Project: Lucene - Core
>          Issue Type: New Feature
>            Reporter: Michael Sokolov
>            Priority: Major
>         Attachments: hnsw_layered_graph.png
>
>          Time Spent: 3h 20m
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This choice limits us 
> to building a graph-per-segment since it would be impractical to maintain a 
> global graph for the whole index in the face of segment merges. However 
> graph-per-segment is a very natural at search time - we can traverse each 
> segments' graph independently and merge results as we do today for term-based 
> search.
> At index time, however, merging graphs is somewhat challenging. While 
> indexing we build a graph incrementally, performing searches to construct 
> links among neighbors. When merging segments we must construct a new graph 
> containing elements of all the merged segments. Ideally we would somehow 
> preserve the work done when building the initial graphs, but at least as a 
> start I'd propose we construct a new graph from scratch when merging. The 
> process is going to be  limited, at least initially, to graphs that can fit 
> in RAM since we require random access to the entire graph while constructing 
> it: In order to add links bidirectionally we must continually update existing 
> documents.
> I think we want to express this API to users as a single joint 
> {{KnnGraphField}} abstraction that joins together the vectors and the graph 
> as a single joint field type. Mostly it just looks like a vector-valued 
> field, but has this graph attached to it.
> I'll push a branch with my POC and would love to hear comments. It has many 
> nocommits, basic design is not really set, there is no Query implementation 
> and no integration iwth IndexSearcher, but it does work by some measure using 
> a standalone test class. I've tested with uniform random vectors and on my 
> laptop indexed 10K documents in around 10 seconds and searched them at 95% 
> recall (compared with exact nearest-neighbor baseline) at around 250 QPS. I 
> haven't made any attempt to use multithreaded search for this, but it is 
> amenable to per-segment concurrency.
> [1] 
> [https://www.semanticscholar.org/paper/Efficient-and-robust-approximate-nearest-neighbor-Malkov-Yashunin/699a2e3b653c69aff5cf7a9923793b974f8ca164]
>  
> *UPDATES:*
>  * (1/12/2020) The up-to-date branch is: 
> [https://github.com/apache/lucene-solr/tree/jira/lucene-9004-aknn-2]



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Reply via email to