Hi Jim, thanks for taking a look.

Yes, it's quite possible that replacing HNSW entirely with
FreshDiskANN-style Vamana would make sense.  I started with this approach
in part to better understand how much difference there was in applying a
Vamana pass to just using the HNSW level 0.  (It's still unclear, because
the difference in search algorithm quality complicates things.)

I'm not sure what property you're thinking of re efficient merging, AFAICT
the DiskANN algorithms are still designed node-at-a-time and the
alpha-pruning is basically a generalization of the HNSW diversity
heuristic, so if there were a way to merge multiple Vamana graphs
efficiently you should be able to do the same thing level-at-a-time with
HNSW.

I'm using the low-level HNSW code to build Cassandra indexes [1] [2], I was
pleased to find that Lucene is useful beyond Elastic and Solr.  I've opened
a PR to contribute my concurrent HNSW index here:
https://github.com/apache/lucene/pull/12421

[1]
https://cwiki.apache.org/confluence/display/CASSANDRA/CEP-30%3A+Approximate+Nearest+Neighbor%28ANN%29+Vector+Search+via+Storage-Attached+Indexes
[2] https://www.datastax.com/blog/vector-search-is-generally-available

On Sat, Aug 5, 2023 at 10:08 AM jim ferenczi <jim.feren...@gmail.com> wrote:

> Hi Jonathan,
>
> Could you provide further clarification on your goal? The current
> description is unclear.
>
> Why construct an HNSW graph only to 'optimize' it into a Vamana graph? Why
> not directly build a Vamana graph? This paper
> <https://arxiv.org/pdf/2105.09613.pdf> provides guidance for streaming
> graph construction.
>
> An intriguing feature of the Vamana graph is its potential for more
> efficient merging of multiple graphs compared to rebuilding the entire HNSW
> graph for each merge. While I'm unsure of the process, it seems simpler to
> work within a single level.
>
> I'm also concerned about using HNSW code beyond the index writer. Flushing
> to disk is integral to the index codec. It might be advisable to employ a
> standard index writer rather than extracting code for external use,
> potentially disrupting the indexing chain.
>
> Best regards,
> Jim
>
> On Sat, 5 Aug 2023 at 22:47, Jonathan Ellis <jbel...@gmail.com> wrote:
>
>> Hi all,
>>
>> I put FINGER on pause to try out different graph construction methods.
>> Vamana (the in-memory component of DiskANN) looks interesting, especially
>> because it's a two-pass algorithm and the first pass is very similar to
>> "build L0 of hnsw."  So there is potentially a good fit there for using
>> HNSW in Cassandra to index and query vectors "online" but when we flush to
>> disk, optimize it with Vamana for about half as much work as if we started
>> from zero.
>>
>> Here's what I'm seeing on the nytimes-256 dataset, at two different beam
>> widths:
>>
>> HNSW: top 100 recall 0.8166, build 42.98s, query 14.89s. 42148760 nodes
>> visited
>>
>> Vamana@2.0: top 100 recall 0.9050, build 15.71s, query 13.03s. 39525880
>> nodes visited
>>
>> Not bad, right?  But here's the thing, that's not using exactly the same
>> search algorithm.  When I point the HnswGraphSearcher.searchLevel at the
>> Vamana graph, it's much worse:
>>
>> Vamona@2.0: top 100 recall 0.7323, build 16.17s, query 9.45s. 44599250
>> nodes visited
>>
>> It is faster, mostly (?) because I've barely put any time into optimizing
>> the Vamana search method.  But recall is much worse.
>>
>> I've attached clips of the algorithm descriptions.  Setting aside a bit
>> of noise, the main difference is that hnsw says, "stop expanding nodes when
>> none of the candidates are closer to the query than the worst of the topk
>> found so far."  While Vamana keeps the topk results and candidates in the
>> *same* set [1] and says, "stop expanding nodes when none of this top L set
>> are unvisited" (where L >= k).  In the results above I manually set L to
>> 4*k, since that makes the visited counts close to even.
>> .
>> Does anyone have any intuition for why, when pointed at the same graph,
>> this does better with the same (slightly fewer) nodes visited?
>>
>> If you'd rather play with the code, rough cut is here:
>> https://github.com/jbellis/lucene/tree/concurrent-vamana, search code is
>> VamanaGraphBuilder.greedySearch.  Test harness branch at
>> https://github.com/jbellis/hnswdemo/tree/vamana.
>>
>> [1] I implemented this as FixedNeighborArray.
>>
>>
>> --
>> Jonathan Ellis
>> co-founder, http://www.datastax.com
>> @spyced
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
>> For additional commands, e-mail: dev-h...@lucene.apache.org
>
>

-- 
Jonathan Ellis
co-founder, http://www.datastax.com
@spyced

Reply via email to