Can we import an HNSW graph into lucene index ?
Hi all, We extensively use Lucene and HNSW graph search capability for ANN searches. One issue we have been running into is long build times with higher dimensional vectors. To address this, we are exploring ways where we can build the hnsw index on the GPU and merge it into an existing Lucene index to serve queries. For example, Nvidia's cuvs library supports building a CAGRA <https://arxiv.org/pdf/2308.15136> index and transforming it into a hnswlib graph. My idea is - once the hnswgraph is built on the GPUs, we can import the graph. We need the graph vertices and their connections. We can then write it to a lucene compatible segment file format. We also map the docids to embeddings and update the fieldinfos. I would like feedback from the community on whether this sounds feasible and any implementation pointers you might have. Thanks, Anand Kotriwal
Add custom merge policy to Lucene sandbox ?
Hi, We (at Amazon product search) have customized TieredMergePolicy to 1. make it easy to configure merge-on-commit merges 2. to ensure no merged segment is accidentally too big a percentage of the total index thus harming effective within-query concurrency and long-pole query latencies. If this community feels this is useful, I am happy to open a jira and contribute a patch that will add it to Lucene's sandbox module. Thanks, Anand
Re: Faster advance on Vector Values
Sure ! created https://issues.apache.org/jira/browse/LUCENE-9674 . Also attached a PR to the above issue. Thanks, Anand On Mon, Jan 18, 2021 at 6:14 AM Michael Sokolov wrote: > Thanks for the suggestion! This will be a nice improvement for use > cases wanting to retrieve vectors for a sparse set of documents, eg > when incorporating a vector-based score as a scoring signal. Would you > mind opening an issue, Anand? > > On Sat, Jan 16, 2021 at 9:07 AM Anand Kotriwal > wrote: > > > > Hi , > > > > Our team is using the recently introduced Lucene90Codec support for > vectors. We have a use case to quickly scan a segment for documents having > vectors. While implementing it, we noticed that the advance function in > the class Lucene90VectorReader does a linear search for the target document. > > I have a proposal to make it faster - We can implement a binary search > over the "ordToDoc" array which will make the advance operation take > logarithmic time to search. > > > > I would like to seek ideas, suggestions from the community. I have an > implementation on my private fork that implements the above idea. I can > open a PR if the idea sounds reasonable. > > > > Thanks ! > > Anand Kotriwal > > > > > > - > To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org > For additional commands, e-mail: dev-h...@lucene.apache.org > >
Faster advance on Vector Values
Hi , Our team is using the recently introduced Lucene90Codec support for vectors. We have a use case to quickly scan a segment for documents having vectors. While implementing it, we noticed that the advance function in the class Lucene90VectorReader does a linear search for the target document. I have a proposal to make it faster - We can implement a binary search over the "ordToDoc" array which will make the advance operation take logarithmic time to search. I would like to seek ideas, suggestions from the community. I have an implementation on my private fork that implements the above idea. I can open a PR if the idea sounds reasonable. Thanks ! Anand Kotriwal