Hi Derek,

I'm not sure how your image embeddings were generated, but as you probably
know, I think it is only by experiment in each case that you can determine
how far you can reduce the dimensions and/or compress the encoding values
of each dimension before too-detrimental effects on nearest-neighbour
scoring occur.  But I'd hazard a guess that encoding 512 vector float
values as 512 bytes using 512 code books generated by k-means clustering on
each dimension (or fewer codebooks if you're lucky - as I mentioned,
ada-002 value distributions for our use-case meant just 2 code books were
needed for its 1536 values when we tried that approach before we then moved
onto PQ coding) would preserve almost all the original embedding
information and reduce your HNSW index size by 75%, at the cost of
requiring a custom similarity class to use the code-books.

For the index I mentioned (160m docs, ada-002 encoding with 1536 floats
represented by 512 bytes using PQ coding to represent 3 floats as 1 byte),
the HNSW index (.vex .vem .vec files) is about 87GB.  If all I am doing is
knn queries and retrieving a document id from the result, the OS file cache
readily caches everything on a 2018-era Intel i7-9800 (8 cores, 16 threads)
with 128GB DDR - there is no IO after the initial cache population, with a
4GB heap for Lucene, and with a search beamwidth ('k") of 3 and 16 search
threads, a total sustained query rate of about 32 queries/sec is
maintained.  Yes, it is CPU intensive, because those 512 bytes still get
expanded to 1536 floats which need multiplying and summing, and over 24K
"probes" are required on average to build the result set for each query.

HNSW certainly issues smallish random probes across its index and query
rates (and CPU usage) decline rapidly if it doesnt fit in memory, even with
16 nvme lanes into the CPU.  If you can move some memory allocation from
the SOLR jvm to the file cache, that may help.

The only time I've needed a big JVM was when constructing the index:
towards the end of the build, some segment merges required a lot of memory:
I guess with multiple segment merges happening in the JVM, and each dealing
with multiple and large memory representations of their incoming and
outgoing segment's HNSW graphes, a lot of heap is required!

best regards

Kent Fitch

On Wed, Mar 1, 2023 at 2:23 AM Derek C <[email protected]> wrote:

> Hi Kent,
>
> That's very interesting.  We have been thinking about reducing,
> down-scaling, our dense vectors from 512 to 64 perhaps using PCA.
> We have about 2.5 million documents and we did some testing (with Apache
> JMeter) and after about 10 concurrent requests we start
> to have performance problems (SOLR seems to stall until we reduce the load
> for a while) so reduced embedding sizes may really
> help with this.
>
> Just out of curiosity - when you were testing with up to 160M documents
> with 512 long embeddings were you using a single
> massive computer ?  I've found that performance is OK/useable with 64Gbytes
> of RAM where SOLR has 30Gbytes and the O/S has the
> remainder with the SOLR collection/core being around 20Gbytes so within the
> amount the O/S can cache the disk I/O.
>
> Derek
>
> On Mon, Feb 27, 2023 at 5:16 AM Kent Fitch <[email protected]> wrote:
>
> > Hi Derek,
> >
> > I have been trying a few settings with HNSW in Lucene/SOLR, and whilst my
> > experiences may not be directly relevant to you, they may provide some
> > background.
> >
> > My tests have been with an index of up to 160M records containing a 512
> > element byte embedding.  The  original embeddings were of text articles
> > (average length about 450 words) generated by openAI's ada-002 as 1536
> > floats, but then encoding as 512 bytes by encoding groups of 3 floats as
> 1
> > byte using PQ encoding using the method described here:
> >
> >
> https://lear.inrialpes.fr/pubs/2011/JDS11/jegou_searching_with_quantization.pdf
> > The motivation for PQ encoding is basically to reduce index size.  A
> first
> > attempt at encoding the floats as bytes worked well (I tried to to
> minimise
> > error by analysing the distribution of float values across the 1536
> > dimensions, and noticed that all but 5 of the dimensions had a very
> > narrow range for most embeddings, and so using k-means clustering to find
> > 256 values for those dimensions, and another 256 values for the 5
> "outlier"
> > dimensions yielded good results).  However, each vector still occupied
> 1536
> > bytes, and HNSW really needs these to be in memory as otherwise  the IOs
> to
> > even the RAID 10 nvme devices connected to their own PCIE3 lanes will
> cause
> > slow query rates.  So quantising 3 floats into 1 byte was attractive.
> > Again, I used k-means on each of the 512 x 3 byte groups to get 256
> > "centroids" to minimise error.  The downside of this approach is the need
> > to define a custom similarity that reads at initiation the 512 centroid
> > tables (each with 256 mappings to expand a byte code to 3 floating-point
> > numbers representing a "centroid" point).
> >
> > Anyway, the loss caused by this mapping is real but not particularly
> > consequential: some result lists are slightly degraded/reordered, but
> HNSW
> > is an "approximate nearest neighbour" search anyway.
> >
> > How sure are you that the unexpected search results you are reporting are
> > caused by the HNSW ANN rather than the encoding?  For example, if you run
> > an exhaustive search on your 2m records to find the "real" nearest
> > neighbours to some point representing some base document, how do the
> > results differ from your HNSW search with various search beamwidths
> > (provided as the "k" parameter on the KnnByteVectorQuery constructor)?
> >
> > Although not directly relevant to your use-case, results I'm seeing on an
> > index of 160M documents with a ada-002 embedding quantised to 512 bytes
> > using a recent (11Feb23) Lucene built with a "M" of 64 and a construction
> > "beamwidth" of 120 and with a custom similarity:
> >
> > with a  search "k" of 1, the "real" closest match is returned 56% of the
> > time and requires 18K similarity comparisons.
> > with a search "k" of 2, the "real" closest match us returned as the top
> > match 61% of the time and requires 22K comparisons
> > with "k" of 3, 64%, 24K comparisons
> > "k" of 5, 70%, 29K
> > "k" of 10, 78%, 37K
> > "k" of 20, 87%, 48K
> > "k" of 50, 94%, 63K
> > "k" of 120, 97%, 121K
> >
> > The nature of the embeddings I loaded is that many are very similar
> > (basically, randomish variations on a much smaller set of "base"
> articles,
> > as we couldnt afford to get embeddings for 160M articles for this test -
> we
> > are just trying to test whether Lucene's HNSW is feasible for our
> > use-case), so in the overwhelming majority of "misses", the top article
> is
> > indeed very similar to the article sought.  That is, for our use case,
> the
> > results are satisfactory, even with the "down-scaling" of the embedding
> to
> > 512 bytes.
> >
> > best regards
> >
> > Kent Fitch
> >
> >
> >
> > On Mon, Feb 27, 2023 at 5:02 AM Derek C <[email protected]> wrote:
> >
> > > Hi all,
> > >
> > > I'm a bit uncertain how KNN with HNSW works in SOLR with dense vector
> > > fields and searching.
> > >
> > > Recently I've been doing tests loading dense vectors after inferencing
> > > [images] and then checking by eye the closest matches and the results
> > look
> > > funny (very similar images not being the nearest results as I'd
> normally
> > > expect).
> > >
> > > I'm unclear about HNSW in general (like what are the best policies, or
> a
> > > good guide or starting point, for choosing hnswMaxConnections and
> > > hnswBeamWidth values if you know the dense vector length (512) and you
> > know
> > > you have 2 million+ documents).
> > >
> > > But one thing I'm wondering right now is with a dataset over time,
> where
> > > documents have been added and documents have been removed over time,
> can
> > > this affect the KNN search (i.e. is it better if all documents, or at
> > least
> > > the dense vector field, had be indexed fresh) ?
> > >
> > > BTW I haven't yet moved from SOLR 9.0 to 9.1 but I do read that the
> HNSW
> > > codec has changed in some way so a reindex is required - I should
> > probably
> > > try 9.1 (I would prioritise this if anyone says 9.1 is better quality
> or
> > > better performance for KNN searches!).
> > >
> > > Thanks for any info!
> > >
> > > Derek
> > >
> > > --
> > > Derek Conniffe
> > > Harvey Software Systems Ltd T/A HSSL
> > > Telephone (IRL): 086 856 3823
> > > Telephone (US): (650) 449 6044
> > > Skype: dconnrt
> > > Email: [email protected]
> > >
> > >
> > > *Disclaimer:* This email and any files transmitted with it are
> > confidential
> > > and intended solely for the use of the individual or entity to whom
> they
> > > are addressed. If you have received this email in error please delete
> it
> > > (if you are not the intended recipient you are notified that
> disclosing,
> > > copying, distributing or taking any action in reliance on the contents
> of
> > > this information is strictly prohibited).
> > > *Warning*: Although HSSL have taken reasonable precautions to ensure no
> > > viruses are present in this email, HSSL cannot accept responsibility
> for
> > > any loss or damage arising from the use of this email or attachments.
> > > P For the Environment, please only print this email if necessary.
> > >
> >
>
>
> --
> --
> Derek Conniffe
> Harvey Software Systems Ltd T/A HSSL
> Telephone (IRL): 086 856 3823
> Telephone (US): (650) 449 6044
> Skype: dconnrt
> Email: [email protected]
>
>
> *Disclaimer:* This email and any files transmitted with it are confidential
> and intended solely for the use of the individual or entity to whom they
> are addressed. If you have received this email in error please delete it
> (if you are not the intended recipient you are notified that disclosing,
> copying, distributing or taking any action in reliance on the contents of
> this information is strictly prohibited).
> *Warning*: Although HSSL have taken reasonable precautions to ensure no
> viruses are present in this email, HSSL cannot accept responsibility for
> any loss or damage arising from the use of this email or attachments.
> P For the Environment, please only print this email if necessary.
>

Reply via email to