[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)
[ https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397827#comment-17397827 ] Zach Chen commented on LUCENE-10002: Hi [~jpountz] [~gsmiller], I have created a PR for this to deprecate the collector API in favor of the collector manager API, as well as some initial refactoring to some tests and the parts in DrillSideways that use TopScoreDocCollector & TopFieldCollector to use the latter API. I plan to submit more PRs afterward for other areas in the codebase. Please note that I did first try to remove the collector API entirely, but that ended up resulting in way too many changes than I'm comfortable with in a single PR, and I also feel this API is such a commonly used one that users may prefer a more gradual deprecation / transition period. Hence I reverted my previous effort and adopted a phased approach. > Remove IndexSearcher#search(Query,Collector) in favor of > IndexSearcher#search(Query,CollectorManager) > - > > Key: LUCENE-10002 > URL: https://issues.apache.org/jira/browse/LUCENE-10002 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > It's a bit trappy that you can create an IndexSearcher with an executor, but > that it would always search on the caller thread when calling > {{IndexSearcher#search(Query,Collector)}}. > Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to > {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory > methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to > return a {{CollectorManager}} instead of a {{Collector}}? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)
zacharymorn opened a new pull request #240: URL: https://github.com/apache/lucene/pull/240 # Description / Solution Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager), with the following changes: 1. Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, TotalHitCountCollectorManager and FixedBitSetCollector 2. Refactor some tests to use the above collector manager 3. Refactor DrillSideways to use extracted out collector managers # Tests Passed updated tests with `./gradlew clean; ./gradlew check -Ptests.nightly=true` (currently there are nocommit messages that are failing precommit though) # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency
[ https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397782#comment-17397782 ] Robert Muir commented on LUCENE-10048: -- btw, there are a lot of alternatives you can look into to avoid having > 2^32 tokens inside a single document's field. For example, you could use more fields, you could encode the thing in the payload, etc. But I don't understand what you are doing, to me it sounds like lucene may not be the right solution at all honestly. Or maybe instead it is worth looking at lucene 9 vector format or something very different. > Bypass total frequency check if field uses custom term frequency > > > Key: LUCENE-10048 > URL: https://issues.apache.org/jira/browse/LUCENE-10048 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tony Xu >Priority: Minor > > For all fields whose index option is not *IndexOptions.NONE*. There is a > check on per field total token count (i.e. field-length) to ensure we don't > index too many tokens. This is done by accumulating the token's > *TermFrequencyAttribute.* > > Given that currently Lucene allows custom term frequency attached to each > token and the usage of the frequency can be pretty wild. It is possible to > have the following case where the check fails with only a few tokens that > have large frequencies. Currently Lucene will skip indexing the whole > document. > *"foo| bar|"* > > What should be way to inform the indexing chain not to check the field length? > A related observation, when custom term frequency is in use, user is not > likely to use the similarity for this field. Maybe we can offer a way to > specify that, too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency
[ https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397774#comment-17397774 ] Robert Muir commented on LUCENE-10048: -- Whether or not you use the similarity or scorer is irrelevant. I'm referring to term/field statistic values stored in the segment itself. It doesn't matter what parts you do/don't use of lucene here, overflowing these values will essentially behave as corruption. That's why I repeat myself every time a JIRA issue is opened to try to bypass these checks. > Bypass total frequency check if field uses custom term frequency > > > Key: LUCENE-10048 > URL: https://issues.apache.org/jira/browse/LUCENE-10048 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tony Xu >Priority: Minor > > For all fields whose index option is not *IndexOptions.NONE*. There is a > check on per field total token count (i.e. field-length) to ensure we don't > index too many tokens. This is done by accumulating the token's > *TermFrequencyAttribute.* > > Given that currently Lucene allows custom term frequency attached to each > token and the usage of the frequency can be pretty wild. It is possible to > have the following case where the check fails with only a few tokens that > have large frequencies. Currently Lucene will skip indexing the whole > document. > *"foo| bar|"* > > What should be way to inform the indexing chain not to check the field length? > A related observation, when custom term frequency is in use, user is not > likely to use the similarity for this field. Maybe we can offer a way to > specify that, too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency
[ https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397771#comment-17397771 ] Ankur commented on LUCENE-10048: @[~rcmuir] Consider the case where these term-document level scoring factors are computed in an offline process, indexed in Lucene and accessed at query time by a ranking function that does not rely on Lucene's [Scorer|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/Scorer.html] and [Similarity|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/similarities/Similarity.html] abstractions. What is considered reasonable is up to the offline process that serves the needs of the ranking function and is outside our control. A single term-document scoring factor can still be less than {{Integer.MAX_VALUE}} but the sum of all such factors for a document could easily exceed the {{Integer.MAX_VALUE}} range. Without this our only option (I think) is to use {{BinaryDocValues}} and implement mechanisms to serialize/deserialize term-document level scoring factors at indexing and searching time ourselves. With this we don't get the space efficiencies that come with the use of highly optimized terms dictionary and the integer compression techniques used to encode postings data (at least not without significant work). Maybe we can keep the restriction on the custom term frequency to be less than {{Integer.MAX_VALUE}} but relax the check on per field total token count for the expert use case ? > Bypass total frequency check if field uses custom term frequency > > > Key: LUCENE-10048 > URL: https://issues.apache.org/jira/browse/LUCENE-10048 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Tony Xu >Priority: Minor > > For all fields whose index option is not *IndexOptions.NONE*. There is a > check on per field total token count (i.e. field-length) to ensure we don't > index too many tokens. This is done by accumulating the token's > *TermFrequencyAttribute.* > > Given that currently Lucene allows custom term frequency attached to each > token and the usage of the frequency can be pretty wild. It is possible to > have the following case where the check fails with only a few tokens that > have large frequencies. Currently Lucene will skip indexing the whole > document. > *"foo| bar|"* > > What should be way to inform the indexing chain not to check the field length? > A related observation, when custom term frequency is in use, user is not > likely to use the similarity for this field. Maybe we can offer a way to > specify that, too? -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397664#comment-17397664 ] Michael Sokolov commented on LUCENE-9937: - I'm glad you found the batch setup was confounding things! It's hard to A/A comparison between this different implementations! > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > *NOTE: These benchmarks contained an error that made hnswlib perform too > well. See the last comment for correct results.* > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the results interesting and opened this issue to share and > discuss. My benchmarking code can be found > [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy > to commit but I’m happy to help anyone get set up with it. > Approaches > * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a > brute force scan to determine nearest neighbors > * LuceneHnsw: our HNSW implementation > * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation > from the author of the paper > Datasets > * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, > comparing euclidean distance > * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, > comparing cosine similarity > *Results on sift-128-euclidean* > Parameters: M=16, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.764 > LuceneHnsw(n_cands=10) 0.603 7736.968 > LuceneHnsw(n_cands=50) 0.890 3605.000 > LuceneHnsw(n_cands=100) 0.953 2237.429 > LuceneHnsw(n_cands=500) 0.996570.900 > LuceneHnsw(n_cands=800) 0.998379.589 > hnswlib(n_cands=10) 0.713 69662.756 > hnswlib(n_cands=50) 0.950 28021.582 > hnswlib(n_cands=100)0.985 16108.538 > hnswlib(n_cands=500)1.000 4115.886 > hnswlib(n_cands=800)1.000 2729.680 > {code} > *Results on glove-100-angular* > Parameters: M=32, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.722 > LuceneHnsw(n_cands=10) 0.507 5036.236 > LuceneHnsw(n_cands=50) 0.760 2099.850 > LuceneHnsw(n_cands=100) 0.833 1233.690 > LuceneHnsw(n_cands=500) 0.941309.077 > LuceneHnsw(n_cands=800) 0.961203.782 > hnswlib(n_cands=10) 0.597 43543.345 > hnswlib(n_cands=50) 0.832 14719.165 > hnswlib(n_cands=100)0.897 8299.948 > hnswlib(n_cands=500)0.981 1931.985 > hnswlib(n_cands=800)0.991881.752 > {code} > Notes on benchmark: > * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and > computes the recall against the true neighbors. The recall calculation has a > small 'fudge factor' that allows neighbors that are within a small epsilon of > the best distance. Queries are executed serially to obtain the QPS. > * I chose parameters where hnswlib performed well, then passed these same > parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and > beamWidth as efConstruction. For search parameters, I set k to k, and fanout > as (num_cands - k) so that the beam search is of size num_cands. Note that > our default value for beamWidth is 16, which is really low – I wasn’t able to > obtain acceptable recall until I bumped it to closer to 500 to match the > hnswlib default. > * I force-merged to one segment before running searches since this gives the > best recall + QPS, and also to match hnswlib. > Some observations: > * It'd be really nice to extend luceneutil to measure vector search recall > in addition to latency. That would help ensure we’re benchmarking a more > realistic scenario, instead of accidentally indexing/ searching at a very low > recall. Tracking recall would also guard against subtle, unintentional > changes to the algorithm. It's easy to make an accidental change while > refactoring, and with approximate algorithms, unit tests don't guard as well > against this. > * Lucene HNSW gives a great speed-up over the baseline without sacrificing > too much recall. But it doesn't perform as well as hnswlib in terms of both > recall and QPS. We wouldn’t expect the results to line up perfectly, since > Lucene doesn't actually implement HNSW – the current algorithm isn't actually > hierarchical and only uses a single graph layer. Does this difference might > indicate we're leaving performance 'on the table' by not using layers, which > (I don't think) adds much index time or space? Or are
[jira] [Commented] (LUCENE-9614) Implement KNN Query
[ https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397613#comment-17397613 ] Michael Sokolov commented on LUCENE-9614: - UPDATE: we went with {{score(q,d) = 1 / (1 + |q - d|)}} to normalize "reversed" scores into [0, 1]: thanks for the suggestion, [~jtibshirani] > Implement KNN Query > --- > > Key: LUCENE-9614 > URL: https://issues.apache.org/jira/browse/LUCENE-9614 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Time Spent: 2.5h > Remaining Estimate: 0h > > Now we have a vector index format, and one vector indexing/KNN search > implementation, but the interface is low-level: you can search across a > single segment only. We would like to expose a Query implementation. > Initially, we want to support a usage where the KnnVectorQuery selects the > k-nearest neighbors without regard to any other constraints, and these can > then be filtered as part of an enclosing Boolean or other query. > Later we will want to explore some kind of filtering *while* performing > vector search, or a re-entrant search process that can yield further results. > Because of the nature of knn search (all documents having any vector value > match), it is more like a ranking than a filtering operation, and it doesn't > really make sense to provide an iterator interface that can be merged in the > usual way, in docid order, skipping ahead. It's not yet clear how to satisfy > a query that is "k nearest neighbors satsifying some arbitrary Query", at > least not without realizing a complete bitset for the Query. But this is for > a later issue; *this* issue is just about performing the knn search in > isolation, computing a set of (some given) K nearest neighbors, and providing > an iterator over those. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397555#comment-17397555 ] Robert Muir commented on LUCENE-9937: - we can help some of those hotspots with vectorization etc improvements (at least with #dims such as 100/128), but we have to wait for openjdk to make the crap available to us. they now incubate this stuff for a third time, so it seems java 19 at the very earliest?: http://openjdk.java.net/jeps/8269306 > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > *NOTE: These benchmarks contained an error that made hnswlib perform too > well. See the last comment for correct results.* > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the results interesting and opened this issue to share and > discuss. My benchmarking code can be found > [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy > to commit but I’m happy to help anyone get set up with it. > Approaches > * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a > brute force scan to determine nearest neighbors > * LuceneHnsw: our HNSW implementation > * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation > from the author of the paper > Datasets > * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, > comparing euclidean distance > * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, > comparing cosine similarity > *Results on sift-128-euclidean* > Parameters: M=16, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.764 > LuceneHnsw(n_cands=10) 0.603 7736.968 > LuceneHnsw(n_cands=50) 0.890 3605.000 > LuceneHnsw(n_cands=100) 0.953 2237.429 > LuceneHnsw(n_cands=500) 0.996570.900 > LuceneHnsw(n_cands=800) 0.998379.589 > hnswlib(n_cands=10) 0.713 69662.756 > hnswlib(n_cands=50) 0.950 28021.582 > hnswlib(n_cands=100)0.985 16108.538 > hnswlib(n_cands=500)1.000 4115.886 > hnswlib(n_cands=800)1.000 2729.680 > {code} > *Results on glove-100-angular* > Parameters: M=32, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.722 > LuceneHnsw(n_cands=10) 0.507 5036.236 > LuceneHnsw(n_cands=50) 0.760 2099.850 > LuceneHnsw(n_cands=100) 0.833 1233.690 > LuceneHnsw(n_cands=500) 0.941309.077 > LuceneHnsw(n_cands=800) 0.961203.782 > hnswlib(n_cands=10) 0.597 43543.345 > hnswlib(n_cands=50) 0.832 14719.165 > hnswlib(n_cands=100)0.897 8299.948 > hnswlib(n_cands=500)0.981 1931.985 > hnswlib(n_cands=800)0.991881.752 > {code} > Notes on benchmark: > * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and > computes the recall against the true neighbors. The recall calculation has a > small 'fudge factor' that allows neighbors that are within a small epsilon of > the best distance. Queries are executed serially to obtain the QPS. > * I chose parameters where hnswlib performed well, then passed these same > parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and > beamWidth as efConstruction. For search parameters, I set k to k, and fanout > as (num_cands - k) so that the beam search is of size num_cands. Note that > our default value for beamWidth is 16, which is really low – I wasn’t able to > obtain acceptable recall until I bumped it to closer to 500 to match the > hnswlib default. > * I force-merged to one segment before running searches since this gives the > best recall + QPS, and also to match hnswlib. > Some observations: > * It'd be really nice to extend luceneutil to measure vector search recall > in addition to latency. That would help ensure we’re benchmarking a more > realistic scenario, instead of accidentally indexing/ searching at a very low > recall. Tracking recall would also guard against subtle, unintentional > changes to the algorithm. It's easy to make an accidental change while > refactoring, and with approximate algorithms, unit tests don't guard as well > against this. > * Lucene HNSW gives a great speed-up over the baseline without sacrificing > too much recall. But it doesn't perform as well as hnswlib in terms of both > recall and QPS. We wouldn’t expect the results to line up perfectly, since > Lucene doesn't actually implement HNSW – the current algorithm isn't actually > hierarchical and only uses a single graph
[jira] [Resolved] (LUCENE-9941) ann-benchmarks results for HNSW indexing
[ https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-9941. -- Resolution: Done > ann-benchmarks results for HNSW indexing > > > Key: LUCENE-9941 > URL: https://issues.apache.org/jira/browse/LUCENE-9941 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > This is a continuation of LUCENE-9937, but for HNSW index performance. > Approaches > * LuceneVectorsOnly: a baseline that only indexes vectors > * LuceneHnsw: our HNSW implementation, with a force merge to one segment > * LuceneHnswNoForceMerge: our HNSW implementation without the force merge > * hnswlib: a C++ HNSW implementation from the author of the paper > Datasets > * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, > comparing euclidean distance > * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, > comparing cosine similarity > *Results on sift-128-euclidean* > Parameters: M=16, efConstruction=500 > {code:java} > Approach Index time (sec) > LuceneVectorsOnly 14.93 > LuceneHnsw 3191.16 > LuceneHnswNoForceMerge 1194.31 > hnswlib 311.09 > {code} > *Results on glove-100-angular* > Parameters: M=32, efConstruction=500 > {code:java} > Approach Index time (sec) > LuceneVectorsOnly 14.17 > LuceneHnsw 8940.41 > LuceneHnswNoForceMerge 3623.68 > hnswlib 587.23 > {code} > We force merge to one segment to emulate a case where vectors aren't > continually being indexed. In these situations, it seems likely users would > force merge to optimize search speed: searching a single large graph is > expected to be faster than searching several small ones serially. To see how > long the force merge takes, we can subtract LuceneHnswNoForceMerge from > LuceneHnsw. > The construction parameters match those in LUCENE-9937 and are optimized for > search recall + QPS instead of index speed, as I figured this would be a > common set-up. > Some observations: > * In cases when segments are eventually force merged, we do a lot of extra > work building intermediate graphs that are eventually merged away. This is a > difficult problem, and one that's been raised in the past. As a simple step, > I wonder if we should not build graphs for segments that are below a certain > size. For sufficiently small segments, it could be a better trade-off to > avoid building a graph and support nearest-neighbor search through a > brute-force scan? > * Indexing is slow compared to what we're used to for other formats, even if > we disregard the extra work mentioned above. For sift-128-euclidean, building > only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 > min. > * As a note, graph indexing uses ANN searches in order to add each new > vector to the graph. So the slower search speed between Lucene and hnswlib > may contribute to slower indexing. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-9937. -- Resolution: Done > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > *NOTE: These benchmarks contained an error that made hnswlib perform too > well. See the last comment for correct results.* > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the results interesting and opened this issue to share and > discuss. My benchmarking code can be found > [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy > to commit but I’m happy to help anyone get set up with it. > Approaches > * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a > brute force scan to determine nearest neighbors > * LuceneHnsw: our HNSW implementation > * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation > from the author of the paper > Datasets > * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, > comparing euclidean distance > * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, > comparing cosine similarity > *Results on sift-128-euclidean* > Parameters: M=16, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.764 > LuceneHnsw(n_cands=10) 0.603 7736.968 > LuceneHnsw(n_cands=50) 0.890 3605.000 > LuceneHnsw(n_cands=100) 0.953 2237.429 > LuceneHnsw(n_cands=500) 0.996570.900 > LuceneHnsw(n_cands=800) 0.998379.589 > hnswlib(n_cands=10) 0.713 69662.756 > hnswlib(n_cands=50) 0.950 28021.582 > hnswlib(n_cands=100)0.985 16108.538 > hnswlib(n_cands=500)1.000 4115.886 > hnswlib(n_cands=800)1.000 2729.680 > {code} > *Results on glove-100-angular* > Parameters: M=32, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.722 > LuceneHnsw(n_cands=10) 0.507 5036.236 > LuceneHnsw(n_cands=50) 0.760 2099.850 > LuceneHnsw(n_cands=100) 0.833 1233.690 > LuceneHnsw(n_cands=500) 0.941309.077 > LuceneHnsw(n_cands=800) 0.961203.782 > hnswlib(n_cands=10) 0.597 43543.345 > hnswlib(n_cands=50) 0.832 14719.165 > hnswlib(n_cands=100)0.897 8299.948 > hnswlib(n_cands=500)0.981 1931.985 > hnswlib(n_cands=800)0.991881.752 > {code} > Notes on benchmark: > * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and > computes the recall against the true neighbors. The recall calculation has a > small 'fudge factor' that allows neighbors that are within a small epsilon of > the best distance. Queries are executed serially to obtain the QPS. > * I chose parameters where hnswlib performed well, then passed these same > parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and > beamWidth as efConstruction. For search parameters, I set k to k, and fanout > as (num_cands - k) so that the beam search is of size num_cands. Note that > our default value for beamWidth is 16, which is really low – I wasn’t able to > obtain acceptable recall until I bumped it to closer to 500 to match the > hnswlib default. > * I force-merged to one segment before running searches since this gives the > best recall + QPS, and also to match hnswlib. > Some observations: > * It'd be really nice to extend luceneutil to measure vector search recall > in addition to latency. That would help ensure we’re benchmarking a more > realistic scenario, instead of accidentally indexing/ searching at a very low > recall. Tracking recall would also guard against subtle, unintentional > changes to the algorithm. It's easy to make an accidental change while > refactoring, and with approximate algorithms, unit tests don't guard as well > against this. > * Lucene HNSW gives a great speed-up over the baseline without sacrificing > too much recall. But it doesn't perform as well as hnswlib in terms of both > recall and QPS. We wouldn’t expect the results to line up perfectly, since > Lucene doesn't actually implement HNSW – the current algorithm isn't actually > hierarchical and only uses a single graph layer. Does this difference might > indicate we're leaving performance 'on the table' by not using layers, which > (I don't think) adds much index time or space? Or are there other algorithmic > improvements would help close the gap? > * Setting beamWidth to 500 *really* slowed down indexing. I'll open a >
[GitHub] [lucene-solr] thelabdude merged pull request #2552: SOLR-9853: Ability to project multi-valued fields in SQL query results
thelabdude merged pull request #2552: URL: https://github.com/apache/lucene-solr/pull/2552 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415 ] Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:10 PM: It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Recall QPS LuceneVectorsOnly() 1.0006.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. was (Author: julietibs): It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > *NOTE: These benchmarks contained an error that made hnswlib perform too > well. See the last comment for correct results.* > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found
[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415 ] Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:06 PM: It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. was (Author: julietibs): It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > *NOTE: These benchmarks contained an error that made hnswlib perform too > well. See the last comment for correct results.* > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the
[jira] [Updated] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-9937: - Description: *NOTE: These benchmarks contained an error that made hnswlib perform too well. See the last comment for correct results.* I hooked up our HNSW implementation to [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used repo for benchmarking nearest neighbor search libraries against large datasets. I found the results interesting and opened this issue to share and discuss. My benchmarking code can be found [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy to commit but I’m happy to help anyone get set up with it. Approaches * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a brute force scan to determine nearest neighbors * LuceneHnsw: our HNSW implementation * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation from the author of the paper Datasets * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing euclidean distance * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing cosine similarity *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 7736.968 LuceneHnsw(n_cands=50) 0.890 3605.000 LuceneHnsw(n_cands=100) 0.953 2237.429 LuceneHnsw(n_cands=500) 0.996570.900 LuceneHnsw(n_cands=800) 0.998379.589 hnswlib(n_cands=10) 0.713 69662.756 hnswlib(n_cands=50) 0.950 28021.582 hnswlib(n_cands=100)0.985 16108.538 hnswlib(n_cands=500)1.000 4115.886 hnswlib(n_cands=800)1.000 2729.680 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.597 43543.345 hnswlib(n_cands=50) 0.832 14719.165 hnswlib(n_cands=100)0.897 8299.948 hnswlib(n_cands=500)0.981 1931.985 hnswlib(n_cands=800)0.991881.752 {code} Notes on benchmark: * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and computes the recall against the true neighbors. The recall calculation has a small 'fudge factor' that allows neighbors that are within a small epsilon of the best distance. Queries are executed serially to obtain the QPS. * I chose parameters where hnswlib performed well, then passed these same parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and beamWidth as efConstruction. For search parameters, I set k to k, and fanout as (num_cands - k) so that the beam search is of size num_cands. Note that our default value for beamWidth is 16, which is really low – I wasn’t able to obtain acceptable recall until I bumped it to closer to 500 to match the hnswlib default. * I force-merged to one segment before running searches since this gives the best recall + QPS, and also to match hnswlib. Some observations: * It'd be really nice to extend luceneutil to measure vector search recall in addition to latency. That would help ensure we’re benchmarking a more realistic scenario, instead of accidentally indexing/ searching at a very low recall. Tracking recall would also guard against subtle, unintentional changes to the algorithm. It's easy to make an accidental change while refactoring, and with approximate algorithms, unit tests don't guard as well against this. * Lucene HNSW gives a great speed-up over the baseline without sacrificing too much recall. But it doesn't perform as well as hnswlib in terms of both recall and QPS. We wouldn’t expect the results to line up perfectly, since Lucene doesn't actually implement HNSW – the current algorithm isn't actually hierarchical and only uses a single graph layer. Does this difference might indicate we're leaving performance 'on the table' by not using layers, which (I don't think) adds much index time or space? Or are there other algorithmic improvements would help close the gap? * Setting beamWidth to 500 *really* slowed down indexing. I'll open a separate issue with indexing speed results, keeping this one focused on search. was: I hooked up our HNSW implementation to [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used repo for benchmarking nearest neighbor search libraries against large datasets. I found the results interesting and opened this issue to share and discuss. My benchmarking code can be found [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy to commit but I’m
[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415 ] Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:04 PM: It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *Results on sift-128-euclidean* Parameters: M=16, efConstruction=500 {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. was (Author: julietibs): It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *sift-128-euclidean, M=16, efConstruction=500* {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the results interesting and opened this issue to share and > discuss. My benchmarking code can be found >
[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search
[ https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415 ] Julie Tibshirani commented on LUCENE-9937: -- It turns out my benchmarks were way too generous to hnswlib!! I was running them in "--batch" mode, which sends all queries at once. I reran hnswlib in the default "serial" mode, and now see more similar results to you. Although I had disabled multi-threading, some parallelism must have snuck in. We're still behind in terms of recall, but the QPS gap is much less. *sift-128-euclidean, M=16, efConstruction=500* {code:java} Approach Recall QPS LuceneVectorsOnly() 1.000 6.764 LuceneHnsw(n_cands=10) 0.603 9045.574 LuceneHnsw(n_cands=50) 0.890 3789.169 LuceneHnsw(n_cands=100) 0.942 2333.714 LuceneHnsw(n_cands=500) 0.996 579.850 LuceneHnsw(n_cands=800) 0.998 386.105 hnswlib(n_cands=10) 0.71022628.980 hnswlib(n_cands=50) 0.950 8640.336 hnswlib(n_cands=100) 0.985 5026.611 hnswlib(n_cands=500) 1.000 1289.108 hnswlib(n_cands=800) 1.000 856.515 {code} *Results on glove-100-angular* Parameters: M=32, efConstruction=500 {code:java} ApproachRecall QPS LuceneVectorsOnly() 1.000 6.722 LuceneHnsw(n_cands=10) 0.507 5036.236 LuceneHnsw(n_cands=50) 0.760 2099.850 LuceneHnsw(n_cands=100) 0.833 1233.690 LuceneHnsw(n_cands=500) 0.941309.077 LuceneHnsw(n_cands=800) 0.961203.782 hnswlib(n_cands=10) 0.601 12877.861 hnswlib(n_cands=50) 0.833 4553.943 hnswlib(n_cands=100)0.896 2618.098 hnswlib(n_cands=500)0.981634.301 hnswlib(n_cands=800)0.991412.333 {code} I'm going to close out this issue since I think we've learned what we can from digging into these benchmarks. My thoughts on conclusions: * Lucene HNSW gives a great speed-up over the baseline at good recall. * We may be missing important algorithmic details from hnswlib that hurt recall and cause us to perform more computations. * Hot spots: decoding vectors from their on-disk format and performing distance calculations. > ann-benchmarks results for HNSW search > -- > > Key: LUCENE-9937 > URL: https://issues.apache.org/jira/browse/LUCENE-9937 > Project: Lucene - Core > Issue Type: Task >Reporter: Julie Tibshirani >Priority: Minor > > I hooked up our HNSW implementation to > [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used > repo for benchmarking nearest neighbor search libraries against large > datasets. I found the results interesting and opened this issue to share and > discuss. My benchmarking code can be found > [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy > to commit but I’m happy to help anyone get set up with it. > Approaches > * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a > brute force scan to determine nearest neighbors > * LuceneHnsw: our HNSW implementation > * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation > from the author of the paper > Datasets > * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, > comparing euclidean distance > * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, > comparing cosine similarity > *Results on sift-128-euclidean* > Parameters: M=16, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.764 > LuceneHnsw(n_cands=10) 0.603 7736.968 > LuceneHnsw(n_cands=50) 0.890 3605.000 > LuceneHnsw(n_cands=100) 0.953 2237.429 > LuceneHnsw(n_cands=500) 0.996570.900 > LuceneHnsw(n_cands=800) 0.998379.589 > hnswlib(n_cands=10) 0.713 69662.756 > hnswlib(n_cands=50) 0.950 28021.582 > hnswlib(n_cands=100)0.985 16108.538 > hnswlib(n_cands=500)1.000 4115.886 > hnswlib(n_cands=800)1.000 2729.680 > {code} > *Results on glove-100-angular* > Parameters: M=32, efConstruction=500 > {code:java} > ApproachRecall QPS > LuceneVectorsOnly() 1.000 6.722 > LuceneHnsw(n_cands=10) 0.507 5036.236 > LuceneHnsw(n_cands=50) 0.760 2099.850 > LuceneHnsw(n_cands=100) 0.833 1233.690 > LuceneHnsw(n_cands=500) 0.941309.077 > LuceneHnsw(n_cands=800) 0.961203.782 > hnswlib(n_cands=10) 0.597 43543.345 > hnswlib(n_cands=50) 0.832 14719.165 > hnswlib(n_cands=100)0.897 8299.948 > hnswlib(n_cands=500)0.981 1931.985 > hnswlib(n_cands=800)0.991881.752 > {code} > Notes on benchmark: > * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and > computes the recall against the true neighbors. The recall calculation has a > small 'fudge factor' that allows neighbors that are within a small epsilon of >
[GitHub] [lucene-solr] thelabdude opened a new pull request #2552: SOLR-9853: Ability to project multi-valued fields in SQL query results
thelabdude opened a new pull request #2552: URL: https://github.com/apache/lucene-solr/pull/2552 Backport of https://github.com/apache/solr/pull/252 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation
msokolov commented on a change in pull request #235: URL: https://github.com/apache/lucene/pull/235#discussion_r686891113 ## File path: lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java ## @@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) { }; /** - * If true, the scores associated with vector comparisons are in reverse order; that is, lower - * scores represent more similar vectors. Otherwise, if false, higher scores represent more - * similar vectors. + * If true, the scores associated with vector comparisons are nonnegative and in reverse order; + * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores + * represent more similar vectors, and scores may be negative or positive. Review comment: heh, in fact this edit was never needed since we never changed the similarity scores (as I mentioned above), and only the scores returned from search... so I'll indeed revert -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?
[ https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani updated LUCENE-10043: -- Fix Version/s: 8.10 > Decrease default for LRUQueryCache#skipCacheFactor? > --- > > Key: LUCENE-10043 > URL: https://issues.apache.org/jira/browse/LUCENE-10043 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > Fix For: 8.10 > > Time Spent: 1h 20m > Remaining Estimate: 0h > > In LUCENE-9002 we introduced logic to skip caching a clause if it would be > too expensive compared to the usual query cost. Specifically, we avoid > caching a clause if its cost is estimated to be a factor higher than the lead > iterator's: > {code} > // skip cache operation which would slow query down too much > if (cost / skipCacheFactor > leadCost) { > return supplier.get(leadCost); > } > {code} > Choosing good defaults is hard! We've seen some examples in Elasticsearch > where caching a query clause causes a major slowdown, contributing to poor > tail latencies. It made me think that the default 'skipCacheFactor' of 250 > may be too high -- interpreted simply, this means we'll cache a clause even > if it is ~250 times more expensive than running the top-level query on its > own. Would it make sense to decrease this to 10 or so? It seems okay to air > on the side of less caching for individual clauses, especially since any > parent 'BooleanQuery' is already eligible for caching? > As a note, the interpretation "~250 times more expensive than running the > top-level query on its own" isn't perfectly accurate. The true cost doesn't > dependent on the number of matched documents, but also the cost of matching > itself. Making it even more complex, some queries like > 'IndexOrDocValuesQuery' have different matching strategies based on whether > they're used as a lead iterator or verifier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?
[ https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Julie Tibshirani resolved LUCENE-10043. --- Resolution: Fixed > Decrease default for LRUQueryCache#skipCacheFactor? > --- > > Key: LUCENE-10043 > URL: https://issues.apache.org/jira/browse/LUCENE-10043 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > In LUCENE-9002 we introduced logic to skip caching a clause if it would be > too expensive compared to the usual query cost. Specifically, we avoid > caching a clause if its cost is estimated to be a factor higher than the lead > iterator's: > {code} > // skip cache operation which would slow query down too much > if (cost / skipCacheFactor > leadCost) { > return supplier.get(leadCost); > } > {code} > Choosing good defaults is hard! We've seen some examples in Elasticsearch > where caching a query clause causes a major slowdown, contributing to poor > tail latencies. It made me think that the default 'skipCacheFactor' of 250 > may be too high -- interpreted simply, this means we'll cache a clause even > if it is ~250 times more expensive than running the top-level query on its > own. Would it make sense to decrease this to 10 or so? It seems okay to air > on the side of less caching for individual clauses, especially since any > parent 'BooleanQuery' is already eligible for caching? > As a note, the interpretation "~250 times more expensive than running the > top-level query on its own" isn't perfectly accurate. The true cost doesn't > dependent on the number of matched documents, but also the cost of matching > itself. Making it even more complex, some queries like > 'IndexOrDocValuesQuery' have different matching strategies based on whether > they're used as a lead iterator or verifier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data
[ https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuda updated LUCENE-10035: -- Attachment: (was: LUCENE-10035.patch) > Simple text codec add multi level skip list data > -- > > Key: LUCENE-10035 > URL: https://issues.apache.org/jira/browse/LUCENE-10035 > Project: Lucene - Core > Issue Type: Wish > Components: core/codecs >Affects Versions: main (9.0) >Reporter: wuda >Priority: Major > Labels: Impact, MultiLevelSkipList, SimpleTextCodec > Attachments: LUCENE-10035.patch > > Time Spent: 2h > Remaining Estimate: 0h > > Simple text codec add skip list data( include impact) to help understand > index format,For debugging, curiosity, transparency only!! When term's > docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default > value is 8), Simple text codec will write skip list, the *.pst (simple text > term dictionary file)* file will looks like this > {code:java} > field title > term args > doc 2 > freq 2 > pos 7 > pos 10 > ## we omit docs for better view .. > doc 98 > freq 2 > pos 2 > pos 6 > skipList > ? > level 1 > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > ? > level 0 > skipDoc 17 > skipDocFP 284 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impacts_end > skipDoc 34 > skipDocFP 624 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 14 > impacts_end > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > skipDoc 90 > skipDocFP 1311 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 10 > impact > freq 3 > norm 13 > impact > freq 4 > norm 14 > impacts_end > END > checksum 000829315543 > {code} > compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, > impact, freq, norm* nodes, at the same, simple text codec can support > advanceShallow when search time. > > h2. Why there has question mark symbol in the file ? > Because the *MultiLevelSkipListWriter* will write "length" and "childPointer" > with VLong > h1. This speed up search process ? > No!!! It can be advanceShallow when search, but why not speed up yet? Because > the skip list data after docs(see the file described before), it must iterate > all docs before read skip list data, so it never speed up search time. it has > no "skipOffset" to direct read skip list data, but as mentioned before, it is > For debugging, curiosity, transparency only!! If this is a problem, may be > the next time, i can add "skipOffset", so we can read skip list data directly. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data
[ https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] wuda updated LUCENE-10035: -- Attachment: LUCENE-10035.patch > Simple text codec add multi level skip list data > -- > > Key: LUCENE-10035 > URL: https://issues.apache.org/jira/browse/LUCENE-10035 > Project: Lucene - Core > Issue Type: Wish > Components: core/codecs >Affects Versions: main (9.0) >Reporter: wuda >Priority: Major > Labels: Impact, MultiLevelSkipList, SimpleTextCodec > Attachments: LUCENE-10035.patch > > Time Spent: 2h > Remaining Estimate: 0h > > Simple text codec add skip list data( include impact) to help understand > index format,For debugging, curiosity, transparency only!! When term's > docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default > value is 8), Simple text codec will write skip list, the *.pst (simple text > term dictionary file)* file will looks like this > {code:java} > field title > term args > doc 2 > freq 2 > pos 7 > pos 10 > ## we omit docs for better view .. > doc 98 > freq 2 > pos 2 > pos 6 > skipList > ? > level 1 > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > ? > level 0 > skipDoc 17 > skipDocFP 284 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impacts_end > skipDoc 34 > skipDocFP 624 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 14 > impacts_end > skipDoc 65 > skipDocFP 949 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 12 > impact > freq 3 > norm 13 > impacts_end > skipDoc 90 > skipDocFP 1311 > impacts > impact > freq 1 > norm 2 > impact > freq 2 > norm 10 > impact > freq 3 > norm 13 > impact > freq 4 > norm 14 > impacts_end > END > checksum 000829315543 > {code} > compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, > impact, freq, norm* nodes, at the same, simple text codec can support > advanceShallow when search time. > > h2. Why there has question mark symbol in the file ? > Because the *MultiLevelSkipListWriter* will write "length" and "childPointer" > with VLong > h1. This speed up search process ? > No!!! It can be advanceShallow when search, but why not speed up yet? Because > the skip list data after docs(see the file described before), it must iterate > all docs before read skip list data, so it never speed up search time. it has > no "skipOffset" to direct read skip list data, but as mentioned before, it is > For debugging, curiosity, transparency only!! If this is a problem, may be > the next time, i can add "skipOffset", so we can read skip list data directly. > > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation
jtibshirani commented on a change in pull request #235: URL: https://github.com/apache/lucene/pull/235#discussion_r686817275 ## File path: lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java ## @@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) { }; /** - * If true, the scores associated with vector comparisons are in reverse order; that is, lower - * scores represent more similar vectors. Otherwise, if false, higher scores represent more - * similar vectors. + * If true, the scores associated with vector comparisons are nonnegative and in reverse order; + * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores + * represent more similar vectors, and scores may be negative or positive. Review comment: Oh I had misunderstood this docs update. Sounds good to save your idea for a follow-up. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?
[ https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397339#comment-17397339 ] ASF subversion and git services commented on LUCENE-10043: -- Commit 2e3620fe0a70d7e1dc3261112ea314ab5512bd3f in lucene-solr's branch refs/heads/branch_8x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2e3620f ] LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 In LUCENE-9002 we introduced logic to skip caching a clause if it would be too expensive compared to the usual query cost. Specifically, we avoid caching a clause if its cost is estimated to be a 250x higher than the lead iterator's. We've found that the default of 250 is quite high and can lead to poor tail latencies. This PR decreases it to 10 to cache more conservatively > Decrease default for LRUQueryCache#skipCacheFactor? > --- > > Key: LUCENE-10043 > URL: https://issues.apache.org/jira/browse/LUCENE-10043 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 1h 20m > Remaining Estimate: 0h > > In LUCENE-9002 we introduced logic to skip caching a clause if it would be > too expensive compared to the usual query cost. Specifically, we avoid > caching a clause if its cost is estimated to be a factor higher than the lead > iterator's: > {code} > // skip cache operation which would slow query down too much > if (cost / skipCacheFactor > leadCost) { > return supplier.get(leadCost); > } > {code} > Choosing good defaults is hard! We've seen some examples in Elasticsearch > where caching a query clause causes a major slowdown, contributing to poor > tail latencies. It made me think that the default 'skipCacheFactor' of 250 > may be too high -- interpreted simply, this means we'll cache a clause even > if it is ~250 times more expensive than running the top-level query on its > own. Would it make sense to decrease this to 10 or so? It seems okay to air > on the side of less caching for individual clauses, especially since any > parent 'BooleanQuery' is already eligible for caching? > As a note, the interpretation "~250 times more expensive than running the > top-level query on its own" isn't perfectly accurate. The true cost doesn't > dependent on the number of matched documents, but also the cost of matching > itself. Making it even more complex, some queries like > 'IndexOrDocValuesQuery' have different matching strategies based on whether > they're used as a lead iterator or verifier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9002) query caching leads to absurdly slow queries
[ https://issues.apache.org/jira/browse/LUCENE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397340#comment-17397340 ] ASF subversion and git services commented on LUCENE-9002: - Commit 2e3620fe0a70d7e1dc3261112ea314ab5512bd3f in lucene-solr's branch refs/heads/branch_8x from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2e3620f ] LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 In LUCENE-9002 we introduced logic to skip caching a clause if it would be too expensive compared to the usual query cost. Specifically, we avoid caching a clause if its cost is estimated to be a 250x higher than the lead iterator's. We've found that the default of 250 is quite high and can lead to poor tail latencies. This PR decreases it to 10 to cache more conservatively > query caching leads to absurdly slow queries > > > Key: LUCENE-9002 > URL: https://issues.apache.org/jira/browse/LUCENE-9002 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 7.7.2, 8.2 >Reporter: Guoqiang Jiang >Priority: Major > Labels: cache, performance > Time Spent: 9h 20m > Remaining Estimate: 0h > > *Description* > We have dozens of ES clusters(based on Lucene) for metric scenarios. Most of > the queries are like this: _host_ip:10.10.10.10 AND timestamp:[2019-10-01 > 00:00:00 TO 2019-10-05 23:59:59]_. And we frequently encounter some absurdly > slow queries. > *Solution* > For a long time range query(e.g. 5 days), each range query will consume tens > of megabytes of memory and spend hundreds of milliseconds to cache, but the > benefits are not obvious. And those large cache entries will cause frequent > cache eviction. So it's better to skip the caching action directly when > large range query appears with a selective lead iterator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10
jtibshirani commented on a change in pull request #232: URL: https://github.com/apache/lucene/pull/232#discussion_r686810436 ## File path: lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java ## @@ -146,7 +146,7 @@ public LRUQueryCache( * of the top-level query will be cached in order to not hurt latency too much because of caching. */ public LRUQueryCache(int maxSize, long maxRamBytesUsed) { -this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 250); +this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 10); Review comment: I looked into this and now think the change isn't worth it -- it requires a new method on `QueryCachingPolicy` but doesn't help clarify or clean-up very much. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation
msokolov commented on a change in pull request #235: URL: https://github.com/apache/lucene/pull/235#discussion_r686734774 ## File path: lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java ## @@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) { }; /** - * If true, the scores associated with vector comparisons are in reverse order; that is, lower - * scores represent more similar vectors. Otherwise, if false, higher scores represent more - * similar vectors. + * If true, the scores associated with vector comparisons are nonnegative and in reverse order; + * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores + * represent more similar vectors, and scores may be negative or positive. Review comment: So the current patch did not change the `VectorSimilarity.compare` function implementations. It does the `1/1+x` normalization in `VectorsReader.search`, so I think the comment is still valid. I suppose we could impose a requirement on similarity functions that they always return [0,1] in ascending-similar order and eliminate the whole notion of reversed similarities. That would be a nice simplification to the API, and I think we've shown that it is achievable. But if we do decide that's best, I'd like to do it separately since it will touch a bunch of places that aren't changed yet in this PR. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?
[ https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397258#comment-17397258 ] ASF subversion and git services commented on LUCENE-10043: -- Commit a9fb5a965dffd2600f7da7700e6429ebf19e14d6 in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a9fb5a9 ] LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 (#232) In LUCENE-9002 we introduced logic to skip caching a clause if it would be too expensive compared to the usual query cost. Specifically, we avoid caching a clause if its cost is estimated to be a 250x higher than the lead iterator's. We've found that the default of 250 is quite high and can lead to poor tail latencies. This PR decreases it to 10 to cache more conservatively. > Decrease default for LRUQueryCache#skipCacheFactor? > --- > > Key: LUCENE-10043 > URL: https://issues.apache.org/jira/browse/LUCENE-10043 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Minor > Time Spent: 1h > Remaining Estimate: 0h > > In LUCENE-9002 we introduced logic to skip caching a clause if it would be > too expensive compared to the usual query cost. Specifically, we avoid > caching a clause if its cost is estimated to be a factor higher than the lead > iterator's: > {code} > // skip cache operation which would slow query down too much > if (cost / skipCacheFactor > leadCost) { > return supplier.get(leadCost); > } > {code} > Choosing good defaults is hard! We've seen some examples in Elasticsearch > where caching a query clause causes a major slowdown, contributing to poor > tail latencies. It made me think that the default 'skipCacheFactor' of 250 > may be too high -- interpreted simply, this means we'll cache a clause even > if it is ~250 times more expensive than running the top-level query on its > own. Would it make sense to decrease this to 10 or so? It seems okay to air > on the side of less caching for individual clauses, especially since any > parent 'BooleanQuery' is already eligible for caching? > As a note, the interpretation "~250 times more expensive than running the > top-level query on its own" isn't perfectly accurate. The true cost doesn't > dependent on the number of matched documents, but also the cost of matching > itself. Making it even more complex, some queries like > 'IndexOrDocValuesQuery' have different matching strategies based on whether > they're used as a lead iterator or verifier. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9002) query caching leads to absurdly slow queries
[ https://issues.apache.org/jira/browse/LUCENE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397259#comment-17397259 ] ASF subversion and git services commented on LUCENE-9002: - Commit a9fb5a965dffd2600f7da7700e6429ebf19e14d6 in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a9fb5a9 ] LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 (#232) In LUCENE-9002 we introduced logic to skip caching a clause if it would be too expensive compared to the usual query cost. Specifically, we avoid caching a clause if its cost is estimated to be a 250x higher than the lead iterator's. We've found that the default of 250 is quite high and can lead to poor tail latencies. This PR decreases it to 10 to cache more conservatively. > query caching leads to absurdly slow queries > > > Key: LUCENE-9002 > URL: https://issues.apache.org/jira/browse/LUCENE-9002 > Project: Lucene - Core > Issue Type: Improvement > Components: core/search >Affects Versions: 7.7.2, 8.2 >Reporter: Guoqiang Jiang >Priority: Major > Labels: cache, performance > Time Spent: 9h 20m > Remaining Estimate: 0h > > *Description* > We have dozens of ES clusters(based on Lucene) for metric scenarios. Most of > the queries are like this: _host_ip:10.10.10.10 AND timestamp:[2019-10-01 > 00:00:00 TO 2019-10-05 23:59:59]_. And we frequently encounter some absurdly > slow queries. > *Solution* > For a long time range query(e.g. 5 days), each range query will consume tens > of megabytes of memory and spend hundreds of milliseconds to cache, but the > benefits are not obvious. And those large cache entries will cause frequent > cache eviction. So it's better to skip the caching action directly when > large range query appears with a selective lead iterator. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10
jtibshirani merged pull request #232: URL: https://github.com/apache/lucene/pull/232 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10
jtibshirani commented on pull request #232: URL: https://github.com/apache/lucene/pull/232#issuecomment-896706043 Thanks for the review. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10
jtibshirani commented on a change in pull request #232: URL: https://github.com/apache/lucene/pull/232#discussion_r686703128 ## File path: lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java ## @@ -146,7 +146,7 @@ public LRUQueryCache( * of the top-level query will be cached in order to not hurt latency too much because of caching. */ public LRUQueryCache(int maxSize, long maxRamBytesUsed) { -this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 250); +this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 10); Review comment: I plan to do this refactor in a follow-up on 9.0 only, to avoid changing the APIs too much on 8.x. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation
jtibshirani commented on a change in pull request #235: URL: https://github.com/apache/lucene/pull/235#discussion_r686637119 ## File path: lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java ## @@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) { }; /** - * If true, the scores associated with vector comparisons are in reverse order; that is, lower - * scores represent more similar vectors. Otherwise, if false, higher scores represent more - * similar vectors. + * If true, the scores associated with vector comparisons are nonnegative and in reverse order; + * that is, lower scores represent more similar vectors. Otherwise, if false, higher scores + * represent more similar vectors, and scores may be negative or positive. Review comment: I think we can revert part of this since you went with `1 / (1 + x)`. ## File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java ## @@ -0,0 +1,318 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search; + +import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS; + +import java.io.IOException; +import java.util.Arrays; +import java.util.Comparator; +import java.util.Objects; +import org.apache.lucene.codecs.KnnVectorsReader; +import org.apache.lucene.document.KnnVectorField; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.LeafReaderContext; + +/** Uses {@link KnnVectorsReader#search} to perform nearest Neighbour search. */ +public class KnnVectorQuery extends Query { + + private static final TopDocs NO_RESULTS = + new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new ScoreDoc[0]); + + private final String field; + private final float[] target; + private final int k; + + /** + * Find the k nearest documents to the target vector according to the vectors in the + * given field. target vector. + * + * @param field a field that has been indexed as a {@link KnnVectorField}. + * @param target the target of the search + * @param k the number of documents to find + * @throws IllegalArgumentException if k is less than 1 + */ + public KnnVectorQuery(String field, float[] target, int k) { +this.field = field; +this.target = target; +this.k = k; +if (k < 1) { + throw new IllegalArgumentException("k must be at least 1, got: " + k); +} + } + + @Override + public Query rewrite(IndexReader reader) throws IOException { +int boundedK = Math.min(k, reader.numDocs()); +TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()]; +for (LeafReaderContext ctx : reader.leaves()) { + // Calculate kPerLeaf as an overestimate of the expected number of the closest k documents in + // this leaf + int expectedKPerLeaf = Math.max(1, boundedK * ctx.reader().numDocs() / reader.numDocs()); + // Increase to include 3 std. deviations of a Binomial distribution. + int kPerLeaf = (int) (expectedKPerLeaf + 3 * Math.sqrt(expectedKPerLeaf)); + perLeafResults[ctx.ord] = searchLeaf(ctx, kPerLeaf); +} +// Merge sort the results +TopDocs topK = TopDocs.merge(boundedK, perLeafResults); +// re-query any outlier segments (normally there should be none). +topK = checkForOutlierSegments(reader, topK, perLeafResults); +if (topK.scoreDocs.length == 0) { + return new MatchNoDocsQuery(); +} +return createRewrittenQuery(reader, topK); + } + + private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf) throws IOException { +TopDocs results = ctx.reader().searchNearestVectors(field, target, kPerLeaf); +if (results == null) { + return NO_RESULTS; +} +if (ctx.docBase > 0) { + for (ScoreDoc scoreDoc : results.scoreDocs) { +scoreDoc.doc += ctx.docBase; + } +} +return results; + } + + private TopDocs checkForOutlierSegments(IndexReader reader, TopDocs topK, TopDocs[] perLeaf) + throws IOException { +int k = topK.scoreDocs.length; +if (k == 0) { + return topK; +} +float minScore = topK.scoreDocs[topK.scoreDocs.length - 1].score; +boolean rescored =