[jira] [Commented] (LUCENE-10002) Remove IndexSearcher#search(Query,Collector) in favor of IndexSearcher#search(Query,CollectorManager)

2021-08-11 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397827#comment-17397827
 ] 

Zach Chen commented on LUCENE-10002:


Hi [~jpountz] [~gsmiller], I have created a PR for this to deprecate the 
collector API in favor of the collector manager API, as well as some initial 
refactoring to some tests and the parts in DrillSideways that use 
TopScoreDocCollector & TopFieldCollector to use the latter API. I plan to 
submit more PRs afterward for other areas in the codebase.

Please note that I did first try to remove the collector API entirely, but that 
ended up resulting in way too many changes than I'm comfortable with in a 
single PR, and I also feel this API is such a commonly used one that users may 
prefer a more gradual deprecation / transition period. Hence I reverted my 
previous effort and adopted a phased approach.

> Remove IndexSearcher#search(Query,Collector) in favor of 
> IndexSearcher#search(Query,CollectorManager)
> -
>
> Key: LUCENE-10002
> URL: https://issues.apache.org/jira/browse/LUCENE-10002
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> It's a bit trappy that you can create an IndexSearcher with an executor, but 
> that it would always search on the caller thread when calling 
> {{IndexSearcher#search(Query,Collector)}}.
>  Let's remove {{IndexSearcher#search(Query,Collector)}}, point our users to 
> {{IndexSearcher#search(Query,CollectorManager)}} instead, and change factory 
> methods of our main collectors (e.g. {{TopScoreDocCollector#create}}) to 
> return a {{CollectorManager}} instead of a {{Collector}}?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request #240: LUCENE-10002: Deprecate IndexSearch#search(Query, Collector) in favor of IndexSearcher#search(Query, CollectorManager)

2021-08-11 Thread GitBox


zacharymorn opened a new pull request #240:
URL: https://github.com/apache/lucene/pull/240


   # Description / Solution
   Deprecate IndexSearch#search(Query, Collector) in favor of 
IndexSearcher#search(Query, CollectorManager), with the following changes:
   1. Refactor out TopScoreDocCollectorManager, TopFieldCollectorManager, 
TotalHitCountCollectorManager and FixedBitSetCollector
   2. Refactor some tests to use the above collector manager
   3. Refactor DrillSideways to use extracted out collector managers
   
   # Tests
   Passed updated tests with `./gradlew clean; ./gradlew check 
-Ptests.nightly=true` (currently there are nocommit messages that are failing 
precommit though)
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

2021-08-11 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397782#comment-17397782
 ] 

Robert Muir commented on LUCENE-10048:
--

btw, there are a lot of alternatives you can look into to avoid having > 2^32 
tokens inside a single document's field. For example, you could use more 
fields, you could encode the thing in the payload, etc.

But I don't understand what you are doing, to me it sounds like lucene may not 
be the right solution at all honestly. Or maybe instead it is worth looking at 
lucene 9 vector format or something very different.

> Bypass total frequency check if field uses custom term frequency
> 
>
> Key: LUCENE-10048
> URL: https://issues.apache.org/jira/browse/LUCENE-10048
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tony Xu
>Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo| bar|"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

2021-08-11 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397774#comment-17397774
 ] 

Robert Muir commented on LUCENE-10048:
--

Whether or not you use the similarity or scorer is irrelevant. I'm referring to 
term/field statistic values stored in the segment itself.
It doesn't matter what parts you do/don't use of lucene here, overflowing these 
values will essentially behave as corruption.

That's why I repeat myself every time a JIRA issue is opened to try to bypass 
these checks.

> Bypass total frequency check if field uses custom term frequency
> 
>
> Key: LUCENE-10048
> URL: https://issues.apache.org/jira/browse/LUCENE-10048
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tony Xu
>Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo| bar|"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10048) Bypass total frequency check if field uses custom term frequency

2021-08-11 Thread Ankur (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10048?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397771#comment-17397771
 ] 

Ankur commented on LUCENE-10048:


@[~rcmuir]

Consider the case where these term-document level scoring factors are computed 
in an offline process, indexed in Lucene and accessed at query time by a 
ranking function that does not rely on Lucene's 
[Scorer|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/Scorer.html]
 and 
[Similarity|https://lucene.apache.org/core/8_9_0/core/org/apache/lucene/search/similarities/Similarity.html]
 abstractions.

What is considered reasonable is up to the offline process that serves the 
needs of the ranking function and is outside our control. A single 
term-document scoring factor can still be less than {{Integer.MAX_VALUE}} but 
the sum of all such factors for a document could easily exceed the 
{{Integer.MAX_VALUE}} range.

Without this our only option (I think) is to use {{BinaryDocValues}} and 
implement mechanisms to serialize/deserialize term-document level scoring 
factors at indexing and searching time ourselves. With this we don't get the 
space efficiencies that come with the use of highly optimized terms dictionary 
and the integer compression techniques used to encode postings data (at least 
not without significant work).

Maybe we can keep the restriction on the custom term frequency to be less than 
{{Integer.MAX_VALUE}} but relax the check on per field total token count for 
the expert use case ?
  

 

> Bypass total frequency check if field uses custom term frequency
> 
>
> Key: LUCENE-10048
> URL: https://issues.apache.org/jira/browse/LUCENE-10048
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Tony Xu
>Priority: Minor
>
> For all fields whose index option is not *IndexOptions.NONE*. There is a 
> check on per field total token count (i.e. field-length) to ensure we don't 
> index too many tokens. This is done by accumulating the token's 
> *TermFrequencyAttribute.*
>  
> Given that currently Lucene allows custom term frequency attached to each 
> token and the usage of the frequency can be pretty wild. It is possible to 
> have the following case where the check fails with only a few tokens that 
> have large frequencies. Currently Lucene will skip indexing the whole 
> document.
> *"foo| bar|"*
>  
> What should be way to inform the indexing chain not to check the field length?
> A related observation, when custom term frequency is in use, user is not 
> likely to use the similarity for this field. Maybe we can offer a way to 
> specify that, too?



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397664#comment-17397664
 ] 

Michael Sokolov commented on LUCENE-9937:
-

I'm glad you found the batch setup was confounding things! It's hard to A/A 
comparison between this different implementations!

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> *NOTE: These benchmarks contained an error that made hnswlib perform too 
> well. See the last comment for correct results.*
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996570.900
> LuceneHnsw(n_cands=800) 0.998379.589
> hnswlib(n_cands=10) 0.713  69662.756
> hnswlib(n_cands=50) 0.950  28021.582
> hnswlib(n_cands=100)0.985  16108.538
> hnswlib(n_cands=500)1.000   4115.886
> hnswlib(n_cands=800)1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.722
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941309.077
> LuceneHnsw(n_cands=800) 0.961203.782
> hnswlib(n_cands=10) 0.597  43543.345
> hnswlib(n_cands=50) 0.832  14719.165
> hnswlib(n_cands=100)0.897   8299.948
> hnswlib(n_cands=500)0.981   1931.985
> hnswlib(n_cands=800)0.991881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> the best distance. Queries are executed serially to obtain the QPS.
>  * I chose parameters where hnswlib performed well, then passed these same 
> parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
> beamWidth as efConstruction. For search parameters, I set k to k, and fanout 
> as (num_cands - k) so that the beam search is of size num_cands. Note that 
> our default value for beamWidth is 16, which is really low – I wasn’t able to 
> obtain acceptable recall until I bumped it to closer to 500 to match the 
> hnswlib default.
>  * I force-merged to one segment before running searches since this gives the 
> best recall + QPS, and also to match hnswlib.
> Some observations:
>  * It'd be really nice to extend luceneutil to measure vector search recall 
> in addition to latency. That would help ensure we’re benchmarking a more 
> realistic scenario, instead of accidentally indexing/ searching at a very low 
> recall. Tracking recall would also guard against subtle, unintentional 
> changes to the algorithm. It's easy to make an accidental change while 
> refactoring, and with approximate algorithms, unit tests don't guard as well 
> against this.
>  * Lucene HNSW gives a great speed-up over the baseline without sacrificing 
> too much recall. But it doesn't perform as well as hnswlib in terms of both 
> recall and QPS. We wouldn’t expect the results to line up perfectly, since 
> Lucene doesn't actually implement HNSW – the current algorithm isn't actually 
> hierarchical and only uses a single graph layer. Does this difference might 
> indicate we're leaving performance 'on the table' by not using layers, which 
> (I don't think) adds much index time or space? Or are 

[jira] [Commented] (LUCENE-9614) Implement KNN Query

2021-08-11 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397613#comment-17397613
 ] 

Michael Sokolov commented on LUCENE-9614:
-

UPDATE: we went with

{{score(q,d) = 1 / (1 + |q - d|)}}

to normalize "reversed" scores into [0, 1]: thanks for the suggestion, 
[~jtibshirani]

> Implement KNN Query
> ---
>
> Key: LUCENE-9614
> URL: https://issues.apache.org/jira/browse/LUCENE-9614
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> Now we have a vector index format, and one vector indexing/KNN search 
> implementation, but the interface is low-level: you can search across a 
> single segment only. We would like to expose a Query implementation. 
> Initially, we want to support a usage where the KnnVectorQuery selects the 
> k-nearest neighbors without regard to any other constraints, and these can 
> then be filtered as part of an enclosing Boolean or other query.
> Later we will want to explore some kind of filtering *while* performing 
> vector search, or a re-entrant search process that can yield further results. 
> Because of the nature of knn search (all documents having any vector value 
> match), it is more like a ranking than a filtering operation, and it doesn't 
> really make sense to provide an iterator interface that can be merged in the 
> usual way, in docid order, skipping ahead. It's not yet clear how to satisfy 
> a query that is "k nearest neighbors satsifying some arbitrary Query", at 
> least not without realizing a complete bitset for the Query. But this is for 
> a later issue; *this* issue is just about performing the knn search in 
> isolation, computing a set of (some given) K nearest neighbors, and providing 
> an iterator over those.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397555#comment-17397555
 ] 

Robert Muir commented on LUCENE-9937:
-

we can help some of those hotspots with vectorization etc improvements (at 
least with #dims such as 100/128), but we have to wait for openjdk to make the 
crap available to us.

they now incubate this stuff for a third time, so it seems java 19 at the very 
earliest?: http://openjdk.java.net/jeps/8269306

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> *NOTE: These benchmarks contained an error that made hnswlib perform too 
> well. See the last comment for correct results.*
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996570.900
> LuceneHnsw(n_cands=800) 0.998379.589
> hnswlib(n_cands=10) 0.713  69662.756
> hnswlib(n_cands=50) 0.950  28021.582
> hnswlib(n_cands=100)0.985  16108.538
> hnswlib(n_cands=500)1.000   4115.886
> hnswlib(n_cands=800)1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.722
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941309.077
> LuceneHnsw(n_cands=800) 0.961203.782
> hnswlib(n_cands=10) 0.597  43543.345
> hnswlib(n_cands=50) 0.832  14719.165
> hnswlib(n_cands=100)0.897   8299.948
> hnswlib(n_cands=500)0.981   1931.985
> hnswlib(n_cands=800)0.991881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> the best distance. Queries are executed serially to obtain the QPS.
>  * I chose parameters where hnswlib performed well, then passed these same 
> parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
> beamWidth as efConstruction. For search parameters, I set k to k, and fanout 
> as (num_cands - k) so that the beam search is of size num_cands. Note that 
> our default value for beamWidth is 16, which is really low – I wasn’t able to 
> obtain acceptable recall until I bumped it to closer to 500 to match the 
> hnswlib default.
>  * I force-merged to one segment before running searches since this gives the 
> best recall + QPS, and also to match hnswlib.
> Some observations:
>  * It'd be really nice to extend luceneutil to measure vector search recall 
> in addition to latency. That would help ensure we’re benchmarking a more 
> realistic scenario, instead of accidentally indexing/ searching at a very low 
> recall. Tracking recall would also guard against subtle, unintentional 
> changes to the algorithm. It's easy to make an accidental change while 
> refactoring, and with approximate algorithms, unit tests don't guard as well 
> against this.
>  * Lucene HNSW gives a great speed-up over the baseline without sacrificing 
> too much recall. But it doesn't perform as well as hnswlib in terms of both 
> recall and QPS. We wouldn’t expect the results to line up perfectly, since 
> Lucene doesn't actually implement HNSW – the current algorithm isn't actually 
> hierarchical and only uses a single graph 

[jira] [Resolved] (LUCENE-9941) ann-benchmarks results for HNSW indexing

2021-08-11 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9941?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9941.
--
Resolution: Done

> ann-benchmarks results for HNSW indexing
> 
>
> Key: LUCENE-9941
> URL: https://issues.apache.org/jira/browse/LUCENE-9941
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> This is a continuation of LUCENE-9937, but for HNSW index performance.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors
>  * LuceneHnsw: our HNSW implementation, with a force merge to one segment
>  * LuceneHnswNoForceMerge: our HNSW implementation without the force merge
>  * hnswlib: a C++ HNSW implementation from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> Approach Index time (sec)
> LuceneVectorsOnly  14.93
> LuceneHnsw   3191.16
> LuceneHnswNoForceMerge   1194.31
> hnswlib   311.09
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> Approach  Index time (sec)
> LuceneVectorsOnly  14.17
> LuceneHnsw   8940.41
> LuceneHnswNoForceMerge   3623.68 
> hnswlib   587.23
> {code}
> We force merge to one segment to emulate a case where vectors aren't 
> continually being indexed. In these situations, it seems likely users would 
> force merge to optimize search speed: searching a single large graph is 
> expected to be faster than searching several small ones serially. To see how 
> long the force merge takes, we can subtract LuceneHnswNoForceMerge from 
> LuceneHnsw.
> The construction parameters match those in LUCENE-9937 and are optimized for 
> search recall + QPS instead of index speed, as I figured this would be a 
> common set-up.
> Some observations:
>  * In cases when segments are eventually force merged, we do a lot of extra 
> work building intermediate graphs that are eventually merged away. This is a 
> difficult problem, and one that's been raised in the past. As a simple step, 
> I wonder if we should not build graphs for segments that are below a certain 
> size. For sufficiently small segments, it could be a better trade-off to 
> avoid building a graph and support nearest-neighbor search through a 
> brute-force scan?
>  * Indexing is slow compared to what we're used to for other formats, even if 
> we disregard the extra work mentioned above. For sift-128-euclidean, building 
> only the final graph takes ~33 min, whereas for glove-100-angular it's ~88 
> min.
>  * As a note, graph indexing uses ANN searches in order to add each new 
> vector to the graph. So the slower search speed between Lucene and hnswlib 
> may contribute to slower indexing.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-9937.
--
Resolution: Done

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> *NOTE: These benchmarks contained an error that made hnswlib perform too 
> well. See the last comment for correct results.*
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996570.900
> LuceneHnsw(n_cands=800) 0.998379.589
> hnswlib(n_cands=10) 0.713  69662.756
> hnswlib(n_cands=50) 0.950  28021.582
> hnswlib(n_cands=100)0.985  16108.538
> hnswlib(n_cands=500)1.000   4115.886
> hnswlib(n_cands=800)1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.722
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941309.077
> LuceneHnsw(n_cands=800) 0.961203.782
> hnswlib(n_cands=10) 0.597  43543.345
> hnswlib(n_cands=50) 0.832  14719.165
> hnswlib(n_cands=100)0.897   8299.948
> hnswlib(n_cands=500)0.981   1931.985
> hnswlib(n_cands=800)0.991881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> the best distance. Queries are executed serially to obtain the QPS.
>  * I chose parameters where hnswlib performed well, then passed these same 
> parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
> beamWidth as efConstruction. For search parameters, I set k to k, and fanout 
> as (num_cands - k) so that the beam search is of size num_cands. Note that 
> our default value for beamWidth is 16, which is really low – I wasn’t able to 
> obtain acceptable recall until I bumped it to closer to 500 to match the 
> hnswlib default.
>  * I force-merged to one segment before running searches since this gives the 
> best recall + QPS, and also to match hnswlib.
> Some observations:
>  * It'd be really nice to extend luceneutil to measure vector search recall 
> in addition to latency. That would help ensure we’re benchmarking a more 
> realistic scenario, instead of accidentally indexing/ searching at a very low 
> recall. Tracking recall would also guard against subtle, unintentional 
> changes to the algorithm. It's easy to make an accidental change while 
> refactoring, and with approximate algorithms, unit tests don't guard as well 
> against this.
>  * Lucene HNSW gives a great speed-up over the baseline without sacrificing 
> too much recall. But it doesn't perform as well as hnswlib in terms of both 
> recall and QPS. We wouldn’t expect the results to line up perfectly, since 
> Lucene doesn't actually implement HNSW – the current algorithm isn't actually 
> hierarchical and only uses a single graph layer. Does this difference might 
> indicate we're leaving performance 'on the table' by not using layers, which 
> (I don't think) adds much index time or space? Or are there other algorithmic 
> improvements would help close the gap?
>  * Setting beamWidth to 500 *really* slowed down indexing. I'll open a 
> 

[GitHub] [lucene-solr] thelabdude merged pull request #2552: SOLR-9853: Ability to project multi-valued fields in SQL query results

2021-08-11 Thread GitBox


thelabdude merged pull request #2552:
URL: https://github.com/apache/lucene-solr/pull/2552


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415
 ] 

Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:10 PM:


It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Recall  QPS
LuceneVectorsOnly()  1.0006.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  


was (Author: julietibs):
It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000 6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> *NOTE: These benchmarks contained an error that made hnswlib perform too 
> well. See the last comment for correct results.*
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found 

[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415
 ] 

Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:06 PM:


It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000 6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  


was (Author: julietibs):
It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000  6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> *NOTE: These benchmarks contained an error that made hnswlib perform too 
> well. See the last comment for correct results.*
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the 

[jira] [Updated] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-9937:
-
Description: 
*NOTE: These benchmarks contained an error that made hnswlib perform too well. 
See the last comment for correct results.*

I hooked up our HNSW implementation to 
[ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used repo 
for benchmarking nearest neighbor search libraries against large datasets. I 
found the results interesting and opened this issue to share and discuss. My 
benchmarking code can be found 
[here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
to commit but I’m happy to help anyone get set up with it.

Approaches
 * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
brute force scan to determine nearest neighbors
 * LuceneHnsw: our HNSW implementation
 * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation from 
the author of the paper

Datasets
 * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, comparing 
euclidean distance
 * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, comparing 
cosine similarity

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.764

LuceneHnsw(n_cands=10)  0.603   7736.968
LuceneHnsw(n_cands=50)  0.890   3605.000
LuceneHnsw(n_cands=100) 0.953   2237.429
LuceneHnsw(n_cands=500) 0.996570.900
LuceneHnsw(n_cands=800) 0.998379.589

hnswlib(n_cands=10) 0.713  69662.756
hnswlib(n_cands=50) 0.950  28021.582
hnswlib(n_cands=100)0.985  16108.538
hnswlib(n_cands=500)1.000   4115.886
hnswlib(n_cands=800)1.000   2729.680
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.597  43543.345
hnswlib(n_cands=50) 0.832  14719.165
hnswlib(n_cands=100)0.897   8299.948
hnswlib(n_cands=500)0.981   1931.985
hnswlib(n_cands=800)0.991881.752
{code}
Notes on benchmark:
 * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
computes the recall against the true neighbors. The recall calculation has a 
small 'fudge factor' that allows neighbors that are within a small epsilon of 
the best distance. Queries are executed serially to obtain the QPS.
 * I chose parameters where hnswlib performed well, then passed these same 
parameters to Lucene HNSW. For index-time parameters, I set maxConn as M and 
beamWidth as efConstruction. For search parameters, I set k to k, and fanout as 
(num_cands - k) so that the beam search is of size num_cands. Note that our 
default value for beamWidth is 16, which is really low – I wasn’t able to 
obtain acceptable recall until I bumped it to closer to 500 to match the 
hnswlib default.
 * I force-merged to one segment before running searches since this gives the 
best recall + QPS, and also to match hnswlib.

Some observations:
 * It'd be really nice to extend luceneutil to measure vector search recall in 
addition to latency. That would help ensure we’re benchmarking a more realistic 
scenario, instead of accidentally indexing/ searching at a very low recall. 
Tracking recall would also guard against subtle, unintentional changes to the 
algorithm. It's easy to make an accidental change while refactoring, and with 
approximate algorithms, unit tests don't guard as well against this.
 * Lucene HNSW gives a great speed-up over the baseline without sacrificing too 
much recall. But it doesn't perform as well as hnswlib in terms of both recall 
and QPS. We wouldn’t expect the results to line up perfectly, since Lucene 
doesn't actually implement HNSW – the current algorithm isn't actually 
hierarchical and only uses a single graph layer. Does this difference might 
indicate we're leaving performance 'on the table' by not using layers, which (I 
don't think) adds much index time or space? Or are there other algorithmic 
improvements would help close the gap?
 * Setting beamWidth to 500 *really* slowed down indexing. I'll open a separate 
issue with indexing speed results, keeping this one focused on search.

  was:
I hooked up our HNSW implementation to 
[ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used repo 
for benchmarking nearest neighbor search libraries against large datasets. I 
found the results interesting and opened this issue to share and discuss. My 
benchmarking code can be found 
[here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
to commit but I’m 

[jira] [Comment Edited] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415
 ] 

Julie Tibshirani edited comment on LUCENE-9937 at 8/11/21, 3:04 PM:


It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

*Results on sift-128-euclidean*
 Parameters: M=16, efConstruction=500
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000  6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  


was (Author: julietibs):
It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

 *sift-128-euclidean, M=16, efConstruction=500*
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000  6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> 

[jira] [Commented] (LUCENE-9937) ann-benchmarks results for HNSW search

2021-08-11 Thread Julie Tibshirani (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397415#comment-17397415
 ] 

Julie Tibshirani commented on LUCENE-9937:
--

It turns out my benchmarks were way too generous to hnswlib!! I was running 
them in "--batch" mode, which sends all queries at once. I reran hnswlib in the 
default "serial" mode, and now see more similar results to you. Although I had 
disabled multi-threading, some parallelism must have snuck in. We're still 
behind in terms of recall, but the QPS gap is much less.

 *sift-128-euclidean, M=16, efConstruction=500*
{code:java}
Approach Recall  QPS
LuceneVectorsOnly() 1.000  6.764

LuceneHnsw(n_cands=10)   0.603 9045.574
LuceneHnsw(n_cands=50)   0.890 3789.169
LuceneHnsw(n_cands=100)  0.942 2333.714
LuceneHnsw(n_cands=500)  0.996  579.850 
LuceneHnsw(n_cands=800)  0.998  386.105

hnswlib(n_cands=10)  0.71022628.980  
hnswlib(n_cands=50)  0.950 8640.336  
hnswlib(n_cands=100) 0.985 5026.611
hnswlib(n_cands=500) 1.000 1289.108
hnswlib(n_cands=800) 1.000  856.515
{code}
*Results on glove-100-angular*
 Parameters: M=32, efConstruction=500
{code:java}
ApproachRecall  QPS
LuceneVectorsOnly() 1.000  6.722

LuceneHnsw(n_cands=10)  0.507   5036.236
LuceneHnsw(n_cands=50)  0.760   2099.850
LuceneHnsw(n_cands=100) 0.833   1233.690
LuceneHnsw(n_cands=500) 0.941309.077
LuceneHnsw(n_cands=800) 0.961203.782

hnswlib(n_cands=10) 0.601  12877.861
hnswlib(n_cands=50) 0.833   4553.943
hnswlib(n_cands=100)0.896   2618.098
hnswlib(n_cands=500)0.981634.301
hnswlib(n_cands=800)0.991412.333
{code}
I'm going to close out this issue since I think we've learned what we can from 
digging into these benchmarks. My thoughts on conclusions:
 * Lucene HNSW gives a great speed-up over the baseline at good recall.
 * We may be missing important algorithmic details from hnswlib that hurt 
recall and cause us to perform more computations.
 * Hot spots: decoding vectors from their on-disk format and performing 
distance calculations.
  

> ann-benchmarks results for HNSW search
> --
>
> Key: LUCENE-9937
> URL: https://issues.apache.org/jira/browse/LUCENE-9937
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Julie Tibshirani
>Priority: Minor
>
> I hooked up our HNSW implementation to 
> [ann-benchmarks|https://github.com/erikbern/ann-benchmarks], a widely used 
> repo for benchmarking nearest neighbor search libraries against large 
> datasets. I found the results interesting and opened this issue to share and 
> discuss. My benchmarking code can be found 
> [here|https://github.com/jtibshirani/lucene/pull/1] – it's hacky and not easy 
> to commit but I’m happy to help anyone get set up with it.
> Approaches
>  * LuceneVectorsOnly: a baseline that only indexes vectors, and performs a 
> brute force scan to determine nearest neighbors
>  * LuceneHnsw: our HNSW implementation
>  * [hnswlib|https://github.com/nmslib/hnswlib]: a C++ HNSW implementation 
> from the author of the paper
> Datasets
>  * sift-128-euclidean: 1 million SIFT feature vectors, dimension 128, 
> comparing euclidean distance
>  * glove-100-angular: ~1.2 million GloVe word vectors, dimension 100, 
> comparing cosine similarity
> *Results on sift-128-euclidean*
>  Parameters: M=16, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.764
> LuceneHnsw(n_cands=10)  0.603   7736.968
> LuceneHnsw(n_cands=50)  0.890   3605.000
> LuceneHnsw(n_cands=100) 0.953   2237.429
> LuceneHnsw(n_cands=500) 0.996570.900
> LuceneHnsw(n_cands=800) 0.998379.589
> hnswlib(n_cands=10) 0.713  69662.756
> hnswlib(n_cands=50) 0.950  28021.582
> hnswlib(n_cands=100)0.985  16108.538
> hnswlib(n_cands=500)1.000   4115.886
> hnswlib(n_cands=800)1.000   2729.680
> {code}
> *Results on glove-100-angular*
>  Parameters: M=32, efConstruction=500
> {code:java}
> ApproachRecall  QPS
> LuceneVectorsOnly() 1.000  6.722
> LuceneHnsw(n_cands=10)  0.507   5036.236
> LuceneHnsw(n_cands=50)  0.760   2099.850
> LuceneHnsw(n_cands=100) 0.833   1233.690
> LuceneHnsw(n_cands=500) 0.941309.077
> LuceneHnsw(n_cands=800) 0.961203.782
> hnswlib(n_cands=10) 0.597  43543.345
> hnswlib(n_cands=50) 0.832  14719.165
> hnswlib(n_cands=100)0.897   8299.948
> hnswlib(n_cands=500)0.981   1931.985
> hnswlib(n_cands=800)0.991881.752
> {code}
> Notes on benchmark:
>  * By default, the ann-benchmarks repo retrieves 10 nearest neighbors and 
> computes the recall against the true neighbors. The recall calculation has a 
> small 'fudge factor' that allows neighbors that are within a small epsilon of 
> 

[GitHub] [lucene-solr] thelabdude opened a new pull request #2552: SOLR-9853: Ability to project multi-valued fields in SQL query results

2021-08-11 Thread GitBox


thelabdude opened a new pull request #2552:
URL: https://github.com/apache/lucene-solr/pull/2552


   Backport of https://github.com/apache/solr/pull/252


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation

2021-08-11 Thread GitBox


msokolov commented on a change in pull request #235:
URL: https://github.com/apache/lucene/pull/235#discussion_r686891113



##
File path: 
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
##
@@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) {
   };
 
   /**
-   * If true, the scores associated with vector comparisons are in reverse 
order; that is, lower
-   * scores represent more similar vectors. Otherwise, if false, higher scores 
represent more
-   * similar vectors.
+   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
+   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
+   * represent more similar vectors, and scores may be negative or positive.

Review comment:
   heh, in fact this edit was never needed since we never changed the 
similarity scores (as I mentioned above), and only the scores returned from 
search... so I'll indeed revert




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?

2021-08-11 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani updated LUCENE-10043:
--
Fix Version/s: 8.10

> Decrease default for LRUQueryCache#skipCacheFactor?
> ---
>
> Key: LUCENE-10043
> URL: https://issues.apache.org/jira/browse/LUCENE-10043
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
> Fix For: 8.10
>
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-9002 we introduced logic to skip caching a clause if it would be 
> too expensive compared to the usual query cost. Specifically, we avoid 
> caching a clause if its cost is estimated to be a factor higher than the lead 
> iterator's:
> {code}
>   // skip cache operation which would slow query down too much
>   if (cost / skipCacheFactor > leadCost) {
> return supplier.get(leadCost);
>   }
> {code}
> Choosing good defaults is hard! We've seen some examples in Elasticsearch 
> where caching a query clause causes a major slowdown, contributing to poor 
> tail latencies. It made me think that the default 'skipCacheFactor' of 250 
> may be too high -- interpreted simply, this means we'll cache a clause even 
> if it is ~250 times more expensive than running the top-level query on its 
> own. Would it make sense to decrease this to 10 or so? It seems okay to air 
> on the side of less caching for individual clauses, especially since any 
> parent 'BooleanQuery' is already eligible for caching?
> As a note, the interpretation "~250 times more expensive than running the 
> top-level query on its own" isn't perfectly accurate. The true cost doesn't 
> dependent on the number of matched documents, but also the cost of matching 
> itself. Making it even more complex, some queries like 
> 'IndexOrDocValuesQuery' have different matching strategies based on whether 
> they're used as a lead iterator or verifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?

2021-08-11 Thread Julie Tibshirani (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Julie Tibshirani resolved LUCENE-10043.
---
Resolution: Fixed

> Decrease default for LRUQueryCache#skipCacheFactor?
> ---
>
> Key: LUCENE-10043
> URL: https://issues.apache.org/jira/browse/LUCENE-10043
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-9002 we introduced logic to skip caching a clause if it would be 
> too expensive compared to the usual query cost. Specifically, we avoid 
> caching a clause if its cost is estimated to be a factor higher than the lead 
> iterator's:
> {code}
>   // skip cache operation which would slow query down too much
>   if (cost / skipCacheFactor > leadCost) {
> return supplier.get(leadCost);
>   }
> {code}
> Choosing good defaults is hard! We've seen some examples in Elasticsearch 
> where caching a query clause causes a major slowdown, contributing to poor 
> tail latencies. It made me think that the default 'skipCacheFactor' of 250 
> may be too high -- interpreted simply, this means we'll cache a clause even 
> if it is ~250 times more expensive than running the top-level query on its 
> own. Would it make sense to decrease this to 10 or so? It seems okay to air 
> on the side of less caching for individual clauses, especially since any 
> parent 'BooleanQuery' is already eligible for caching?
> As a note, the interpretation "~250 times more expensive than running the 
> top-level query on its own" isn't perfectly accurate. The true cost doesn't 
> dependent on the number of matched documents, but also the cost of matching 
> itself. Making it even more complex, some queries like 
> 'IndexOrDocValuesQuery' have different matching strategies based on whether 
> they're used as a lead iterator or verifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data

2021-08-11 Thread wuda (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuda updated LUCENE-10035:
--
Attachment: (was: LUCENE-10035.patch)

> Simple text codec add  multi level skip list data 
> --
>
> Key: LUCENE-10035
> URL: https://issues.apache.org/jira/browse/LUCENE-10035
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: wuda
>Priority: Major
>  Labels: Impact, MultiLevelSkipList, SimpleTextCodec
> Attachments: LUCENE-10035.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Simple text codec add skip list data( include impact) to help understand 
> index format,For debugging, curiosity, transparency only!! When term's 
> docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default 
> value is 8), Simple text codec will write skip list, the *.pst (simple text 
> term dictionary file)* file will looks like this
> {code:java}
> field title
>   term args
> doc 2
>   freq 2
>   pos 7
>   pos 10
> ## we omit docs for better view ..
> doc 98
>   freq 2
>   pos 2
>   pos 6
> skipList 
> ?
>   level 1
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> ?
>   level 0
> skipDoc 17
> skipDocFP 284
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
> impacts_end 
> skipDoc 34
> skipDocFP 624
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 14
> impacts_end 
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> skipDoc 90
> skipDocFP 1311
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 10
>   impact 
> freq 3
> norm 13
>   impact 
> freq 4
> norm 14
> impacts_end 
> END
> checksum 000829315543
> {code}
> compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
> impact, freq, norm* nodes, at the same, simple text codec can support 
> advanceShallow when search time.
>  
> h2. Why there has question mark symbol in the file ?
> Because the *MultiLevelSkipListWriter* will write "length" and "childPointer" 
> with VLong
> h1. This speed up search process ?
> No!!! It can be advanceShallow when search, but why not speed up yet? Because 
> the skip list data after docs(see the file described before), it must iterate 
> all docs before read skip list data, so it never speed up search time. it has 
> no "skipOffset" to direct read skip list data, but as mentioned before, it is 
> For debugging, curiosity, transparency only!! If this is a problem, may be 
> the next time, i can add "skipOffset", so we can read skip list data directly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10035) Simple text codec add multi level skip list data

2021-08-11 Thread wuda (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10035?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

wuda updated LUCENE-10035:
--
Attachment: LUCENE-10035.patch

> Simple text codec add  multi level skip list data 
> --
>
> Key: LUCENE-10035
> URL: https://issues.apache.org/jira/browse/LUCENE-10035
> Project: Lucene - Core
>  Issue Type: Wish
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: wuda
>Priority: Major
>  Labels: Impact, MultiLevelSkipList, SimpleTextCodec
> Attachments: LUCENE-10035.patch
>
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> Simple text codec add skip list data( include impact) to help understand 
> index format,For debugging, curiosity, transparency only!! When term's 
> docFreq greater than or equal to SimpleTextSkipWriter.BLOCK_SIZE (default 
> value is 8), Simple text codec will write skip list, the *.pst (simple text 
> term dictionary file)* file will looks like this
> {code:java}
> field title
>   term args
> doc 2
>   freq 2
>   pos 7
>   pos 10
> ## we omit docs for better view ..
> doc 98
>   freq 2
>   pos 2
>   pos 6
> skipList 
> ?
>   level 1
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> ?
>   level 0
> skipDoc 17
> skipDocFP 284
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
> impacts_end 
> skipDoc 34
> skipDocFP 624
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 14
> impacts_end 
> skipDoc 65
> skipDocFP 949
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 12
>   impact 
> freq 3
> norm 13
> impacts_end 
> skipDoc 90
> skipDocFP 1311
> impacts 
>   impact 
> freq 1
> norm 2
>   impact 
> freq 2
> norm 10
>   impact 
> freq 3
> norm 13
>   impact 
> freq 4
> norm 14
> impacts_end 
> END
> checksum 000829315543
> {code}
> compare with previous,we add *skipList,level, skipDoc, skipDocFP, impacts, 
> impact, freq, norm* nodes, at the same, simple text codec can support 
> advanceShallow when search time.
>  
> h2. Why there has question mark symbol in the file ?
> Because the *MultiLevelSkipListWriter* will write "length" and "childPointer" 
> with VLong
> h1. This speed up search process ?
> No!!! It can be advanceShallow when search, but why not speed up yet? Because 
> the skip list data after docs(see the file described before), it must iterate 
> all docs before read skip list data, so it never speed up search time. it has 
> no "skipOffset" to direct read skip list data, but as mentioned before, it is 
> For debugging, curiosity, transparency only!! If this is a problem, may be 
> the next time, i can add "skipOffset", so we can read skip list data directly.
>  
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation

2021-08-11 Thread GitBox


jtibshirani commented on a change in pull request #235:
URL: https://github.com/apache/lucene/pull/235#discussion_r686817275



##
File path: 
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
##
@@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) {
   };
 
   /**
-   * If true, the scores associated with vector comparisons are in reverse 
order; that is, lower
-   * scores represent more similar vectors. Otherwise, if false, higher scores 
represent more
-   * similar vectors.
+   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
+   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
+   * represent more similar vectors, and scores may be negative or positive.

Review comment:
   Oh I had misunderstood this docs update. Sounds good to save your idea 
for a follow-up.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?

2021-08-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397339#comment-17397339
 ] 

ASF subversion and git services commented on LUCENE-10043:
--

Commit 2e3620fe0a70d7e1dc3261112ea314ab5512bd3f in lucene-solr's branch 
refs/heads/branch_8x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2e3620f ]

LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively


> Decrease default for LRUQueryCache#skipCacheFactor?
> ---
>
> Key: LUCENE-10043
> URL: https://issues.apache.org/jira/browse/LUCENE-10043
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 1h 20m
>  Remaining Estimate: 0h
>
> In LUCENE-9002 we introduced logic to skip caching a clause if it would be 
> too expensive compared to the usual query cost. Specifically, we avoid 
> caching a clause if its cost is estimated to be a factor higher than the lead 
> iterator's:
> {code}
>   // skip cache operation which would slow query down too much
>   if (cost / skipCacheFactor > leadCost) {
> return supplier.get(leadCost);
>   }
> {code}
> Choosing good defaults is hard! We've seen some examples in Elasticsearch 
> where caching a query clause causes a major slowdown, contributing to poor 
> tail latencies. It made me think that the default 'skipCacheFactor' of 250 
> may be too high -- interpreted simply, this means we'll cache a clause even 
> if it is ~250 times more expensive than running the top-level query on its 
> own. Would it make sense to decrease this to 10 or so? It seems okay to air 
> on the side of less caching for individual clauses, especially since any 
> parent 'BooleanQuery' is already eligible for caching?
> As a note, the interpretation "~250 times more expensive than running the 
> top-level query on its own" isn't perfectly accurate. The true cost doesn't 
> dependent on the number of matched documents, but also the cost of matching 
> itself. Making it even more complex, some queries like 
> 'IndexOrDocValuesQuery' have different matching strategies based on whether 
> they're used as a lead iterator or verifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9002) query caching leads to absurdly slow queries

2021-08-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397340#comment-17397340
 ] 

ASF subversion and git services commented on LUCENE-9002:
-

Commit 2e3620fe0a70d7e1dc3261112ea314ab5512bd3f in lucene-solr's branch 
refs/heads/branch_8x from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=2e3620f ]

LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively


> query caching leads to absurdly slow queries
> 
>
> Key: LUCENE-9002
> URL: https://issues.apache.org/jira/browse/LUCENE-9002
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 7.7.2, 8.2
>Reporter: Guoqiang Jiang
>Priority: Major
>  Labels: cache, performance
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> *Description*
> We have dozens of ES clusters(based on Lucene) for metric scenarios. Most of 
> the queries are like this: _host_ip:10.10.10.10 AND timestamp:[2019-10-01 
> 00:00:00 TO 2019-10-05 23:59:59]_. And we frequently encounter some absurdly 
> slow queries.
> *Solution*
> For a long time range query(e.g. 5 days), each range query will consume tens 
> of megabytes of memory and spend hundreds of milliseconds to cache, but the 
> benefits are not obvious. And those large cache entries will cause frequent 
> cache eviction. So it's better to  skip the caching action directly when 
> large range query appears with a selective lead iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

2021-08-11 Thread GitBox


jtibshirani commented on a change in pull request #232:
URL: https://github.com/apache/lucene/pull/232#discussion_r686810436



##
File path: lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java
##
@@ -146,7 +146,7 @@ public LRUQueryCache(
* of the top-level query will be cached in order to not hurt latency too 
much because of caching.
*/
   public LRUQueryCache(int maxSize, long maxRamBytesUsed) {
-this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 
250);
+this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 
10);

Review comment:
   I looked into this and now think the change isn't worth it -- it 
requires a new method on `QueryCachingPolicy` but doesn't help clarify or 
clean-up very much.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation

2021-08-11 Thread GitBox


msokolov commented on a change in pull request #235:
URL: https://github.com/apache/lucene/pull/235#discussion_r686734774



##
File path: 
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
##
@@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) {
   };
 
   /**
-   * If true, the scores associated with vector comparisons are in reverse 
order; that is, lower
-   * scores represent more similar vectors. Otherwise, if false, higher scores 
represent more
-   * similar vectors.
+   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
+   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
+   * represent more similar vectors, and scores may be negative or positive.

Review comment:
   So the current patch did not change the `VectorSimilarity.compare` 
function implementations. It does the `1/1+x` normalization in 
`VectorsReader.search`, so I think the comment is still valid. I suppose we 
could impose a requirement on similarity functions that they always return 
[0,1] in ascending-similar order and eliminate the whole notion of reversed 
similarities. That would be a nice simplification to the API, and I think we've 
shown that it is achievable. But if we do decide that's best, I'd like to do it 
separately since it will touch a bunch of places that aren't changed yet in 
this PR.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10043) Decrease default for LRUQueryCache#skipCacheFactor?

2021-08-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397258#comment-17397258
 ] 

ASF subversion and git services commented on LUCENE-10043:
--

Commit a9fb5a965dffd2600f7da7700e6429ebf19e14d6 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a9fb5a9 ]

LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 (#232)

In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively.

> Decrease default for LRUQueryCache#skipCacheFactor?
> ---
>
> Key: LUCENE-10043
> URL: https://issues.apache.org/jira/browse/LUCENE-10043
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Minor
>  Time Spent: 1h
>  Remaining Estimate: 0h
>
> In LUCENE-9002 we introduced logic to skip caching a clause if it would be 
> too expensive compared to the usual query cost. Specifically, we avoid 
> caching a clause if its cost is estimated to be a factor higher than the lead 
> iterator's:
> {code}
>   // skip cache operation which would slow query down too much
>   if (cost / skipCacheFactor > leadCost) {
> return supplier.get(leadCost);
>   }
> {code}
> Choosing good defaults is hard! We've seen some examples in Elasticsearch 
> where caching a query clause causes a major slowdown, contributing to poor 
> tail latencies. It made me think that the default 'skipCacheFactor' of 250 
> may be too high -- interpreted simply, this means we'll cache a clause even 
> if it is ~250 times more expensive than running the top-level query on its 
> own. Would it make sense to decrease this to 10 or so? It seems okay to air 
> on the side of less caching for individual clauses, especially since any 
> parent 'BooleanQuery' is already eligible for caching?
> As a note, the interpretation "~250 times more expensive than running the 
> top-level query on its own" isn't perfectly accurate. The true cost doesn't 
> dependent on the number of matched documents, but also the cost of matching 
> itself. Making it even more complex, some queries like 
> 'IndexOrDocValuesQuery' have different matching strategies based on whether 
> they're used as a lead iterator or verifier. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9002) query caching leads to absurdly slow queries

2021-08-11 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9002?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17397259#comment-17397259
 ] 

ASF subversion and git services commented on LUCENE-9002:
-

Commit a9fb5a965dffd2600f7da7700e6429ebf19e14d6 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=a9fb5a9 ]

LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10 (#232)

In LUCENE-9002 we introduced logic to skip caching a clause if it would be too
expensive compared to the usual query cost. Specifically, we avoid caching a
clause if its cost is estimated to be a 250x higher than the lead iterator's.
We've found that the default of 250 is quite high and can lead to poor tail
latencies. This PR decreases it to 10 to cache more conservatively.

> query caching leads to absurdly slow queries
> 
>
> Key: LUCENE-9002
> URL: https://issues.apache.org/jira/browse/LUCENE-9002
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/search
>Affects Versions: 7.7.2, 8.2
>Reporter: Guoqiang Jiang
>Priority: Major
>  Labels: cache, performance
>  Time Spent: 9h 20m
>  Remaining Estimate: 0h
>
> *Description*
> We have dozens of ES clusters(based on Lucene) for metric scenarios. Most of 
> the queries are like this: _host_ip:10.10.10.10 AND timestamp:[2019-10-01 
> 00:00:00 TO 2019-10-05 23:59:59]_. And we frequently encounter some absurdly 
> slow queries.
> *Solution*
> For a long time range query(e.g. 5 days), each range query will consume tens 
> of megabytes of memory and spend hundreds of milliseconds to cache, but the 
> benefits are not obvious. And those large cache entries will cause frequent 
> cache eviction. So it's better to  skip the caching action directly when 
> large range query appears with a selective lead iterator.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

2021-08-11 Thread GitBox


jtibshirani merged pull request #232:
URL: https://github.com/apache/lucene/pull/232


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

2021-08-11 Thread GitBox


jtibshirani commented on pull request #232:
URL: https://github.com/apache/lucene/pull/232#issuecomment-896706043


   Thanks for the review.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #232: LUCENE-10043: Decrease default LRUQueryCache#skipCacheFactor to 10

2021-08-11 Thread GitBox


jtibshirani commented on a change in pull request #232:
URL: https://github.com/apache/lucene/pull/232#discussion_r686703128



##
File path: lucene/core/src/java/org/apache/lucene/search/LRUQueryCache.java
##
@@ -146,7 +146,7 @@ public LRUQueryCache(
* of the top-level query will be cached in order to not hurt latency too 
much because of caching.
*/
   public LRUQueryCache(int maxSize, long maxRamBytesUsed) {
-this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 
250);
+this(maxSize, maxRamBytesUsed, new MinSegmentSizePredicate(1, .03f), 
10);

Review comment:
   I plan to do this refactor in a follow-up on 9.0 only, to avoid changing 
the APIs too much on 8.x.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #235: LUCENE-9614: add KnnVectorQuery implementation

2021-08-11 Thread GitBox


jtibshirani commented on a change in pull request #235:
URL: https://github.com/apache/lucene/pull/235#discussion_r686637119



##
File path: 
lucene/core/src/java/org/apache/lucene/index/VectorSimilarityFunction.java
##
@@ -43,9 +43,9 @@ public float compare(float[] v1, float[] v2) {
   };
 
   /**
-   * If true, the scores associated with vector comparisons are in reverse 
order; that is, lower
-   * scores represent more similar vectors. Otherwise, if false, higher scores 
represent more
-   * similar vectors.
+   * If true, the scores associated with vector comparisons are nonnegative 
and in reverse order;
+   * that is, lower scores represent more similar vectors. Otherwise, if 
false, higher scores
+   * represent more similar vectors, and scores may be negative or positive.

Review comment:
   I think we can revert part of this since you went with `1 / (1 + x)`.

##
File path: lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java
##
@@ -0,0 +1,318 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search;
+
+import static org.apache.lucene.search.DocIdSetIterator.NO_MORE_DOCS;
+
+import java.io.IOException;
+import java.util.Arrays;
+import java.util.Comparator;
+import java.util.Objects;
+import org.apache.lucene.codecs.KnnVectorsReader;
+import org.apache.lucene.document.KnnVectorField;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.LeafReaderContext;
+
+/** Uses {@link KnnVectorsReader#search} to perform nearest Neighbour search. 
*/
+public class KnnVectorQuery extends Query {
+
+  private static final TopDocs NO_RESULTS =
+  new TopDocs(new TotalHits(0, TotalHits.Relation.EQUAL_TO), new 
ScoreDoc[0]);
+
+  private final String field;
+  private final float[] target;
+  private final int k;
+
+  /**
+   * Find the k nearest documents to the target vector according 
to the vectors in the
+   * given field. target vector.
+   *
+   * @param field a field that has been indexed as a {@link KnnVectorField}.
+   * @param target the target of the search
+   * @param k the number of documents to find
+   * @throws IllegalArgumentException if k is less than 1
+   */
+  public KnnVectorQuery(String field, float[] target, int k) {
+this.field = field;
+this.target = target;
+this.k = k;
+if (k < 1) {
+  throw new IllegalArgumentException("k must be at least 1, got: " + k);
+}
+  }
+
+  @Override
+  public Query rewrite(IndexReader reader) throws IOException {
+int boundedK = Math.min(k, reader.numDocs());
+TopDocs[] perLeafResults = new TopDocs[reader.leaves().size()];
+for (LeafReaderContext ctx : reader.leaves()) {
+  // Calculate kPerLeaf as an overestimate of the expected number of the 
closest k documents in
+  // this leaf
+  int expectedKPerLeaf = Math.max(1, boundedK * ctx.reader().numDocs() / 
reader.numDocs());
+  // Increase to include 3 std. deviations of a Binomial distribution.
+  int kPerLeaf = (int) (expectedKPerLeaf + 3 * 
Math.sqrt(expectedKPerLeaf));
+  perLeafResults[ctx.ord] = searchLeaf(ctx, kPerLeaf);
+}
+// Merge sort the results
+TopDocs topK = TopDocs.merge(boundedK, perLeafResults);
+// re-query any outlier segments (normally there should be none).
+topK = checkForOutlierSegments(reader, topK, perLeafResults);
+if (topK.scoreDocs.length == 0) {
+  return new MatchNoDocsQuery();
+}
+return createRewrittenQuery(reader, topK);
+  }
+
+  private TopDocs searchLeaf(LeafReaderContext ctx, int kPerLeaf) throws 
IOException {
+TopDocs results = ctx.reader().searchNearestVectors(field, target, 
kPerLeaf);
+if (results == null) {
+  return NO_RESULTS;
+}
+if (ctx.docBase > 0) {
+  for (ScoreDoc scoreDoc : results.scoreDocs) {
+scoreDoc.doc += ctx.docBase;
+  }
+}
+return results;
+  }
+
+  private TopDocs checkForOutlierSegments(IndexReader reader, TopDocs topK, 
TopDocs[] perLeaf)
+  throws IOException {
+int k = topK.scoreDocs.length;
+if (k == 0) {
+  return topK;
+}
+float minScore = topK.scoreDocs[topK.scoreDocs.length - 1].score;
+boolean rescored =