Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]
github-actions[bot] commented on PR #12966: URL: https://github.com/apache/lucene/pull/12966#issuecomment-1922531005 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]
github-actions[bot] commented on PR #12938: URL: https://github.com/apache/lucene/pull/12938#issuecomment-1922531062 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix DV update files referenced by merge will be deleted by concurrent flush [lucene]
github-actions[bot] commented on PR #13017: URL: https://github.com/apache/lucene/pull/13017#issuecomment-1922530938 This PR has not had activity in the past 2 weeks, labeling it as stale. If the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you for your contribution! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475060677 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; +KnnCollectorManager knnCollectorManager = getKnnCollectorManager(k, supportsConcurrency); Review Comment: I think this interface should accept the `indexSearcher` as the parameter and not `supportsConcurrency` or `multipleLeaves` -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
benwtrent commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1475057976 ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ + +package org.apache.lucene.search.knn; + +import java.io.IOException; +import org.apache.lucene.search.KnnCollector; +import org.apache.lucene.util.BitSet; + +/** + * KnnCollectorManager responsible for creating {@link KnnCollector} instances. Useful to create + * {@link KnnCollector} instances that share global state across leaves, such a global queue of + * results collected so far. + */ +public abstract class KnnCollectorManager { Review Comment: Do we need these generics? This also seems like it should be an `interface` ## lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java: ## @@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, DocIdSetIterator accept } @Override - protected TopDocs approximateSearch(LeafReaderContext context, Bits acceptDocs, int visitedLimit) + protected KnnCollectorManager getKnnCollectorManager(int k, boolean supportsConcurrency) { +return new DiversifyingNearestChildrenKnnCollectorManager(k); Review Comment: If we adjust the interface, this manager could know about `BitSetProducer parentsFilter;` and abstract that away from this query. ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -277,27 +273,24 @@ public final TopDocs searchNearestVectors( * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit + * @param knnCollector collector with settings for gathering the vector results. * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ public final TopDocs searchNearestVectors( - String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { Review Comment: same here, this shouldn't be mutated at all. ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java: ## @@ -23,7 +23,7 @@ */ public abstract class AbstractKnnCollector implements KnnCollector { - private long visitedCount; + long visitedCount; Review Comment: I think this should be protected, not package private. Only sub-classes should be able to read it. ## lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java: ## @@ -27,25 +29,78 @@ */ public final class TopKnnCollector extends AbstractKnnCollector { + // greediness of globally non-competitive search: (0,1] + private static final float DEFAULT_GREEDINESS = 0.9f; + // the local queue of the results with the highest similarities collected so far in the current + // segment Review Comment: I think this should be a separate collector. Something like `MultiLeafTopKnnCollector`. There is such very little code from the original collector still around, it seems weird to me. We should have two, one that shares information, another that doesn't. This allows us to remove all the `null` values in the ctor. ## lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java: ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
mayya-sharipova commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1474964972 ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; Review Comment: I copied this how we do that for [top docs collectors](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L478) But you are right, because we see speedups even in sequential run, you suggestions make sense. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Backport SOLR-14765 to branch_8_11 [lucene-solr]
HoustonPutman commented on PR #2682: URL: https://github.com/apache/lucene-solr/pull/2682#issuecomment-1921919670 Hey @risdenk are you still planning on getting this in? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix knn vector visit limit fence post error [lucene]
jpountz commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1921879126 Yeah, it's annoying, but I agree that adding more APIs to fix this is overkill. Can you update your change to keep cost = cardinality, and only do the +1 when calling `approximateSearch()`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]
jimczi commented on code in PR #12962: URL: https://github.com/apache/lucene/pull/12962#discussion_r1474802862 ## lucene/core/src/java/org/apache/lucene/index/LeafReader.java: ## @@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws IOException { * * @param field the vector field to search * @param target the vector-valued query - * @param k the number of docs to return * @param acceptDocs {@link Bits} that represents the allowed documents to match, or {@code null} * if they are all allowed to match. - * @param visitedLimit the maximum number of nodes that the search is allowed to visit + * @param knnCollector collector with settings for gathering the vector results. * @return the k nearest neighbor documents, along with their (searchStrategy-specific) scores. * @lucene.experimental */ public final TopDocs searchNearestVectors( - String field, float[] target, int k, Bits acceptDocs, int visitedLimit) throws IOException { + String field, float[] target, Bits acceptDocs, KnnCollector knnCollector) throws IOException { Review Comment: Is it a leftover? There's already a variant with the `KnnCollector` in the abstract functions. I don't think we should change this signature, it's ok if this version doesn't use the global queue imo. ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; +KnnCollectorManager knnCollectorManager = getKnnCollectorManager(k, supportsConcurrency); + TaskExecutor taskExecutor = indexSearcher.getTaskExecutor(); List leafReaderContexts = reader.leaves(); List> tasks = new ArrayList<>(leafReaderContexts.size()); for (LeafReaderContext context : leafReaderContexts) { - tasks.add(() -> searchLeaf(context, filterWeight)); + tasks.add(() -> searchLeaf(context, filterWeight, knnCollectorManager)); } TopDocs[] perLeafResults = taskExecutor.invokeAll(tasks).toArray(TopDocs[]::new); // Merge sort the results TopDocs topK = mergeLeafResults(perLeafResults); +if (infoStream.isEnabled("KnnVectorQuery")) { + infoStream.message("KnnVectorQuery", "visited:" + topK.totalHits.value); Review Comment: looks like a leftover too? ## lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java: ## @@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws IOException { filterWeight = null; } +final boolean supportsConcurrency = indexSearcher.getSlices().length > 1; Review Comment: A slice can have multiple segments so it should rather check the reader's leaves here and maybe call the boolean `isMultiSegments`? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]
benwtrent commented on issue #13065: URL: https://github.com/apache/lucene/issues/13065#issuecomment-1921775426 For the analyzer, are you meaning something that tokenizes into an embedding? Or just creates the tokens (wordpiece + dictionary)? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix normalization in TeluguAnalyzer [lucene]
jpountz merged PR #13059: URL: https://github.com/apache/lucene/pull/13059 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[I] Contributing a deep-learning, BERT-based analyzer [lucene]
lmessinger opened a new issue, #13065: URL: https://github.com/apache/lucene/issues/13065 ### Description Hi, We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and stopwords), based on a BERT model. We'd like to contribute this to this repository. How can we do that and be accepted? Can we compile it to native code and use JNI or [Panama ](https://openjdk.org/projects/panama/)? If not, what is the best approacch? https://github.com/apache/lucene/issues/12502#issuecomment-1675084211 @uschindler would be very happy to hear what you think -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize counts on two clause term disjunctions [lucene]
jpountz merged PR #13036: URL: https://github.com/apache/lucene/pull/13036 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix knn vector visit limit fence post error [lucene]
benwtrent commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1921169639 > "exactly visitLimit hits have been collected but I collected all that I needed" and "exactly visitLimit hits have been collected but I would need to collect more to be done"? @jpountz I suppose the key thing is that the collector doesn't know if more candidates were left to explore. Being able to tell it this would extend its API and require some more code changed in the searcher. Basically, "earlyTerminated" would no longer be tied to numVisited, but whatever is doing the exploring would have to increment the visited count and when the visit limit is reached flag "earlyTerminated" if indeed it did. This seems like a bigger complication to me with little benefit. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]
benwtrent commented on issue #11419: URL: https://github.com/apache/lucene/issues/11419#issuecomment-1921160057 > What was the decision behind adding these candidates `outOfOrder`? Speed, once we know things are sorted, we know they have been checked for diversity. But with any optimization, measurement is important. If you can get the same graph results but faster by adjusting how we track diverse candidates & using a radix heap, I say go forth and benchmark :) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix broken loop in TestDocumentsWriterStallControl.assertState() [lucene]
s1monw merged PR #13062: URL: https://github.com/apache/lucene/pull/13062 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]
s1monw closed issue #13061: TestDocumentsWriterStallControl.assertState() does not do what it appears it would URL: https://github.com/apache/lucene/issues/13061 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
benwtrent commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1921135758 @nitirajrathore very interesting results. This sort of indicates to me that no matter the heuristic, we just need a second pass over the graph to ensure connectedness and fix it up :/ Even though the graph is different, the idea is the same here: https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java I will see about testing your new "new heuristic with remove otherhalf and honour max-conn" as if I am reading this correctly, seems the most promising. But, I suspect, even in extreme scenarios, we will still need a second pass over the graph to ensure graph connectedness. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Optimize counts on two clause term disjunctions [lucene]
jfreden commented on PR #13036: URL: https://github.com/apache/lucene/pull/13036#issuecomment-1920986116 Thank you @jpountz ! I've pushed changes to the tests, added the comment and also added an entry to `CHANGES.txt`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [I] HnwsGraph creates disconnected components [lucene]
nitirajrathore commented on issue #12627: URL: https://github.com/apache/lucene/issues/12627#issuecomment-1920943319 Hi @benwtrent, I left Amazon but I was able to run some tests with open dataset and also with Amazon dataset before leaving. I cannot share whole lot of detail about Amazon dataset but some aggregated results are shown. Rest assured I am picking this issue on priority now. I was able to run tests trying out different heuristics. Basically, I tried the Lucene `default` diversity check, which generates disconnected even with 50K nodes of minilm dataset. I further tried different variations as suggested in HNSW paper. The `extended candidates` approach reduced the disconnectedness but did not eliminate it completely and it increased the indexing time by many times. Next, `keepPruned candidates` approach with keeping max-conn connections increased the disconnected nodes. So I tried `keepPruned candidates` with keeping pruned connections only till max-conn/2. This also did not show any improvement. Next, I tried the new proposed hueristic but without the component of removeing the two way connections, so basically the new hueristic with just removing one side of the duplex connection in the graph. Interestingly, this also did not change the disconnectedness. This was a bit surprising to me. Then I tried `new heuristic with remove-otherhalf`, basically re moving the two way connections completely. This completely removed disconnecteness and number of disconnected nodes at all levels was zero. But unfortunately this means that the edges at some nodes can grow beyond the max-conn. I did not get chance to find the counts and distribution of total connections at each node which goes beyond max-conn, but I assume this is something that we may not want in lucene. So I thought may be the removing duplex edges (i.e. the remove-otherhalf) is the key behind decreasing disconnectedness. So I tried `default` algo and `keep prunned` both with the `remove-otherhalf`. Interestingly, those two experiments also did not decrease number of disconnected nodes. Now, I was left with `new heuristic with remove other half` as the best option. To make sure that total connections per node do not grow beyond max-conn, I modified the algo to remove some random node in case it is not able to find the best node to remove (`new heuristic with remove otherhalf and honour max-conn`). This did help to keep the overall disconnectedness to zero, but it did show some nodes at level1 to be disconnected still (see comments) for minilm dataset. I tried with max-conn=8, max-conn=16, num_docs=50K and num_docs=100k. All gave zero overall disconnectedness and zero disconnected nodes at level0 but there were some disconnected nodes at level1 graph. Anyway, I also did the experiment using Amazon dataset for `new heuristic with remove otherhalf and honour max-conn`. I had to do code changes again to adopt it to Lucene 9.7 version. I saw similar results. Here also `default` algorithm gave lot of disconnected nodes but the new heuristic gave zero disconnected nodes at level0 and zero overall disconnected nodes. But at level1 and sometimes at level2 also there were some nodes that were disconnected. I am more confident of the new heuristic now, but won't mind to run more tests and perf tests. PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits | Algo | no. of Disconnected Nodes at zeroth level | %age overall disconnected (nodes) | Comments | index time | | | - | - | --- | -- | | Baseline Equivalent | 90 | 0.1740 (87 nodes) | Same as baseline but with some extra parameters to allow more experiments | 33s| | Baseline | 90 | 0.1740 (87) | Exactly the code as in production || | Extend Candidates| 39
Re: [PR] Optimize counts on two clause term disjunctions [lucene]
jpountz commented on code in PR #13036: URL: https://github.com/apache/lucene/pull/13036#discussion_r1474129857 ## lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java: ## @@ -962,6 +962,118 @@ public void testDisjunctionMatchesCount() throws IOException { dir.close(); } + public void testTwoClauseTermDisjunctionCountOptimization() throws Exception { +int largerTermCount = RandomNumbers.randomIntBetween(random(), 11, 100); +int smallerTermCount = RandomNumbers.randomIntBetween(random(), 1, (largerTermCount - 1) / 10); + +List docContent = new ArrayList<>(largerTermCount + smallerTermCount); + +for (int i = 0; i < largerTermCount; i++) { + docContent.add(new String[] {"large"}); +} + +for (int i = 0; i < smallerTermCount; i++) { + docContent.add(new String[] {"small", "also small"}); +} + +try (Directory dir = newDirectory()) { + try (IndexWriter w = + new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy( { + +for (String[] values : docContent) { + Document doc = new Document(); + for (String value : values) { +doc.add(new StringField("foo", value, Field.Store.NO)); + } + w.addDocument(doc); +} +w.forceMerge(1); + } + + try (IndexReader reader = DirectoryReader.open(dir)) { +final int[] countInvocations = new int[] {0}; +IndexSearcher countingIndexSearcher = +new IndexSearcher(reader) { + @Override + public int count(Query query) throws IOException { +countInvocations[0]++; +return super.count(query); + } +}; + +{ + // Test no matches in either term + countInvocations[0] = 0; + BooleanQuery query = + new BooleanQuery.Builder() + .add(new TermQuery(new Term("foo", "no match")), BooleanClause.Occur.SHOULD) + .add(new TermQuery(new Term("foo", "also no match")), BooleanClause.Occur.SHOULD) + .build(); + + assertEquals(0, countingIndexSearcher.count(query)); + assertEquals(3, countInvocations[0]); +} +{ + // Test match in one term + countInvocations[0] = 0; + BooleanQuery query = + new BooleanQuery.Builder() + .add(new TermQuery(new Term("foo", "no match")), BooleanClause.Occur.SHOULD) + .add(new TermQuery(new Term("foo", "small")), BooleanClause.Occur.SHOULD) + .build(); + + assertEquals(smallerTermCount, countingIndexSearcher.count(query)); + assertEquals(3, countInvocations[0]); +} Review Comment: Can you do the same with clauses in the opposite order? ## lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java: ## @@ -962,6 +962,118 @@ public void testDisjunctionMatchesCount() throws IOException { dir.close(); } + public void testTwoClauseTermDisjunctionCountOptimization() throws Exception { +int largerTermCount = RandomNumbers.randomIntBetween(random(), 11, 100); +int smallerTermCount = RandomNumbers.randomIntBetween(random(), 1, (largerTermCount - 1) / 10); + +List docContent = new ArrayList<>(largerTermCount + smallerTermCount); + +for (int i = 0; i < largerTermCount; i++) { + docContent.add(new String[] {"large"}); +} + +for (int i = 0; i < smallerTermCount; i++) { + docContent.add(new String[] {"small", "also small"}); +} + +try (Directory dir = newDirectory()) { + try (IndexWriter w = + new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy( { + +for (String[] values : docContent) { + Document doc = new Document(); + for (String value : values) { +doc.add(new StringField("foo", value, Field.Store.NO)); + } + w.addDocument(doc); +} +w.forceMerge(1); + } + + try (IndexReader reader = DirectoryReader.open(dir)) { +final int[] countInvocations = new int[] {0}; +IndexSearcher countingIndexSearcher = +new IndexSearcher(reader) { + @Override + public int count(Query query) throws IOException { +countInvocations[0]++; +return super.count(query); + } +}; + +{ + // Test no matches in either term + countInvocations[0] = 0; + BooleanQuery query = + new BooleanQuery.Builder() + .add(new TermQuery(new Term("foo", "no match")), BooleanClause.Occur.SHOULD) + .add(new TermQuery(new Term("foo", "also no match")), BooleanClause.Occur.SHOULD) + .build(); + + assertEquals(0, countingIndexSearcher.count(query)); + as
Re: [PR] Optimize counts on two clause term disjunctions [lucene]
jfreden commented on code in PR #13036: URL: https://github.com/apache/lucene/pull/13036#discussion_r1474087852 ## lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java: ## @@ -962,6 +962,46 @@ public void testDisjunctionMatchesCount() throws IOException { dir.close(); } + public void testTwoClauseTermDisjunctionCountOptimization() throws Exception { +List docContent = +Arrays.asList( +new String[] {"A", "B"}, +new String[] {"A"}, +new String[] {}, +new String[] {"A", "B", "C"}, +new String[] {"B"}, +new String[] {"B", "C"}); + +try (Directory dir = newDirectory()) { + try (IndexWriter w = + new IndexWriter(dir, newIndexWriterConfig().setMergePolicy(newLogMergePolicy( { + +for (String[] values : docContent) { + Document doc = new Document(); + for (String value : values) { +doc.add(new StringField("foo", value, Field.Store.NO)); + } + w.addDocument(doc); +} +w.forceMerge(1); + } + + try (IndexReader reader = DirectoryReader.open(dir)) { +IndexSearcher searcher = newSearcher(reader); + +Query query = +new BooleanQuery.Builder() +.add(new TermQuery(new Term("foo", "A")), BooleanClause.Occur.SHOULD) +.add(new TermQuery(new Term("foo", "B")), BooleanClause.Occur.SHOULD) +.setMinimumNumberShouldMatch(1) +.build(); + +int count = searcher.count(query); +assertEquals(5, count); Review Comment: Thank you! Added tests for that and also implemented a `countingIndexSearcher` that overrides `count` to make sure the recursion that is triggered by the optimization happens. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
Re: [PR] Fix knn vector visit limit fence post error [lucene]
jpountz commented on PR #13058: URL: https://github.com/apache/lucene/pull/13058#issuecomment-1920761327 Hmm, sorry it still doesn't feel completely right... It feels like the issue is that the collector doesn't distinguish between "exactly visitLimit hits have been collected but I collected all that I needed" and "exactly visitLimit hits have been collected but I would need to collect more to be done"? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org