date:20240201

Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

2024-02-01 Thread via GitHub



github-actions[bot] commented on PR #12966:
URL: https://github.com/apache/lucene/pull/12966#issuecomment-1922531005

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]

2024-02-01 Thread via GitHub



github-actions[bot] commented on PR #12938:
URL: https://github.com/apache/lucene/pull/12938#issuecomment-1922531062

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix DV update files referenced by merge will be deleted by concurrent flush [lucene]

2024-02-01 Thread via GitHub



github-actions[bot] commented on PR #13017:
URL: https://github.com/apache/lucene/pull/13017#issuecomment-1922530938

   This PR has not had activity in the past 2 weeks, labeling it as stale. If 
the PR is waiting for review, notify the d...@lucene.apache.org list. Thank you 
for your contribution!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475060677


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   filterWeight = null;
 }
 
+final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;
+KnnCollectorManager knnCollectorManager = getKnnCollectorManager(k, 
supportsConcurrency);

Review Comment:
   I think this interface should accept the `indexSearcher` as the parameter 
and not `supportsConcurrency` or `multipleLeaves`



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1475057976


##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.search.knn;
+
+import java.io.IOException;
+import org.apache.lucene.search.KnnCollector;
+import org.apache.lucene.util.BitSet;
+
+/**
+ * KnnCollectorManager responsible for creating {@link KnnCollector} 
instances. Useful to create
+ * {@link KnnCollector} instances that share global state across leaves, such 
a global queue of
+ * results collected so far.
+ */
+public abstract class KnnCollectorManager {

Review Comment:
   Do we need these generics? This also seems like it should be an `interface`



##
lucene/join/src/java/org/apache/lucene/search/join/DiversifyingChildrenByteKnnVectorQuery.java:
##
@@ -123,7 +124,16 @@ protected TopDocs exactSearch(LeafReaderContext context, 
DocIdSetIterator accept
   }
 
   @Override
-  protected TopDocs approximateSearch(LeafReaderContext context, Bits 
acceptDocs, int visitedLimit)
+  protected KnnCollectorManager getKnnCollectorManager(int k, boolean 
supportsConcurrency) {
+return new DiversifyingNearestChildrenKnnCollectorManager(k);

Review Comment:
   If we adjust the interface, this manager could know about `BitSetProducer 
parentsFilter;` and abstract that away from this query.



##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -277,27 +273,24 @@ public final TopDocs searchNearestVectors(
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param knnCollector collector with settings for gathering the vector 
results.
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
   public final TopDocs searchNearestVectors(
-  String field, byte[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {

Review Comment:
   same here, this shouldn't be mutated at all. 



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnCollector.java:
##
@@ -23,7 +23,7 @@
  */
 public abstract class AbstractKnnCollector implements KnnCollector {
 
-  private long visitedCount;
+  long visitedCount;

Review Comment:
   I think this should be protected, not package private. Only sub-classes 
should be able to read it.



##
lucene/core/src/java/org/apache/lucene/search/TopKnnCollector.java:
##
@@ -27,25 +29,78 @@
  */
 public final class TopKnnCollector extends AbstractKnnCollector {
 
+  // greediness of globally non-competitive search: (0,1]
+  private static final float DEFAULT_GREEDINESS = 0.9f;
+  // the local queue of the results with the highest similarities collected so 
far in the current
+  // segment

Review Comment:
   I think this should be a separate collector. Something like 
`MultiLeafTopKnnCollector`. 
   
   There is such very little code from the original collector still around, it 
seems weird to me. We should have two, one that shares information, another 
that doesn't. 
   
   This allows us to remove all the `null` values in the ctor. 



##
lucene/core/src/java/org/apache/lucene/search/knn/KnnCollectorManager.java:
##
@@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub



mayya-sharipova commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1474964972


##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   filterWeight = null;
 }
 
+final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;

Review Comment:
   I copied this how we do that for [top docs 
collectors](https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/search/IndexSearcher.java#L478)
   
   But you are right, because we see speedups even in sequential run, you 
suggestions make sense.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Backport SOLR-14765 to branch_8_11 [lucene-solr]

2024-02-01 Thread via GitHub



HoustonPutman commented on PR #2682:
URL: https://github.com/apache/lucene-solr/pull/2682#issuecomment-1921919670

   Hey @risdenk are you still planning on getting this in?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-01 Thread via GitHub



jpountz commented on PR #13058:
URL: https://github.com/apache/lucene/pull/13058#issuecomment-1921879126

   Yeah, it's annoying, but I agree that adding more APIs to fix this is 
overkill. Can you update your change to keep cost = cardinality, and only do 
the +1 when calling `approximateSearch()`?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

2024-02-01 Thread via GitHub



jimczi commented on code in PR #12962:
URL: https://github.com/apache/lucene/pull/12962#discussion_r1474802862


##
lucene/core/src/java/org/apache/lucene/index/LeafReader.java:
##
@@ -236,27 +235,24 @@ public final PostingsEnum postings(Term term) throws 
IOException {
*
* @param field the vector field to search
* @param target the vector-valued query
-   * @param k the number of docs to return
* @param acceptDocs {@link Bits} that represents the allowed documents to 
match, or {@code null}
* if they are all allowed to match.
-   * @param visitedLimit the maximum number of nodes that the search is 
allowed to visit
+   * @param knnCollector collector with settings for gathering the vector 
results.
* @return the k nearest neighbor documents, along with their 
(searchStrategy-specific) scores.
* @lucene.experimental
*/
   public final TopDocs searchNearestVectors(
-  String field, float[] target, int k, Bits acceptDocs, int visitedLimit) 
throws IOException {
+  String field, float[] target, Bits acceptDocs, KnnCollector 
knnCollector) throws IOException {

Review Comment:
   Is it a leftover? There's already a variant with the `KnnCollector` in the 
abstract functions. I don't think we should change this signature, it's ok if 
this version doesn't use the global queue imo.



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   filterWeight = null;
 }
 
+final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;
+KnnCollectorManager knnCollectorManager = getKnnCollectorManager(k, 
supportsConcurrency);
+
 TaskExecutor taskExecutor = indexSearcher.getTaskExecutor();
 List leafReaderContexts = reader.leaves();
 List> tasks = new ArrayList<>(leafReaderContexts.size());
 for (LeafReaderContext context : leafReaderContexts) {
-  tasks.add(() -> searchLeaf(context, filterWeight));
+  tasks.add(() -> searchLeaf(context, filterWeight, knnCollectorManager));
 }
 TopDocs[] perLeafResults = 
taskExecutor.invokeAll(tasks).toArray(TopDocs[]::new);
 
 // Merge sort the results
 TopDocs topK = mergeLeafResults(perLeafResults);
+if (infoStream.isEnabled("KnnVectorQuery")) {
+  infoStream.message("KnnVectorQuery", "visited:" + topK.totalHits.value);

Review Comment:
   looks like a leftover too?



##
lucene/core/src/java/org/apache/lucene/search/AbstractKnnVectorQuery.java:
##
@@ -79,24 +83,32 @@ public Query rewrite(IndexSearcher indexSearcher) throws 
IOException {
   filterWeight = null;
 }
 
+final boolean supportsConcurrency = indexSearcher.getSlices().length > 1;

Review Comment:
   A slice can have multiple segments so it should rather check the reader's 
leaves here and maybe call the boolean `isMultiSegments`?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on issue #13065:
URL: https://github.com/apache/lucene/issues/13065#issuecomment-1921775426

   For the analyzer, are you meaning something that tokenizes into an 
embedding? 
   
   Or just creates the tokens (wordpiece + dictionary)?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix normalization in TeluguAnalyzer [lucene]

2024-02-01 Thread via GitHub



jpountz merged PR #13059:
URL: https://github.com/apache/lucene/pull/13059


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[I] Contributing a deep-learning, BERT-based analyzer [lucene]

2024-02-01 Thread via GitHub



lmessinger opened a new issue, #13065:
URL: https://github.com/apache/lucene/issues/13065

   ### Description
   
   Hi,
   
   We are building an open-source custom Hebrew/Arabic analyzer (lemmatizer and 
stopwords), based on a BERT model. We'd like to contribute this to this 
repository. How can we do that and be accepted? Can we compile it to native 
code and use JNI or [Panama ](https://openjdk.org/projects/panama/)? If not, 
what is the best approacch? 
   
   https://github.com/apache/lucene/issues/12502#issuecomment-1675084211
   
   @uschindler would be very happy to hear what you think


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

2024-02-01 Thread via GitHub



jpountz merged PR #13036:
URL: https://github.com/apache/lucene/pull/13036


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on PR #13058:
URL: https://github.com/apache/lucene/pull/13058#issuecomment-1921169639

   > "exactly visitLimit hits have been collected but I collected all that I 
needed" and "exactly visitLimit hits have been collected but I would need to 
collect more to be done"?
   
   @jpountz I suppose the key thing is that the collector doesn't know if more 
candidates were left to explore. Being able to tell it this would extend its 
API and require some more code changed in the searcher. Basically, 
"earlyTerminated" would no longer be tied to numVisited, but whatever is doing 
the exploring would have to increment the visited count and when the visit 
limit is reached flag "earlyTerminated" if indeed it did.
   
   This seems like a bigger complication to me with little benefit.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on issue #11419:
URL: https://github.com/apache/lucene/issues/11419#issuecomment-1921160057

   > What was the decision behind adding these candidates `outOfOrder`?
   
   Speed, once we know things are sorted, we know they have been checked for 
diversity. But with any optimization, measurement is important. 
   
   If you can get the same graph results but faster by adjusting how we track 
diverse candidates & using a radix heap, I say go forth and benchmark :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix broken loop in TestDocumentsWriterStallControl.assertState() [lucene]

2024-02-01 Thread via GitHub



s1monw merged PR #13062:
URL: https://github.com/apache/lucene/pull/13062


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]

2024-02-01 Thread via GitHub



s1monw closed issue #13061: TestDocumentsWriterStallControl.assertState() does 
not do what it appears it would
URL: https://github.com/apache/lucene/issues/13061


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] HnwsGraph creates disconnected components [lucene]

2024-02-01 Thread via GitHub



benwtrent commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1921135758

   @nitirajrathore very interesting results. This sort of indicates to me that 
no matter the heuristic, we just need a second pass over the graph to ensure 
connectedness and fix it up :/
   
   Even though the graph is different, the idea is the same here: 
https://github.com/jbellis/jvector/blob/main/jvector-base/src/main/java/io/github/jbellis/jvector/graph/GraphIndexBuilder.java
   
   I will see about testing your new "new heuristic with remove otherhalf and 
honour max-conn" as if I am reading this correctly, seems the most promising. 
But, I suspect, even in extreme scenarios, we will still need a second pass 
over the graph to ensure graph connectedness.
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

2024-02-01 Thread via GitHub



jfreden commented on PR #13036:
URL: https://github.com/apache/lucene/pull/13036#issuecomment-1920986116

   Thank you @jpountz ! I've pushed changes to the tests, added the comment and 
also added an entry to `CHANGES.txt`. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [I] HnwsGraph creates disconnected components [lucene]

2024-02-01 Thread via GitHub



nitirajrathore commented on issue #12627:
URL: https://github.com/apache/lucene/issues/12627#issuecomment-1920943319

   Hi @benwtrent,

   I left Amazon but I was able to run some tests with open dataset and also 
with Amazon dataset before leaving. I cannot share whole lot of detail about 
Amazon dataset but some aggregated results are shown. Rest assured I am picking 
this issue on priority now.
   
   I was able to run tests trying out different heuristics. Basically, I tried 
the Lucene `default` diversity check, which generates disconnected even with 
50K nodes of minilm dataset. I further tried different variations as suggested 
in HNSW paper. The `extended candidates` approach reduced the disconnectedness 
but did not eliminate it completely and it increased the indexing time by many 
times. Next, `keepPruned candidates` approach with keeping max-conn connections 
increased the disconnected nodes. So I tried `keepPruned candidates` with 
keeping pruned connections only till max-conn/2. This also did not show any 
improvement. Next, I tried the new proposed hueristic but without the component 
of removeing the two way connections, so basically the new hueristic with just 
removing one side of the duplex connection in the graph. Interestingly, this 
also did not change the disconnectedness. This was a bit surprising to me. Then 
I tried `new heuristic with remove-otherhalf`, basically re
 moving the two way connections completely. This completely removed 
disconnecteness and number of disconnected nodes at all levels was zero. But 
unfortunately this means that the edges at some nodes can grow beyond the 
max-conn. I did not get chance to find the counts and distribution of total 
connections at each node which goes beyond max-conn, but I assume this is 
something that we may not want in lucene. So I thought may be the removing 
duplex edges (i.e. the remove-otherhalf) is the key behind decreasing 
disconnectedness. So I tried `default` algo and `keep prunned` both with the 
`remove-otherhalf`. Interestingly, those two experiments also did not decrease 
number of disconnected nodes. 
   Now, I was left with `new heuristic with remove other half` as the best 
option. To make sure that total connections per node do not grow beyond 
max-conn, I modified the algo to remove some random node in case it is not able 
to find the best node to remove (`new heuristic with remove otherhalf and 
honour max-conn`). This did help to keep the overall disconnectedness to zero, 
but it did show some nodes at level1 to be disconnected still (see comments) 
for minilm dataset. I tried with max-conn=8, max-conn=16, num_docs=50K and 
num_docs=100k. All gave zero overall disconnectedness and zero disconnected 
nodes at level0 but there were some disconnected nodes at level1 graph.
   
   Anyway, I also did the experiment using Amazon  dataset for `new heuristic 
with remove otherhalf and honour max-conn`. I had to do code changes again to 
adopt it to Lucene 9.7 version. I saw similar results. Here also `default` 
algorithm gave lot of disconnected nodes but the new heuristic gave zero 
disconnected nodes at level0 and zero overall disconnected nodes. But at level1 
and sometimes at level2 also there were some nodes that were disconnected.
   I am more confident of the new heuristic now, but won't mind to run more 
tests and perf tests.
   
   PR with all heuristics : https://github.com/apache/lucene/pull/12783/commits
   
   
   | Algo | no. 
of Disconnected Nodes at zeroth level | %age overall disconnected (nodes) | 
Comments

| index time |
   |  | 
- | - | 
---
 | -- |
   | Baseline Equivalent  | 90  
  | 0.1740 (87 nodes) | 
Same as baseline but with some extra parameters to allow more experiments   

| 33s|
   | Baseline | 90  
  | 0.1740 (87)   | 
Exactly the code as in production   

||
   | Extend Candidates| 39

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

2024-02-01 Thread via GitHub



jpountz commented on code in PR #13036:
URL: https://github.com/apache/lucene/pull/13036#discussion_r1474129857


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java:
##
@@ -962,6 +962,118 @@ public void testDisjunctionMatchesCount() throws 
IOException {
 dir.close();
   }
 
+  public void testTwoClauseTermDisjunctionCountOptimization() throws Exception 
{
+int largerTermCount = RandomNumbers.randomIntBetween(random(), 11, 100);
+int smallerTermCount = RandomNumbers.randomIntBetween(random(), 1, 
(largerTermCount - 1) / 10);
+
+List docContent = new ArrayList<>(largerTermCount + 
smallerTermCount);
+
+for (int i = 0; i < largerTermCount; i++) {
+  docContent.add(new String[] {"large"});
+}
+
+for (int i = 0; i < smallerTermCount; i++) {
+  docContent.add(new String[] {"small", "also small"});
+}
+
+try (Directory dir = newDirectory()) {
+  try (IndexWriter w =
+  new IndexWriter(dir, 
newIndexWriterConfig().setMergePolicy(newLogMergePolicy( {
+
+for (String[] values : docContent) {
+  Document doc = new Document();
+  for (String value : values) {
+doc.add(new StringField("foo", value, Field.Store.NO));
+  }
+  w.addDocument(doc);
+}
+w.forceMerge(1);
+  }
+
+  try (IndexReader reader = DirectoryReader.open(dir)) {
+final int[] countInvocations = new int[] {0};
+IndexSearcher countingIndexSearcher =
+new IndexSearcher(reader) {
+  @Override
+  public int count(Query query) throws IOException {
+countInvocations[0]++;
+return super.count(query);
+  }
+};
+
+{
+  // Test no matches in either term
+  countInvocations[0] = 0;
+  BooleanQuery query =
+  new BooleanQuery.Builder()
+  .add(new TermQuery(new Term("foo", "no match")), 
BooleanClause.Occur.SHOULD)
+  .add(new TermQuery(new Term("foo", "also no match")), 
BooleanClause.Occur.SHOULD)
+  .build();
+
+  assertEquals(0, countingIndexSearcher.count(query));
+  assertEquals(3, countInvocations[0]);
+}
+{
+  // Test match in one term
+  countInvocations[0] = 0;
+  BooleanQuery query =
+  new BooleanQuery.Builder()
+  .add(new TermQuery(new Term("foo", "no match")), 
BooleanClause.Occur.SHOULD)
+  .add(new TermQuery(new Term("foo", "small")), 
BooleanClause.Occur.SHOULD)
+  .build();
+
+  assertEquals(smallerTermCount, countingIndexSearcher.count(query));
+  assertEquals(3, countInvocations[0]);
+}

Review Comment:
   Can you do the same with clauses in the opposite order?



##
lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java:
##
@@ -962,6 +962,118 @@ public void testDisjunctionMatchesCount() throws 
IOException {
 dir.close();
   }
 
+  public void testTwoClauseTermDisjunctionCountOptimization() throws Exception 
{
+int largerTermCount = RandomNumbers.randomIntBetween(random(), 11, 100);
+int smallerTermCount = RandomNumbers.randomIntBetween(random(), 1, 
(largerTermCount - 1) / 10);
+
+List docContent = new ArrayList<>(largerTermCount + 
smallerTermCount);
+
+for (int i = 0; i < largerTermCount; i++) {
+  docContent.add(new String[] {"large"});
+}
+
+for (int i = 0; i < smallerTermCount; i++) {
+  docContent.add(new String[] {"small", "also small"});
+}
+
+try (Directory dir = newDirectory()) {
+  try (IndexWriter w =
+  new IndexWriter(dir, 
newIndexWriterConfig().setMergePolicy(newLogMergePolicy( {
+
+for (String[] values : docContent) {
+  Document doc = new Document();
+  for (String value : values) {
+doc.add(new StringField("foo", value, Field.Store.NO));
+  }
+  w.addDocument(doc);
+}
+w.forceMerge(1);
+  }
+
+  try (IndexReader reader = DirectoryReader.open(dir)) {
+final int[] countInvocations = new int[] {0};
+IndexSearcher countingIndexSearcher =
+new IndexSearcher(reader) {
+  @Override
+  public int count(Query query) throws IOException {
+countInvocations[0]++;
+return super.count(query);
+  }
+};
+
+{
+  // Test no matches in either term
+  countInvocations[0] = 0;
+  BooleanQuery query =
+  new BooleanQuery.Builder()
+  .add(new TermQuery(new Term("foo", "no match")), 
BooleanClause.Occur.SHOULD)
+  .add(new TermQuery(new Term("foo", "also no match")), 
BooleanClause.Occur.SHOULD)
+  .build();
+
+  assertEquals(0, countingIndexSearcher.count(query));
+  as

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

2024-02-01 Thread via GitHub



jfreden commented on code in PR #13036:
URL: https://github.com/apache/lucene/pull/13036#discussion_r1474087852


##
lucene/core/src/test/org/apache/lucene/search/TestBooleanQuery.java:
##
@@ -962,6 +962,46 @@ public void testDisjunctionMatchesCount() throws 
IOException {
 dir.close();
   }
 
+  public void testTwoClauseTermDisjunctionCountOptimization() throws Exception 
{
+List docContent =
+Arrays.asList(
+new String[] {"A", "B"},
+new String[] {"A"},
+new String[] {},
+new String[] {"A", "B", "C"},
+new String[] {"B"},
+new String[] {"B", "C"});
+
+try (Directory dir = newDirectory()) {
+  try (IndexWriter w =
+  new IndexWriter(dir, 
newIndexWriterConfig().setMergePolicy(newLogMergePolicy( {
+
+for (String[] values : docContent) {
+  Document doc = new Document();
+  for (String value : values) {
+doc.add(new StringField("foo", value, Field.Store.NO));
+  }
+  w.addDocument(doc);
+}
+w.forceMerge(1);
+  }
+
+  try (IndexReader reader = DirectoryReader.open(dir)) {
+IndexSearcher searcher = newSearcher(reader);
+
+Query query =
+new BooleanQuery.Builder()
+.add(new TermQuery(new Term("foo", "A")), 
BooleanClause.Occur.SHOULD)
+.add(new TermQuery(new Term("foo", "B")), 
BooleanClause.Occur.SHOULD)
+.setMinimumNumberShouldMatch(1)
+.build();
+
+int count = searcher.count(query);
+assertEquals(5, count);

Review Comment:
   Thank you! Added tests for that and also implemented a 
`countingIndexSearcher` that overrides `count` to make sure the recursion that 
is triggered by the optimization happens. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Fix knn vector visit limit fence post error [lucene]

2024-02-01 Thread via GitHub



jpountz commented on PR #13058:
URL: https://github.com/apache/lucene/pull/13058#issuecomment-1920761327

   Hmm, sorry it still doesn't feel completely right... It feels like the issue 
is that the collector doesn't distinguish between "exactly visitLimit hits have 
been collected but I collected all that I needed" and "exactly visitLimit hits 
have been collected but I would need to collect more to be done"?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

Re: [PR] Reduce duplication in taxonomy facets; always do counts [lucene]

Re: [PR] Made the UnifiedHighlighter's hasUnrecognizedQuery function processes FunctionQuery the same way as MatchAllDocsQuery and MatchNoDocsQuery queries for performance reasons. [lucene]

Re: [PR] Fix DV update files referenced by merge will be deleted by concurrent flush [lucene]

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Re: [PR] Backport SOLR-14765 to branch_8_11 [lucene-solr]

Re: [PR] Fix knn vector visit limit fence post error [lucene]

Re: [PR] Speedup concurrent multi-segment HNWS graph search 2 [lucene]

Re: [I] Contributing a deep-learning, BERT-based analyzer [lucene]

Re: [PR] Fix normalization in TeluguAnalyzer [lucene]

[I] Contributing a deep-learning, BERT-based analyzer [lucene]

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

Re: [PR] Fix knn vector visit limit fence post error [lucene]

Re: [I] Explore moving HNSW's NeighborQueue to a radix heap [LUCENE-10383] [lucene]

Re: [PR] Fix broken loop in TestDocumentsWriterStallControl.assertState() [lucene]

Re: [I] TestDocumentsWriterStallControl.assertState() does not do what it appears it would [lucene]

Re: [I] HnwsGraph creates disconnected components [lucene]

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

Re: [I] HnwsGraph creates disconnected components [lucene]

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

Re: [PR] Optimize counts on two clause term disjunctions [lucene]

Re: [PR] Fix knn vector visit limit fence post error [lucene]

23 matches

Site Navigation

Mail list logo

Footer information