[GitHub] [lucene] jtibshirani commented on a diff in pull request #1073: fix VectorUtil.dotProductScore normalization

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1073:
URL: https://github.com/apache/lucene/pull/1073#discussion_r950632314


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -270,7 +270,8 @@ public static float dotProduct(BytesRef a, BytesRef b) {
*/
   public static float dotProductScore(BytesRef a, BytesRef b) {
 // divide by 2 * 2^14 (maximum absolute value of product of 2 signed 
bytes) * len
-return (1 + dotProduct(a, b)) / (float) (a.length * (1 << 15));
+float maxValue = (float) (a.length * (1 << 14));
+return (maxValue + dotProduct(a, b)) / (2 * maxValue);

Review Comment:
   Could this just be `1 + (dotProduct(a, b) / maxValue)` ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950631408


##
lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java:
##
@@ -20,8 +20,12 @@
 import java.io.IOException;
 import org.apache.lucene.util.Accountable;
 
-/** Vectors' writer for a field */
-public abstract class KnnFieldVectorsWriter implements Accountable {
+/**
+ * Vectors' writer for a field
+ *
+ * @param  an array type; the type of vectors to be written
+ */
+public abstract class KnnFieldVectorsWriter implements Accountable {

Review Comment:
   Oh okay, I must have missed those objections from an earlier PR draft. I 
understand and appreciate your "progress over perfection" approach here. I've 
been trying out refactorings so I could suggest cleaner ways forward, but 
nothing has stuck so far... I'll continue to chip away at it.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950631270


##
lucene/core/src/java/org/apache/lucene/util/VectorUtil.java:
##
@@ -213,4 +243,48 @@ public static void add(float[] u, float[] v) {
   u[i] += v[i];
 }
   }
+
+  /**
+   * Dot product computed over signed bytes.
+   *
+   * @param a bytes containing a vector
+   * @param b bytes containing another vector, of the same dimension
+   * @return the value of the dot product of the two vectors
+   */
+  public static float dotProduct(BytesRef a, BytesRef b) {
+assert a.length == b.length;
+int total = 0;
+int aOffset = a.offset, bOffset = b.offset;
+for (int i = 0; i < a.length; i++) {
+  total += a.bytes[aOffset++] * b.bytes[bOffset++];
+}
+return total;
+  }
+
+  /**
+   * Dot product score computed over signed bytes, scaled to be in [0, 1].
+   *
+   * @param a bytes containing a vector
+   * @param b bytes containing another vector, of the same dimension
+   * @return the value of the similarity function applied to the two vectors
+   */
+  public static float dotProductScore(BytesRef a, BytesRef b) {
+// divide by 2 * 2^14 (maximum absolute value of product of 2 signed 
bytes) * len
+return (1 + dotProduct(a, b)) / (float) (a.length * (1 << 15));
+  }
+
+  /**
+   * Convert a floating point vector to an array of bytes using casting; the 
vector values should be
+   * in [-128,127]
+   *
+   * @param vector a vector
+   * @return a new BytesRef containing the vector's values cast to byte.
+   */
+  public static BytesRef toBytesRef(float[] vector) {

Review Comment:
   Then maybe we could do a check in `Lucene94HnswVectorsReader`? The main 
thing I had in mind was if a user accidentally provides a query where the 
elements don't fall in the range [-128, 127]. We would silently cast the floats 
to bytes, which could result in incorrect search results.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950631004


##
lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java:
##
@@ -76,6 +78,15 @@ public static KnnVectorsFormat forName(String name) {
   /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */
   public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws 
IOException;
 
+  /**
+   * Returns the current KnnVectorsFormat version number. Indexes written 
using the format will be
+   * "stamped" with this version.
+   */
+  public int currentVersion() {

Review Comment:
   Here's what I had in mind: https://github.com/apache/lucene/pull/1077. We 
would just move the checks to test classes.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani opened a new pull request, #1077: Remove KnnVectorsFormat#currentVersion

2022-08-19 Thread GitBox


jtibshirani opened a new pull request, #1077:
URL: https://github.com/apache/lucene/pull/1077

   These internal versions only make sense within a codec definition, and aren't
   meant to be exposed and compared across codecs. Since this method is only 
used
   in tests, we can move the check to the test classes instead.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?

2022-08-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582097#comment-17582097
 ] 

ASF subversion and git services commented on LUCENE-9583:
-

Commit 8308688d786cd6c55fcbe4e59f67966f385989a2 in lucene's branch 
refs/heads/main from Julie Tibshirani
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8308688d786 ]

LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071)

This change folds the `RandomAccessVectorValuesProducer` interface into
`RandomAccessVectorValues`. This reduces the number of interfaces and clarifies
the cloning/ copying behavior.

This is a small simplification related to LUCENE-9583, but does not address the
main issue.

> How should we expose VectorValues.RandomAccess?
> ---
>
> Key: LUCENE-9583
> URL: https://issues.apache.org/jira/browse/LUCENE-9583
> Project: Lucene - Core
>  Issue Type: Improvement
>Affects Versions: 9.0
>Reporter: Michael Sokolov
>Assignee: Julie Tibshirani
>Priority: Major
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} 
> sub-interface. [~jtibshirani] pointed out this is not needed by some 
> vector-indexing strategies which can operate solely using a forward-iterator 
> (it is needed by HNSW), and so in the interest of simplifying the public API 
> we should not expose this internal detail (which by the way surfaces internal 
> ordinals that are somewhat uninteresting outside the random access API).
> I looked into how to move this inside the HNSW-specific code and remembered 
> that we do also currently make use of the RA API when merging vector fields 
> over sorted indexes. Without it, we would need to load all vectors into RAM  
> while flushing/merging, as we currently do in 
> {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost 
> for the simpler API.
> Another thing I noticed while reviewing this is that I moved the KNN 
> {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} 
>  to {{VectorValues.RandomAccess}}. This I think we could move back, and 
> handle the HNSW requirements for search elsewhere. I wonder if that would 
> alleviate the major concern here? 



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani merged pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani merged PR #1071:
URL: https://github.com/apache/lucene/pull/1071


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950624287


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight 
filterWeight) throws IO
   return NO_RESULTS;
 }
 
-BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc);
-BitSetIterator filterIterator = new BitSetIterator(bitSet, 
bitSet.cardinality());
+BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
 
-if (filterIterator.cost() <= k) {
+if (acceptDocs.cardinality() <= k) {

Review Comment:
   That approach works for me.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1076: Add safety checks to KnnVectorField; fixed issue with copying BytesRef

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1076:
URL: https://github.com/apache/lucene/pull/1076#discussion_r950623197


##
lucene/core/src/java/org/apache/lucene/index/VectorEncoding.java:
##
@@ -21,12 +21,8 @@
 public enum VectorEncoding {
 
   /**
-   * Encodes vector using 8 bits of precision per sample. Use only with 
DOT_PRODUCT similarity.
-   * NOTE: this can enable significant storage savings and faster searches, at 
the cost of some
-   * possible loss of precision. In order to use it, all vectors must be of 
the same norm, as
-   * measured by the sum of the squares of the scalar values, and those values 
must be in the range
-   * [-128, 127]. This applies to both document and query vectors. Using 
nonconforming vectors can
-   * result in errors or poor search results.
+   * Encodes vector using 8 bits of precision per sample. NOTE: this can 
enable significant storage

Review Comment:
   Maybe we could keep the part about how query vectors need to have all values 
within [-128, 127]? Because if they don't, we'll just cast to a byte and the 
results could be really surprising.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani commented on PR #1071:
URL: https://github.com/apache/lucene/pull/1071#issuecomment-1221189385

   Thank you for the reviews. I'm going to merge since it seems you're both on 
board with the change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r950622069


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws 
IOException {
 }
   }
 
-  @SuppressWarnings("unchecked")
-  private void dumpGraph(Path docsPath) throws IOException {

Review Comment:
   Let me know anytime you start to miss it and I'll restore it! For now I'll 
remove it to keep things simple. If we keep it, I'll need to refactor 
`BinaryFileVectors` -- which is doable but slightly tricky. 



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r950622069


##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws 
IOException {
 }
   }
 
-  @SuppressWarnings("unchecked")
-  private void dumpGraph(Path docsPath) throws IOException {

Review Comment:
   Let me know anytime you start to miss it and I'll restore it! For now I'll 
remove it to keep things simple.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r950621473


##
lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java:
##
@@ -133,7 +132,7 @@ private HnswGraphBuilder(
* accessor for the vectors
*/
   public OnHeapHnswGraph build(RandomAccessVectorValues vectors) throws 
IOException {
-if (vectors == vectorValues) {
+if (vectors == this.vectors) {

Review Comment:
    will fix



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


jtibshirani commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r950621376


##
lucene/core/src/java/org/apache/lucene/index/VectorValues.java:
##
@@ -192,36 +176,5 @@ public int advance(int target) throws IOException {
 public long cost() {
   return size();
 }
-
-@Override
-public RandomAccessVectorValues randomAccess() throws IOException {

Review Comment:
   Indeed we never reached a great resolution there! I'm trying to chip away at 
some of these tricky design questions, we'll see how it goes...



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing

2022-08-19 Thread Jira


[ 
https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582032#comment-17582032
 ] 

Luís Filipe Nassif commented on LUCENE-8118:


I did some workarounds in our project to avoid this. But as suggested before, 
it would be nice to throw a better Exception and update the 
IndexWriter.addDocumentS(Iterable) documentation to warn users they shouldn't 
pass huge collections or documents to that method at once.

> ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
> -
>
> Key: LUCENE-8118
> URL: https://issues.apache.org/jira/browse/LUCENE-8118
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 7.2
> Environment: Debian/Stretch
> java version "1.8.0_144"  
>   
>Java(TM) SE Runtime 
> Environment (build 1.8.0_144-b01) 
>   
>Java HotSpot(TM) 64-Bit Server VM (build 
> 25.144-b01, mixed mode)
>Reporter: Laura Dietz
>Priority: Major
> Attachments: LUCENE-8118_test.patch
>
>  Time Spent: 2h 40m
>  Remaining Estimate: 0h
>
> Indexing a large collection of about 20 million paragraph-sized documents 
> results in an ArrayIndexOutOfBoundsException in 
> org.apache.lucene.index.TermsHashPerField.writeByte  (full stack trace 
> below). 
> The bug is possibly related to issues described in 
> [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html]
>   and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I 
> am not using SOLR, I am directly using Lucene Core.
> The issue can be reproduced using code from  [GitHub 
> trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example]
>  
> - compile with `mvn compile assembly:single`
> - run with `java -cp 
> ./target/treccar-tools-example-0.1-jar-with-dependencies.jar 
> edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir`
> Where paragraphCorpus.cbor is contained in this 
> [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz]
> Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536   
> at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198)
>   
>at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224)
>   
>at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159)
>   
>  at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) 
>   
> at 
> org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786)
>   
>at 
> org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430)
>   
>   at 
> org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392)
>   
>at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281)
>   
>at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451)
>   
>  at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532)
>   
> at 
> 

[GitHub] [lucene] msokolov commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery

2022-08-19 Thread GitBox


msokolov commented on code in PR #1058:
URL: https://github.com/apache/lucene/pull/1058#discussion_r950509327


##
lucene/CHANGES.txt:
##
@@ -95,6 +95,9 @@ Improvements
 -
 * LUCENE-10592: Build HNSW Graph on indexing. (Mayya Sharipova, Adrien Grand, 
Julie Tibshirani)
 
+* LUCENE-10207: TermInSetQuery can now provide a ScoreSupplier with cost 
estimation, making it
+  usable in TermInSetQuery. (Greg Miller)

Review Comment:
   this comment is confusing -- do you really mean that now TISQ is usable in 
TISQ?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery

2022-08-19 Thread GitBox


msokolov commented on code in PR #1058:
URL: https://github.com/apache/lucene/pull/1058#discussion_r950508835


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -345,15 +345,62 @@ public BulkScorer bulkScorer(LeafReaderContext context) 
throws IOException {
   }
 
   @Override
-  public Scorer scorer(LeafReaderContext context) throws IOException {
-final WeightOrDocIdSet weightOrBitSet = rewrite(context);
-if (weightOrBitSet == null) {
-  return null;
-} else if (weightOrBitSet.weight != null) {
-  return weightOrBitSet.weight.scorer(context);
-} else {
-  return scorer(weightOrBitSet.set);
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+// Cost estimation reasoning is:
+//  1. Assume every query term matches at least one document 
(queryTermsCount).
+//  2. Determine the total number of docs beyond the first one for 
each term.
+// That count provides a ceiling on the number of extra docs that 
could match beyond
+// that first one. (We omit the first since it's already been 
counted in #1).
+// This approach still provides correct worst-case cost in general, 
but provides tighter
+// estimates for primary-key-like fields. See: LUCENE-10207
+
+// TODO: This cost estimation may grossly overestimate since we have 
no index statistics
+// for the specific query terms. While it's nice to avoid the cost of 
intersecting the
+// query terms with the index, it could be beneficial to do that work 
and get better
+// cost estimates.
+final long cost;
+final long queryTermsCount = termData.size();
+Terms indexTerms = context.reader().terms(field);
+long potentialExtraCost = indexTerms.getSumDocFreq();
+final long indexedTermCount = indexTerms.size();
+if (indexedTermCount != -1) {
+  potentialExtraCost -= indexedTermCount;
 }
+cost = queryTermsCount + potentialExtraCost;
+
+final Weight weight = this;
+return new ScorerSupplier() {
+  @Override
+  public Scorer get(long leadCost) throws IOException {
+WeightOrDocIdSet weightOrDocIdSet = rewrite(context);
+if (weightOrDocIdSet == null) {
+  return null;
+}
+
+final Scorer scorer;
+if (weightOrDocIdSet.weight != null) {
+  scorer = weightOrDocIdSet.weight.scorer(context);
+} else {
+  scorer = scorer(weightOrDocIdSet.set);
+}
+
+return Objects.requireNonNullElseGet(
+scorer,
+() ->
+new ConstantScoreScorer(weight, score(), scoreMode, 
DocIdSetIterator.empty()));
+  }
+
+  @Override
+  public long cost() {
+return cost;

Review Comment:
   yeah that's fine. I guess I thought it had to with number of documents, but 
if it is just sort of abstractly "how expensive is this" like in terms of 
operations then it makes sense to let it blow up



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer

2022-08-19 Thread GitBox


msokolov commented on code in PR #1071:
URL: https://github.com/apache/lucene/pull/1071#discussion_r950507613


##
lucene/core/src/java/org/apache/lucene/index/VectorValues.java:
##
@@ -192,36 +176,5 @@ public int advance(int target) throws IOException {
 public long cost() {
   return size();
 }
-
-@Override
-public RandomAccessVectorValues randomAccess() throws IOException {

Review Comment:
   This is kind of funny to me since it is kind of back to a much earlier 
iteration where these were all on the same interface, but I split them apart as 
a way to  try and hide the random access that people didn't like :) Although it 
was never very successful. Anyway I am +1 to get rid of the fig leaf



##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws 
IOException {
 }
   }
 
-  @SuppressWarnings("unchecked")
-  private void dumpGraph(Path docsPath) throws IOException {

Review Comment:
   hm I guess we can always restore. This had been pretty useful in early times 
when nothing was working ...



##
lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java:
##
@@ -783,66 +742,6 @@ private static void usage() {
 System.exit(1);
   }
 
-  class BinaryFileVectors implements RandomAccessVectorValuesProducer, 
Closeable {

Review Comment:
   if it's not used, I think we can remove?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950503788


##
lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java:
##
@@ -117,6 +160,21 @@ public KnnVectorField(String name, float[] vector, 
FieldType fieldType) {
 fieldsData = vector;
   }
 
+  /**
+   * Creates a numeric vector field. Fields are single-valued: each document 
has either one value or
+   * no value. Vectors of a single field share the same dimension and 
similarity function.
+   *
+   * @param name field name
+   * @param vector value
+   * @param fieldType field type
+   * @throws IllegalArgumentException if any parameter is null, or the vector 
is empty or has
+   * dimension  1024.
+   */
+  public KnnVectorField(String name, BytesRef vector, FieldType fieldType) {

Review Comment:
   https://github.com/apache/lucene/pull/1076 for this; also addreses the 
javadoc comment below and fixes a bug



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #1076: Add safety checks to KnnVectorField; fixed issue with copying BytesRef

2022-08-19 Thread GitBox


msokolov opened a new pull request, #1076:
URL: https://github.com/apache/lucene/pull/1076

   Adds some type safety checks to KnnVectorField
   Adds a unit test to exercise the safety checks in TestField
   the unit test uncovered a bad bug where I had used a length instead of a 
"to" position in ArrayUtil.copyOfSubArray, so I also fixed that
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950423611


##
lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java:
##
@@ -249,6 +261,29 @@ private void writeSortingField(FieldWriter fieldData, int 
maxDoc, Sorter.DocMap
 mockGraph);
   }
 
+  private long writeSortedFloat32Vectors(FieldWriter fieldData, int[] 
ordMap)
+  throws IOException {
+long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
+final ByteBuffer buffer =
+ByteBuffer.allocate(fieldData.dim * 
Float.BYTES).order(ByteOrder.LITTLE_ENDIAN);
+final BytesRef binaryValue = new BytesRef(buffer.array());
+for (int ordinal : ordMap) {
+  float[] vector = (float[]) fieldData.vectors.get(ordinal);
+  buffer.asFloatBuffer().put(vector);
+  vectorData.writeBytes(binaryValue.bytes, binaryValue.offset, 
binaryValue.length);
+}
+return vectorDataOffset;
+  }
+
+  private long writeSortedByteVectors(FieldWriter fieldData, int[] ordMap) 
throws IOException {
+long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES);
+for (int ordinal : ordMap) {
+  byte[] vector = (byte[]) fieldData.vectors.get(ordinal);

Review Comment:
   I posted a fix here https://github.com/apache/lucene/pull/1074



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


msokolov commented on code in PR #1054:
URL: https://github.com/apache/lucene/pull/1054#discussion_r950423184


##
lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java:
##
@@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight 
filterWeight) throws IO
   return NO_RESULTS;
 }
 
-BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc);
-BitSetIterator filterIterator = new BitSetIterator(bitSet, 
bitSet.cardinality());
+BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc);
 
-if (filterIterator.cost() <= k) {
+if (acceptDocs.cardinality() <= k) {

Review Comment:
   I think what bothered me was that we were creating a `BitSetIterator` even 
when we didn't use it, except as a holder for the cost, which also we had to 
cast. How's this? https://github.com/apache/lucene/pull/1075



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #1075: don't call BitSet.cardinality() more than needed

2022-08-19 Thread GitBox


msokolov opened a new pull request, #1075:
URL: https://github.com/apache/lucene/pull/1075

   ### Description (or a Jira issue link if you have one)
   
   
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov opened a new pull request, #1074: Fix for bad cast when sorting a KnnVectors index over BytesRef

2022-08-19 Thread GitBox


msokolov opened a new pull request, #1074:
URL: https://github.com/apache/lucene/pull/1074

   Thanks @jtibshirani for noticing this one! Clearly we were missing some 
tests, so I beefed up BaseKnnVectorsFormatTestCase a bit, adding a specific 
sorted index test over bytes and also a testRandomBytes. Probably some 
additional coverage would be a good idea.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits

2022-08-19 Thread GitBox


msokolov commented on PR #1054:
URL: https://github.com/apache/lucene/pull/1054#issuecomment-1220873853

   Thanks so much for the review @jtibshirani -- I will post some PRs with 
fixes today, although I may not be able to address everything you raised


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?

2022-08-19 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581942#comment-17581942
 ] 

Michael Sokolov commented on LUCENE-10318:
--

Another idea I played with at one point was to preserve all the graphs from the 
existing segments (remapping their ordinals) and linking them together with 
additional links. But a lot of links needed to be created in order to get close 
to the recall for a new "from scratch" graph, and I struggled to get any 
improvement. At the time I wasn't even concerning myself about deletions.

> Reuse HNSW graphs when merging segments?
> 
>
> Key: LUCENE-10318
> URL: https://issues.apache.org/jira/browse/LUCENE-10318
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Julie Tibshirani
>Priority: Major
>  Labels: vector-based-search
>
> Currently when merging segments, the HNSW vectors format rebuilds the entire 
> graph from scratch. In general, building these graphs is very expensive, and 
> it'd be nice to optimize it in any way we can. I was wondering if during 
> merge, we could choose the largest segment with no deletes, and load its HNSW 
> graph into heap. Then we'd add vectors from the other segments to this graph, 
> through the normal build process. This could cut down on the number of 
> operations we need to perform when building the graph.
> This is just an early idea, I haven't run experiments to see if it would 
> help. I'd guess that whether it helps would also depend on details of the 
> MergePolicy.



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file

2022-08-19 Thread Michael Sokolov (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581941#comment-17581941
 ] 

Michael Sokolov commented on LUCENE-10681:
--

I think so; please close this one if you agree

> ArrayIndexOutOfBoundsException while indexing large binary file
> ---
>
> Key: LUCENE-10681
> URL: https://issues.apache.org/jira/browse/LUCENE-10681
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: core/index
>Affects Versions: 9.2
> Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1
>Reporter: Luís Filipe Nassif
>Priority: Major
>
> Hello,
> I looked for a similar issue, but didn't find one, so I'm creating this, 
> sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 
> recently and an user reported error below while indexing a huge binary file 
> in a parent-children schema where strings extracted from the huge binary file 
> (using strings command) are indexed as thousands of ~10MB children text docs 
> of the parent metadata document:
>  
> {noformat}
> Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of 
> bounds for length 71428
>     at 
> org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432)
>  ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at 
> org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) 
> ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - 
> romseygeek - 2022-05-19 15:10:13]
>     at iped.engine.task.index.IndexTask.process(IndexTask.java:148) 
> ~[iped-engine-4.0.2.jar:?]
>     at 
> iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) 
> ~[iped-engine-4.0.2.jar:?]{noformat}
>  
> This seems an integer overflow to me, not sure... It didn't use to happen 
> with previous lucene-5.5.5 and indexing files like this is pretty common to 
> us, although with lucene-5.5.5 we used to break that huge file manually 
> before indexing and to index using IndexWriter.addDocument(Document) method 
> several times for each 10MB chunk, now we are using the 
> IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts?



--
This message was sent by Atlassian Jira
(v8.20.10#820010)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: 

[GitHub] [lucene] msokolov opened a new pull request, #1073: fix VectorUtil.dotProductScore normalization

2022-08-19 Thread GitBox


msokolov opened a new pull request, #1073:
URL: https://github.com/apache/lucene/pull/1073

   There was a thinko in recently-pushed VectorUtil.dotProductScore. It didn't 
have the needed offset to the score: was adding 1 when it should have added 
`-128 * -128 * dimension`.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-19 Thread GitBox


LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) 
throws IOException {
   builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the 
`reader.maxDoc() == termsEnum.docFreq()` calculation.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-19 Thread GitBox


LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) 
throws IOException {
   builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the termsEnum in line 295 did not involved with the 
reader.maxDoc() == termsEnum.docFreq() calculation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-19 Thread GitBox


LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950306914


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) 
throws IOException {
   builder.add(docs);

Review Comment:
   Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the 
`reader.maxDoc() == termsEnum.docFreq() ` calculation



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-19 Thread GitBox


gsmiller commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r950251125


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) 
throws IOException {
   builder.add(docs);

Review Comment:
   @LuXugang I'm not sure I follow your question. `builder.add(docs)` will add 
all docs for the term's `PostingsEnum` to the `DocIdSetBuilder`. Or are you 
referring to the initialization of the builder here: `builder = new 
DocIdSetBuilder(reader.maxDoc(), terms);`? In this initialization, 
`reader.maxDoc()` is used to specify the number of possible docids (used to 
size the underlying bitset). `reader.maxDoc()` returns "one greater than the 
largest possible document number" (according to javadoc), so this should be 
correct.
   
   Sorry if I'm misunderstanding your question.



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] ceciliassis closed pull request #450: Add rule exception for "imento" and "mento" suffix

2022-08-19 Thread GitBox


ceciliassis closed pull request #450: Add rule exception for "imento" and 
"mento" suffix
URL: https://github.com/apache/lucene-solr/pull/450


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment

2022-08-19 Thread GitBox


LuXugang commented on code in PR #1062:
URL: https://github.com/apache/lucene/pull/1062#discussion_r949980320


##
lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java:
##
@@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) 
throws IOException {
   builder.add(docs);

Review Comment:
   It seems like this term's `reader.maxDoc() == termsEnum.docFreq()` was 
skipped?



-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org