[GitHub] [lucene] jtibshirani commented on a diff in pull request #1073: fix VectorUtil.dotProductScore normalization
jtibshirani commented on code in PR #1073: URL: https://github.com/apache/lucene/pull/1073#discussion_r950632314 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -270,7 +270,8 @@ public static float dotProduct(BytesRef a, BytesRef b) { */ public static float dotProductScore(BytesRef a, BytesRef b) { // divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len -return (1 + dotProduct(a, b)) / (float) (a.length * (1 << 15)); +float maxValue = (float) (a.length * (1 << 14)); +return (maxValue + dotProduct(a, b)) / (2 * maxValue); Review Comment: Could this just be `1 + (dotProduct(a, b) / maxValue)` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
jtibshirani commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950631408 ## lucene/core/src/java/org/apache/lucene/codecs/KnnFieldVectorsWriter.java: ## @@ -20,8 +20,12 @@ import java.io.IOException; import org.apache.lucene.util.Accountable; -/** Vectors' writer for a field */ -public abstract class KnnFieldVectorsWriter implements Accountable { +/** + * Vectors' writer for a field + * + * @param an array type; the type of vectors to be written + */ +public abstract class KnnFieldVectorsWriter implements Accountable { Review Comment: Oh okay, I must have missed those objections from an earlier PR draft. I understand and appreciate your "progress over perfection" approach here. I've been trying out refactorings so I could suggest cleaner ways forward, but nothing has stuck so far... I'll continue to chip away at it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
jtibshirani commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950631270 ## lucene/core/src/java/org/apache/lucene/util/VectorUtil.java: ## @@ -213,4 +243,48 @@ public static void add(float[] u, float[] v) { u[i] += v[i]; } } + + /** + * Dot product computed over signed bytes. + * + * @param a bytes containing a vector + * @param b bytes containing another vector, of the same dimension + * @return the value of the dot product of the two vectors + */ + public static float dotProduct(BytesRef a, BytesRef b) { +assert a.length == b.length; +int total = 0; +int aOffset = a.offset, bOffset = b.offset; +for (int i = 0; i < a.length; i++) { + total += a.bytes[aOffset++] * b.bytes[bOffset++]; +} +return total; + } + + /** + * Dot product score computed over signed bytes, scaled to be in [0, 1]. + * + * @param a bytes containing a vector + * @param b bytes containing another vector, of the same dimension + * @return the value of the similarity function applied to the two vectors + */ + public static float dotProductScore(BytesRef a, BytesRef b) { +// divide by 2 * 2^14 (maximum absolute value of product of 2 signed bytes) * len +return (1 + dotProduct(a, b)) / (float) (a.length * (1 << 15)); + } + + /** + * Convert a floating point vector to an array of bytes using casting; the vector values should be + * in [-128,127] + * + * @param vector a vector + * @return a new BytesRef containing the vector's values cast to byte. + */ + public static BytesRef toBytesRef(float[] vector) { Review Comment: Then maybe we could do a check in `Lucene94HnswVectorsReader`? The main thing I had in mind was if a user accidentally provides a query where the elements don't fall in the range [-128, 127]. We would silently cast the floats to bytes, which could result in incorrect search results. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
jtibshirani commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950631004 ## lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsFormat.java: ## @@ -76,6 +78,15 @@ public static KnnVectorsFormat forName(String name) { /** Returns a {@link KnnVectorsReader} to read the vectors from the index. */ public abstract KnnVectorsReader fieldsReader(SegmentReadState state) throws IOException; + /** + * Returns the current KnnVectorsFormat version number. Indexes written using the format will be + * "stamped" with this version. + */ + public int currentVersion() { Review Comment: Here's what I had in mind: https://github.com/apache/lucene/pull/1077. We would just move the checks to test classes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani opened a new pull request, #1077: Remove KnnVectorsFormat#currentVersion
jtibshirani opened a new pull request, #1077: URL: https://github.com/apache/lucene/pull/1077 These internal versions only make sense within a codec definition, and aren't meant to be exposed and compared across codecs. Since this method is only used in tests, we can move the check to the test classes instead. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9583) How should we expose VectorValues.RandomAccess?
[ https://issues.apache.org/jira/browse/LUCENE-9583?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582097#comment-17582097 ] ASF subversion and git services commented on LUCENE-9583: - Commit 8308688d786cd6c55fcbe4e59f67966f385989a2 in lucene's branch refs/heads/main from Julie Tibshirani [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=8308688d786 ] LUCENE-9583: Remove RandomAccessVectorValuesProducer (#1071) This change folds the `RandomAccessVectorValuesProducer` interface into `RandomAccessVectorValues`. This reduces the number of interfaces and clarifies the cloning/ copying behavior. This is a small simplification related to LUCENE-9583, but does not address the main issue. > How should we expose VectorValues.RandomAccess? > --- > > Key: LUCENE-9583 > URL: https://issues.apache.org/jira/browse/LUCENE-9583 > Project: Lucene - Core > Issue Type: Improvement >Affects Versions: 9.0 >Reporter: Michael Sokolov >Assignee: Julie Tibshirani >Priority: Major > Time Spent: 2h > Remaining Estimate: 0h > > In the newly-added {{VectorValues}} API, we have a {{RandomAccess}} > sub-interface. [~jtibshirani] pointed out this is not needed by some > vector-indexing strategies which can operate solely using a forward-iterator > (it is needed by HNSW), and so in the interest of simplifying the public API > we should not expose this internal detail (which by the way surfaces internal > ordinals that are somewhat uninteresting outside the random access API). > I looked into how to move this inside the HNSW-specific code and remembered > that we do also currently make use of the RA API when merging vector fields > over sorted indexes. Without it, we would need to load all vectors into RAM > while flushing/merging, as we currently do in > {{BinaryDocValuesWriter.BinaryDVs}}. I wonder if it's worth paying this cost > for the simpler API. > Another thing I noticed while reviewing this is that I moved the KNN > {{search(float[] target, int topK, int fanout)}} method from {{VectorValues}} > to {{VectorValues.RandomAccess}}. This I think we could move back, and > handle the HNSW requirements for search elsewhere. I wonder if that would > alleviate the major concern here? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani merged pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani merged PR #1071: URL: https://github.com/apache/lucene/pull/1071 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
jtibshirani commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950624287 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IO return NO_RESULTS; } -BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc); -BitSetIterator filterIterator = new BitSetIterator(bitSet, bitSet.cardinality()); +BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -if (filterIterator.cost() <= k) { +if (acceptDocs.cardinality() <= k) { Review Comment: That approach works for me. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1076: Add safety checks to KnnVectorField; fixed issue with copying BytesRef
jtibshirani commented on code in PR #1076: URL: https://github.com/apache/lucene/pull/1076#discussion_r950623197 ## lucene/core/src/java/org/apache/lucene/index/VectorEncoding.java: ## @@ -21,12 +21,8 @@ public enum VectorEncoding { /** - * Encodes vector using 8 bits of precision per sample. Use only with DOT_PRODUCT similarity. - * NOTE: this can enable significant storage savings and faster searches, at the cost of some - * possible loss of precision. In order to use it, all vectors must be of the same norm, as - * measured by the sum of the squares of the scalar values, and those values must be in the range - * [-128, 127]. This applies to both document and query vectors. Using nonconforming vectors can - * result in errors or poor search results. + * Encodes vector using 8 bits of precision per sample. NOTE: this can enable significant storage Review Comment: Maybe we could keep the part about how query vectors need to have all values within [-128, 127]? Because if they don't, we'll just cast to a byte and the results could be really surprising. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on PR #1071: URL: https://github.com/apache/lucene/pull/1071#issuecomment-1221189385 Thank you for the reviews. I'm going to merge since it seems you're both on board with the change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r950622069 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws IOException { } } - @SuppressWarnings("unchecked") - private void dumpGraph(Path docsPath) throws IOException { Review Comment: Let me know anytime you start to miss it and I'll restore it! For now I'll remove it to keep things simple. If we keep it, I'll need to refactor `BinaryFileVectors` -- which is doable but slightly tricky. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r950622069 ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws IOException { } } - @SuppressWarnings("unchecked") - private void dumpGraph(Path docsPath) throws IOException { Review Comment: Let me know anytime you start to miss it and I'll restore it! For now I'll remove it to keep things simple. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r950621473 ## lucene/core/src/java/org/apache/lucene/util/hnsw/HnswGraphBuilder.java: ## @@ -133,7 +132,7 @@ private HnswGraphBuilder( * accessor for the vectors */ public OnHeapHnswGraph build(RandomAccessVectorValues vectors) throws IOException { -if (vectors == vectorValues) { +if (vectors == this.vectors) { Review Comment: will fix -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jtibshirani commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
jtibshirani commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r950621376 ## lucene/core/src/java/org/apache/lucene/index/VectorValues.java: ## @@ -192,36 +176,5 @@ public int advance(int target) throws IOException { public long cost() { return size(); } - -@Override -public RandomAccessVectorValues randomAccess() throws IOException { Review Comment: Indeed we never reached a great resolution there! I'm trying to chip away at some of these tricky design questions, we'll see how it goes... -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8118) ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing
[ https://issues.apache.org/jira/browse/LUCENE-8118?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17582032#comment-17582032 ] Luís Filipe Nassif commented on LUCENE-8118: I did some workarounds in our project to avoid this. But as suggested before, it would be nice to throw a better Exception and update the IndexWriter.addDocumentS(Iterable) documentation to warn users they shouldn't pass huge collections or documents to that method at once. > ArrayIndexOutOfBoundsException in TermsHashPerField.writeByte during indexing > - > > Key: LUCENE-8118 > URL: https://issues.apache.org/jira/browse/LUCENE-8118 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 7.2 > Environment: Debian/Stretch > java version "1.8.0_144" > >Java(TM) SE Runtime > Environment (build 1.8.0_144-b01) > >Java HotSpot(TM) 64-Bit Server VM (build > 25.144-b01, mixed mode) >Reporter: Laura Dietz >Priority: Major > Attachments: LUCENE-8118_test.patch > > Time Spent: 2h 40m > Remaining Estimate: 0h > > Indexing a large collection of about 20 million paragraph-sized documents > results in an ArrayIndexOutOfBoundsException in > org.apache.lucene.index.TermsHashPerField.writeByte (full stack trace > below). > The bug is possibly related to issues described in > [here|http://lucene.472066.n3.nabble.com/ArrayIndexOutOfBoundsException-65536-td3661945.html] > and [SOLR-10936|https://issues.apache.org/jira/browse/SOLR-10936] -- but I > am not using SOLR, I am directly using Lucene Core. > The issue can be reproduced using code from [GitHub > trec-car-tools-example|https://github.com/TREMA-UNH/trec-car-tools/tree/lucene-bug/trec-car-tools-example] > > - compile with `mvn compile assembly:single` > - run with `java -cp > ./target/treccar-tools-example-0.1-jar-with-dependencies.jar > edu.unh.cs.TrecCarBuildLuceneIndex paragraphs paragraphCorpus.cbor indexDir` > Where paragraphCorpus.cbor is contained in this > [archive|http://trec-car.cs.unh.edu/datareleases/v2.0-snapshot/archive-paragraphCorpus.tar.xz] > Exception in thread "main" java.lang.ArrayIndexOutOfBoundsException: -65536 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:198) > >at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:224) > >at > org.apache.lucene.index.FreqProxTermsWriterPerField.addTerm(FreqProxTermsWriterPerField.java:159) > > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:185) > > at > org.apache.lucene.index.DefaultIndexingChain$PerField.invert(DefaultIndexingChain.java:786) > >at > org.apache.lucene.index.DefaultIndexingChain.processField(DefaultIndexingChain.java:430) > > at > org.apache.lucene.index.DefaultIndexingChain.processDocument(DefaultIndexingChain.java:392) > >at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:281) > >at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:451) > > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > > at >
[GitHub] [lucene] msokolov commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery
msokolov commented on code in PR #1058: URL: https://github.com/apache/lucene/pull/1058#discussion_r950509327 ## lucene/CHANGES.txt: ## @@ -95,6 +95,9 @@ Improvements - * LUCENE-10592: Build HNSW Graph on indexing. (Mayya Sharipova, Adrien Grand, Julie Tibshirani) +* LUCENE-10207: TermInSetQuery can now provide a ScoreSupplier with cost estimation, making it + usable in TermInSetQuery. (Greg Miller) Review Comment: this comment is confusing -- do you really mean that now TISQ is usable in TISQ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #1058: LUCENE-10207: TermInSetQuery now provides a ScoreSupplier with cost estimation for use in TermInSetQuery
msokolov commented on code in PR #1058: URL: https://github.com/apache/lucene/pull/1058#discussion_r950508835 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -345,15 +345,62 @@ public BulkScorer bulkScorer(LeafReaderContext context) throws IOException { } @Override - public Scorer scorer(LeafReaderContext context) throws IOException { -final WeightOrDocIdSet weightOrBitSet = rewrite(context); -if (weightOrBitSet == null) { - return null; -} else if (weightOrBitSet.weight != null) { - return weightOrBitSet.weight.scorer(context); -} else { - return scorer(weightOrBitSet.set); + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +// Cost estimation reasoning is: +// 1. Assume every query term matches at least one document (queryTermsCount). +// 2. Determine the total number of docs beyond the first one for each term. +// That count provides a ceiling on the number of extra docs that could match beyond +// that first one. (We omit the first since it's already been counted in #1). +// This approach still provides correct worst-case cost in general, but provides tighter +// estimates for primary-key-like fields. See: LUCENE-10207 + +// TODO: This cost estimation may grossly overestimate since we have no index statistics +// for the specific query terms. While it's nice to avoid the cost of intersecting the +// query terms with the index, it could be beneficial to do that work and get better +// cost estimates. +final long cost; +final long queryTermsCount = termData.size(); +Terms indexTerms = context.reader().terms(field); +long potentialExtraCost = indexTerms.getSumDocFreq(); +final long indexedTermCount = indexTerms.size(); +if (indexedTermCount != -1) { + potentialExtraCost -= indexedTermCount; } +cost = queryTermsCount + potentialExtraCost; + +final Weight weight = this; +return new ScorerSupplier() { + @Override + public Scorer get(long leadCost) throws IOException { +WeightOrDocIdSet weightOrDocIdSet = rewrite(context); +if (weightOrDocIdSet == null) { + return null; +} + +final Scorer scorer; +if (weightOrDocIdSet.weight != null) { + scorer = weightOrDocIdSet.weight.scorer(context); +} else { + scorer = scorer(weightOrDocIdSet.set); +} + +return Objects.requireNonNullElseGet( +scorer, +() -> +new ConstantScoreScorer(weight, score(), scoreMode, DocIdSetIterator.empty())); + } + + @Override + public long cost() { +return cost; Review Comment: yeah that's fine. I guess I thought it had to with number of documents, but if it is just sort of abstractly "how expensive is this" like in terms of operations then it makes sense to let it blow up -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #1071: LUCENE-9583: Remove RandomAccessVectorValuesProducer
msokolov commented on code in PR #1071: URL: https://github.com/apache/lucene/pull/1071#discussion_r950507613 ## lucene/core/src/java/org/apache/lucene/index/VectorValues.java: ## @@ -192,36 +176,5 @@ public int advance(int target) throws IOException { public long cost() { return size(); } - -@Override -public RandomAccessVectorValues randomAccess() throws IOException { Review Comment: This is kind of funny to me since it is kind of back to a much earlier iteration where these were all on the same interface, but I split them apart as a way to try and hide the random access that people didn't like :) Although it was never very successful. Anyway I am +1 to get rid of the fig leaf ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -308,41 +302,6 @@ private void printFanoutHist(Path indexPath) throws IOException { } } - @SuppressWarnings("unchecked") - private void dumpGraph(Path docsPath) throws IOException { Review Comment: hm I guess we can always restore. This had been pretty useful in early times when nothing was working ... ## lucene/core/src/test/org/apache/lucene/util/hnsw/KnnGraphTester.java: ## @@ -783,66 +742,6 @@ private static void usage() { System.exit(1); } - class BinaryFileVectors implements RandomAccessVectorValuesProducer, Closeable { Review Comment: if it's not used, I think we can remove? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950503788 ## lucene/core/src/java/org/apache/lucene/document/KnnVectorField.java: ## @@ -117,6 +160,21 @@ public KnnVectorField(String name, float[] vector, FieldType fieldType) { fieldsData = vector; } + /** + * Creates a numeric vector field. Fields are single-valued: each document has either one value or + * no value. Vectors of a single field share the same dimension and similarity function. + * + * @param name field name + * @param vector value + * @param fieldType field type + * @throws IllegalArgumentException if any parameter is null, or the vector is empty or has + * dimension 1024. + */ + public KnnVectorField(String name, BytesRef vector, FieldType fieldType) { Review Comment: https://github.com/apache/lucene/pull/1076 for this; also addreses the javadoc comment below and fixes a bug -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #1076: Add safety checks to KnnVectorField; fixed issue with copying BytesRef
msokolov opened a new pull request, #1076: URL: https://github.com/apache/lucene/pull/1076 Adds some type safety checks to KnnVectorField Adds a unit test to exercise the safety checks in TestField the unit test uncovered a bad bug where I had used a length instead of a "to" position in ArrayUtil.copyOfSubArray, so I also fixed that -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950423611 ## lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94HnswVectorsWriter.java: ## @@ -249,6 +261,29 @@ private void writeSortingField(FieldWriter fieldData, int maxDoc, Sorter.DocMap mockGraph); } + private long writeSortedFloat32Vectors(FieldWriter fieldData, int[] ordMap) + throws IOException { +long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); +final ByteBuffer buffer = +ByteBuffer.allocate(fieldData.dim * Float.BYTES).order(ByteOrder.LITTLE_ENDIAN); +final BytesRef binaryValue = new BytesRef(buffer.array()); +for (int ordinal : ordMap) { + float[] vector = (float[]) fieldData.vectors.get(ordinal); + buffer.asFloatBuffer().put(vector); + vectorData.writeBytes(binaryValue.bytes, binaryValue.offset, binaryValue.length); +} +return vectorDataOffset; + } + + private long writeSortedByteVectors(FieldWriter fieldData, int[] ordMap) throws IOException { +long vectorDataOffset = vectorData.alignFilePointer(Float.BYTES); +for (int ordinal : ordMap) { + byte[] vector = (byte[]) fieldData.vectors.get(ordinal); Review Comment: I posted a fix here https://github.com/apache/lucene/pull/1074 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on a diff in pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on code in PR #1054: URL: https://github.com/apache/lucene/pull/1054#discussion_r950423184 ## lucene/core/src/java/org/apache/lucene/search/KnnVectorQuery.java: ## @@ -133,22 +130,21 @@ private TopDocs searchLeaf(LeafReaderContext ctx, Weight filterWeight) throws IO return NO_RESULTS; } -BitSet bitSet = createBitSet(scorer.iterator(), liveDocs, maxDoc); -BitSetIterator filterIterator = new BitSetIterator(bitSet, bitSet.cardinality()); +BitSet acceptDocs = createBitSet(scorer.iterator(), liveDocs, maxDoc); -if (filterIterator.cost() <= k) { +if (acceptDocs.cardinality() <= k) { Review Comment: I think what bothered me was that we were creating a `BitSetIterator` even when we didn't use it, except as a holder for the cost, which also we had to cast. How's this? https://github.com/apache/lucene/pull/1075 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #1075: don't call BitSet.cardinality() more than needed
msokolov opened a new pull request, #1075: URL: https://github.com/apache/lucene/pull/1075 ### Description (or a Jira issue link if you have one) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov opened a new pull request, #1074: Fix for bad cast when sorting a KnnVectors index over BytesRef
msokolov opened a new pull request, #1074: URL: https://github.com/apache/lucene/pull/1074 Thanks @jtibshirani for noticing this one! Clearly we were missing some tests, so I beefed up BaseKnnVectorsFormatTestCase a bit, adding a specific sorted index test over bytes and also a testRandomBytes. Probably some additional coverage would be a good idea. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] msokolov commented on pull request #1054: LUCENE-10577: enable quantization of HNSW vectors to 8 bits
msokolov commented on PR #1054: URL: https://github.com/apache/lucene/pull/1054#issuecomment-1220873853 Thanks so much for the review @jtibshirani -- I will post some PRs with fixes today, although I may not be able to address everything you raised -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10318) Reuse HNSW graphs when merging segments?
[ https://issues.apache.org/jira/browse/LUCENE-10318?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581942#comment-17581942 ] Michael Sokolov commented on LUCENE-10318: -- Another idea I played with at one point was to preserve all the graphs from the existing segments (remapping their ordinals) and linking them together with additional links. But a lot of links needed to be created in order to get close to the recall for a new "from scratch" graph, and I struggled to get any improvement. At the time I wasn't even concerning myself about deletions. > Reuse HNSW graphs when merging segments? > > > Key: LUCENE-10318 > URL: https://issues.apache.org/jira/browse/LUCENE-10318 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Julie Tibshirani >Priority: Major > Labels: vector-based-search > > Currently when merging segments, the HNSW vectors format rebuilds the entire > graph from scratch. In general, building these graphs is very expensive, and > it'd be nice to optimize it in any way we can. I was wondering if during > merge, we could choose the largest segment with no deletes, and load its HNSW > graph into heap. Then we'd add vectors from the other segments to this graph, > through the normal build process. This could cut down on the number of > operations we need to perform when building the graph. > This is just an early idea, I haven't run experiments to see if it would > help. I'd guess that whether it helps would also depend on details of the > MergePolicy. -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10681) ArrayIndexOutOfBoundsException while indexing large binary file
[ https://issues.apache.org/jira/browse/LUCENE-10681?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17581941#comment-17581941 ] Michael Sokolov commented on LUCENE-10681: -- I think so; please close this one if you agree > ArrayIndexOutOfBoundsException while indexing large binary file > --- > > Key: LUCENE-10681 > URL: https://issues.apache.org/jira/browse/LUCENE-10681 > Project: Lucene - Core > Issue Type: Bug > Components: core/index >Affects Versions: 9.2 > Environment: Ubuntu 20.04 (LTS), java x64 version 11.0.16.1 >Reporter: Luís Filipe Nassif >Priority: Major > > Hello, > I looked for a similar issue, but didn't find one, so I'm creating this, > sorry if it was reported before. We upgraded from Lucene-5.5.5 to 9.2.0 > recently and an user reported error below while indexing a huge binary file > in a parent-children schema where strings extracted from the huge binary file > (using strings command) are indexed as thousands of ~10MB children text docs > of the parent metadata document: > > {noformat} > Caused by: java.lang.ArrayIndexOutOfBoundsException: Index -65536 out of > bounds for length 71428 > at > org.apache.lucene.index.TermsHashPerField.writeByte(TermsHashPerField.java:219) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.writeVInt(TermsHashPerField.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.writeProx(FreqProxTermsWriterPerField.java:86) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.FreqProxTermsWriterPerField.newTerm(FreqProxTermsWriterPerField.java:127) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.initStreamSlices(TermsHashPerField.java:175) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.TermsHashPerField.add(TermsHashPerField.java:198) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain$PerField.invert(IndexingChain.java:1224) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processField(IndexingChain.java:729) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexingChain.processDocument(IndexingChain.java:620) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriterPerThread.updateDocuments(DocumentsWriterPerThread.java:241) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.DocumentsWriter.updateDocuments(DocumentsWriter.java:432) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.updateDocuments(IndexWriter.java:1532) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at > org.apache.lucene.index.IndexWriter.addDocuments(IndexWriter.java:1503) > ~[lucene-core-9.2.0.jar:9.2.0 ba8c3a806ada3d7b3c34d408e449a92376a8481b - > romseygeek - 2022-05-19 15:10:13] > at iped.engine.task.index.IndexTask.process(IndexTask.java:148) > ~[iped-engine-4.0.2.jar:?] > at > iped.engine.task.AbstractTask.processMonitorTimeout(AbstractTask.java:250) > ~[iped-engine-4.0.2.jar:?]{noformat} > > This seems an integer overflow to me, not sure... It didn't use to happen > with previous lucene-5.5.5 and indexing files like this is pretty common to > us, although with lucene-5.5.5 we used to break that huge file manually > before indexing and to index using IndexWriter.addDocument(Document) method > several times for each 10MB chunk, now we are using the > IndexWriter.addDocuments(Iterable) method with lucene-9.2.0... Any thoughts? -- This message was sent by Atlassian Jira (v8.20.10#820010) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail:
[GitHub] [lucene] msokolov opened a new pull request, #1073: fix VectorUtil.dotProductScore normalization
msokolov opened a new pull request, #1073: URL: https://github.com/apache/lucene/pull/1073 There was a thinko in recently-pushed VectorUtil.dotProductScore. It didn't have the needed offset to the score: was adding 1 when it should have added `-128 * -128 * dimension`. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
LuXugang commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { builder.add(docs); Review Comment: Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the `reader.maxDoc() == termsEnum.docFreq()` calculation. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
LuXugang commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r950307252 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { builder.add(docs); Review Comment: Hi @gsmiller , I mean the termsEnum in line 295 did not involved with the reader.maxDoc() == termsEnum.docFreq() calculation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
LuXugang commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r950306914 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { builder.add(docs); Review Comment: Hi @gsmiller , I mean the `termsEnum` in line 295 did not involved with the `reader.maxDoc() == termsEnum.docFreq() ` calculation -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
gsmiller commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r950251125 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { builder.add(docs); Review Comment: @LuXugang I'm not sure I follow your question. `builder.add(docs)` will add all docs for the term's `PostingsEnum` to the `DocIdSetBuilder`. Or are you referring to the initialization of the builder here: `builder = new DocIdSetBuilder(reader.maxDoc(), terms);`? In this initialization, `reader.maxDoc()` is used to specify the number of possible docids (used to size the underlying bitset). `reader.maxDoc()` returns "one greater than the largest possible document number" (according to javadoc), so this should be correct. Sorry if I'm misunderstanding your question. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] ceciliassis closed pull request #450: Add rule exception for "imento" and "mento" suffix
ceciliassis closed pull request #450: Add rule exception for "imento" and "mento" suffix URL: https://github.com/apache/lucene-solr/pull/450 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] LuXugang commented on a diff in pull request #1062: Optimize TermInSetQuery for terms that match all docs in a segment
LuXugang commented on code in PR #1062: URL: https://github.com/apache/lucene/pull/1062#discussion_r949980320 ## lucene/core/src/java/org/apache/lucene/search/TermInSetQuery.java: ## @@ -293,6 +296,9 @@ private WeightOrDocIdSet rewrite(LeafReaderContext context) throws IOException { builder.add(docs); Review Comment: It seems like this term's `reader.maxDoc() == termsEnum.docFreq()` was skipped? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org