[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload
[ https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264 ] Vinh Le edited comment on SOLR-10506 at 1/20/20 7:14 AM: - In 7.7.2, some SolrCore still are not released after being removed. !image-2020-01-20-14-51-26-411.png|width=880,height=538! was (Author: vinhlh): In 7.7.2, some SolrCore still are not released after being removed. !image-2020-01-20-14-51-26-411.png! > Possible memory leak upon collection reload > --- > > Key: SOLR-10506 > URL: https://issues.apache.org/jira/browse/SOLR-10506 > Project: Solr > Issue Type: Bug > Components: Server >Affects Versions: 6.5 >Reporter: Torsten Bøgh Köster >Assignee: Christine Poerschke >Priority: Major > Fix For: 6.6.6, 7.0 > > Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, > solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png > > Time Spent: 20m > Remaining Estimate: 0h > > Upon manual Solr Collection reloading, references to the closed {{SolrCore}} > are not fully removed by the garbage collector as a strong reference to the > {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for > schema changes. > In our case, this leads to a massive memory leak as managed resources are > still referenced by the closed {{SolrCore}}. Our Solr cloud environment > utilizes rather large managed resources (synonyms, stopwords). To reproduce, > we fired out environment up and reloaded the collection 13 times. As a result > we fully exhausted our heap. A closer look with the Yourkit profiler revealed > 13 {{SolrCore}} instances, still holding strong references to the garbage > collection root (see screenshot 1). > Each {{SolrCore}} instance holds a single path with strong references to the > gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The > {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the > Zookeeper is not removed upon core close. > We supplied a Github Pull Request > (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper > `Watcher` as a static inner class. To eliminate the memory leak, the schema > reader is held inside a `WeakReference` and the reference is explicitly > removed on core close. > Initially I wanted to supply a test case but unfortunately did not find a > good starting point ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload
[ https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264 ] Vinh Le edited comment on SOLR-10506 at 1/20/20 7:14 AM: - In 7.7.2, some SolrCore still are not released after being removed. !image-2020-01-20-14-51-26-411.png! was (Author: vinhlh): In 7.7.2, some SolrCore still are not released after being removed (and alias switched to another one) !image-2020-01-20-14-51-26-411.png! > Possible memory leak upon collection reload > --- > > Key: SOLR-10506 > URL: https://issues.apache.org/jira/browse/SOLR-10506 > Project: Solr > Issue Type: Bug > Components: Server >Affects Versions: 6.5 >Reporter: Torsten Bøgh Köster >Assignee: Christine Poerschke >Priority: Major > Fix For: 6.6.6, 7.0 > > Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, > solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png > > Time Spent: 20m > Remaining Estimate: 0h > > Upon manual Solr Collection reloading, references to the closed {{SolrCore}} > are not fully removed by the garbage collector as a strong reference to the > {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for > schema changes. > In our case, this leads to a massive memory leak as managed resources are > still referenced by the closed {{SolrCore}}. Our Solr cloud environment > utilizes rather large managed resources (synonyms, stopwords). To reproduce, > we fired out environment up and reloaded the collection 13 times. As a result > we fully exhausted our heap. A closer look with the Yourkit profiler revealed > 13 {{SolrCore}} instances, still holding strong references to the garbage > collection root (see screenshot 1). > Each {{SolrCore}} instance holds a single path with strong references to the > gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The > {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the > Zookeeper is not removed upon core close. > We supplied a Github Pull Request > (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper > `Watcher` as a static inner class. To eliminate the memory leak, the schema > reader is held inside a `WeakReference` and the reference is explicitly > removed on core close. > Initially I wanted to supply a test case but unfortunately did not find a > good starting point ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload
[ https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264 ] Vinh Le edited comment on SOLR-10506 at 1/20/20 6:57 AM: - In 7.7.2, some SolrCore still are not released after being removed (and alias switched to another one) !image-2020-01-20-14-51-26-411.png! was (Author: vinhlh): In 7.7.2, some SolrCore still are not released after being removed (actually alias switched to another one) !image-2020-01-20-14-51-26-411.png! > Possible memory leak upon collection reload > --- > > Key: SOLR-10506 > URL: https://issues.apache.org/jira/browse/SOLR-10506 > Project: Solr > Issue Type: Bug > Components: Server >Affects Versions: 6.5 >Reporter: Torsten Bøgh Köster >Assignee: Christine Poerschke >Priority: Major > Fix For: 6.6.6, 7.0 > > Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, > solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png > > Time Spent: 20m > Remaining Estimate: 0h > > Upon manual Solr Collection reloading, references to the closed {{SolrCore}} > are not fully removed by the garbage collector as a strong reference to the > {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for > schema changes. > In our case, this leads to a massive memory leak as managed resources are > still referenced by the closed {{SolrCore}}. Our Solr cloud environment > utilizes rather large managed resources (synonyms, stopwords). To reproduce, > we fired out environment up and reloaded the collection 13 times. As a result > we fully exhausted our heap. A closer look with the Yourkit profiler revealed > 13 {{SolrCore}} instances, still holding strong references to the garbage > collection root (see screenshot 1). > Each {{SolrCore}} instance holds a single path with strong references to the > gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The > {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the > Zookeeper is not removed upon core close. > We supplied a Github Pull Request > (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper > `Watcher` as a static inner class. To eliminate the memory leak, the schema > reader is held inside a `WeakReference` and the reference is explicitly > removed on core close. > Initially I wanted to supply a test case but unfortunately did not find a > good starting point ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (SOLR-10506) Possible memory leak upon collection reload
[ https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264 ] Vinh Le commented on SOLR-10506: In 7.7.2, some SolrCore still are not released after being removed (actually alias switched to another one) !image-2020-01-20-14-51-26-411.png! > Possible memory leak upon collection reload > --- > > Key: SOLR-10506 > URL: https://issues.apache.org/jira/browse/SOLR-10506 > Project: Solr > Issue Type: Bug > Components: Server >Affects Versions: 6.5 >Reporter: Torsten Bøgh Köster >Assignee: Christine Poerschke >Priority: Major > Fix For: 6.6.6, 7.0 > > Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, > solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png > > Time Spent: 20m > Remaining Estimate: 0h > > Upon manual Solr Collection reloading, references to the closed {{SolrCore}} > are not fully removed by the garbage collector as a strong reference to the > {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for > schema changes. > In our case, this leads to a massive memory leak as managed resources are > still referenced by the closed {{SolrCore}}. Our Solr cloud environment > utilizes rather large managed resources (synonyms, stopwords). To reproduce, > we fired out environment up and reloaded the collection 13 times. As a result > we fully exhausted our heap. A closer look with the Yourkit profiler revealed > 13 {{SolrCore}} instances, still holding strong references to the garbage > collection root (see screenshot 1). > Each {{SolrCore}} instance holds a single path with strong references to the > gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The > {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the > Zookeeper is not removed upon core close. > We supplied a Github Pull Request > (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper > `Watcher` as a static inner class. To eliminate the memory leak, the schema > reader is held inside a `WeakReference` and the reference is explicitly > removed on core close. > Initially I wanted to supply a test case but unfortunately did not find a > good starting point ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-10506) Possible memory leak upon collection reload
[ https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Vinh Le updated SOLR-10506: --- Attachment: image-2020-01-20-14-51-26-411.png > Possible memory leak upon collection reload > --- > > Key: SOLR-10506 > URL: https://issues.apache.org/jira/browse/SOLR-10506 > Project: Solr > Issue Type: Bug > Components: Server >Affects Versions: 6.5 >Reporter: Torsten Bøgh Köster >Assignee: Christine Poerschke >Priority: Major > Fix For: 6.6.6, 7.0 > > Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, > solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png > > Time Spent: 20m > Remaining Estimate: 0h > > Upon manual Solr Collection reloading, references to the closed {{SolrCore}} > are not fully removed by the garbage collector as a strong reference to the > {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for > schema changes. > In our case, this leads to a massive memory leak as managed resources are > still referenced by the closed {{SolrCore}}. Our Solr cloud environment > utilizes rather large managed resources (synonyms, stopwords). To reproduce, > we fired out environment up and reloaded the collection 13 times. As a result > we fully exhausted our heap. A closer look with the Yourkit profiler revealed > 13 {{SolrCore}} instances, still holding strong references to the garbage > collection root (see screenshot 1). > Each {{SolrCore}} instance holds a single path with strong references to the > gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The > {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the > Zookeeper is not removed upon core close. > We supplied a Github Pull Request > (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper > `Watcher` as a static inner class. To eliminate the memory leak, the schema > reader is held inside a `WeakReference` and the reference is explicitly > removed on core close. > Initially I wanted to supply a test case but unfortunately did not find a > good starting point ... -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize
dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368388656 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.automaton; + +public final class FrozenIntSet extends IntSet { + final int state; + + public FrozenIntSet(int[] values, int hashCode, int state) { +this.values = values; +this.hashCode = hashCode; +this.state = state; + } + + public FrozenIntSet(int num, int state) { +this.values = new int[] { num }; +this.state = state; +this.hashCode = 683 + num; Review comment: oh, ok. drop it; the less code to understand the better. I don't think it'll be a particular gain here. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize
dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368388354 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java ## @@ -151,126 +141,23 @@ public void computeHash() { } } + /** + * Create a FrozenIntSet from the current values in this IntSet. + * + * Note: Must call computeHash() before calling this method + * + * @param state the state to save + * @return a FrozenIntSet that has the same values and hashCode as this set + */ public FrozenIntSet freeze(int state) { final int[] c = new int[upto]; System.arraycopy(values, 0, c, 0, upto); Review comment: You look at Java code but the difference is that Array.copyOf is (I believe) an jvm intrinsic so it should be replaced with a more optimized code. The pair of allocation+arraycopy should as well be optimized for that matter but it requires execution graph analysis and copyOf is straightforward. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search
[ https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Xin-Chun Zhang updated LUCENE-9136: --- Description: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C++ [[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as KD-tree; # Hashing methods, such as LSH (Local Sensitive Hashing); # Product quantization algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; where IVFFlat and HNSW are the most popular ones among all the VR algorithms. Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both algorithms have their merits and demerits. Since HNSW is now under development, it may be better to provide both implementations (HNSW && IVFFlat) for potential users who are faced with very different scenarios and want to more choices. was: Representation learning (RL) has been an established discipline in the machine learning space for decades but it draws tremendous attention lately with the emergence of deep learning. The central problem of RL is to determine an optimal representation of the input data. By embedding the data into a high dimensional vector, the vector retrieval (VR) method is then applied to search the relevant items. With the rapid development of RL over the past few years, the technique has been used extensively in industry from online advertising to computer vision and speech recognition. There exist many open source implementations of VR algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various choices for potential users. However, the aforementioned implementations are all written in C++, and no plan for supporting Java interface, making it hard to be integrated in Java projects or those who are not familier with C/C++ [[https://github.com/facebookresearch/faiss/issues/105]]. The algorithms for vector retrieval can be roughly classified into four categories, # Tree-base algorithms, such as KD-tree; # Hashing methods, such as LSH (Local Sensitive Hashing); # Product quantization algorithms, such as IVFFlat; # Graph-base algorithms, such as HNSW, SSG, NSG; where IVFFlat and HNSW are the most popular ones among all the VR algorithms. Recently, the implementation of HNSW (Hierarchical Navigable Small World, LUCENE-9004) for Lucene, has made great progress. The issue draws attention of those who are interested in Lucene or hope to use HNSW with Solr/Lucene. As an alternative for solving ANN similarity search problems, IVFFlat is also very popular with many users and supporters. Compared with HNSW, IVFFlat has smaller index size but requires k-means clustering, while HNSW is faster in query (no training required) but requires extra storage for saving graphs [indexing 1M vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. Another advantage is that IVFFlat can be faster and more accurate when enables GPU parallel computing (current not support in Java). Both algorithms have their merits and demerits. Since HNSW is now under development, it may be better to provide both implementations (HNSW && IVFFlat) for potential users who are faced with very different scenarios and wan
[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging
irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368349044 ## File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java ## @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws Exception { iw.commit(); assertConsistentGraph(iw, values); - assertRecall(dir, 0, values[0]); + assertRecall(dir, 1, values[0]); +} + } + + public void testDocsDeletionAndRecall() throws Exception { +/** + * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector is exactly in the index) + * deletion strategy to filter all unmatched results searched by {@link org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery}, + * and deletes at most ef*segmentCnt vectors that are the same to the specified queryVector. + */ +final class KnnExactVectorValueWeight extends ConstantScoreWeight { + private final String field; + private final ScoreMode scoreMode; + private final float[] queryVector; + private final int ef; + + KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, String field, float[] queryVector, int ef) { +super(query, score); +this.field = field; +this.scoreMode = scoreMode; +this.queryVector = queryVector; +this.ef = ef; + } + + /** + * Returns a {@link Scorer} which can iterate in order over all matching + * documents and assign them a score. + * + * NOTE: null can be returned if no documents will be scored by this + * query. + * + * NOTE: The returned {@link Scorer} does not have + * {@link LeafReader#getLiveDocs()} applied, they need to be checked on top. + * + * @param context the {@link LeafReaderContext} for which to return the {@link Scorer}. + * @return a {@link Scorer} which scores documents in/out-of order. + * @throws IOException if there is a low-level I/O error + */ + @Override + public Scorer scorer(LeafReaderContext context) throws IOException { +ScorerSupplier supplier = scorerSupplier(context); +if (supplier == null) { + return null; +} +return supplier.get(Long.MAX_VALUE); + } + + @Override + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +int numDimensions = fi.getVectorNumDimensions(); +if (numDimensions != queryVector.length) { + throw new IllegalArgumentException("field=\"" + field + "\" was indexed with dimensions=" + numDimensions + + "; this is incompatible with query dimensions=" + queryVector.length); +} + +final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context); +final VectorValues vectorValues = context.reader().getVectorValues(field); +if (vectorValues == null) { + // No docs in this segment/field indexed any vector values + return null; +} + +final Weight weight = this; +return new ScorerSupplier() { + @Override + public Scorer get(long leadCost) throws IOException { +final Neighbors neighbors = hnswReader.searchNeighbors(queryVector, ef, vectorValues); + +if (neighbors.size() > 0) { + Neighbor top = neighbors.top(); + if (top.distance() > 0) { +neighbors.clear(); + } else { +final List toDeleteNeighbors = new ArrayList<>(neighbors.size()); Review comment: Yes, and thanks. I hope to test some cases where segments contain deleted vectors. The classes KnnExactVectorValueQuery and KnnExactVectorValueWeight are added because I expect the deleted vector values are deterministic, making the assertions meet in any execution. The two classes are just used for my test case, so I put them in the test file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize
madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368349481 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java ## @@ -151,126 +141,23 @@ public void computeHash() { } } + /** + * Create a FrozenIntSet from the current values in this IntSet. + * + * Note: Must call computeHash() before calling this method + * + * @param state the state to save + * @return a FrozenIntSet that has the same values and hashCode as this set + */ public FrozenIntSet freeze(int state) { final int[] c = new int[upto]; System.arraycopy(values, 0, c, 0, upto); Review comment: Arrays.copyOf does it in two steps, same as here. FutureArrays has compare, equals, and mismatch that I see? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging
irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368349044 ## File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java ## @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws Exception { iw.commit(); assertConsistentGraph(iw, values); - assertRecall(dir, 0, values[0]); + assertRecall(dir, 1, values[0]); +} + } + + public void testDocsDeletionAndRecall() throws Exception { +/** + * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector is exactly in the index) + * deletion strategy to filter all unmatched results searched by {@link org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery}, + * and deletes at most ef*segmentCnt vectors that are the same to the specified queryVector. + */ +final class KnnExactVectorValueWeight extends ConstantScoreWeight { + private final String field; + private final ScoreMode scoreMode; + private final float[] queryVector; + private final int ef; + + KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, String field, float[] queryVector, int ef) { +super(query, score); +this.field = field; +this.scoreMode = scoreMode; +this.queryVector = queryVector; +this.ef = ef; + } + + /** + * Returns a {@link Scorer} which can iterate in order over all matching + * documents and assign them a score. + * + * NOTE: null can be returned if no documents will be scored by this + * query. + * + * NOTE: The returned {@link Scorer} does not have + * {@link LeafReader#getLiveDocs()} applied, they need to be checked on top. + * + * @param context the {@link LeafReaderContext} for which to return the {@link Scorer}. + * @return a {@link Scorer} which scores documents in/out-of order. + * @throws IOException if there is a low-level I/O error + */ + @Override + public Scorer scorer(LeafReaderContext context) throws IOException { +ScorerSupplier supplier = scorerSupplier(context); +if (supplier == null) { + return null; +} +return supplier.get(Long.MAX_VALUE); + } + + @Override + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +int numDimensions = fi.getVectorNumDimensions(); +if (numDimensions != queryVector.length) { + throw new IllegalArgumentException("field=\"" + field + "\" was indexed with dimensions=" + numDimensions + + "; this is incompatible with query dimensions=" + queryVector.length); +} + +final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context); +final VectorValues vectorValues = context.reader().getVectorValues(field); +if (vectorValues == null) { + // No docs in this segment/field indexed any vector values + return null; +} + +final Weight weight = this; +return new ScorerSupplier() { + @Override + public Scorer get(long leadCost) throws IOException { +final Neighbors neighbors = hnswReader.searchNeighbors(queryVector, ef, vectorValues); + +if (neighbors.size() > 0) { + Neighbor top = neighbors.top(); + if (top.distance() > 0) { +neighbors.clear(); + } else { +final List toDeleteNeighbors = new ArrayList<>(neighbors.size()); Review comment: Yes, and thanks. I hope to test some cases where segments contain deleted vectors. I expect the deleted vector values are deterministic, making the assertions meet in any execution. The classes, KnnExactVectorValueQuery and KnnExactVectorValueWeight, are just used for my test case, so I put them in the test file. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize
madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368348312 ## File path: lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java ## @@ -0,0 +1,38 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.util.automaton; + +public final class FrozenIntSet extends IntSet { + final int state; + + public FrozenIntSet(int[] values, int hashCode, int state) { +this.values = values; +this.hashCode = hashCode; +this.state = state; + } + + public FrozenIntSet(int num, int state) { +this.values = new int[] { num }; +this.state = state; +this.hashCode = 683 + num; Review comment: It's a shortcut for the other constructor, uses the same calculation for hash that is in SortedIntSet.calculateHash, but specialized down to a single value. I'll see if we can easily drop this code since I agree that it adds complexity for maintaners. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9004) Approximate nearest vector search
[ https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019066#comment-17019066 ] Michael Sokolov commented on LUCENE-9004: - I'll second the thanks, [~jtibshirani] . There's clearly active work going on, and it may be too soon to declare a single winner in this complex space. I do think there is a need to focus on higher-dimensional cases since in Lucene there is already well-developed support for dim <=8 via KD-tree, but nothing for higher dimensions. One thing that surprises me a bit about some evaluations I'm seeing is that they report Precision@1 (and sometimes even when operating over the training set?!). I wonder if anyone has looked at a metric that includes top 10 (say), and penalizes more distant matches? For exmaple MSE over normalized vectors would enable one to distinguish among results that are both the same "precision" yet one has vectors that are closer than the other. Re: deletions, yeah we have not addressed that. The only thing that makes sense to me for deletions is to prune them while searching. TBH I'm not sure how to plumb livedocs in to the query, or if this is somehow untenable? Supposing we do that, it would impose some operational constraints in that if a lot of documents are deleted, performance will drop substantially, but I think that is probably OK. Users will just have to understand the limitation? We'll have to understand the impact as deletions accumulate. I think the issue about filtering against other queries is more challenging since we don't have an up-front bitset to filter against, typically. In a sense the ANN query is the most expensive because *every* document is a potential match. Perhaps the thing to do is come up with an estimate of a radius R bounding the top K (around the query vector) based on the approximate top K we find, and then allowing to advance to a document, even if it was not returned by graph search, so long as its distance is <= R. This would not truly answer the question "top K closest documents satisfying these constraints," though. For that I don't see what we could do other than forcing to compute a bitset, and then passing that in to the graph search (like for deletions). > Approximate nearest vector search > - > > Key: LUCENE-9004 > URL: https://issues.apache.org/jira/browse/LUCENE-9004 > Project: Lucene - Core > Issue Type: New Feature >Reporter: Michael Sokolov >Priority: Major > Attachments: hnsw_layered_graph.png > > Time Spent: 2.5h > Remaining Estimate: 0h > > "Semantic" search based on machine-learned vector "embeddings" representing > terms, queries and documents is becoming a must-have feature for a modern > search engine. SOLR-12890 is exploring various approaches to this, including > providing vector-based scoring functions. This is a spinoff issue from that. > The idea here is to explore approximate nearest-neighbor search. Researchers > have found an approach based on navigating a graph that partially encodes the > nearest neighbor relation at multiple scales can provide accuracy > 95% (as > compared to exact nearest neighbor calculations) at a reasonable cost. This > issue will explore implementing HNSW (hierarchical navigable small-world) > graphs for the purpose of approximate nearest vector search (often referred > to as KNN or k-nearest-neighbor search). > At a high level the way this algorithm works is this. First assume you have a > graph that has a partial encoding of the nearest neighbor relation, with some > short and some long-distance links. If this graph is built in the right way > (has the hierarchical navigable small world property), then you can > efficiently traverse it to find nearest neighbors (approximately) in log N > time where N is the number of nodes in the graph. I believe this idea was > pioneered in [1]. The great insight in that paper is that if you use the > graph search algorithm to find the K nearest neighbors of a new document > while indexing, and then link those neighbors (undirectedly, ie both ways) to > the new document, then the graph that emerges will have the desired > properties. > The implementation I propose for Lucene is as follows. We need two new data > structures to encode the vectors and the graph. We can encode vectors using a > light wrapper around {{BinaryDocValues}} (we also want to encode the vector > dimension and have efficient conversion from bytes to floats). For the graph > we can use {{SortedNumericDocValues}} where the values we encode are the > docids of the related documents. Encoding the interdocument relations using > docids directly will make it relatively fast to traverse the graph since we > won't need to lookup through an id-field indirection. This
[GitHub] [lucene-solr] msokolov commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging
msokolov commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368330855 ## File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java ## @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws Exception { iw.commit(); assertConsistentGraph(iw, values); - assertRecall(dir, 0, values[0]); + assertRecall(dir, 1, values[0]); +} + } + + public void testDocsDeletionAndRecall() throws Exception { +/** + * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector is exactly in the index) + * deletion strategy to filter all unmatched results searched by {@link org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery}, + * and deletes at most ef*segmentCnt vectors that are the same to the specified queryVector. + */ +final class KnnExactVectorValueWeight extends ConstantScoreWeight { + private final String field; + private final ScoreMode scoreMode; + private final float[] queryVector; + private final int ef; + + KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, String field, float[] queryVector, int ef) { +super(query, score); +this.field = field; +this.scoreMode = scoreMode; +this.queryVector = queryVector; +this.ef = ef; + } + + /** + * Returns a {@link Scorer} which can iterate in order over all matching + * documents and assign them a score. + * + * NOTE: null can be returned if no documents will be scored by this + * query. + * + * NOTE: The returned {@link Scorer} does not have + * {@link LeafReader#getLiveDocs()} applied, they need to be checked on top. + * + * @param context the {@link LeafReaderContext} for which to return the {@link Scorer}. + * @return a {@link Scorer} which scores documents in/out-of order. + * @throws IOException if there is a low-level I/O error + */ + @Override + public Scorer scorer(LeafReaderContext context) throws IOException { +ScorerSupplier supplier = scorerSupplier(context); +if (supplier == null) { + return null; +} +return supplier.get(Long.MAX_VALUE); + } + + @Override + public ScorerSupplier scorerSupplier(LeafReaderContext context) throws IOException { +FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field); +int numDimensions = fi.getVectorNumDimensions(); +if (numDimensions != queryVector.length) { + throw new IllegalArgumentException("field=\"" + field + "\" was indexed with dimensions=" + numDimensions + + "; this is incompatible with query dimensions=" + queryVector.length); +} + +final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context); +final VectorValues vectorValues = context.reader().getVectorValues(field); +if (vectorValues == null) { + // No docs in this segment/field indexed any vector values + return null; +} + +final Weight weight = this; +return new ScorerSupplier() { + @Override + public Scorer get(long leadCost) throws IOException { +final Neighbors neighbors = hnswReader.searchNeighbors(queryVector, ef, vectorValues); + +if (neighbors.size() > 0) { + Neighbor top = neighbors.top(); + if (top.distance() > 0) { +neighbors.clear(); + } else { +final List toDeleteNeighbors = new ArrayList<>(neighbors.size()); Review comment: You are -- finding exact matches to the input vector, right? I don't understand what this has to do with deletion. I'm also unclear why we want to have an exact match query in the first place. What problem is it solving that we could not solve with a hashmap lookup? And ... it is implemented here in a test file. Is this supporting testing in some way? Thanks, I feel I must be missing some essential thing here... This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] dsmiley commented on issue #1166: SOLR-14040: shareSchema support for SolrCloud
dsmiley commented on issue #1166: SOLR-14040: shareSchema support for SolrCloud URL: https://github.com/apache/lucene-solr/pull/1166#issuecomment-576052694 Also, we should probably add some protections to prevent sharing of core-specific things. For example if shareSchema=true, then we might want to log a warning if there are core specific lib dirs or lib directives in solrconfig.xml. This is an old issue and less likely with SolrCloud scenario. Also we might want to block core specific properties from being expanded. And, use of SolrCoreAware when loading schema components ought to log a warning. This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
[ https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated SOLR-12325: Attachment: SOLR-12325.patch Status: Open (was: Open) > introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet > -- > > Key: SOLR-12325 > URL: https://issues.apache.org/jira/browse/SOLR-12325 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Reporter: Mikhail Khludnev >Priority: Major > Attachments: SOLR-12325.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin > query parsing method, don't invent your own. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
[ https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Mikhail Khludnev updated SOLR-12325: Status: Patch Available (was: Open) > introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet > -- > > Key: SOLR-12325 > URL: https://issues.apache.org/jira/browse/SOLR-12325 > Project: Solr > Issue Type: New Feature > Components: Facet Module >Reporter: Mikhail Khludnev >Priority: Major > Attachments: SOLR-12325.patch > > Time Spent: 1.5h > Remaining Estimate: 0h > > It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin > query parsing method, don't invent your own. -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-9134) Port ant-regenerate tasks to Gradle build
[ https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Erick Erickson updated LUCENE-9134: --- Attachment: core_regen.patch Status: Open (was: Open) This adds two regenerate tasks for the lucene/core file (plus cleans up a couple of nocommits from my PR). I decided to put up a patch rather than a PR because I'm getting confused about which is which, both are WIPs but separate. This looks like it succeeds on createPackedIntSources and createLevAutomaton I have not yet tried to deal with the jflex bits. NOTE: I had to munge the path in createLevAutomata.py (sys.path.insert), I'm not clear that this is the right thing to do... > Port ant-regenerate tasks to Gradle build > - > > Key: LUCENE-9134 > URL: https://issues.apache.org/jira/browse/LUCENE-9134 > Project: Lucene - Core > Issue Type: Sub-task >Reporter: Erick Erickson >Assignee: Erick Erickson >Priority: Major > Attachments: LUCENE-9134.patch, core_regen.patch, gen-kuromoji.patch > > Time Spent: 50m > Remaining Estimate: 0h > > Here are the "regenerate" targets I found in the ant version. There are a > couple that I don't have evidence for or against being rebuilt > // Very top level > {code:java} > ./build.xml: > ./build.xml: failonerror="true"> > ./build.xml: depends="regenerate,-check-after-regeneration"/> > {code} > // top level Lucene. This includes the core/build.xml and > test-framework/build.xml files > {code:java} > ./lucene/build.xml: > ./lucene/build.xml: inheritall="false"> > ./lucene/build.xml: > {code} > // This one has quite a number of customizations to > {code:java} > ./lucene/core/build.xml: depends="createLevAutomata,createPackedIntSources,jflex"/> > {code} > // This one has a bunch of code modifications _after_ javacc is run on > certain of the > // output files. Save this one for last? > {code:java} > ./lucene/queryparser/build.xml: > {code} > // the files under ../lucene/analysis... are pretty self contained. I expect > these could be done as a unit > {code:java} > ./lucene/analysis/build.xml: > ./lucene/analysis/build.xml: > ./lucene/analysis/common/build.xml: depends="jflex,unicode-data"/> > ./lucene/analysis/icu/build.xml: depends="gen-utr30-data-files,gennorm2,genrbbi"/> > ./lucene/analysis/kuromoji/build.xml: depends="build-dict"/> > ./lucene/analysis/nori/build.xml: depends="build-dict"/> > ./lucene/analysis/opennlp/build.xml: depends="train-test-models"/> > {code} > > // These _are_ regenerated from the top-level regenerate target, but for -- > LUCENE-9080//the changes were only in imports so there are no > //corresponding files checked in in that JIRA > {code:java} > ./lucene/expressions/build.xml: depends="run-antlr"/> > {code} > // Apparently unrelated to ./lucene/analysis/opennlp/build.xml > "train-test-models" target > // Apparently not rebuilt from the top level, but _are_ regenerated when > executed from > // ./solr/contrib/langid > {code:java} > ./solr/contrib/langid/build.xml: depends="train-test-models"/> > {code} > -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019012#comment-17019012 ] Michael Froh commented on LUCENE-8962: -- Here's a before and after comparison of the average number of segments searched per request since I applied this change (with a TieredMergePolicy subclass that tries to merge all segments smaller than 100MB into a single segment on commit, with floorSegmentMB of 500). It lowers the overall count, but especially significantly reduced the variance. !LUCENE-8962_demo.png! > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?
[ https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael Froh updated LUCENE-8962: - Attachment: LUCENE-8962_demo.png > Can we merge small segments during refresh, for faster searching? > - > > Key: LUCENE-8962 > URL: https://issues.apache.org/jira/browse/LUCENE-8962 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Michael McCandless >Priority: Major > Attachments: LUCENE-8962_demo.png > > Time Spent: 1h 40m > Remaining Estimate: 0h > > With near-real-time search we ask {{IndexWriter}} to write all in-memory > segments to disk and open an {{IndexReader}} to search them, and this is > typically a quick operation. > However, when you use many threads for concurrent indexing, {{IndexWriter}} > will accumulate write many small segments during {{refresh}} and this then > adds search-time cost as searching must visit all of these tiny segments. > The merge policy would normally quickly coalesce these small segments if > given a little time ... so, could we somehow improve {{IndexWriter'}}s > refresh to optionally kick off merge policy to merge segments below some > threshold before opening the near-real-time reader? It'd be a bit tricky > because while we are waiting for merges, indexing may continue, and new > segments may be flushed, but those new segments shouldn't be included in the > point-in-time segments returned by refresh ... > One could almost do this on top of Lucene today, with a custom merge policy, > and some hackity logic to have the merge policy target small segments just > written by refresh, but it's tricky to then open a near-real-time reader, > excluding newly flushed but including newly merged segments since the refresh > originally finished ... > I'm not yet sure how best to solve this, so I wanted to open an issue for > discussion! -- This message was sent by Atlassian Jira (v8.3.4#803005) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging
mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368289325 ## File path: lucene/core/src/java/org/apache/lucene/search/KnnScoreWeight.java ## @@ -18,6 +18,8 @@ package org.apache.lucene.search; import java.io.IOException; +import java.util.ArrayList; +import java.util.List; Review comment: Can you please remove those unused imports? This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging
mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368289244 ## File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java ## @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws Exception { iw.commit(); assertConsistentGraph(iw, values); - assertRecall(dir, 0, values[0]); + assertRecall(dir, 1, values[0]); +} + } + + public void testDocsDeletionAndRecall() throws Exception { +/** + * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector is exactly in the index) + * deletion strategy to filter all unmatched results searched by {@link org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery}, + * and deletes at most ef*segmentCnt vectors that are the same to the specified queryVector. + */ +final class KnnExactVectorValueWeight extends ConstantScoreWeight { Review comment: Thanks, it looks almost okay to me but the Weight and Query classes can be (and should be) *static* classes. It would look like this: ``` public class TestKnnGraph extends LuceneTestCase { private static final class KnnExactVectorValueWeight extends ConstantScoreWeight { } private static final class KnnExactVectorValueQuery extends Query { } public void testDocsDeletionAndRecall() throws Exception { Query query = new KnnExactVectorValueQuery(...); } } ``` Please avoid *non-static* inner classes whenever you can do so, because they consume extra memory and object references ;) This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: us...@infra.apache.org With regards, Apache Git Services - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org