date:20200119

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

2020-01-19 Thread Vinh Le (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264
 ] 

Vinh Le edited comment on SOLR-10506 at 1/20/20 7:14 AM:
-

In 7.7.2, some SolrCore still are not released after being removed.

!image-2020-01-20-14-51-26-411.png|width=880,height=538!
  
  
  
  


was (Author: vinhlh):
In 7.7.2, some SolrCore still are not released after being removed.

 

!image-2020-01-20-14-51-26-411.png!
  
  
 
 

> Possible memory leak upon collection reload
> ---
>
> Key: SOLR-10506
> URL: https://issues.apache.org/jira/browse/SOLR-10506
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 6.5
>Reporter: Torsten Bøgh Köster
>Assignee: Christine Poerschke
>Priority: Major
> Fix For: 6.6.6, 7.0
>
> Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, 
> solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon manual Solr Collection reloading, references to the closed {{SolrCore}} 
> are not fully removed by the garbage collector as a strong reference to the 
> {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for 
> schema changes.
> In our case, this leads to a massive memory leak as managed resources are 
> still referenced by the closed {{SolrCore}}. Our Solr cloud environment 
> utilizes rather large managed resources (synonyms, stopwords). To reproduce, 
> we fired out environment up and reloaded the collection 13 times. As a result 
> we fully exhausted our heap. A closer look with the Yourkit profiler revealed 
> 13 {{SolrCore}} instances, still holding strong references to the garbage 
> collection root (see screenshot 1).
> Each {{SolrCore}} instance holds a single path with strong references to the 
> gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The 
> {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the 
> Zookeeper is not removed upon core close.
> We supplied a Github Pull Request 
> (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper 
> `Watcher` as a static inner class. To eliminate the memory leak, the schema 
> reader is held inside a `WeakReference` and the reference is explicitly 
> removed on core close.
> Initially I wanted to supply a test case but unfortunately did not find a 
> good starting point ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

2020-01-19 Thread Vinh Le (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264
 ] 

Vinh Le edited comment on SOLR-10506 at 1/20/20 7:14 AM:
-

In 7.7.2, some SolrCore still are not released after being removed.

 

!image-2020-01-20-14-51-26-411.png!
  
  
 
 


was (Author: vinhlh):
In 7.7.2, some SolrCore still are not released after being removed (and alias 
switched to another one)

 

!image-2020-01-20-14-51-26-411.png!
 
 

> Possible memory leak upon collection reload
> ---
>
> Key: SOLR-10506
> URL: https://issues.apache.org/jira/browse/SOLR-10506
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 6.5
>Reporter: Torsten Bøgh Köster
>Assignee: Christine Poerschke
>Priority: Major
> Fix For: 6.6.6, 7.0
>
> Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, 
> solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon manual Solr Collection reloading, references to the closed {{SolrCore}} 
> are not fully removed by the garbage collector as a strong reference to the 
> {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for 
> schema changes.
> In our case, this leads to a massive memory leak as managed resources are 
> still referenced by the closed {{SolrCore}}. Our Solr cloud environment 
> utilizes rather large managed resources (synonyms, stopwords). To reproduce, 
> we fired out environment up and reloaded the collection 13 times. As a result 
> we fully exhausted our heap. A closer look with the Yourkit profiler revealed 
> 13 {{SolrCore}} instances, still holding strong references to the garbage 
> collection root (see screenshot 1).
> Each {{SolrCore}} instance holds a single path with strong references to the 
> gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The 
> {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the 
> Zookeeper is not removed upon core close.
> We supplied a Github Pull Request 
> (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper 
> `Watcher` as a static inner class. To eliminate the memory leak, the schema 
> reader is held inside a `WeakReference` and the reference is explicitly 
> removed on core close.
> Initially I wanted to supply a test case but unfortunately did not find a 
> good starting point ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

2020-01-19 Thread Vinh Le (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264
 ] 

Vinh Le edited comment on SOLR-10506 at 1/20/20 6:57 AM:
-

In 7.7.2, some SolrCore still are not released after being removed (and alias 
switched to another one)

 

!image-2020-01-20-14-51-26-411.png!
 
 


was (Author: vinhlh):
In 7.7.2, some SolrCore still are not released after being removed (actually 
alias switched to another one)

 

!image-2020-01-20-14-51-26-411.png!

> Possible memory leak upon collection reload
> ---
>
> Key: SOLR-10506
> URL: https://issues.apache.org/jira/browse/SOLR-10506
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 6.5
>Reporter: Torsten Bøgh Köster
>Assignee: Christine Poerschke
>Priority: Major
> Fix For: 6.6.6, 7.0
>
> Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, 
> solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon manual Solr Collection reloading, references to the closed {{SolrCore}} 
> are not fully removed by the garbage collector as a strong reference to the 
> {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for 
> schema changes.
> In our case, this leads to a massive memory leak as managed resources are 
> still referenced by the closed {{SolrCore}}. Our Solr cloud environment 
> utilizes rather large managed resources (synonyms, stopwords). To reproduce, 
> we fired out environment up and reloaded the collection 13 times. As a result 
> we fully exhausted our heap. A closer look with the Yourkit profiler revealed 
> 13 {{SolrCore}} instances, still holding strong references to the garbage 
> collection root (see screenshot 1).
> Each {{SolrCore}} instance holds a single path with strong references to the 
> gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The 
> {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the 
> Zookeeper is not removed upon core close.
> We supplied a Github Pull Request 
> (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper 
> `Watcher` as a static inner class. To eliminate the memory leak, the schema 
> reader is held inside a `WeakReference` and the reference is explicitly 
> removed on core close.
> Initially I wanted to supply a test case but unfortunately did not find a 
> good starting point ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (SOLR-10506) Possible memory leak upon collection reload

2020-01-19 Thread Vinh Le (Jira)



[ 
https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019264#comment-17019264
 ] 

Vinh Le commented on SOLR-10506:


In 7.7.2, some SolrCore still are not released after being removed (actually 
alias switched to another one)

 

!image-2020-01-20-14-51-26-411.png!

> Possible memory leak upon collection reload
> ---
>
> Key: SOLR-10506
> URL: https://issues.apache.org/jira/browse/SOLR-10506
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 6.5
>Reporter: Torsten Bøgh Köster
>Assignee: Christine Poerschke
>Priority: Major
> Fix For: 6.6.6, 7.0
>
> Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, 
> solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon manual Solr Collection reloading, references to the closed {{SolrCore}} 
> are not fully removed by the garbage collector as a strong reference to the 
> {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for 
> schema changes.
> In our case, this leads to a massive memory leak as managed resources are 
> still referenced by the closed {{SolrCore}}. Our Solr cloud environment 
> utilizes rather large managed resources (synonyms, stopwords). To reproduce, 
> we fired out environment up and reloaded the collection 13 times. As a result 
> we fully exhausted our heap. A closer look with the Yourkit profiler revealed 
> 13 {{SolrCore}} instances, still holding strong references to the garbage 
> collection root (see screenshot 1).
> Each {{SolrCore}} instance holds a single path with strong references to the 
> gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The 
> {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the 
> Zookeeper is not removed upon core close.
> We supplied a Github Pull Request 
> (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper 
> `Watcher` as a static inner class. To eliminate the memory leak, the schema 
> reader is held inside a `WeakReference` and the reference is explicitly 
> removed on core close.
> Initially I wanted to supply a test case but unfortunately did not find a 
> good starting point ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-10506) Possible memory leak upon collection reload

2020-01-19 Thread Vinh Le (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-10506?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Vinh Le updated SOLR-10506:
---
Attachment: image-2020-01-20-14-51-26-411.png

> Possible memory leak upon collection reload
> ---
>
> Key: SOLR-10506
> URL: https://issues.apache.org/jira/browse/SOLR-10506
> Project: Solr
>  Issue Type: Bug
>  Components: Server
>Affects Versions: 6.5
>Reporter: Torsten Bøgh Köster
>Assignee: Christine Poerschke
>Priority: Major
> Fix For: 6.6.6, 7.0
>
> Attachments: SOLR-10506.patch, image-2020-01-20-14-51-26-411.png, 
> solr_collection_reload_13_cores.png, solr_gc_path_via_zk_WatchManager.png
>
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Upon manual Solr Collection reloading, references to the closed {{SolrCore}} 
> are not fully removed by the garbage collector as a strong reference to the 
> {{ZkIndexSchemaReader}} is held in a ZooKeeper {{Watcher}} that watches for 
> schema changes.
> In our case, this leads to a massive memory leak as managed resources are 
> still referenced by the closed {{SolrCore}}. Our Solr cloud environment 
> utilizes rather large managed resources (synonyms, stopwords). To reproduce, 
> we fired out environment up and reloaded the collection 13 times. As a result 
> we fully exhausted our heap. A closer look with the Yourkit profiler revealed 
> 13 {{SolrCore}} instances, still holding strong references to the garbage 
> collection root (see screenshot 1).
> Each {{SolrCore}} instance holds a single path with strong references to the 
> gc root via a `Watcher` in `ZkIndexSchemaReader` (see screenshot 2). The 
> {{ZkIndexSchemaReader}} registers a close hook in the {{SolrCore}} but the 
> Zookeeper is not removed upon core close.
> We supplied a Github Pull Request 
> (https://github.com/apache/lucene-solr/pull/197) that extracts the zookeeper 
> `Watcher` as a static inner class. To eliminate the memory leak, the schema 
> reader is held inside a `WeakReference` and the reference is explicitly 
> removed on core close.
> Initially I wanted to supply a test case but unfortunately did not find a 
> good starting point ...



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-01-19 Thread GitBox

dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet 
operations for determinize
URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368388656
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+public final class FrozenIntSet extends IntSet {
+  final int state;
+
+  public FrozenIntSet(int[] values, int hashCode, int state) {
+this.values = values;
+this.hashCode = hashCode;
+this.state = state;
+  }
+
+  public FrozenIntSet(int num, int state) {
+this.values = new int[] { num };
+this.state = state;
+this.hashCode = 683 + num;
 
 Review comment:
   oh, ok. drop it; the less code to understand the better. I don't think it'll 
be a particular gain here.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-01-19 Thread GitBox

dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet 
operations for determinize
URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368388354
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java
 ##
 @@ -151,126 +141,23 @@ public void computeHash() {
 }
   }
 
+  /**
+   * Create a FrozenIntSet from the current values in this IntSet.
+   *
+   * Note: Must call computeHash() before calling this method
+   *
+   * @param state the state to save
+   * @return a FrozenIntSet that has the same values and hashCode as this set
+   */
   public FrozenIntSet freeze(int state) {
 final int[] c = new int[upto];
 System.arraycopy(values, 0, c, 0, upto);
 
 Review comment:
   You look at Java code but the difference is that Array.copyOf is (I believe) 
an jvm intrinsic so it should be replaced with a more optimized code. The pair 
of allocation+arraycopy should as well be optimized for that matter but it 
requires execution graph analysis and copyOf is straightforward.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

2020-01-19 Thread Xin-Chun Zhang (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9136?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Xin-Chun Zhang updated LUCENE-9136:
---
Description: 
Representation learning (RL) has been an established discipline in the machine 
learning space for decades but it draws tremendous attention lately with the 
emergence of deep learning. The central problem of RL is to determine an 
optimal representation of the input data. By embedding the data into a high 
dimensional vector, the vector retrieval (VR) method is then applied to search 
the relevant items.

With the rapid development of RL over the past few years, the technique has 
been used extensively in industry from online advertising to computer vision 
and speech recognition. There exist many open source implementations of VR 
algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
choices for potential users. However, the aforementioned implementations are 
all written in C++, and no plan for supporting Java interface, making it hard 
to be integrated in Java projects or those who are not familier with C/C++  
[[https://github.com/facebookresearch/faiss/issues/105]]. 

The algorithms for vector retrieval can be roughly classified into four 
categories,
 # Tree-base algorithms, such as KD-tree;
 # Hashing methods, such as LSH (Local Sensitive Hashing);
 # Product quantization algorithms, such as IVFFlat;
 # Graph-base algorithms, such as HNSW, SSG, NSG;

where IVFFlat and HNSW are the most popular ones among all the VR algorithms.

Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
LUCENE-9004) for Lucene, has made great progress. The issue draws attention of 
those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 

As an alternative for solving ANN similarity search problems, IVFFlat is also 
very popular with many users and supporters. Compared with HNSW, IVFFlat has 
smaller index size but requires k-means clustering, while HNSW is faster in 
query (no training required) but requires extra storage for saving graphs 
[indexing 1M 
vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. 
Another advantage is that IVFFlat can be faster and more accurate when enables 
GPU parallel computing (current not support in Java). Both algorithms have 
their merits and demerits. Since HNSW is now under development, it may be 
better to provide both implementations (HNSW && IVFFlat) for potential users 
who are faced with very different scenarios and want to more choices.

  was:
Representation learning (RL) has been an established discipline in the machine 
learning space for decades but it draws tremendous attention lately with the 
emergence of deep learning. The central problem of RL is to determine an 
optimal representation of the input data. By embedding the data into a high 
dimensional vector, the vector retrieval (VR) method is then applied to search 
the relevant items.

With the rapid development of RL over the past few years, the technique has 
been used extensively in industry from online advertising to computer vision 
and speech recognition. There exist many open source implementations of VR 
algorithms, such as Facebook's FAISS and Microsoft's SPTAG, providing various 
choices for potential users. However, the aforementioned implementations are 
all written in C++, and no plan for supporting Java interface, making it hard 
to be integrated in Java projects or those who are not familier with C/C++  
[[https://github.com/facebookresearch/faiss/issues/105]]. 

The algorithms for vector retrieval can be roughly classified into four 
categories,
 # Tree-base algorithms, such as KD-tree;
 # Hashing methods, such as LSH (Local Sensitive Hashing);
 # Product quantization algorithms, such as IVFFlat;
 # Graph-base algorithms, such as HNSW, SSG, NSG;

where IVFFlat and HNSW are the most popular ones among all the VR algorithms.

Recently, the implementation of HNSW (Hierarchical Navigable Small World, 
LUCENE-9004) for Lucene, has made great progress. The issue draws attention of 
those who are interested in Lucene or hope to use HNSW with Solr/Lucene. 

As an alternative for solving ANN similarity search problems, IVFFlat is also 
very popular with many users and supporters. Compared with HNSW, IVFFlat has 
smaller index size but requires k-means clustering, while HNSW is faster in 
query (no training required) but requires extra storage for saving graphs 
[indexing 1M 
vectors|[https://github.com/facebookresearch/faiss/wiki/Indexing-1M-vectors]]. 
Another advantage is that IVFFlat can be faster and more accurate when enables 
GPU parallel computing (current not support in Java). Both algorithms have 
their merits and demerits. Since HNSW is now under development, it may be 
better to provide both implementations (HNSW && IVFFlat) for potential users 
who are faced with very different scenarios and wan

[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-01-19 Thread GitBox

irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor 
feature and patch -- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368349044
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
 ##
 @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws  Exception {
   iw.commit();
   assertConsistentGraph(iw, values);
 
-  assertRecall(dir, 0, values[0]);
+  assertRecall(dir, 1, values[0]);
+}
+  }
+
+  public void testDocsDeletionAndRecall() throws  Exception {
+/**
+ * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector 
is exactly in the index)
+ * deletion strategy to filter all unmatched results searched by {@link 
org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery},
+ * and deletes at most ef*segmentCnt vectors that are the same to the 
specified queryVector.
+ */
+final class KnnExactVectorValueWeight extends ConstantScoreWeight {
+  private final String field;
+  private final ScoreMode scoreMode;
+  private final float[] queryVector;
+  private final int ef;
+
+  KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, 
String field, float[] queryVector, int ef) {
+super(query, score);
+this.field = field;
+this.scoreMode = scoreMode;
+this.queryVector = queryVector;
+this.ef = ef;
+  }
+
+  /**
+   * Returns a {@link Scorer} which can iterate in order over all matching
+   * documents and assign them a score.
+   * 
+   * NOTE: null can be returned if no documents will be scored by 
this
+   * query.
+   * 
+   * NOTE: The returned {@link Scorer} does not have
+   * {@link LeafReader#getLiveDocs()} applied, they need to be checked on 
top.
+   *
+   * @param context the {@link LeafReaderContext} for which to return the 
{@link Scorer}.
+   * @return a {@link Scorer} which scores documents in/out-of order.
+   * @throws IOException if there is a low-level I/O error
+   */
+  @Override
+  public Scorer scorer(LeafReaderContext context) throws IOException {
+ScorerSupplier supplier = scorerSupplier(context);
+if (supplier == null) {
+  return null;
+}
+return supplier.get(Long.MAX_VALUE);
+  }
+
+  @Override
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+int numDimensions = fi.getVectorNumDimensions();
+if (numDimensions != queryVector.length) {
+  throw new IllegalArgumentException("field=\"" + field + "\" was 
indexed with dimensions=" + numDimensions +
+  "; this is incompatible with query dimensions=" + 
queryVector.length);
+}
+
+final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context);
+final VectorValues vectorValues = 
context.reader().getVectorValues(field);
+if (vectorValues == null) {
+  // No docs in this segment/field indexed any vector values
+  return null;
+}
+
+final Weight weight = this;
+return new ScorerSupplier() {
+  @Override
+  public Scorer get(long leadCost) throws IOException {
+final Neighbors neighbors = 
hnswReader.searchNeighbors(queryVector, ef, vectorValues);
+
+if (neighbors.size() > 0) {
+  Neighbor top = neighbors.top();
+  if (top.distance() > 0) {
+neighbors.clear();
+  } else {
+final List toDeleteNeighbors = new 
ArrayList<>(neighbors.size());
 
 Review comment:
   Yes, and thanks. 
   I hope to test some cases where segments contain deleted vectors. The 
classes KnnExactVectorValueQuery and KnnExactVectorValueWeight are added 
because I expect the deleted vector values are deterministic, making the 
assertions meet in any execution. 
   The two classes are just used for my test case, so I put them in the test 
file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-01-19 Thread GitBox

madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet 
operations for determinize
URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368349481
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/SortedIntSet.java
 ##
 @@ -151,126 +141,23 @@ public void computeHash() {
 }
   }
 
+  /**
+   * Create a FrozenIntSet from the current values in this IntSet.
+   *
+   * Note: Must call computeHash() before calling this method
+   *
+   * @param state the state to save
+   * @return a FrozenIntSet that has the same values and hashCode as this set
+   */
   public FrozenIntSet freeze(int state) {
 final int[] c = new int[upto];
 System.arraycopy(values, 0, c, 0, upto);
 
 Review comment:
   Arrays.copyOf does it in two steps, same as here. FutureArrays has compare, 
equals, and mismatch that I see?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-01-19 Thread GitBox

irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor 
feature and patch -- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368349044
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
 ##
 @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws  Exception {
   iw.commit();
   assertConsistentGraph(iw, values);
 
-  assertRecall(dir, 0, values[0]);
+  assertRecall(dir, 1, values[0]);
+}
+  }
+
+  public void testDocsDeletionAndRecall() throws  Exception {
+/**
+ * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector 
is exactly in the index)
+ * deletion strategy to filter all unmatched results searched by {@link 
org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery},
+ * and deletes at most ef*segmentCnt vectors that are the same to the 
specified queryVector.
+ */
+final class KnnExactVectorValueWeight extends ConstantScoreWeight {
+  private final String field;
+  private final ScoreMode scoreMode;
+  private final float[] queryVector;
+  private final int ef;
+
+  KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, 
String field, float[] queryVector, int ef) {
+super(query, score);
+this.field = field;
+this.scoreMode = scoreMode;
+this.queryVector = queryVector;
+this.ef = ef;
+  }
+
+  /**
+   * Returns a {@link Scorer} which can iterate in order over all matching
+   * documents and assign them a score.
+   * 
+   * NOTE: null can be returned if no documents will be scored by 
this
+   * query.
+   * 
+   * NOTE: The returned {@link Scorer} does not have
+   * {@link LeafReader#getLiveDocs()} applied, they need to be checked on 
top.
+   *
+   * @param context the {@link LeafReaderContext} for which to return the 
{@link Scorer}.
+   * @return a {@link Scorer} which scores documents in/out-of order.
+   * @throws IOException if there is a low-level I/O error
+   */
+  @Override
+  public Scorer scorer(LeafReaderContext context) throws IOException {
+ScorerSupplier supplier = scorerSupplier(context);
+if (supplier == null) {
+  return null;
+}
+return supplier.get(Long.MAX_VALUE);
+  }
+
+  @Override
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+int numDimensions = fi.getVectorNumDimensions();
+if (numDimensions != queryVector.length) {
+  throw new IllegalArgumentException("field=\"" + field + "\" was 
indexed with dimensions=" + numDimensions +
+  "; this is incompatible with query dimensions=" + 
queryVector.length);
+}
+
+final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context);
+final VectorValues vectorValues = 
context.reader().getVectorValues(field);
+if (vectorValues == null) {
+  // No docs in this segment/field indexed any vector values
+  return null;
+}
+
+final Weight weight = this;
+return new ScorerSupplier() {
+  @Override
+  public Scorer get(long leadCost) throws IOException {
+final Neighbors neighbors = 
hnswReader.searchNeighbors(queryVector, ef, vectorValues);
+
+if (neighbors.size() > 0) {
+  Neighbor top = neighbors.top();
+  if (top.distance() > 0) {
+neighbors.clear();
+  } else {
+final List toDeleteNeighbors = new 
ArrayList<>(neighbors.size());
 
 Review comment:
   Yes, and thanks. 
   I hope to test some cases where segments contain deleted vectors. I expect 
the deleted vector values are deterministic, making the assertions meet in any 
execution. 
   The classes, KnnExactVectorValueQuery and KnnExactVectorValueWeight, are 
just used for my test case, so I put them in the test file.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

2020-01-19 Thread GitBox

madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet 
operations for determinize
URL: https://github.com/apache/lucene-solr/pull/1184#discussion_r368348312
 
 

 ##
 File path: 
lucene/core/src/java/org/apache/lucene/util/automaton/FrozenIntSet.java
 ##
 @@ -0,0 +1,38 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.util.automaton;
+
+public final class FrozenIntSet extends IntSet {
+  final int state;
+
+  public FrozenIntSet(int[] values, int hashCode, int state) {
+this.values = values;
+this.hashCode = hashCode;
+this.state = state;
+  }
+
+  public FrozenIntSet(int num, int state) {
+this.values = new int[] { num };
+this.state = state;
+this.hashCode = 683 + num;
 
 Review comment:
   It's a shortcut for the other constructor, uses the same calculation for 
hash that is in SortedIntSet.calculateHash, but specialized down to a single 
value. I'll see if we can easily drop this code since I agree that it adds 
complexity for maintaners.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

2020-01-19 Thread Michael Sokolov (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-9004?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019066#comment-17019066
 ] 

Michael Sokolov commented on LUCENE-9004:
-

I'll second the thanks, [~jtibshirani] . There's clearly active work going on, 
and it may be too soon to declare a single winner in this complex space. I do 
think there is a need to focus on higher-dimensional cases since in Lucene 
there is already well-developed support for dim <=8 via KD-tree, but nothing 
for higher dimensions.

One thing that surprises me a bit about some evaluations I'm seeing is that 
they report Precision@1 (and sometimes even when operating over the training 
set?!). I wonder if anyone has looked at a metric that includes top 10 (say), 
and penalizes more distant matches? For exmaple MSE over normalized vectors 
would enable one to distinguish among results that are both the same 
"precision" yet one has vectors that are closer than the other.

Re: deletions, yeah we have not addressed that. The only thing that makes sense 
to me for deletions is to prune them while searching. TBH I'm not sure how to 
plumb livedocs in to the query, or if this is somehow untenable? Supposing we  
do that, it would impose some operational constraints in that if a lot of 
documents are deleted, performance will drop substantially, but I think that is 
probably OK. Users will just have to understand the limitation? We'll have to 
understand the impact as deletions accumulate.

I think the issue about filtering against other queries is more challenging 
since we don't have an up-front bitset to filter against, typically. In a sense 
the ANN query is the most expensive because *every* document is a potential 
match. Perhaps the thing to do is come up with an estimate of a radius R 
bounding the top K (around the query vector) based on the approximate top K we 
find, and then allowing to advance to a document, even if it was not returned 
by graph search, so long as its distance is <= R. This would not truly answer 
the question "top K closest documents satisfying these constraints," though. 
For that I don't see what we could do other than forcing to compute a bitset, 
and then passing that in to the graph search (like for deletions).

> Approximate nearest vector search
> -
>
> Key: LUCENE-9004
> URL: https://issues.apache.org/jira/browse/LUCENE-9004
> Project: Lucene - Core
>  Issue Type: New Feature
>Reporter: Michael Sokolov
>Priority: Major
> Attachments: hnsw_layered_graph.png
>
>  Time Spent: 2.5h
>  Remaining Estimate: 0h
>
> "Semantic" search based on machine-learned vector "embeddings" representing 
> terms, queries and documents is becoming a must-have feature for a modern 
> search engine. SOLR-12890 is exploring various approaches to this, including 
> providing vector-based scoring functions. This is a spinoff issue from that.
> The idea here is to explore approximate nearest-neighbor search. Researchers 
> have found an approach based on navigating a graph that partially encodes the 
> nearest neighbor relation at multiple scales can provide accuracy > 95% (as 
> compared to exact nearest neighbor calculations) at a reasonable cost. This 
> issue will explore implementing HNSW (hierarchical navigable small-world) 
> graphs for the purpose of approximate nearest vector search (often referred 
> to as KNN or k-nearest-neighbor search).
> At a high level the way this algorithm works is this. First assume you have a 
> graph that has a partial encoding of the nearest neighbor relation, with some 
> short and some long-distance links. If this graph is built in the right way 
> (has the hierarchical navigable small world property), then you can 
> efficiently traverse it to find nearest neighbors (approximately) in log N 
> time where N is the number of nodes in the graph. I believe this idea was 
> pioneered in  [1]. The great insight in that paper is that if you use the 
> graph search algorithm to find the K nearest neighbors of a new document 
> while indexing, and then link those neighbors (undirectedly, ie both ways) to 
> the new document, then the graph that emerges will have the desired 
> properties.
> The implementation I propose for Lucene is as follows. We need two new data 
> structures to encode the vectors and the graph. We can encode vectors using a 
> light wrapper around {{BinaryDocValues}} (we also want to encode the vector 
> dimension and have efficient conversion from bytes to floats). For the graph 
> we can use {{SortedNumericDocValues}} where the values we encode are the 
> docids of the related documents. Encoding the interdocument relations using 
> docids directly will make it relatively fast to traverse the graph since we 
> won't need to lookup through an id-field indirection. This

[GitHub] [lucene-solr] msokolov commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-01-19 Thread GitBox

msokolov commented on a change in pull request #1169: LUCENE-9004: A minor 
feature and patch -- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368330855
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
 ##
 @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws  Exception {
   iw.commit();
   assertConsistentGraph(iw, values);
 
-  assertRecall(dir, 0, values[0]);
+  assertRecall(dir, 1, values[0]);
+}
+  }
+
+  public void testDocsDeletionAndRecall() throws  Exception {
+/**
+ * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector 
is exactly in the index)
+ * deletion strategy to filter all unmatched results searched by {@link 
org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery},
+ * and deletes at most ef*segmentCnt vectors that are the same to the 
specified queryVector.
+ */
+final class KnnExactVectorValueWeight extends ConstantScoreWeight {
+  private final String field;
+  private final ScoreMode scoreMode;
+  private final float[] queryVector;
+  private final int ef;
+
+  KnnExactVectorValueWeight(Query query, float score, ScoreMode scoreMode, 
String field, float[] queryVector, int ef) {
+super(query, score);
+this.field = field;
+this.scoreMode = scoreMode;
+this.queryVector = queryVector;
+this.ef = ef;
+  }
+
+  /**
+   * Returns a {@link Scorer} which can iterate in order over all matching
+   * documents and assign them a score.
+   * 
+   * NOTE: null can be returned if no documents will be scored by 
this
+   * query.
+   * 
+   * NOTE: The returned {@link Scorer} does not have
+   * {@link LeafReader#getLiveDocs()} applied, they need to be checked on 
top.
+   *
+   * @param context the {@link LeafReaderContext} for which to return the 
{@link Scorer}.
+   * @return a {@link Scorer} which scores documents in/out-of order.
+   * @throws IOException if there is a low-level I/O error
+   */
+  @Override
+  public Scorer scorer(LeafReaderContext context) throws IOException {
+ScorerSupplier supplier = scorerSupplier(context);
+if (supplier == null) {
+  return null;
+}
+return supplier.get(Long.MAX_VALUE);
+  }
+
+  @Override
+  public ScorerSupplier scorerSupplier(LeafReaderContext context) throws 
IOException {
+FieldInfo fi = context.reader().getFieldInfos().fieldInfo(field);
+int numDimensions = fi.getVectorNumDimensions();
+if (numDimensions != queryVector.length) {
+  throw new IllegalArgumentException("field=\"" + field + "\" was 
indexed with dimensions=" + numDimensions +
+  "; this is incompatible with query dimensions=" + 
queryVector.length);
+}
+
+final HNSWGraphReader hnswReader = new HNSWGraphReader(field, context);
+final VectorValues vectorValues = 
context.reader().getVectorValues(field);
+if (vectorValues == null) {
+  // No docs in this segment/field indexed any vector values
+  return null;
+}
+
+final Weight weight = this;
+return new ScorerSupplier() {
+  @Override
+  public Scorer get(long leadCost) throws IOException {
+final Neighbors neighbors = 
hnswReader.searchNeighbors(queryVector, ef, vectorValues);
+
+if (neighbors.size() > 0) {
+  Neighbor top = neighbors.top();
+  if (top.distance() > 0) {
+neighbors.clear();
+  } else {
+final List toDeleteNeighbors = new 
ArrayList<>(neighbors.size());
 
 Review comment:
   You are -- finding exact matches to the input vector, right? I don't 
understand what this has to do with deletion. I'm also unclear why we want to 
have an exact match query in the first place. What problem is it solving that 
we could not solve with a hashmap lookup?  And ... it is implemented here in a 
test file. Is this supporting testing in some way? Thanks, I feel I must be 
missing some essential thing here...


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] dsmiley commented on issue #1166: SOLR-14040: shareSchema support for SolrCloud

2020-01-19 Thread GitBox

dsmiley commented on issue #1166: SOLR-14040: shareSchema support for SolrCloud
URL: https://github.com/apache/lucene-solr/pull/1166#issuecomment-576052694
 
 
   Also, we should probably add some protections to prevent sharing of 
core-specific things.  For example if shareSchema=true, then we might want to 
log a warning if there are core specific lib dirs or lib directives in 
solrconfig.xml.  This is an old issue and less likely with SolrCloud scenario.  
Also we might want to block core specific properties from being expanded.  And, 
use of SolrCoreAware when loading schema components ought to log a warning.


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet

2020-01-19 Thread Mikhail Khludnev (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-12325:

Attachment: SOLR-12325.patch
Status: Open  (was: Open)

> introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
> --
>
> Key: SOLR-12325
> URL: https://issues.apache.org/jira/browse/SOLR-12325
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Major
> Attachments: SOLR-12325.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin 
> query parsing method, don't invent your own. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet

2020-01-19 Thread Mikhail Khludnev (Jira)



 [ 
https://issues.apache.org/jira/browse/SOLR-12325?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Mikhail Khludnev updated SOLR-12325:

Status: Patch Available  (was: Open)

> introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet
> --
>
> Key: SOLR-12325
> URL: https://issues.apache.org/jira/browse/SOLR-12325
> Project: Solr
>  Issue Type: New Feature
>  Components: Facet Module
>Reporter: Mikhail Khludnev
>Priority: Major
> Attachments: SOLR-12325.patch
>
>  Time Spent: 1.5h
>  Remaining Estimate: 0h
>
> It might be faster twin for {{uniqueBlock(\_root_)}}. Please utilise buildin 
> query parsing method, don't invent your own. 



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

2020-01-19 Thread Erick Erickson (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-9134?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Erick Erickson updated LUCENE-9134:
---
Attachment: core_regen.patch
Status: Open  (was: Open)

This adds two regenerate tasks for the lucene/core file (plus cleans up a 
couple of nocommits from my PR).

I decided to put up a patch rather than a PR because I'm getting confused about 
which is which, both are WIPs but separate.

This looks like it succeeds on 

createPackedIntSources and createLevAutomaton

I have not yet tried to deal with the jflex bits.

NOTE: I had to munge the path in createLevAutomata.py (sys.path.insert), I'm 
not clear that this is the right thing to do...

> Port ant-regenerate tasks to Gradle build
> -
>
> Key: LUCENE-9134
> URL: https://issues.apache.org/jira/browse/LUCENE-9134
> Project: Lucene - Core
>  Issue Type: Sub-task
>Reporter: Erick Erickson
>Assignee: Erick Erickson
>Priority: Major
> Attachments: LUCENE-9134.patch, core_regen.patch, gen-kuromoji.patch
>
>  Time Spent: 50m
>  Remaining Estimate: 0h
>
> Here are the "regenerate" targets I found in the ant version. There are a 
> couple that I don't have evidence for or against being rebuilt
>  // Very top level
> {code:java}
> ./build.xml: 
> ./build.xml:  failonerror="true">
> ./build.xml:  depends="regenerate,-check-after-regeneration"/>
>  {code}
> // top level Lucene. This includes the core/build.xml and 
> test-framework/build.xml files
> {code:java}
> ./lucene/build.xml: 
> ./lucene/build.xml:  inheritall="false">
> ./lucene/build.xml: 
>  {code}
> // This one has quite a number of customizations to
> {code:java}
> ./lucene/core/build.xml:  depends="createLevAutomata,createPackedIntSources,jflex"/>
>  {code}
> // This one has a bunch of code modifications _after_ javacc is run on 
> certain of the
>  // output files. Save this one for last?
> {code:java}
> ./lucene/queryparser/build.xml: 
>  {code}
> // the files under ../lucene/analysis... are pretty self contained. I expect 
> these could be done as a unit
> {code:java}
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/build.xml: 
> ./lucene/analysis/common/build.xml:  depends="jflex,unicode-data"/>
> ./lucene/analysis/icu/build.xml:  depends="gen-utr30-data-files,gennorm2,genrbbi"/>
> ./lucene/analysis/kuromoji/build.xml:  depends="build-dict"/>
> ./lucene/analysis/nori/build.xml:  depends="build-dict"/>
> ./lucene/analysis/opennlp/build.xml:  depends="train-test-models"/>
>  {code}
>  
> // These _are_ regenerated from the top-level regenerate target, but for --
> LUCENE-9080//the changes were only in imports so there are no
> //corresponding files checked in in that JIRA
> {code:java}
> ./lucene/expressions/build.xml:  depends="run-antlr"/>
>  {code}
> // Apparently unrelated to ./lucene/analysis/opennlp/build.xml 
> "train-test-models" target
> // Apparently not rebuilt from the top level, but _are_ regenerated when 
> executed from
> // ./solr/contrib/langid
> {code:java}
> ./solr/contrib/langid/build.xml:  depends="train-test-models"/>
>  {code}
>  



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-19 Thread Michael Froh (Jira)



[ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17019012#comment-17019012
 ] 

Michael Froh commented on LUCENE-8962:
--

Here's a before and after comparison of the average number of segments searched 
per request since I applied this change (with a TieredMergePolicy subclass that 
tries to merge all segments smaller than 100MB into a single segment on commit, 
with floorSegmentMB of 500). It lowers the overall count, but especially 
significantly reduced the variance.

 

!LUCENE-8962_demo.png!

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Updated] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

2020-01-19 Thread Michael Froh (Jira)



 [ 
https://issues.apache.org/jira/browse/LUCENE-8962?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Michael Froh updated LUCENE-8962:
-
Attachment: LUCENE-8962_demo.png

> Can we merge small segments during refresh, for faster searching?
> -
>
> Key: LUCENE-8962
> URL: https://issues.apache.org/jira/browse/LUCENE-8962
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Michael McCandless
>Priority: Major
> Attachments: LUCENE-8962_demo.png
>
>  Time Spent: 1h 40m
>  Remaining Estimate: 0h
>
> With near-real-time search we ask {{IndexWriter}} to write all in-memory 
> segments to disk and open an {{IndexReader}} to search them, and this is 
> typically a quick operation.
> However, when you use many threads for concurrent indexing, {{IndexWriter}} 
> will accumulate write many small segments during {{refresh}} and this then 
> adds search-time cost as searching must visit all of these tiny segments.
> The merge policy would normally quickly coalesce these small segments if 
> given a little time ... so, could we somehow improve {{IndexWriter'}}s 
> refresh to optionally kick off merge policy to merge segments below some 
> threshold before opening the near-real-time reader?  It'd be a bit tricky 
> because while we are waiting for merges, indexing may continue, and new 
> segments may be flushed, but those new segments shouldn't be included in the 
> point-in-time segments returned by refresh ...
> One could almost do this on top of Lucene today, with a custom merge policy, 
> and some hackity logic to have the merge policy target small segments just 
> written by refresh, but it's tricky to then open a near-real-time reader, 
> excluding newly flushed but including newly merged segments since the refresh 
> originally finished ...
> I'm not yet sure how best to solve this, so I wanted to open an issue for 
> discussion!



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-01-19 Thread GitBox

mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor 
feature and patch -- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368289325
 
 

 ##
 File path: lucene/core/src/java/org/apache/lucene/search/KnnScoreWeight.java
 ##
 @@ -18,6 +18,8 @@
 package org.apache.lucene.search;
 
 import java.io.IOException;
+import java.util.ArrayList;
+import java.util.List;
 
 Review comment:
   Can you please remove those unused imports?


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

2020-01-19 Thread GitBox

mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor 
feature and patch -- support deleting vector values and fix segments merging
URL: https://github.com/apache/lucene-solr/pull/1169#discussion_r368289244
 
 

 ##
 File path: lucene/core/src/test/org/apache/lucene/index/TestKnnGraph.java
 ##
 @@ -92,7 +108,277 @@ public void testSingleDocRecall() throws  Exception {
   iw.commit();
   assertConsistentGraph(iw, values);
 
-  assertRecall(dir, 0, values[0]);
+  assertRecall(dir, 1, values[0]);
+}
+  }
+
+  public void testDocsDeletionAndRecall() throws  Exception {
+/**
+ * {@code KnnExactVectorValueWeight} applies in-set (i.e. the query vector 
is exactly in the index)
+ * deletion strategy to filter all unmatched results searched by {@link 
org.apache.lucene.search.KnnGraphQuery.KnnExactVectorValueQuery},
+ * and deletes at most ef*segmentCnt vectors that are the same to the 
specified queryVector.
+ */
+final class KnnExactVectorValueWeight extends ConstantScoreWeight {
 
 Review comment:
   Thanks, it looks almost okay to me but the Weight and Query classes can be 
(and should be) *static* classes. 
   
   It would look like this:
   ```
   public class TestKnnGraph extends LuceneTestCase {
   
   private static final class KnnExactVectorValueWeight extends 
ConstantScoreWeight {
   
   }
   
   private static final class KnnExactVectorValueQuery extends Query {
   
   }
   
   public void testDocsDeletionAndRecall() throws  Exception {
   
   Query query = new KnnExactVectorValueQuery(...);
   
   }
   
   }
   ```
   Please avoid *non-static* inner classes whenever you can do so, because they 
consume extra memory and object references ;)


This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
us...@infra.apache.org


With regards,
Apache Git Services

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

[jira] [Comment Edited] (SOLR-10506) Possible memory leak upon collection reload

[jira] [Commented] (SOLR-10506) Possible memory leak upon collection reload

[jira] [Updated] (SOLR-10506) Possible memory leak upon collection reload

[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

[GitHub] [lucene-solr] dweiss commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

[jira] [Updated] (LUCENE-9136) Introduce IVFFlat to Lucene for ANN similarity search

[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

[GitHub] [lucene-solr] irvingzhang commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

[GitHub] [lucene-solr] madrob commented on a change in pull request #1184: LUCENE-9142 Refactor IntSet operations for determinize

[jira] [Commented] (LUCENE-9004) Approximate nearest vector search

[GitHub] [lucene-solr] msokolov commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

[GitHub] [lucene-solr] dsmiley commented on issue #1166: SOLR-14040: shareSchema support for SolrCloud

[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet

[jira] [Updated] (SOLR-12325) introduce uniqueBlockQuery(parent:true) aggregation for JSON Facet

[jira] [Updated] (LUCENE-9134) Port ant-regenerate tasks to Gradle build

[jira] [Commented] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

[jira] [Updated] (LUCENE-8962) Can we merge small segments during refresh, for faster searching?

[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

[GitHub] [lucene-solr] mocobeta commented on a change in pull request #1169: LUCENE-9004: A minor feature and patch -- support deleting vector values and fix segments merging

22 matches

Site Navigation

Mail list logo

Footer information