[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17348046#comment-17348046
 ] 

Zach Chen commented on LUCENE-9335:
---

{quote}Actually this matches my expectation. BMM and BMW differ in that BMM 
only makes a decision about which scorers lead iteration once per block, while 
BMW needs to make decisions on every document. So BMM collects more documents 
than BMW but BMW takes the risk that trying to be too smart makes things slower 
than a simpler approach.
{quote}
Ok I also took a further look at the TopDocsCollector code, and confirmed that 
I had an incorrect understanding of "collect" and "hit count" here earlier. 
This (and Michael's earlier response) totally makes sense now!
{quote}Yes. You can download the "Collection" and "Queries" files from 
[https://microsoft.github.io/msmarco/#ranking] (make sure to accept terms at 
the top first so that download links are active).
{quote}
Thanks! I was able to download them. Will explore a bit more to see how they 
can be improved further.

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-9966) SynonymGraphFilter cannot consume a graph token stream

2021-05-19 Thread Geoffrey Lawson (Jira)
Geoffrey Lawson created LUCENE-9966:
---

 Summary: SynonymGraphFilter cannot consume a graph token stream
 Key: LUCENE-9966
 URL: https://issues.apache.org/jira/browse/LUCENE-9966
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/analysis
Reporter: Geoffrey Lawson


SynonymGraphFilter cannot take a graph token stream as input. This limitation 
is documented in SynonymGraphFilter. There are multiple tickets reporting this 
issue with SynonymGraphFilter.
 LUCENE-8985
 LUCENE-9123
 LUCENE-9173

Some tickets have proposed fixes.
 LUCENE-5012
 LUCENE-8985

>From this [email 
>chain|https://lists.apache.org/thread.html/rcef29279940bd8fcc0b6f2165cd25fa8c1230dc4252f6dafbe9bbc6a%40%3Cjava-user.lucene.apache.org%3E]
> we would like to get SynonymGraphFilter correctly consuming graph token 
>streams



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jtibshirani commented on a change in pull request #144: LUCENE-9965: Add tooling to introspect query execution time

2021-05-19 Thread GitBox


jtibshirani commented on a change in pull request #144:
URL: https://github.com/apache/lucene/pull/144#discussion_r635628013



##
File path: 
lucene/sandbox/src/test/org/apache/lucene/sandbox/queries/profile/TestProfileQuery.java
##
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.queries.profile;
+
+import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.greaterThan;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field.Store;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.LRUQueryCache;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.RandomApproximationQuery;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TotalHitCountCollector;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+import org.hamcrest.MatcherAssert;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+
+public class TestProfileQuery extends LuceneTestCase {

Review comment:
   Small comment, `TestProfileIndexSearcher` could be a clearer name.

##
File path: 
lucene/sandbox/src/test/org/apache/lucene/sandbox/queries/profile/TestProfileQuery.java
##
@@ -0,0 +1,217 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.queries.profile;
+
+import static org.hamcrest.Matchers.equalTo;
+import static org.hamcrest.Matchers.greaterThan;
+
+import java.io.IOException;
+import java.util.List;
+import java.util.Map;
+import org.apache.lucene.document.Document;
+import org.apache.lucene.document.Field.Store;
+import org.apache.lucene.document.StringField;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.index.RandomIndexWriter;
+import org.apache.lucene.index.Term;
+import org.apache.lucene.search.LRUQueryCache;
+import org.apache.lucene.search.LeafCollector;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.RandomApproximationQuery;
+import org.apache.lucene.search.Sort;
+import org.apache.lucene.search.TermQuery;
+import org.apache.lucene.search.TotalHitCountCollector;
+import org.apache.lucene.store.Directory;
+import org.apache.lucene.util.IOUtils;
+import org.apache.lucene.util.LuceneTestCase;
+import org.apache.lucene.util.TestUtil;
+import org.hamcrest.MatcherAssert;
+import org.junit.After;
+import org.junit.AfterClass;
+import org.junit.BeforeClass;
+
+public class TestProfileQuery extends LuceneTestCase {
+
+  private static Directory dir;
+  private static IndexReader reader;
+  private static ProfileIndexSearcher searcher;
+
+  @BeforeClass
+  public static void setup() throws IOException {
+dir = newDirectory();
+RandomIndexWriter w = new RandomIndexWriter(random(), dir);
+final int numDocs = TestUtil.nextInt(random(), 1, 20);
+for (int i = 0; i < numDocs; ++i) {
+  final int numHoles = random().nextInt(5);
+  for (int j = 0; j < numHoles; ++j) {
+ 

[GitHub] [lucene] dnhatn commented on a change in pull request #147: LUCENE-9827: Update backward codec in Lucene 9.0

2021-05-19 Thread GitBox


dnhatn commented on a change in pull request #147:
URL: https://github.com/apache/lucene/pull/147#discussion_r635554782



##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene50/compressing/Lucene50CompressingStoredFieldsReader.java
##
@@ -231,14 +238,45 @@ public Lucene50CompressingStoredFieldsReader(
   this.maxPointer = maxPointer;
   this.indexReader = indexReader;
 
-  if (version >= VERSION_META) {
+  if (version >= VERSION_NUM_CHUNKS) {
+numChunks = metaIn.readVLong();
 numDirtyChunks = metaIn.readVLong();
 numDirtyDocs = metaIn.readVLong();

Review comment:
   We are still using bulk merges in StoredField/TermVectors 
[writers](https://github.com/apache/lucene/pull/147/files#diff-6effde18170f5ffa4be913fc312b5852cb86c000a26aaac1e7076eef4e729f54L672-L673)
 in tests. I have removed the optimized merges from these writers in 
https://github.com/apache/lucene/pull/147/commits/150c1c16bfbd7d26043e523047c290f224aafd14.
 I think we just need to keep bare writers to verify the corresponding readers.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] HoustonPutman closed pull request #2292: SOLR-15129: Use Solr distribution TGZ as docker context

2021-05-19 Thread GitBox


HoustonPutman closed pull request #2292:
URL: https://github.com/apache/lucene-solr/pull/2292


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #144: LUCENE-9965: Add tooling to introspect query execution time

2021-05-19 Thread GitBox


jpountz commented on a change in pull request #144:
URL: https://github.com/apache/lucene/pull/144#discussion_r635308121



##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/profile/AbstractInternalProfileTree.java
##
@@ -0,0 +1,185 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.queries.profile;

Review comment:
   We use the `queries` package for Query implementations in other modules, 
but this functionality is more something that introspects query execution, so 
I'd rather put it in a `search` package, e.g.
   ```suggestion
   package org.apache.lucene.sandbox.search;
   ```

##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/profile/AbstractProfiler.java
##
@@ -0,0 +1,51 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.queries.profile;
+
+import java.util.List;
+
+/**
+ * This class acts as as storage for a profile tree. This class may be 
extended to define how the
+ * profile contains and how it's broken into different pieces.
+ */
+public class AbstractProfiler, E> {

Review comment:
   likewise here, this class has a single implementation so let's merge it 
with its only implementation?

##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/profile/ProfileIndexSearcher.java
##
@@ -0,0 +1,79 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+
+package org.apache.lucene.sandbox.queries.profile;
+
+import java.io.IOException;
+import org.apache.lucene.index.IndexReader;
+import org.apache.lucene.search.IndexSearcher;
+import org.apache.lucene.search.Query;
+import org.apache.lucene.search.ScoreMode;
+import org.apache.lucene.search.Weight;
+
+/**
+ * A simple extension of {@link IndexSearcher} to add a {@link QueryProfiler} 
that can be set to
+ * test query timings.
+ */

Review comment:
   Add an example of how it may be used in the javadocs?

##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/queries/profile/AbstractInternalProfileTree.java
##
@@ -0,0 +1,185 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements.  See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License.  You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, 

[GitHub] [lucene] jpountz commented on a change in pull request #147: LUCENE-9827: Update backward codec in Lucene 9.0

2021-05-19 Thread GitBox


jpountz commented on a change in pull request #147:
URL: https://github.com/apache/lucene/pull/147#discussion_r635443199



##
File path: 
lucene/backward-codecs/src/java/org/apache/lucene/backward_codecs/lucene50/compressing/Lucene50CompressingStoredFieldsReader.java
##
@@ -231,14 +238,45 @@ public Lucene50CompressingStoredFieldsReader(
   this.maxPointer = maxPointer;
   this.indexReader = indexReader;
 
-  if (version >= VERSION_META) {
+  if (version >= VERSION_NUM_CHUNKS) {
+numChunks = metaIn.readVLong();
 numDirtyChunks = metaIn.readVLong();
 numDirtyDocs = metaIn.readVLong();

Review comment:
   since we won't do bulk merges anyway, maybe we could do like in the 
`else` block and just consume `metaIn` without setting `numXXX`?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dnhatn opened a new pull request #147: LUCENE-9827: Update backward codec in Lucene 9.0

2021-05-19 Thread GitBox


dnhatn opened a new pull request #147:
URL: https://github.com/apache/lucene/pull/147


   We need to update the reading logic of the backward codec in Lucene 9 for 
[LUCENE-9827](https://github.com/apache/lucene-solr/pull/2495) and 
[LUCENE-9935](https://github.com/apache/lucene-solr/pull/2494) as we have 
backported them to Lucene 8.
   
   Relates https://github.com/apache/lucene-solr/pull/2495
   Relates https://github.com/apache/lucene-solr/pull/2494


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9936) update gradle build to support gpg signing of tgz/zip distributions

2021-05-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9936?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347744#comment-17347744
 ] 

ASF subversion and git services commented on LUCENE-9936:
-

Commit f919672647341a9dbe66f0e35a4b930234ee96e9 in lucene's branch 
refs/heads/main from Houston Putman
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f919672 ]

LUCENE-9936: Add Gpg Signing help info to gradle help command


> update gradle build to support gpg signing of tgz/zip distributions
> ---
>
> Key: LUCENE-9936
> URL: https://issues.apache.org/jira/browse/LUCENE-9936
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Chris M. Hostetter
>Assignee: Chris M. Hostetter
>Priority: Major
> Fix For: main (9.0)
>
> Attachments: LUCENE-9936.patch, LUCENE-9936.patch
>
>
> the gradle build does not currently have any support for gpg signing the 
> distributions we produce.
> this is neccessary for releases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #143: LUCENE-9962: Allow DrillSideways sub-classes to provide their own "drill down" facet counting implementation (or null).

2021-05-19 Thread GitBox


gsmiller commented on pull request #143:
URL: https://github.com/apache/lucene/pull/143#issuecomment-844179797


   @mikemccand as for the gradle check, I think something was busted 
temporarily. I noticed a number of failing workflows during that time. It's now 
passed for this.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] msokolov commented on a change in pull request #146: LUCENE-9963 Add tests for alternate path failures in FlattenGraphFilter

2021-05-19 Thread GitBox


msokolov commented on a change in pull request #146:
URL: https://github.com/apache/lucene/pull/146#discussion_r635290159



##
File path: 
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestFlattenGraphFilter.java
##
@@ -314,5 +314,116 @@ public void testTwoLongParallelPaths() throws Exception {
 11);
   }
 
+  // The end node the long path is supposed to flatten over doesn't exist
+  @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9963;)
+  public void testAltPathFirstStepHole() throws Exception {
+TokenStream in =
+new CannedTokenStream(
+0,
+3,
+new Token[] {token("abc", 1, 3, 0, 3), token("b", 1, 1, 1, 2), 
token("c", 1, 1, 2, 3)});
+
+TokenStream out = new FlattenGraphFilter(in);
+
+assertTokenStreamContents(

Review comment:
   I'm curious to know what happens currently - how does the test fail? 
Could you maybe add a comment saying what the bad results are in each of these 
cases? Hmm I see you did in some cases below.
   
   This stuff is crazy? TBH I am not all that familiar with how 
FlattenGraphFilter works, but I wonder if we could express some invariants 
cleanly and write a randomized test? There seem to be a lot of different edge 
cases.

##
File path: 
lucene/analysis/common/src/test/org/apache/lucene/analysis/core/TestFlattenGraphFilter.java
##
@@ -314,5 +314,116 @@ public void testTwoLongParallelPaths() throws Exception {
 11);
   }
 
+  // The end node the long path is supposed to flatten over doesn't exist
+  @AwaitsFix(bugUrl = "https://issues.apache.org/jira/browse/LUCENE-9963;)
+  public void testAltPathFirstStepHole() throws Exception {
+TokenStream in =
+new CannedTokenStream(
+0,
+3,
+new Token[] {token("abc", 1, 3, 0, 3), token("b", 1, 1, 1, 2), 
token("c", 1, 1, 2, 3)});
+
+TokenStream out = new FlattenGraphFilter(in);
+
+assertTokenStreamContents(
+out,
+new String[] {"abc", "b", "c"},
+new int[] {0, 1, 2},
+new int[] {3, 2, 3},
+new int[] {1, 1, 1},
+new int[] {3, 1, 1},
+3);
+  }
+  // Last node in an alt path releases the long path. but it doesn't exist in 
this graph

Review comment:
   nit: add a blank line




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-9957) Use DirectMonotonicWriter to store sorted Values in NumericDocValues/SortedNumericDocValues

2021-05-19 Thread Lu Xugang (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-9957?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Lu Xugang updated LUCENE-9957:
--
Affects Version/s: (was: 8.8.2)
   main (9.0)

> Use DirectMonotonicWriter to store sorted Values in 
> NumericDocValues/SortedNumericDocValues
> ---
>
> Key: LUCENE-9957
> URL: https://issues.apache.org/jira/browse/LUCENE-9957
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Affects Versions: main (9.0)
>Reporter: Lu Xugang
>Priority: Major
> Attachments: image-2021-05-15-02-03-06-167.png, 
> image-2021-05-15-02-04-09-085.png, image-2021-05-15-02-04-43-405.png
>
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> When all values were sorted, using DirectMonotonicWriter to store them can 
> get relatively impressive compression



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9963) Flatten graph filter has errors when there are holes at beginning or end of alternate paths

2021-05-19 Thread Geoffrey Lawson (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9963?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347615#comment-17347615
 ] 

Geoffrey Lawson commented on LUCENE-9963:
-

I see three issues that need resolution.

1) When there is a hole at the beginning of an alternate path the long path 
doesn't have a node setup to end on after flattening. There already has to be 
some hole recovery during the alternate path so we should be able address the 
recovered output node correctly so the long path can find it when it flattens.

2)The last node in an alternate path is what triggers the long path to give up 
it's pointer to the input from the output. If it's not there, tokens that start 
from the long path's output node in the input will try to start at it's output 
node in the output. This can result in out of order tokens and errors. When the 
token after both paths gets added I think it should start at the frontier. If 
it doesn't it should release the edge that brought it to the current node. This 
one seems the trickiest and to fix.

3)Similar to issue 2, but instead of another token coming in to trigger the 
hole resolution, the token stream ends. The output graph is mostly correct, but 
while releasing tokens the filter will expect tokens that don't exist and 
error. We can identify these as hole and not output any tokens.

I've got a change that addresses these problems. I'm not thrilled on the fix 
for issue 2 and I want to add more unit tests to verify it's working as 
intended. I'll post a separate PR for the fix so we can get these tests in 
first.

> Flatten graph filter has errors when there are holes at beginning or end of 
> alternate paths
> ---
>
> Key: LUCENE-9963
> URL: https://issues.apache.org/jira/browse/LUCENE-9963
> Project: Lucene - Core
>  Issue Type: Bug
>  Components: modules/analysis
>Affects Versions: 8.8
>Reporter: Geoffrey Lawson
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> If asserts are enabled having gaps at the beginning or end of an alternate 
> path can result in assertion errors
> ex: 
>  
> {code:java}
> java.lang.AssertionError: 2
> at  
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:195)
> {code}
>  
> Or
>  
> {code:java}
> java.lang.AssertionError
> at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.releaseBufferedToken(FlattenGraphFilter.java:191)
> {code}
>  
>  
> If asserts are not enabled these the same conditions will result in either 
> IndexOutOfBounds Exceptions, or dropped tokens.
>  
> {code:java}
> java.lang.ArrayIndexOutOfBoundsException: Index -2 out of bounds for length 8
> at org.apache.lucene.util.RollingBuffer.get(RollingBuffer.java:109)
> at 
> org.apache.lucene.analysis.core.FlattenGraphFilter.incrementToken(FlattenGraphFilter.java:325)
> {code}
>  
> These issues can be recreated with the following unit tests
> {code:java}
> public void testAltPathFirstStepHole() throws IOException {
>  TokenStream in = new CannedTokenStream(0, 3, new Token[]{
>  token("abc",1, 3, 0, 3),
>  token("b",1, 1, 1, 2),
>  token("c",1, 1, 2, 3)
>  });
>  TokenStream out = new FlattenGraphFilter(in);
>  assertTokenStreamContents(out,
>  new String[]{"abc", "b", "c"},
>  new int[] {0, 1, 2},
>  new int[] {3, 2, 3}, 
>  new int[] {1, 1, 1},
>  new int[] {3, 1, 1}, //token 0 may need to be len 1 after flattening
>  3);
> }{code}
> {code:java}
> public void testAltPathLastStepHole() throws IOException {
>  TokenStream in = new CannedTokenStream(0, 4, new Token[]{
>  token("abc",1, 3, 0, 3),
>  token("a",0, 1, 0, 1),
>  token("b",1, 1, 1, 2),
>  token("d",2, 1, 3, 4)
>  });
>  TokenStream out = new FlattenGraphFilter(in);
>  assertTokenStreamContents(out,
>  new String[]{"abc", "a", "b", "d"},
>  new int[] {0, 0, 1, 3},
>  new int[] {1, 1, 2, 4},
>  new int[] {1, 0, 1, 2},
>  new int[] {3, 1, 1, 1},
>  4);
> }{code}
> {code:java}
> public void testAltPathLastStepHoleWithoutEndToken() throws IOException {
>  TokenStream in = new CannedTokenStream(0, 2, new Token[]{
>  token("abc",1, 3, 0, 3),
>  token("a",0, 1, 0, 1),
>  token("b",1, 1, 1, 2)
>  });
>  TokenStream out = new FlattenGraphFilter(in);
>  assertTokenStreamContents(out,
>  new String[]{"abc", "a", "b"},
>  new int[] {0, 0, 1},
>  new int[] {1, 1, 2},
>  new int[] {1, 0, 1},
>  new int[] {1, 1, 1},
>  2);
> }{code}
> I believe Lucene-8723 is a related issue as it looks like the last token in 
> an alternate path is being deleted.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] glawson0 opened a new pull request #146: LUCENE-9963 Add tests for alternate path failures in FlattenGraphFilter

2021-05-19 Thread GitBox


glawson0 opened a new pull request #146:
URL: https://github.com/apache/lucene/pull/146


   
   
   
   # Description
   
   Add tests for alternate path failures in FlattenGraphFilter.
   
   Currently failing tests marked AwaitingFix so they will be ignored. Adding 
these tests before fix to allow others to also develop towards a fix.
   
   
   # Solution
   
   None
   
   # Tests
   
   Tests cover 3 problems that can arise for holes in alternate paths.
   1) hole at start of path causes long path to not flatten correctly
   2) hole at end of path causes long path to not move it's output point 
correctly
   3) token stream ends before last shows up to end alternate path.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #138: LUCENE-9956: Make getBaseQuery, getDrillDownQueries API from DrillDownQuery public

2021-05-19 Thread GitBox


gsmiller commented on a change in pull request #138:
URL: https://github.com/apache/lucene/pull/138#discussion_r635152354



##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,15 +174,32 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
+if (isDimQueriesDirty == false) {
+  // returns previously built dimQueries
+  return builtDimQueries;
+}
 Query[] dimQueries = new Query[this.dimQueries.size()];

Review comment:
   You shouldn't necessarily need to allocate a new array here right? If 
you've previously built the queries (i.e., it's non-null), you should be able 
to use `ArrayUtil` to grow the array (if necessary) and then repopulate it 
directly. I suppose it depends a little bit on whether-or-not we want to 
provide the caller any guarantees around whether-or-not the contents of the 
array we return to them can change out from underneath them. But in this case, 
I don't think that matters too much (but I would document it in the javadoc).

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -170,11 +170,22 @@ private BooleanQuery getBooleanQuery() {
 return bq.build();
   }
 
-  Query getBaseQuery() {
+  /**
+   * Returns the internal baseQuery of the DrillDownQuery
+   *
+   * @return The baseQuery used on initialization of DrillDownQuery
+   */
+  public Query getBaseQuery() {
 return baseQuery;
   }
 
-  Query[] getDrillDownQueries() {
+  /**
+   * Returns the dimension queries added either via {@link #add(String, 
Query)} or {@link
+   * #add(String, String...)}
+   *
+   * @return The array of dimQueries
+   */
+  public Query[] getDrillDownQueries() {
 Query[] dimQueries = new Query[this.dimQueries.size()];

Review comment:
   Thanks for giving this optimization a shot @gautamworah96. Generally 
looks great! Left a couple comments on the approach.

##
File path: lucene/facet/src/java/org/apache/lucene/facet/DrillDownQuery.java
##
@@ -53,6 +53,8 @@ public static Term term(String field, String dim, String... 
path) {
   private final Query baseQuery;
   private final List dimQueries = new ArrayList<>();
   private final Map drillDownDims = new LinkedHashMap<>();
+  private boolean isDimQueriesDirty = true;

Review comment:
   What about tracking the need to rebuild per dimension? If there's a case 
where a user adds a large number of drill downs but keeps only modifying a 
single one (e.g., adding additional terms) in-between calls to `getDrillDowns` 
we could do a fair amount of wasteful rebuilding. You could keep a 
`List` to track the status of each dimension and then only rebuild 
those that have changed.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9950) Support both single- and multi-value string fields in facet counting (non-taxonomy based approaches)

2021-05-19 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9950?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347596#comment-17347596
 ] 

ASF subversion and git services commented on LUCENE-9950:
-

Commit 5e99b8b4742f058e9ebee949ab16f8e6f1495fca in lucene-solr's branch 
refs/heads/branch_8x from Greg Miller
[ https://gitbox.apache.org/repos/asf?p=lucene-solr.git;h=5e99b8b ]

LUCENE-9950: New facet counting implementation for general string doc value 
fields (#2497)

Co-authored-by: Greg Miller 

> Support both single- and multi-value string fields in facet counting 
> (non-taxonomy based approaches)
> 
>
> Key: LUCENE-9950
> URL: https://issues.apache.org/jira/browse/LUCENE-9950
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (9.0)
>Reporter: Greg Miller
>Priority: Minor
>  Time Spent: 3h
>  Remaining Estimate: 0h
>
> Users wanting to facet count string-based fields using a non-taxonomy-based 
> approach can use {{SortedSetDocValueFacetCounts}}, which accumulates facet 
> counts based on a {{SortedSetDocValues}} field. This requires the stored doc 
> values to be multi-valued (i.e., {{SORTED_SET}}), and doesn't work on 
> single-valued fields (i.e., SORTED). In contrast, if a user wants to facet 
> count on a stored numeric field, they can use {{LongValueFacetCounts}}, which 
> supports both single- and multi-valued fields (and in LUCENE-9948, we now 
> auto-detect instead of asking the user to specify).
> Let's update {{SortedSetDocValueFacetCounts}} to also support, and 
> automatically detect single- and multi-value fields. Note that this is a 
> spin-off issue from LUCENE-9946, where [~rcmuir] points out that this can 
> essentially be a one-line change, but we may want to do some class renaming 
> at the same time. Also note that we should do this in 
> {{ConcurrentSortedSetDocValuesFacetCounts}} while we're at it.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand merged pull request #2497: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-19 Thread GitBox


mikemccand merged pull request #2497:
URL: https://github.com/apache/lucene-solr/pull/2497


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mikemccand commented on pull request #2497: LUCENE-9950: New facet counting implementation for general string doc value fields

2021-05-19 Thread GitBox


mikemccand commented on pull request #2497:
URL: https://github.com/apache/lucene-solr/pull/2497#issuecomment-844005185


   Thanks @gsmiller!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand commented on pull request #145: LUCENE-9950: move changes entry for backport to 8.9

2021-05-19 Thread GitBox


mikemccand commented on pull request #145:
URL: https://github.com/apache/lucene/pull/145#issuecomment-843994587


   Thanks @gmiller!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] mikemccand merged pull request #145: LUCENE-9950: move changes entry for backport to 8.9

2021-05-19 Thread GitBox


mikemccand merged pull request #145:
URL: https://github.com/apache/lucene/pull/145


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] karlwettin edited a comment on pull request #136: LUCENE-9589 Swedish Minimal Stemmer

2021-05-19 Thread GitBox


karlwettin edited a comment on pull request #136:
URL: https://github.com/apache/lucene/pull/136#issuecomment-843614395


   I gave the stemmer a spin on 
[SAOL](https://en.wikipedia.org/wiki/Svenska_Akademiens_ordlista) 13 (2006). I 
have to stay within the bounds of fair use and can't publish the complete 
results.
   
   Generally speaking I think it does a remarkable job with such a small 
decision tree. Given what it's meant to do, I would merge it.
   
   A few notes that are more applicable on a not so minimal implementation:
   
   The suffix-s pluralis rule has ~5300 exceptions where words ends with s is 
nominative case singularis.
   
   It's also missing the rules defined in LUCENE-1515, especially 'an' and 
'ans'-suffixes. Back then I came to the conclusion that 8% of the Swedish 
language can be inflected that way, but there is a list of ~200 words that 
needs to be setup as exceptions to those rules.
   
   Two standard an/ans-suffixes:
   
   | Stemmed| Original   |
   | - |:-:|
   ättiksgurk | ättiksgurka
   ättiksgurka | ättiksgurkan
   ättiksgurka | ättiksgurkans
   ättiksgurk | ättiksgurkas
   ättiksgurk | ättiksgurkor
   ättiksgurk | ättiksgurkorna
   ättiksgurk | ättiksgurkornas
   ättiksgurk | ättiksgurkors
   
   | Stemmed| Original   |
   | - |:-:|
   ättestup | ättestupa
   ättestupa | ättestupan
   ättestupa | ättestupans
   ättestup | ättestupas
   ättestup | ättestupor
   ättestup | ättestuporna
   ättestup | ättestupornas
   ättestup | ättestupors
   
   There are probably more complete and better examples of this in LUCENE-1515.
   
   And if I have to go looking for problems, I see these:
   
   | Stemmed| Original   |
   | - |:-:|
   höstmörk | höstmörker
   höstmörk | höstmörkers
   höstmörkr | höstmörkret
   höstmörkr | höstmörkrets
   
   | Stemmed| Original   |
   | - |:-:|
   höstkollektio | höstkollektion
   höstkollektion | höstkollektionen
   höstkollektion | höstkollektionens
   höstkollektion | höstkollektioner
   höstkollektion | höstkollektionerna
   höstkollektion | höstkollektionernas
   höstkollektion | höstkollektioners
   höstkollektio | höstkollektions
   
   This one is a number of different words with very different meaning that 
turn out completely mixed up, not all nouns though:
   
   | Stemmed| Original   |
   | - |:-:|
   hölj | hölj
   hölj | hölja
   hölja | höljan
   höljand | höljande
   hölja | höljans
   hölj | höljas
   höljd | höljd
   höljd | höljda
   höljd | höljde
   höljd | höljdes
   hölj | hölje
   hölj | höljen
   höljen | höljena
   höljen | höljenas
   hölj | höljens
   hölj | höljer
   hölj | höljes
   hölj | höljet
   hölj | höljets
   hölj | höljor
   hölj | höljorna
   hölj | höljornas
   hölj | höljors
   hölj | höljs
   höljt | höljt
   höljt | höljts
   
   I'm afraid it isn't possible to extract stemmer rules and exception lists 
from SAOL due to copyright issues (unless we find a digital copy that's at 
least 20 years old), but perhaps an alternative and more global route would be 
to mine [Wikidata:Lexicographical 
data](https://www.wikidata.org/wiki/Wikidata:Lexicographical_data)?
   
   https://www.wikidata.org/wiki/Lexeme:L38829


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9958) Performance regression when a minimum number of matching SHOULD clauses is required

2021-05-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347412#comment-17347412
 ] 

Adrien Grand commented on LUCENE-9958:
--

To set expectations, some queries might still be slower than they were in older 
versions after this change. This is due to the fact that that BMW adds some 
overhead and might not always help skip enough documents to counter-balance 
this overhead. For instance here is the benchmark of a baseline that doesn't do 
BMW (by reverting LUCENE-9346) vs. main. Queries with a high number of required 
SHOULD clauses may still be slower.

{noformat}
TaskQPS baseline  StdDev   QPS patch  StdDev
Pct diff p-value
MSM6   85.71  (8.7%)   50.79  (2.3%)  
-40.7% ( -47% -  -32%) 0.000
MSM5   28.38  (6.9%)   23.18  (2.3%)  
-18.3% ( -25% -   -9%) 0.000
MSM7  200.58  (3.9%)  199.28  (3.6%)   
-0.7% (  -7% -7%) 0.580
MSM1   20.38  (2.7%)   20.55  (2.7%)
0.8% (  -4% -6%) 0.351
PKLookup  231.96  (3.6%)  234.75  (3.6%)
1.2% (  -5% -8%) 0.292
MSM48.48  (6.7%)   20.54  (6.5%)  
142.1% ( 120% -  166%) 0.000
MSM32.95  (6.0%)   20.52 (19.9%)  
595.8% ( 537% -  661%) 0.000
MSM21.92  (3.6%)   20.59 (27.4%)  
970.5% ( 907% - 1038%) 0.000
{noformat}

> Performance regression when a minimum number of matching SHOULD clauses is 
> required
> ---
>
> Key: LUCENE-9958
> URL: https://issues.apache.org/jira/browse/LUCENE-9958
> Project: Lucene - Core
>  Issue Type: Bug
>Reporter: Adrien Grand
>Priority: Minor
> Fix For: 8.9
>
>
> Opening this issue on behalf of [~mattweber], who reported this at 
> https://discuss.elastic.co/t/es-7-7-1-es-7-12-0-wand-performance-issue/272854.
> It looks like the fact that we introduced dynamic pruning for queries that 
> already have a minimum number of SHOULD clauses configured makes things 
> _slower_, at least in some cases.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347403#comment-17347403
 ] 

Adrien Grand commented on LUCENE-9335:
--

bq. I also notice that BMM BulkScorer collects roughly 10X the amount of docs 
compared with BMM scorer, which in turn also collects > 10X the amount of docs 
compared with BMW. I feel this may also explain the unexpected slow down? In 
general I would assume these scorers to all collect the same amount of top docs.

Actually this matches my expectation. BMM and BMW differ in that BMM only makes 
a decision about which scorers lead iteration once per block, while BMW needs 
to make decisions on every document. So BMM collects more documents than BMW 
but BMW takes the risk that trying to be too smart makes things slower than a 
simpler approach.

bq.  Are these passages data set and queries used available for download 
somewhere

Yes. You can download the "Collection" and "Queries" files from 
https://microsoft.github.io/msmarco/#ranking (make sure to accept terms at the 
top first so that download links are active).

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] balmukundblr commented on a change in pull request #132: Parallel processing

2021-05-19 Thread GitBox


balmukundblr commented on a change in pull request #132:
URL: https://github.com/apache/lucene/pull/132#discussion_r634992701



##
File path: 
lucene/benchmark/src/java/org/apache/lucene/benchmark/byTask/feeds/ReutersContentSource.java
##
@@ -102,19 +103,33 @@ public void close() throws IOException {
   public DocData getNextDocData(DocData docData) throws NoMoreDataException, 
IOException {
 Path f = null;
 String name = null;
-synchronized (this) {
-  if (nextFile >= inputFiles.size()) {
-// exhausted files, start a new round, unless forever set to false.
-if (!forever) {
-  throw new NoMoreDataException();
-}
-nextFile = 0;
-iteration++;
-  }
-  f = inputFiles.get(nextFile++);
-  name = f.toRealPath() + "_" + iteration;
+int inputFilesSize = inputFiles.size();
+
+if (threadIndexCreated == false) {
+  createThreadIndex();
+}
+
+// Getting file index value which is set for each thread
+int index = 
Integer.parseInt(Thread.currentThread().getName().substring(12));

Review comment:
   -Yes, TaskSequence.java is only where new Index threads are created.
   
   -We want to ensure that the name of Index threads maintains a guaranteed 
sequence and we explicitly setup thread names in TaskSequence.java. The thread 
name maintains "IndexThread-" pattern where  is an integer. So, 
it is safe to parse the thread name to int.
   We'll also add necessary comments in ReutersContentSource.java as well.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-843836186


   > > ./gradlew check -Ptests.nightly=true 
-Pvalidation.git.failOnModified=false
   > 
   > 
   > 
   > Hi Zach. You can also "git add -A ." (stage your changes for commit); or 
just commit them in. Then there's  no need for the fail-on-modified flag to be 
turned off. :)
   
   Ha yes I came to realize that also some time ago, but kinda formed the habit 
of passing it in by default now (mostly from past command search) so that I 
don't need to worry about where my changes are. But yeah that's good tip. 
Thanks Dawid!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM:
-

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
*BMMBulkScorer without window (with the above scorer implementation)*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024* 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 
some 

[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen edited comment on LUCENE-9335 at 5/19/21, 7:25 AM:
-

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
  

*BMMBulkScorer without window (with the above scorer implementation)*
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024* 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 

[jira] [Commented] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347362#comment-17347362
 ] 

Zach Chen commented on LUCENE-9335:
---

{quote}The speedup for some of the slower queries looks great. I know Fuzzy1 
and Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe 
your change makes them faster?
{quote}
Ah not sure why I didn't think of running them through BMM earlier! I just gave 
them a run, and got the following results:

*BMM Scorer*

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.46 (24.7%)   17.63 (11.6%)  
-42.1% ( -62% -   -7%) 0.000
  Fuzzy2   21.61 (16.4%)   16.28 (12.0%)  
-24.7% ( -45% -4%) 0.000
PKLookup  216.72  (4.1%)  215.63  (3.0%)   
-0.5% (  -7% -6%) 0.654
{code}
 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   30.58  (9.1%)   22.12  (6.4%)  
-27.7% ( -39% -  -13%) 0.000
  Fuzzy2   36.07 (12.7%)   27.05 (10.8%)  
-25.0% ( -42% -   -1%) 0.000
PKLookup  215.26  (3.4%)  213.99  (2.5%)   
-0.6% (  -6% -5%) 0.530{code}
 

 

*BMMBulkScorer without window (with the above scorer implementation)*

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   16.32 (22.6%)   15.68 (16.3%)   
-3.9% ( -34% -   45%) 0.527
  Fuzzy1   48.11 (17.6%)   47.48 (13.6%)   
-1.3% ( -27% -   36%) 0.791
PKLookup  213.67  (3.2%)  212.52  (4.0%)   
-0.5% (  -7% -6%) 0.640
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   26.99 (23.2%)   24.75 (13.6%)   
-8.3% ( -36% -   37%) 0.169
PKLookup  216.27  (4.3%)  216.43  (3.4%)
0.1% (  -7% -8%) 0.951
  Fuzzy1   19.01 (24.2%)   20.01 (14.2%)
5.3% ( -26% -   57%) 0.400
{code}
*BMMBulkScorer with window size 1024*

 

 
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy2   23.56 (26.0%)   19.08 (13.9%)  
-19.0% ( -46% -   28%) 0.004
  Fuzzy1   30.97 (31.6%)   25.82 (16.9%)  
-16.6% ( -49% -   46%) 0.038
PKLookup  213.23  (2.5%)  211.63  (1.8%)   
-0.7% (  -5% -3%) 0.289
{code}
{code:java}
TaskQPS baseline  StdDevQPS my_modified_version  
StdDevPct diff p-value
  Fuzzy1   20.59 (12.1%)   20.59 (10.5%)   
-0.0% ( -20% -   25%) 0.994
PKLookup  205.21  (3.1%)  206.99  (3.7%)
0.9% (  -5% -7%) 0.422
  Fuzzy2   30.74 (22.7%)   32.71 (17.0%)
6.4% ( -27% -   59%) 0.311
{code}
 

These results look strange to me actually, as I would imagine the BulkScorer 
without window one to perform similarly with the scorer one, as it was just 
using the scorer implementation under the hood. I'll need to dive into it more 
to understand what contributed to these difference (their JFR CPU recordings 
look similar too).

>From the results I got now, it seems BMM may not be ideal for handling queries 
>with many terms. My high level guess is that with these queries that can be 
>rewritten into boolean queries with  ~50 terms, BMM may find itself spending 
>lots of time to compute upTo and update maxScore, as the minimum of all block 
>boundaries of scorers were used to update upTo each time. This can explain why 
>the bulkScorer implementation with a fixed window size has better performance 
>than the scorer one, but doesn't explain the difference above.

 
{quote}I wanted to do some more tests so I played with the MSMARCO passages 
dataset, which has the interesting property of having queries that have several 
terms (often around 8-10). See the attached benchmark if you are interested, 
here are the outputs I'm getting for various scorers:

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?
{quote}
Thanks for running these additional tests! The results indeed look interesting. 
I took a look at the MSMarcoPassages.java code you attached, and wonder if it's 
also possible that, since the percentile numbers were computed after sort, for 
some low percentile (P10 for example) 

[GitHub] [lucene] dweiss commented on pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


dweiss commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-843813330


   > ./gradlew check -Ptests.nightly=true -Pvalidation.git.failOnModified=false
   
   Hi Zach. You can also "git add -A ." (stage your changes for commit); or 
just commit them in. Then there's  no need for the fail-on-modified flag to be 
turned off. :)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634957781



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -2795,12 +2972,14 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
* @lucene.experimental
*/
   public static Status.DocValuesStatus testDocValues(
-  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {

Review comment:
   I also think it should be ok to backport. The only thing I would like to 
mention is that, in addition to the API change, the more subtle change is that 
these methods would now no longer throw unchecked RuntimeException when the 
check find index integrity error, and `failFast` set to `true`. For any 
application that (should not have) relied on this behavior, it may now find the 
check would continue processing instead of aborting with exception when the 
check finds an error. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634957136



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -926,17 +1100,19 @@ public Status checkIndex(List onlySegments) 
throws IOException {
* @lucene.experimental
*/
   public static Status.LiveDocStatus testLiveDocs(
-  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+  CodecReader reader, PrintStream infoStream, String segmentId) {
 long startNS = System.nanoTime();
+String segmentPartId = segmentId + "[LiveDocs]";
 final Status.LiveDocStatus status = new Status.LiveDocStatus();
 
 try {
-  if (infoStream != null) infoStream.print("test: check live 
docs.");
+  if (infoStream != null) infoStream.print(segmentPartId + "test: 
check live docs.");

Review comment:
   > Sorry about not answering the // nocommit question before.
   
   No problem, and thanks again for the review and feedback!
   
   > Ideally, all infoStream.print for a given "part" of the index checking 
would first append to a per-part log, and then (under lock) print to 
console/main infoStream as a single "block" of output? (So that we don't see 
confusing interleaved across segments/parts checks)?
   
   Oh I see, haven't thought about this approach before, and it sounds 
interesting! I assume by "per-part log" you meant an array of in-memory, per 
part buffers that accumulate messages over concurrent check right? If we were 
to combine these buffers at the end of / after the concurrent index check, we 
should be ok to just print them out to main InfoStream without locking?

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -926,17 +1100,19 @@ public Status checkIndex(List onlySegments) 
throws IOException {
* @lucene.experimental
*/
   public static Status.LiveDocStatus testLiveDocs(
-  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+  CodecReader reader, PrintStream infoStream, String segmentId) {
 long startNS = System.nanoTime();
+String segmentPartId = segmentId + "[LiveDocs]";
 final Status.LiveDocStatus status = new Status.LiveDocStatus();
 
 try {
-  if (infoStream != null) infoStream.print("test: check live 
docs.");
+  if (infoStream != null) infoStream.print(segmentPartId + "test: 
check live docs.");
   final int numDocs = reader.numDocs();
   if (reader.hasDeletions()) {
 Bits liveDocs = reader.getLiveDocs();
 if (liveDocs == null) {
-  throw new RuntimeException("segment should have deletions, but 
liveDocs is null");
+  throw new RuntimeException(

Review comment:
   Done.

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -2106,16 +2286,6 @@ static void checkImpacts(Impacts impacts, int 
lastTarget) {
 }
   }
 
-  /**
-   * Test the term index.
-   *
-   * @lucene.experimental
-   */
-  public static Status.TermIndexStatus testPostings(CodecReader reader, 
PrintStream infoStream)

Review comment:
   I think I accidentally removed it...I've restored it as well as another 
one.

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -2737,13 +2910,14 @@ public Relation compare(byte[] minPackedValue, byte[] 
maxPackedValue) {
* @lucene.experimental
*/
   public static Status.StoredFieldStatus testStoredFields(
-  CodecReader reader, PrintStream infoStream, boolean failFast) throws 
IOException {
+  CodecReader reader, PrintStream infoStream, String segmentId) {
 long startNS = System.nanoTime();
+String segmentPartId = segmentId + "[StoredFields]";

Review comment:
   Done (I used `CheckIndexException` instead of `CheckIndexFailure` for 
naming consistency). I also replaced all `RuntimeException` in `CheckIndex` 
with this new exception class.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634956303



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -216,6 +225,9 @@
 
   /** Status of vectors */
   public VectorValuesStatus vectorValuesStatus;
+
+  /** Status of soft deletes */
+  public SoftDeletsStatus softDeletesStatus;

Review comment:
   It was checked before, but was done in a way that's different (not using 
status class for example) from the rest 
https://github.com/apache/lucene/blob/65820e5170ed15e91cc3349e6dd4da90689ecd5d/lucene/core/src/java/org/apache/lucene/index/CheckIndex.java#L786-L789,
 so I went ahead and updated it to follow the same convention. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634956106



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -468,6 +495,10 @@ private static void msg(PrintStream out, String msg) {
 if (out != null) out.println(msg);
   }
 
+  private static void msg(PrintStream out, String id, String msg) {
+if (out != null) out.println(id + " " + msg);

Review comment:
   Done. 

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -372,6 +384,14 @@ private FieldNormStatus() {}
   /** Exception thrown during term index test (null on success) */
   public Throwable error = null;
 }
+
+/** Status from testing soft deletes */
+public static final class SoftDeletsStatus {
+  SoftDeletsStatus() {}
+
+  /** Exception thrown during soft deletes test (null on success) */
+  public Throwable error = null;

Review comment:
   Removed, as well as the same ones used in other status classes. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on pull request #128:
URL: https://github.com/apache/lucene/pull/128#issuecomment-843795302


   > I love this change -- `CheckIndex` is often run in the "heat of the 
moment" so time is of the essence to recovering your Lucene index and getting 
things back online. Yet it is [very slow today, even on highly concurrent 
boxes](https://home.apache.org/~mikemccand/lucenebench/checkIndexTime.html).
   > 
   > I left some small comments that I think are fine to do in a followon PR. 
This change is already massive enough (GitHub was at first not willing to even 
render the `CheckIndex` diffs!) and impactful enough that we can do the 
improvements after.
   
   Thanks Michael for the review and feedback. I think as far as the original 
scope of the jira ticket goes, there's also the parallelization across segments 
that has not been implemented yet. But agree that this PR is already big and 
should already provide a good speed up on powerful concurrent boxes (up to 11 
concurrent checks for each segment), so we can probably let it run for a while 
and see if parallelization across segments is still needed, which from my quick 
in-mind coding will definitely require much more changes for concurrency 
control to get it right.
   
   One thing I'm still researching is that, it seems there's limited direct 
test coverage for this `CheckIndex`  class? I see there's `TestCheckIndex`, but 
it only has 4 tests, and the majority of its functionalities seems to be put 
under tests by other index testing utilities and test cases. Shall I still add 
a few more tests for these changes (and should I put them in `TestCheckIndex`)? 
On the other hand, I've been running `./gradlew check -Ptests.nightly=true 
-Pvalidation.git.failOnModified=false` the whole time and all tests have been 
passing.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-9335) Add a bulk scorer for disjunctions that does dynamic pruning

2021-05-19 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9335?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel=17347104#comment-17347104
 ] 

Adrien Grand edited comment on LUCENE-9335 at 5/19/21, 6:44 AM:


The speedup for some of the slower queries looks great. I know Fuzzy1 and 
Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your 
change makes them faster?

I wanted to do some more tests so I played with the MSMARCO passages dataset, 
which has the interesting property of having queries that have several terms 
(often around 8-10). See the attached benchmark if you are interested, here are 
the outputs I'm getting for various scorers:

BMW
{noformat}
AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894
{noformat}

BMM - scorer
{noformat}
AVG: 3.0175025635E7
Median: 18845216
P75: 37770072
P90: 77785996
P95: 98711520
P99: 181530180
Collected AVG: 50089.09
Collected Median: 20322
Collected P75: 57955
Collected P90: 125052
Collected P95: 194696
Collected P99: 442811
{noformat}

BMM - bulk scorer
{noformat}
AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467
{noformat}

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?

EDIT: I had first got wrong results for BMM scorer because I had run with an 
old commit of the PR, this is now fixed.


was (Author: jpountz):
The speedup for some of the slower queries looks great. I know Fuzzy1 and 
Fuzzy2 are quite noisy, but have you tried running them using BMM? Maybe your 
change makes them faster?

I wanted to do some more tests so I played with the MSMARCO passages dataset, 
which has the interesting property of having queries that have several terms 
(often around 8-10). See the attached benchmark if you are interested, here are 
the outputs I'm getting for various scorers:

BMW
{noformat}
AVG: 1.0851470951E7
Median: 5552285
P75: 12087216
P90: 26834970
P95: 40460199
P99: 77821369
Collected AVG: 8168.523
Collected Median: 2259
Collected P75: 3735
Collected P90: 6228
Collected P95: 13063
Collected P99: 221894
{noformat}

BMM - scorer
{noformat}
AVG: 4.1779829712E7
Median: 28701530
P75: 57780117
P90: 103794862
P95: 130582282
P99: 215559175
Collected AVG: 460.482
Collected Median: 143
Collected P75: 158
Collected P90: 180
Collected P95: 2316
Collected P99: 7277
{noformat}

BMM - bulk scorer
{noformat}
AVG: 5.3372459518E7
Median: 18658182
P75: 60750919
P90: 143040509
P95: 227538646
P99: 461590829
Collected AVG: 525419.23
Collected Median: 109750
Collected P75: 563404
Collected P90: 1651320
Collected P95: 2597310
Collected P99: 4508467
{noformat}

Contrary to my intuition, WAND seems to perform better despite the high number 
of terms. I wonder if there are some improvements we can still make to BMM?

> Add a bulk scorer for disjunctions that does dynamic pruning
> 
>
> Key: LUCENE-9335
> URL: https://issues.apache.org/jira/browse/LUCENE-9335
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
> Attachments: MSMarcoPassages.java, wikimedium.10M.nostopwords.tasks, 
> wikimedium.10M.nostopwords.tasks.5OrMeds
>
>  Time Spent: 6h 50m
>  Remaining Estimate: 0h
>
> Lucene often gets benchmarked against other engines, e.g. against Tantivy and 
> PISA at [https://tantivy-search.github.io/bench/] or against research 
> prototypes in Table 1 of 
> [https://cs.uwaterloo.ca/~jimmylin/publications/Grand_etal_ECIR2020_preprint.pdf].
>  Given that top-level disjunctions of term queries are commonly used for 
> benchmarking, it would be nice to optimize this case a bit more, I suspect 
> that we could make fewer per-document decisions by implementing a BulkScorer 
> instead of a Scorer.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634953939



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -701,104 +765,196 @@ public Status checkIndex(List onlySegments) 
throws IOException {
 if (reader.hasDeletions()) {
   if (reader.numDocs() != info.info.maxDoc() - info.getDelCount()) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + (info.info.maxDoc() - info.getDelCount())
 + " vs reader="
 + reader.numDocs());
   }
   if ((info.info.maxDoc() - reader.numDocs()) > reader.maxDoc()) {
 throw new RuntimeException(
-"too many deleted docs: maxDoc()="
+segmentId
++ "too many deleted docs: maxDoc()="
 + reader.maxDoc()
 + " vs del count="
 + (info.info.maxDoc() - reader.numDocs()));
   }
   if (info.info.maxDoc() - reader.numDocs() != info.getDelCount()) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + info.getDelCount()
 + " vs reader="
 + (info.info.maxDoc() - reader.numDocs()));
   }
 } else {
   if (info.getDelCount() != 0) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + info.getDelCount()
 + " vs reader="
 + (info.info.maxDoc() - reader.numDocs()));
   }
 }
 
 if (checksumsOnly == false) {
+  // This redundant assignment is done to make compiler happy
+  SegmentReader finalReader = reader;
+
   // Test Livedocs
-  segInfoStat.liveDocStatus = testLiveDocs(reader, infoStream, 
failFast);
+  CompletableFuture testliveDocs =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testLiveDocs(finalReader, infoStream, segmentId),
+  liveDocStatus -> segInfoStat.liveDocStatus = liveDocStatus);
 
   // Test Fieldinfos
-  segInfoStat.fieldInfoStatus = testFieldInfos(reader, infoStream, 
failFast);
+  CompletableFuture testFieldInfos =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testFieldInfos(finalReader, infoStream, segmentId),
+  fieldInfoStatus -> segInfoStat.fieldInfoStatus = 
fieldInfoStatus);
 
   // Test Field Norms
-  segInfoStat.fieldNormStatus = testFieldNorms(reader, infoStream, 
failFast);
+  CompletableFuture testFieldNorms =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testFieldNorms(finalReader, infoStream, segmentId),
+  fieldNormStatus -> segInfoStat.fieldNormStatus = 
fieldNormStatus);
 
   // Test the Term Index
-  segInfoStat.termIndexStatus =
-  testPostings(reader, infoStream, verbose, doSlowChecks, 
failFast);
+  CompletableFuture testTermIndex =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testPostings(finalReader, infoStream, segmentId, 
verbose, doSlowChecks),
+  termIndexStatus -> segInfoStat.termIndexStatus = 
termIndexStatus);
 
   // Test Stored Fields
-  segInfoStat.storedFieldStatus = testStoredFields(reader, infoStream, 
failFast);
+  CompletableFuture testStoredFields =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testStoredFields(finalReader, infoStream, segmentId),
+  storedFieldStatus -> segInfoStat.storedFieldStatus = 
storedFieldStatus);
 
   // Test Term Vectors
-  segInfoStat.termVectorStatus =
-  testTermVectors(reader, infoStream, verbose, doSlowChecks, 
failFast);
+  CompletableFuture testTermVectors =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testTermVectors(finalReader, infoStream, segmentId, 
verbose, doSlowChecks),
+  termVectorStatus -> segInfoStat.termVectorStatus = 
termVectorStatus);
 
   // Test Docvalues
-  segInfoStat.docValuesStatus = testDocValues(reader, infoStream, 
failFast);
+  CompletableFuture testDocValues =
+  runAysncSegmentPartCheck(
+  executorService,
+  () -> testDocValues(finalReader, 

[GitHub] [lucene] zacharymorn commented on a change in pull request #128: LUCENE-9662: [WIP] CheckIndex should be concurrent

2021-05-19 Thread GitBox


zacharymorn commented on a change in pull request #128:
URL: https://github.com/apache/lucene/pull/128#discussion_r634953443



##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -488,8 +519,35 @@ public Status checkIndex() throws IOException {
* quite a long time to run.
*/
   public Status checkIndex(List onlySegments) throws IOException {
+ExecutorService executorService =
+Executors.newFixedThreadPool(threadCount, new 
NamedThreadFactory("async-check-index"));
+try {
+  return checkIndex(onlySegments, executorService);
+} finally {
+  executorService.shutdown();
+  try {
+executorService.awaitTermination(5, TimeUnit.SECONDS);
+  } catch (
+  @SuppressWarnings("unused")

Review comment:
   Yeah agreed. Updated. 
   

##
File path: lucene/core/src/java/org/apache/lucene/index/CheckIndex.java
##
@@ -701,104 +765,196 @@ public Status checkIndex(List onlySegments) 
throws IOException {
 if (reader.hasDeletions()) {
   if (reader.numDocs() != info.info.maxDoc() - info.getDelCount()) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + (info.info.maxDoc() - info.getDelCount())
 + " vs reader="
 + reader.numDocs());
   }
   if ((info.info.maxDoc() - reader.numDocs()) > reader.maxDoc()) {
 throw new RuntimeException(
-"too many deleted docs: maxDoc()="
+segmentId
++ "too many deleted docs: maxDoc()="
 + reader.maxDoc()
 + " vs del count="
 + (info.info.maxDoc() - reader.numDocs()));
   }
   if (info.info.maxDoc() - reader.numDocs() != info.getDelCount()) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + info.getDelCount()
 + " vs reader="
 + (info.info.maxDoc() - reader.numDocs()));
   }
 } else {
   if (info.getDelCount() != 0) {
 throw new RuntimeException(
-"delete count mismatch: info="
+segmentId
++ "delete count mismatch: info="
 + info.getDelCount()
 + " vs reader="
 + (info.info.maxDoc() - reader.numDocs()));
   }
 }
 
 if (checksumsOnly == false) {
+  // This redundant assignment is done to make compiler happy
+  SegmentReader finalReader = reader;
+
   // Test Livedocs
-  segInfoStat.liveDocStatus = testLiveDocs(reader, infoStream, 
failFast);
+  CompletableFuture testliveDocs =
+  runAysncSegmentPartCheck(

Review comment:
   Oops. Fixed. 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org