[GitHub] [lucene] jimczi commented on a change in pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test

2021-11-15 Thread GitBox


jimczi commented on a change in pull request #444:
URL: https://github.com/apache/lucene/pull/444#discussion_r749986210



##
File path: 
lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java
##
@@ -61,7 +63,14 @@
 if (needsScores) {
   final List normsList = new ArrayList<>();
   final List weightList = new ArrayList<>();
+  final Set duplicateCheckingSet = new HashSet<>();
   for (FieldAndWeight field : normFields) {
+assert duplicateCheckingSet.contains(field.field) == false
+: "There is duplicated field ["
++ field.field
++ "] used to construct MultiNormsLeafSimScorer";
+duplicateCheckingSet.add(field.field);

Review comment:
   Could be `assert duplicateCheckingSet.add(field.field) == false` ? 




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070
 ] 

Feng Guo edited comment on LUCENE-10233 at 11/16/21, 7:43 AM:
--

[~jpountz]  Thanks! +1 to remove the OffsetBitSet class, but i find that there 
has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do 
you mean we should implement it?

In this case, Bitset is clearly in this parttern: [nothing][dense]. So compared 
with SparseFixedBitSet, I tend to use a more customized way to store if 
possible . Another way to replace the OffsetBitSet class i can think of is to 
support a docBase in BitSetIterator, i implement this in the newest commit: 
https://github.com/apache/lucene/pull/438, i wonder if this approach makes 
sense to you.


was (Author: gf2121):
[~jpountz]  Thanks! +1 to remove the OffsetBitSet class, but i find that there 
has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do 
you mean we should implement it?

Another way to replace the OffsetBitSet class i can think of is to support a 
docBase in BitSetIterator, i implement this in the newest commit: 
https://github.com/apache/lucene/pull/438, i wonder if this approach makes 
sense to you.

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn commented on pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery

2021-11-15 Thread GitBox


zacharymorn commented on pull request #418:
URL: https://github.com/apache/lucene/pull/418#issuecomment-969846412


   > Hi @jpountz @jimczi, when I was doing some more deep dive to see how this 
PR can be improved further, I noticed a difference of field-weight parameter 
passed to create`MultiNormsLeafSimScorer` in `CombinedFieldQuery`:
   > 
   > `CombinedFieldQuery#scorer`:
   > 
   > 
https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421
   > 
   > `CombinedFieldQuery#explain`:
   > 
   > 
https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389
   > 
   > For the `CombinedFieldQuery#scorer` one, `fields` may contain duplicated 
fields:
   > 
   > 
https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414
   > 
   > , whereas `fieldAndWeights.values()` should not contain duplicated fields.
   > 
   > I feel `CombinedFieldQuery#scorer` should be updated to use 
`fieldAndWeights.values()` as well? What do you think?
   
   For now I have created a spin-off issue 
https://issues.apache.org/jira/browse/LUCENE-10236 and a separate PR 
https://github.com/apache/lucene/pull/444 for this. Please let me know if they 
look good to you.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zacharymorn opened a new pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test

2021-11-15 Thread GitBox


zacharymorn opened a new pull request #444:
URL: https://github.com/apache/lucene/pull/444


   # Description
   
   Updated field-weight used in CombinedFieldQuery scoring calculation
   
   # Tests
   
   1. Added a new test from https://github.com/apache/lucene/pull/418
   2. Run `./gradlew clean; ./gradlew check 
-Pvalidation.git.failOnModified=false`
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [x] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2021-11-15 Thread Zach Chen (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Zach Chen updated LUCENE-10236:
---
Description: 
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

 

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 

  was:
This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:

 
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
 

I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 


> CombinedFieldsQuery to use fieldAndWeights.values() when constructing 
> MultiNormsLeafSimScorer for scoring
> -
>
> Key: LUCENE-10236
> URL: https://issues.apache.org/jira/browse/LUCENE-10236
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/sandbox
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>
> This is a spin-off issue from discussion in 
> [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a 
> quick fix in CombinedFieldsQuery scoring.
> Currently 

[jira] [Created] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring

2021-11-15 Thread Zach Chen (Jira)
Zach Chen created LUCENE-10236:
--

 Summary: CombinedFieldsQuery to use fieldAndWeights.values() when 
constructing MultiNormsLeafSimScorer for scoring
 Key: LUCENE-10236
 URL: https://issues.apache.org/jira/browse/LUCENE-10236
 Project: Lucene - Core
  Issue Type: Improvement
  Components: modules/sandbox
Reporter: Zach Chen
Assignee: Zach Chen


This is a spin-off issue from discussion in 
[https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick 
fix in CombinedFieldsQuery scoring.

Currently CombinedFieldsQuery would use a constructed 
[fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421]
 object to create a MultiNormsLeafSimScorer for scoring, but the fields object 
may contain duplicated field-weight pairs as it is [built from looping over 
fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414],
 resulting into duplicated norms being added during scoring calculation in 
MultiNormsLeafSimScorer. 

E.g. for CombinedFieldsQuery with two fields and two values matching a 
particular doc:

 
{code:java}
CombinedFieldQuery query =
new CombinedFieldQuery.Builder()
.addField("field1", (float) 1.0)
.addField("field2", (float) 1.0)
.addTerm(new BytesRef("foo"))
.addTerm(new BytesRef("zoo"))
.build(); {code}
 

I would imagine the scoring to be based on the following:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2)

but the current logic would use the following for scoring:
 # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) 
+ freq(field2:zoo)
 # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + 
norm(field2)

In addition, this differs from how MultiNormsLeafSimScorer is constructed from 
CombinedFieldsQuery explain function, which [uses 
fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389]
 and does not contain duplicated field-weight pairs. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery

2021-11-15 Thread Zach Chen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444254#comment-17444254
 ] 

Zach Chen commented on LUCENE-10212:


No problem [~julietibs] ! Glad to be able to contribute! 

> Add luceneutil benchmark task for CombinedFieldsQuery
> -
>
> Key: LUCENE-10212
> URL: https://issues.apache.org/jira/browse/LUCENE-10212
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Zach Chen
>Assignee: Zach Chen
>Priority: Minor
>
> This is a spin-off task from 
> https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively 
> evaluate performance changes for CombinedFieldsQuery, we would like to  add 
> benchmark task and parsing for CombinedFieldsQuery.
> One proposal to the query syntax to enable CombinedFieldsQuery benchmarking 
> would be the following:
> {code:java}
> taskName: term1 term2 term3 term4 
> +combinedFields=field1^boost1,field2^boost2,field3^boost3
> {code}
>  
>  
>  



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

2021-11-15 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444178#comment-17444178
 ] 

Greg Miller commented on LUCENE-10062:
--

Here's the PR targeting 9.0: [https://github.com/apache/lucene/pull/443]

Let's iterate there. Once it's merged, I'll pull the change over to 9x as well 
and then update my other PR against main with a much simpler version that 
removes the back-compat concerns with 8.x indices.

> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for 
> faceting
> 
>
> Key: LUCENE-10062
> URL: https://issues.apache.org/jira/browse/LUCENE-10062
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 3h 40m
>  Remaining Estimate: 0h
>
> We currently encode taxonomy ordinals using varint style packing in a binary 
> doc values field. I suspect there have been a number of improvements to 
> SortedNumericDocValues since taxonomy faceting was first introduced, and I 
> plan to explore replacing the custom binary format we have today with a 
> SORTED_NUMERIC type dv field instead.
> I'll report benchmark results and index size impact here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-15 Thread GitBox


gsmiller commented on a change in pull request #443:
URL: https://github.com/apache/lucene/pull/443#discussion_r749715273



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java
##
@@ -409,9 +410,26 @@ private void processFacetFields(
 indexDrillDownTerms(doc, indexFieldName, dimConfig, facetLabel);
   }
 
-  // Facet counts:
-  // DocValues are considered stored fields:
-  doc.add(new BinaryDocValuesField(indexFieldName, 
dedupAndEncode(ordinals.get(;
+  // Store the taxonomy ordinals associated with each doc. Prefer to use 
SortedNumericDocValues
+  // but "fall back" to a custom binary format to maintain backwards 
compatibility with Lucene 8
+  // indexes.
+  if (taxoWriter.useNumericDocValuesForOrdinals()) {

Review comment:
   I'm not entirely sure this is correct. This tells us that the taxonomy 
index was created with an 8.x or newer version, but doesn't tell us anything 
about the version of the main index. It's assuming that the same version is 
used to create both the taxonomy index and main index. Is it possible for these 
versions to not match? I wonder if we need to account for that situation? If 
so, I'm not entirely how to go about doing so as we don't have any information 
about the main index here in this class. So that would need some more thought.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-15 Thread GitBox


gsmiller commented on pull request #443:
URL: https://github.com/apache/lucene/pull/443#issuecomment-969362319


   **NOTE**: I'm working on additional testing but hoping to get some early 
feedback on this approach, particularly as backwards-compatibility is 
concerned. I'll update with more tests and a CHANGES entry.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] gsmiller opened a new pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals

2021-11-15 Thread GitBox


gsmiller opened a new pull request #443:
URL: https://github.com/apache/lucene/pull/443


   # Description
   
   In benchmarks, using numeric doc values to store taxonomy facet ordinals 
shows almost a 400% qps improvement in browse-related taxonomy-based tasks 
(instead of custom delta-encoding into a binary doc values field). This PR 
changes the encoding of facet ordinals, while maintaining backwards 
compatibility with 8.x indexes.
   
   # Solution
   
   This change moves to standard numeric doc values for storing taxonomy 
ordinals.
   
   # Tests
   
   No new tests added. Lots of existing test coverage for taxonomy faceting 
functionality.
   
   **NOTE**: I will add new testing that ensures backwards compatibility 
support remains with 8.x. I'm putting this PR out for feedback while doing so.
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting

2021-11-15 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444124#comment-17444124
 ] 

Greg Miller commented on LUCENE-10062:
--

I'm posting a new PR now for adding this format change to the 9.0 release. The 
intention is to maintain backwards compatibility with 8.x indices, and then 
drop support for the older binary doc values format in 10 (which allows us to 
avoid some of the back-compat complexity on the main branch). The PR I'm 
posting takes a fairly aggressive approach to deprecation, and I'm curious what 
folks will think of this. I'll outline the deprecation approach here, starting 
with the less-controversial followed by the potentially more-controversial.

*Less Controversial*

Starting with Lucene 9.0, taxonomy ordinals will be stored as a 
{{SortedNumericDocValues}} field. For backwards compatibility, if Lucene 9.x 
code is writing to an index created with 8.x, it will revert back to using a 
{{BinaryDocValues}} format. In Lucene 8.x, we allow users to plug in their own 
custom binary format if they don't want the default. This will continue to work 
in 9.x, but only if writing to an 8.x index. Users will not be able to plug in 
any sort of custom format for indexes created with 9.0 onward (they'll get a 
{{SortedNumericDocValues}} field as-is).

When merging segments, we will honor the present format for backwards 
compatibility. So if the segments being merged were written with 9.x, we'll 
merge the {{SortedNumericDocValues}} fields. If we're merging 8.x segments, 
we'll maintain the older binary format (including any customization plugged in 
by the user). Again, no custom format support will be provided for 9.0 onwards.

When reading the ordinals, we'll be backwards compatible with 8.x indexes 
(using the binary format).

*Potentially Controversial*

Users currently have the ability to provide ordinals for a given document 
through the concept of an {{OrdinalsReader}} when using 
{{{}TaxonomyFacetCounts{}}}, {{TaxonomyFacetSumValueSource}} and 
{{{}TaxonomyFacetLabels{}}}. This seems like it's available mainly to support 
users that have created a custom binary format for their taxonomy ordinals. 
But, in theory, it could be useful more generally if users have some need to 
provide ordinals in some other, custom way. I propose deprecating this concept 
entirely. While it's not terribly hard to keep it around, I struggle to think 
of a real use-case for users needing to provide ordinals in a custom way if we 
no longer support the ability to plug in a custom binary format. Note that the 
other facet implementations (including things like 
{{{}FastTaxonomyFacetCounts{}}}) assume the default encoding, so they'll 
seamlessly switch from the binary format to the numeric format under-the-hood 
in a backwards-compatible fashion. If users really have some custom need, 
there's nothing preventing them from implementing their own {{Facets}} 
sub-class, etc.

If anyone knows of real-world use-cases for maintaining the support for 
{{{}OrdinalsReader{}}}, I'm happy to keep it in. I have a version of the change 
that does so already, so it's not really any extra work, it just seems a good 
opportunity to remove some code complexity.

> Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for 
> faceting
> 
>
> Key: LUCENE-10062
> URL: https://issues.apache.org/jira/browse/LUCENE-10062
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Reporter: Greg Miller
>Assignee: Greg Miller
>Priority: Minor
>  Time Spent: 3h 10m
>  Remaining Estimate: 0h
>
> We currently encode taxonomy ordinals using varint style packing in a binary 
> doc values field. I suspect there have been a number of improvements to 
> SortedNumericDocValues since taxonomy faceting was first introduced, and I 
> plan to explore replacing the custom binary format we have today with a 
> SORTED_NUMERIC type dv field instead.
> I'll report benchmark results and index size impact here.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] bruno-roustant commented on pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-15 Thread GitBox


bruno-roustant commented on pull request #430:
URL: https://github.com/apache/lucene/pull/430#issuecomment-969337494


   Woah, @jpountz your idea has a clear effect. It brings +(10 - 12)% perf, 
compared without the top-k optim. And top-k optim adds another +3% on top (plus 
it is still very effective for k very close to boundaries).
   So I'd like to keep both. I hope it does not make the code too complex.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mkhludnev commented on pull request #2611: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


mkhludnev commented on pull request #2611:
URL: https://github.com/apache/lucene-solr/pull/2611#issuecomment-969325618


   @jpountz I still not sure if i did it right.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mkhludnev closed pull request #2610: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


mkhludnev closed pull request #2610:
URL: https://github.com/apache/lucene-solr/pull/2610


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] rmuir commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)

2021-11-15 Thread GitBox


rmuir commented on a change in pull request #264:
URL: https://github.com/apache/lucene/pull/264#discussion_r749671027



##
File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java
##
@@ -81,4 +84,19 @@ public long cost() {
   }
 };
   }
+
+  /**
+   * Determine whether-or-not an index segment is using the older-style binary 
format or the newer
+   * NumericDocValues format for storing taxonomy faceting ordinals (for the 
specified field).
+   *
+   * @deprecated Please do not rely on this method. It is added as a temporary 
measure for providing
+   * index backwards-compatibility with Lucene 9 and earlier indexes, and 
will be removed in
+   * Lucene 11.
+   */
+  // TODO: Remove in Lucene 11
+  @Deprecated
+  public static boolean usesOlderBinaryOrdinals(LeafReader reader, String 
field) {

Review comment:
   Why this check? Can't we simply do 
`reader.getMetaData().getCreatedVersionMajor() < 9` or whatever, matching the 
exact logic at write-time in DirectoryTaxonomyWriter?




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-15 Thread Haoyu Zhai (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444098#comment-17444098
 ] 

Haoyu Zhai commented on LUCENE-10122:
-

OK, here's the new PR (with the back-compatibility): 
[https://github.com/apache/lucene/pull/442]

[~jpountz] I set that PR to target on 9.0 branch based on the previous email 
thread, but since we've already in process of releasing please let me know if 
you want this to be targeting main branch instead.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (10.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 2h 10m
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] zhaih opened a new pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array

2021-11-15 Thread GitBox


zhaih opened a new pull request #442:
URL: https://github.com/apache/lucene/pull/442


   
   
   
   # Description
   
   As mentioned in the issue, use NumericDocValues to store parent array 
instead of term positioning.
   
   Benchmark results available in JIRA, in short we're seeing a slightly larger 
taxonomy index and plain performance
   
   Added some if-branches for backward compatibility.
   
   ## Misc
   I've unified my names in "CHNAGES.txt" to be "Patrick Zhai" (Previously 
sometimes I used "Haoyu Zhai")
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [x] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [x] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [x] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [x] I have developed this patch against the `main` branch.
   - [x] I have run `./gradlew check`.
   - [ ] I have added tests for my changes. (Old tests should be sufficient)


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444091#comment-17444091
 ] 

Michael McCandless commented on LUCENE-10122:
-

OK I am convinced too – let's move forward!  It is really weird to abuse 
positions like this.

[~zhai7631] what do you think?  I think the only issue on the PR is back-compat 
from 8.x indices?

Later we can also try to eliminate the separate large heap-resident {{int[]}} 
as [~rmuir] suggested on the dev list today.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (10.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-15 Thread GitBox


bruno-roustant commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r749654605



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Thanks for this idea Adrien. I'll try it.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mkhludnev commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


mkhludnev commented on pull request #2609:
URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969271708


   Uhgg... Thanks, Adrien. I'm on it. 


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070
 ] 

Feng Guo edited comment on LUCENE-10233 at 11/15/21, 7:51 PM:
--

[~jpountz]  Thanks! +1 to remove the OffsetBitSet class, but i find that there 
has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do 
you mean we should implement it?

Another way to replace the OffsetBitSet class i can think of is to support a 
docBase in BitSetIterator, i implement this in the newest commit: 
https://github.com/apache/lucene/pull/438, i wonder if this approach makes 
sense to you.


was (Author: gf2121):
[~jpountz]  Thanks! +1 to remove the OffsetBitSet class, but i find that there 
has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do 
you mean we should implement it?

Another way to replace the OffsetBitSet class i can think of is to support a 
docBase in BitSetIterator, i implement this in the newest commit: 
https://github.com/apache/lucene/pull/438, i want to know if this approach 
makes sense to you.

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070
 ] 

Feng Guo commented on LUCENE-10233:
---

[~jpountz]  Thanks! +1 to remove the OffsetBitSet class, but i find that there 
has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do 
you mean we should implement it?

Another way to replace the OffsetBitSet class i can think of is to support a 
docBase in BitSetIterator, i implement this in the newest commit: 
https://github.com/apache/lucene/pull/438, i want to know if this approach 
makes sense to you.

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Arjen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444029#comment-17444029
 ] 

Arjen commented on LUCENE-9921:
---

Thanks, although this kind of issue can happen with any library. Antlr used an 
even older version, but luckily only with usages that didn't require such a 
hard version check.

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444025#comment-17444025
 ] 

Robert Muir commented on LUCENE-9921:
-

Sorry you got jar hell because lucene has the outdated dependency. I mean we 
really should be using the latest stuff. We can address it for 9.1 for sure.

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Arjen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444024#comment-17444024
 ] 

Arjen commented on LUCENE-9921:
---

I know that.

I even fixated it in my build.gradle at 62.2 to prevent other libraries from 
picking a different version... but if another library needs a higher version, 
that higher version is apparently selected by Gradle (rather than stopping with 
a version conflict). Which I discovered with Antlr; they upgraded from 
something below 62.2 to 70.1 in their version 4.9.3 :(

I can stick to 4.9.2 for now, but it would be nice to be able to upgrade in the 
semi-near future (i.e. waiting for 9.1 should do).

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444016#comment-17444016
 ] 

Robert Muir commented on LUCENE-9921:
-

You can't just override the advertised dependency version and upgrade the 
library yourself to an arbitrary version. It won't work.

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Arjen (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444011#comment-17444011
 ] 

Arjen commented on LUCENE-9921:
---

If the window is closing on 9, I agree it would be too risky.

My reason for looking for a ticket like this, was the {{utr30.nrm}} file used 
in {{ICUFoldingFilter.}} That seems to trigger a version check with icu 70.1:
{code:java}
Caused by: com.ibm.icu.util.ICUUncheckedIOException: java.io.IOException: ICU 
data file error: Header authentication failed, please check if you have a valid 
ICU data file; data format 4e726d32, format version 3.0.0.0
    at app//com.ibm.icu.impl.Normalizer2Impl.load(Normalizer2Impl.java:506)
    at 
app//com.ibm.icu.impl.Norm2AllModes$1.createInstance(Norm2AllModes.java:351)
    at 
app//com.ibm.icu.impl.Norm2AllModes$1.createInstance(Norm2AllModes.java:344)
    at app//com.ibm.icu.impl.SoftCache.getInstance(SoftCache.java:69)
    at app//com.ibm.icu.impl.Norm2AllModes.getInstance(Norm2AllModes.java:341)
    at app//com.ibm.icu.text.Normalizer2.getInstance(Normalizer2.java:202)
    at 
app//org.apache.lucene.analysis.icu.ICUFoldingFilter.(ICUFoldingFilter.java:72)
    ... 6 more
Caused by: java.io.IOException: ICU data file error: Header authentication 
failed, please check if you have a valid ICU data file; data format 4e726d32, 
format version 3.0.0.0
    at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:606)
    at com.ibm.icu.impl.ICUBinary.readHeaderAndDataVersion(ICUBinary.java:557)
    at com.ibm.icu.impl.Normalizer2Impl.load(Normalizer2Impl.java:453)
{code}
The line causing it is a custom analyzer that does this: {{{}new 
ICUFoldingFilter(tokenStream){}}}. So if someone upgrades the icu-dependency 
manually (or uses another library that requires > 68.2), they end up with the 
above exception.

That may actually be yet another reason for a new issue, but I don't know how 
to test this particular code with version 70.1 to see if its simply a matter of 
regenerating the particular file or not :)

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2021-11-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443976#comment-17443976
 ] 

Adrien Grand commented on LUCENE-8739:
--

Side thought: it would be nice to use Project Panama's Foreign linker when it 
gets released instead of depending on this JNI library.

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Robert Muir (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443972#comment-17443972
 ] 

Robert Muir commented on LUCENE-9921:
-

I agree, let's not rush this in close to a release. Note that we don't need to 
upgrade this jar for the 37 new unicode 14 emoji to work, they will already be 
tokenized/tagged correctly, due to the way the preallocation happens in unicode:

See test:
https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L506-L519

See list:
https://s.apache.org/pqnnc



> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array

2021-11-15 Thread Greg Miller (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443965#comment-17443965
 ] 

Greg Miller commented on LUCENE-10122:
--

Yeah, +1 to moving to doc values. Even if we see a minor taxonomy size growth, 
it's a more sensible data structure for this use-case. Taxonomy indices are 
generally quite small anyway (compared to the main index), so I'd rather align 
the use-case with an appropriate data structure then see if we can optimize it 
over time.

> Explore using NumericDocValue to store taxonomy parent array
> 
>
> Key: LUCENE-10122
> URL: https://issues.apache.org/jira/browse/LUCENE-10122
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: modules/facet
>Affects Versions: main (10.0)
>Reporter: Haoyu Zhai
>Priority: Minor
>  Time Spent: 2h
>  Remaining Estimate: 0h
>
> We currently use term position of a hardcoded term in a hardcoded field to 
> represent the parent ordinal of each taxonomy label. That is an old way and 
> perhaps could be dated back to the time where doc values didn't exist.
> We probably would want to use NumericDocValues instead given we have spent 
> quite a lot of effort optimizing them.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2021-11-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443964#comment-17443964
 ] 

Adrien Grand commented on LUCENE-8739:
--

I ran your PR with the new stored fields benchmark to see how codecs compare:

||Codec ||Indexing time (ms) ||Disk usage (MB) || Retrieval time per 10k docs 
(ms) ||
| BEST_SPEED | 35383 | 90.175 | 190.17524 |
| BEST_COMPRESSION (vanilla zlib) | 76671 | 58.682 | 1910.42106 |
| BEST_COMPRESSION (Cloudflare zlib) | 54791 | 58.601 | 1395.53593 |
| ZSTD (level=1) | 42433 | 70.527 | 240.04036 |
| ZSTD (level=3) | 53426 | 68.737 | 259.61897 |
| ZSTD (level=6) | 100697 | 66.283 | 251.91177 |

>From a quick look at your PR, it looks like you are not using dictionaries, 
>which would explain why we're seeing a worse compression ratio?

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10235) LRUQueryCache should not count never-cacheable queries as a miss

2021-11-15 Thread Yannick Welsch (Jira)
Yannick Welsch created LUCENE-10235:
---

 Summary: LRUQueryCache should not count never-cacheable queries as 
a miss
 Key: LUCENE-10235
 URL: https://issues.apache.org/jira/browse/LUCENE-10235
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Yannick Welsch


Hit and miss counts of a cache are typically used to check how effective a 
caching layer is. While looking at a system that exhibited a very high miss to 
hit ratio, I took a closer look at Lucene's LRUQueryCache and noticed that it's 
treating the handling of queries as a miss that it would never ever even think 
about caching in the first place. (e.g. TermQuery and others mentioned in 
UsageTrackingQueryCachingPolicy.shouldNeverCache).

The reason these are counted as a miss is that LRUQueryCache (scorerSupplier 
and bulkScorer methods) first does a lookup on the cache, incrementing hit or 
miss counters, and upon miss, only then checks QueryCachingPolicy.shouldCache 
to decide whether that query should be put into the cache.

This issue is made more complex by the fact that QueryCachingPolicy.shouldCache 
is a stateful method, and cacheability of a query can change over time (e.g. 
after appearing N times).

I'm opening this issue to discuss whether others also feel that the current way 
of accounting misses is unintuitive / confusing. I would also like to put 
forward a proposal to:
 * generalize the boolean QueryCachingPolicy.shouldCache method to return an 
enum instead (one of YES, NOT_RIGHT_NOW, NEVER), and only account queries that 
are (eventually) cacheable and not in the cache as a miss,
 * optionally introduce another metric for queries that are never cacheable, 
e.g. "ignored", and
 * optionally refine miss count into a count for items that are cacheable right 
away, and those that will eventually be cacheable.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API

2021-11-15 Thread Michael McCandless (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443953#comment-17443953
 ] 

Michael McCandless commented on LUCENE-10216:
-

I like this plan (extending {{MergePolicy}} so it also has purview over how 
merging is done in {{{}addIndexes(CodecReader[]){}}}).

Reference counting might get tricky, if {{OneMerge}} or {{IndexWriter}} holding 
completed {{OneMerge}} instances try to {{decRef}} readers.

To improve testing we could create a new {{LuceneTestCase}} method to 
{{addIndexes}} from {{Directory[]}} that randomly does so with both impls and 
fix tests to sometimes use that for adding indices.

> Add concurrency to addIndexes(CodecReader…) API
> ---
>
> Key: LUCENE-10216
> URL: https://issues.apache.org/jira/browse/LUCENE-10216
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/index
>Reporter: Vigya Sharma
>Priority: Major
>
> I work at Amazon Product Search, and we use Lucene to power search for the 
> e-commerce platform. I’m working on a project that involves applying 
> metadata+ETL transforms and indexing documents on n different _indexing_ 
> boxes, combining them into a single index on a separate _reducer_ box, and 
> making it available for queries on m different _search_ boxes (replicas). 
> Segments are asynchronously copied from indexers to reducers to searchers as 
> they become available for the next layer to consume.
> I am using the addIndexes API to combine multiple indexes into one on the 
> reducer boxes. Since we also have taxonomy data, we need to remap facet field 
> ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version 
> of this API. The API leverages {{SegmentMerger.merge()}} to create segments 
> with new ordinal values while also merging all provided segments in the 
> process.
> _This is however a blocking call that runs in a single thread._ Until we have 
> written segments with new ordinal values, we cannot copy them to searcher 
> boxes, which increases the time to make documents available for search.
> I was playing around with the API by creating multiple concurrent merges, 
> each with only a single reader, creating a concurrently running 1:1 
> conversion from old segments to new ones (with new ordinal values). We follow 
> this up with non-blocking background merges. This lets us copy the segments 
> to searchers and replicas as soon as they are available, and later replace 
> them with merged segments as background jobs complete. On the Amazon dataset 
> I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. 
> Each call was given about 5 readers to add on average.
> This might be useful add to Lucene. We could create another {{addIndexes()}} 
> API with a {{boolean}} flag for concurrency, that internally submits multiple 
> merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, 
> and waits for them to complete before returning.
> While this is doable from outside Lucene by using your thread pool, starting 
> multiple addIndexes() calls and waiting for them to complete, I felt it needs 
> some understanding of what addIndexes does, why you need to wait on the merge 
> and why it makes sense to pass a single reader in the addIndexes API.
> Out of box support in Lucene could simplify this for folks a similar use case.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?

2021-11-15 Thread Dawid Weiss (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443934#comment-17443934
 ] 

Dawid Weiss commented on LUCENE-9921:
-

I think it's too late for 9.0 - too little testing would be run before it's 
shipped to risk the upgrade (?). But for 9.1 - certainly.

> Can ICU regeneration tasks treat icu version as input?
> --
>
> Key: LUCENE-9921
> URL: https://issues.apache.org/jira/browse/LUCENE-9921
> Project: Lucene - Core
>  Issue Type: Task
>Reporter: Robert Muir
>Priority: Major
>
> ICU 69 was released, so i was playing with the upgrade just to test it out 
> and test out our regeneration.
> Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks 
> were SKIPPED by the build.
> So I'm curious if the ICU version can be treated as an "input" to these 
> tasks, such that if it changes, tasks know the generated output is stale?



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Resolved] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss resolved LUCENE-10234.
--
Fix Version/s: 9.0
   Resolution: Fixed

> Add automatic module name to JAR manifests.
> ---
>
> Key: LUCENE-10234
> URL: https://issues.apache.org/jira/browse/LUCENE-10234
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
> Fix For: 9.0
>
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the first step to make Lucene a proper fit for the java module 
> system. I chose a shorthand "lucene.[x]" module name convention, without the 
> "org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443919#comment-17443919
 ] 

ASF subversion and git services commented on LUCENE-10234:
--

Commit 9d0eb88d2cd5339d4988417b5e46789a23d60d6f in lucene's branch 
refs/heads/branch_9x from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9d0eb88 ]

LUCENE-10234: Add automatic module name to JAR manifests. (#440)



> Add automatic module name to JAR manifests.
> ---
>
> Key: LUCENE-10234
> URL: https://issues.apache.org/jira/browse/LUCENE-10234
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the first step to make Lucene a proper fit for the java module 
> system. I chose a shorthand "lucene.[x]" module name convention, without the 
> "org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Assigned] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread Dawid Weiss (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Dawid Weiss reassigned LUCENE-10234:


Assignee: Dawid Weiss

> Add automatic module name to JAR manifests.
> ---
>
> Key: LUCENE-10234
> URL: https://issues.apache.org/jira/browse/LUCENE-10234
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Assignee: Dawid Weiss
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the first step to make Lucene a proper fit for the java module 
> system. I chose a shorthand "lucene.[x]" module name convention, without the 
> "org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443918#comment-17443918
 ] 

ASF subversion and git services commented on LUCENE-10234:
--

Commit bf8072c1e9465ab491aa290baf8c1ac53e22d41c in lucene's branch 
refs/heads/branch_9_0 from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bf8072c ]

LUCENE-10234: Add automatic module name to JAR manifests. (#440)



> Add automatic module name to JAR manifests.
> ---
>
> Key: LUCENE-10234
> URL: https://issues.apache.org/jira/browse/LUCENE-10234
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the first step to make Lucene a proper fit for the java module 
> system. I chose a shorthand "lucene.[x]" module name convention, without the 
> "org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443917#comment-17443917
 ] 

ASF subversion and git services commented on LUCENE-10234:
--

Commit f5e5cf008aa791e0add8e1cc71ef2447b6b54c46 in lucene's branch 
refs/heads/main from Dawid Weiss
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f5e5cf0 ]

LUCENE-10234: Add automatic module name to JAR manifests. (#440)



> Add automatic module name to JAR manifests.
> ---
>
> Key: LUCENE-10234
> URL: https://issues.apache.org/jira/browse/LUCENE-10234
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Dawid Weiss
>Priority: Trivial
>  Time Spent: 0.5h
>  Remaining Estimate: 0h
>
> This is the first step to make Lucene a proper fit for the java module 
> system. I chose a shorthand "lucene.[x]" module name convention, without the 
> "org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss merged pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.

2021-11-15 Thread GitBox


dweiss merged pull request #440:
URL: https://github.com/apache/lucene/pull/440


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.

2021-11-15 Thread GitBox


jpountz commented on a change in pull request #430:
URL: https://github.com/apache/lucene/pull/430#discussion_r749462492



##
File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java
##
@@ -185,6 +204,115 @@ private void shuffle(int from, int to) {
 }
   }
 
+  /** Selects the k-th entry with a bottom-k algorithm, given that k is close 
to {@code from}. */

Review comment:
   Interesting. When I raised this question, I had more in mind borrowing 
ideas from interpolation search 
(https://en.wikipedia.org/wiki/Binary_search_algorithm#Interpolation_search) 
and e.g. see if picking the lowest of the 3 medians when `k-from < (to - from) 
/ 8` and the highest of the 3 medians when `to-k < (to - from) / 8)` makes 
things any faster.




-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene

2021-11-15 Thread Praveen Nishchal (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443912#comment-17443912
 ] 

Praveen Nishchal commented on LUCENE-8739:
--

I have created a pull request - [https://github.com/apache/lucene/pull/439]

I am using Zstd-JNI [https://github.com/luben/zstd-jni] in a new custom codec 
which integrates Zstd compression and decompression in StoredFieldFormat.

> ZSTD Compressor support in Lucene
> -
>
> Key: LUCENE-8739
> URL: https://issues.apache.org/jira/browse/LUCENE-8739
> Project: Lucene - Core
>  Issue Type: New Feature
>  Components: core/codecs
>Reporter: Sean Torres
>Priority: Minor
>  Labels: features
>  Time Spent: 1h 10m
>  Remaining Estimate: 0h
>
> ZStandard has a great speed and compression ratio tradeoff. 
> ZStandard is open source compression from Facebook.
> More about ZSTD
> [https://github.com/facebook/zstd]
> [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/]



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on pull request #2607: SOLR-15794: Switching a PRS collection from true -> false -> true results in INACTIVE replicas

2021-11-15 Thread GitBox


jpountz commented on pull request #2607:
URL: https://github.com/apache/lucene-solr/pull/2607#issuecomment-969035659


   @noblepaul Note that it should be backported to `branch_8_11` if you want 
this change to ever be released.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


jpountz commented on pull request #2609:
URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969034069


   Also it looks like you missed backporting to `branch_8_11`. `branch_8x` is 
essentially dead at this point since 8.11 is going the be the last 8.x minor 
release.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] jpountz commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


jpountz commented on pull request #2609:
URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969033135


   You need to move the CHANGES entry to 8.11.1 since the 8.11 won't have this 
change.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Adrien Grand (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443902#comment-17443902
 ] 

Adrien Grand commented on LUCENE-10233:
---

This is an interesting idea!

One drawback of this approach is that we're trying to keep the number of 
classes that implement oal.util.BitSet at 2, and this would be a 3rd one. I 
wonder if we could use SparseFixedBitSet instead of this new OffsetBitSet class?

> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10085) Implement Weight#count on DocValuesFieldExistsQuery

2021-11-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443897#comment-17443897
 ] 

ASF subversion and git services commented on LUCENE-10085:
--

Commit e034a2d6e2913ae10ac38434240ed8e9b9aa19ad in lucene's branch 
refs/heads/branch_9x from Quentin Pradet
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e034a2d ]

LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441)

FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test
wasn't renamed.

> Implement Weight#count on DocValuesFieldExistsQuery
> ---
>
> Key: LUCENE-10085
> URL: https://issues.apache.org/jira/browse/LUCENE-10085
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 20m
>  Remaining Estimate: 0h
>
> Now that we require all documents to use the same features (LUCENE-9334) we 
> could implement {{Weight#count}} to return docCount if either terms or points 
> are indexed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Commented] (LUCENE-10085) Implement Weight#count on DocValuesFieldExistsQuery

2021-11-15 Thread ASF subversion and git services (Jira)


[ 
https://issues.apache.org/jira/browse/LUCENE-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443895#comment-17443895
 ] 

ASF subversion and git services commented on LUCENE-10085:
--

Commit 1e5e997880ee30a94f0fc2fd15cc071f9ef43c29 in lucene's branch 
refs/heads/main from Quentin Pradet
[ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1e5e997 ]

LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441)

FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test
wasn't renamed.

> Implement Weight#count on DocValuesFieldExistsQuery
> ---
>
> Key: LUCENE-10085
> URL: https://issues.apache.org/jira/browse/LUCENE-10085
> Project: Lucene - Core
>  Issue Type: Improvement
>Reporter: Adrien Grand
>Priority: Minor
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> Now that we require all documents to use the same features (LUCENE-9334) we 
> could implement {{Weight#count}} to return docCount if either terms or points 
> are indexed.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz merged pull request #441: LUCENE-10085: Rename DocValuesFieldExistsQuery test

2021-11-15 Thread GitBox


jpountz merged pull request #441:
URL: https://github.com/apache/lucene/pull/441


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] jpountz commented on pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.

2021-11-15 Thread GitBox


jpountz commented on pull request #440:
URL: https://github.com/apache/lucene/pull/440#issuecomment-968952307


   Please do!


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] pquentin opened a new pull request #441: LUCENE-10085: Rename DocValuesFieldExistsQuery test

2021-11-15 Thread GitBox


pquentin opened a new pull request #441:
URL: https://github.com/apache/lucene/pull/441


   FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test
   wasn't renamed.


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] dweiss commented on pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.

2021-11-15 Thread GitBox


dweiss commented on pull request #440:
URL: https://github.com/apache/lucene/pull/440#issuecomment-968862647


   This is what I'd like to add to branch_9_0 (and branch_9x), @jpountz . Can I?


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Created] (LUCENE-10234) Add automatic module name to JAR manifests.

2021-11-15 Thread Dawid Weiss (Jira)
Dawid Weiss created LUCENE-10234:


 Summary: Add automatic module name to JAR manifests.
 Key: LUCENE-10234
 URL: https://issues.apache.org/jira/browse/LUCENE-10234
 Project: Lucene - Core
  Issue Type: Improvement
Reporter: Dawid Weiss


This is the first step to make Lucene a proper fit for the java module system. 
I chose a shorthand "lucene.[x]" module name convention, without the 
"org.apache" prefix.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and {{intersect}} will get into {{addAll}} logic. If we 
store ids as bitset, and give the IntersectVisitor bulk visiting ability, we 
can speed up addAll because we can just execute the 'or' logic between the 
result and the block ids.

Optimization will be triggered when the following conditions are met at the 
same time:
 # leafCardinality = 1
 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too 
much storage)
 # no duplicate doc id

I mocked a field that has 10,000,000 docs per value and search it with a 1 term 
PointInSetQuery, the build scorer time decreased from 71ms to 8ms.

(WIP, Just post this first to see whether you think this optimization makes 
sense)

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit 
ability, we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become a bit slower because it needs to iterate docIds one 
by one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.
> (WIP, Just post this first to see whether you think this optimization makes 
> sense)



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and {{intersect}} will get into {{addAll}} logic. If we 
store ids as bitset, and give the IntersectVisitor bulk visiting ability, we 
can speed up addAll because we can just execute the 'or' logic between the 
result and the block ids.

Optimization will be triggered when the following conditions are met at the 
same time:
 # leafCardinality = 1
 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too 
much storage)
 # no duplicate doc id

I mocked a field that has 10,000,000 docs per value and search it with a 1 term 
PointInSetQuery, the build scorer time decreased from 71ms to 8ms.

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and {{intersect}} will get into {{addAll}} logic. If we 
store ids as bitset, and give the IntersectVisitor bulk visiting ability, we 
can speed up addAll because we can just execute the 'or' logic between the 
result and the block ids.

Optimization will be triggered when the following conditions are met at the 
same time:
 # leafCardinality = 1
 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too 
much storage)
 # no duplicate doc id

I mocked a field that has 10,000,000 docs per value and search it with a 1 term 
PointInSetQuery, the build scorer time decreased from 71ms to 8ms.

(WIP, Just post this first to see whether you think this optimization makes 
sense)


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and {{intersect}} will get into {{addAll}} logic. 
> If we store ids as bitset, and give the IntersectVisitor bulk visiting 
> ability, we can speed up addAll because we can just execute the 'or' logic 
> between the result and the block ids.
> Optimization will be triggered when the following conditions are met at the 
> same time:
>  # leafCardinality = 1
>  # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding 
> too much storage)
>  # no duplicate doc id
> I mocked a field that has 10,000,000 docs per value and search it with a 1 
> term PointInSetQuery, the build scorer time decreased from 71ms to 8ms.



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene-solr] mkhludnev merged pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool

2021-11-15 Thread GitBox


mkhludnev merged pull request #2609:
URL: https://github.com/apache/lucene-solr/pull/2609


   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit 
ability, we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become a bit slower because it needs to iterate docIds one 
by one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit 
ability, we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visit ability, we can speed up addAll because we can 
> just execute the 'or' logic between the result and the block ids.
> I mocked a field that has 10,000,000 docs per value and search it with a 
> PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's (max-min) <= n * count?)
> 2. MergeReader will become a bit slower because it needs to iterate docIds 
> one by one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll

2021-11-15 Thread Feng Guo (Jira)


 [ 
https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Feng Guo updated LUCENE-10233:
--
Description: 
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit 
ability, we can speed up addAll because we can just execute the 'or' logic 
between the result and the block ids.

I mocked a field that has 10,000,000 docs per value and search it with a 
PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 

  was:
In low cardinality points cases, id blocks will usually store doc ids that have 
the same point value, and intersect will get into addAll logic. If we store ids 
as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk 
visiting ability (something like visit(DocIdSetIterator iterator), we can speed 
up addAll because we can just execute the 'or' logic between the result and the 
block ids.

Concerns:
1. Bitset could occupy more disk space.(Maybe we can force this optimization 
only works when block's (max-min) <= n * count?)
2. MergeReader will become slower because it needs to iterate docIds one by 
one. 


> Store docIds as bitset when leafCardinality = 1 to speed up addAll
> --
>
> Key: LUCENE-10233
> URL: https://issues.apache.org/jira/browse/LUCENE-10233
> Project: Lucene - Core
>  Issue Type: Improvement
>  Components: core/codecs
>Reporter: Feng Guo
>Priority: Major
>  Time Spent: 10m
>  Remaining Estimate: 0h
>
> In low cardinality points cases, id blocks will usually store doc ids that 
> have the same point value, and intersect will get into addAll logic. If we 
> store ids as bitset when the leafCadinality = 1, and give the 
> IntersectVisitor bulk visit ability, we can speed up addAll because we can 
> just execute the 'or' logic between the result and the block ids.
> I mocked a field that has 10,000,000 docs per value and search it with a 
> PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms.
> Concerns:
> 1. Bitset could occupy more disk space.(Maybe we can force this optimization 
> only works when block's (max-min) <= n * count?)
> 2. MergeReader will become slower because it needs to iterate docIds one by 
> one. 



--
This message was sent by Atlassian Jira
(v8.20.1#820001)

-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org



[GitHub] [lucene] praveennish opened a new pull request #439: LUCENE-8739: custom codec providing Zstandard compression/decompression

2021-11-15 Thread GitBox


praveennish opened a new pull request #439:
URL: https://github.com/apache/lucene/pull/439


   
   
   
   # Description
   
   Lucene currently supports LZ4 and Zlib compression/decompression for 
StoredFieldsFormat. We propose Zstandard (https://facebook.github.io/zstd/) 
compression/decompression for StoredFieldsFormat for the following reasons:
   
   * ZStandard is being used in some of the most popular open source projects 
like Apache Cassandra, Hadoop, and Kafka.
   * Zstandard, at the default setting of 3, is expected to show substantial 
improvements in both compression and decompression speed while compressing at 
the same ratio as Zlib as per study mentioned by Yann Collet at Facebook
   
(https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/).
   * Zstandard currently offers 22 different Compression levels, which enable 
flexible, granular trade-offs between compression speed and ratios for future 
data. This solution also provides the flexibility of choosing compression 
levels between 1 and 22. For example, we can use level 1 if speed is most 
important and level 22 if the size is most important.
   * Zstandard designed to scale with modern 
hardware(https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/).
   
   # Solution
   
   * Developed a custom codec to enable Zstandard compression/decompression 
support.
   * This custom codec also provides flexibility to add any other compression 
algorithms.
   
   # Tests
   
   * Added required test case for testing new custom codec. Ran with option 
-Dtests.codec for newly developed custom codec
   
   # Checklist
   
   Please review the following and check all that apply:
   
   - [X] I have reviewed the guidelines for [How to 
Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code 
conforms to the standards described there to the best of my ability.
   - [X] I have created a Jira issue and added the issue ID to my pull request 
title.
   - [X] I have given Lucene maintainers 
[access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork)
 to contribute to my PR branch. (optional but recommended)
   - [X] I have developed this patch against the `main` branch.
   - [X] I have run `./gradlew check`.
   - [X] I have added tests for my changes.
   


-- 
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.

To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org

For queries about this service, please contact Infrastructure at:
us...@infra.apache.org



-
To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org
For additional commands, e-mail: issues-h...@lucene.apache.org