[GitHub] [lucene] jimczi commented on a change in pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test
jimczi commented on a change in pull request #444: URL: https://github.com/apache/lucene/pull/444#discussion_r749986210 ## File path: lucene/sandbox/src/java/org/apache/lucene/sandbox/search/MultiNormsLeafSimScorer.java ## @@ -61,7 +63,14 @@ if (needsScores) { final List normsList = new ArrayList<>(); final List weightList = new ArrayList<>(); + final Set duplicateCheckingSet = new HashSet<>(); for (FieldAndWeight field : normFields) { +assert duplicateCheckingSet.contains(field.field) == false +: "There is duplicated field [" ++ field.field ++ "] used to construct MultiNormsLeafSimScorer"; +duplicateCheckingSet.add(field.field); Review comment: Could be `assert duplicateCheckingSet.add(field.field) == false` ? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070 ] Feng Guo edited comment on LUCENE-10233 at 11/16/21, 7:43 AM: -- [~jpountz] Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? In this case, Bitset is clearly in this parttern: [nothing][dense]. So compared with SparseFixedBitSet, I tend to use a more customized way to store if possible . Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: https://github.com/apache/lucene/pull/438, i wonder if this approach makes sense to you. was (Author: gf2121): [~jpountz] Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: https://github.com/apache/lucene/pull/438, i wonder if this approach makes sense to you. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn commented on pull request #418: LUCENE-10061: Implements dynamic pruning support for CombinedFieldsQuery
zacharymorn commented on pull request #418: URL: https://github.com/apache/lucene/pull/418#issuecomment-969846412 > Hi @jpountz @jimczi, when I was doing some more deep dive to see how this PR can be improved further, I noticed a difference of field-weight parameter passed to create`MultiNormsLeafSimScorer` in `CombinedFieldQuery`: > > `CombinedFieldQuery#scorer`: > > https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421 > > `CombinedFieldQuery#explain`: > > https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389 > > For the `CombinedFieldQuery#scorer` one, `fields` may contain duplicated fields: > > https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414 > > , whereas `fieldAndWeights.values()` should not contain duplicated fields. > > I feel `CombinedFieldQuery#scorer` should be updated to use `fieldAndWeights.values()` as well? What do you think? For now I have created a spin-off issue https://issues.apache.org/jira/browse/LUCENE-10236 and a separate PR https://github.com/apache/lucene/pull/444 for this. Please let me know if they look good to you. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zacharymorn opened a new pull request #444: LUCENE-10236: Updated field-weight used in CombinedFieldQuery scoring calculation, and added a test
zacharymorn opened a new pull request #444: URL: https://github.com/apache/lucene/pull/444 # Description Updated field-weight used in CombinedFieldQuery scoring calculation # Tests 1. Added a new test from https://github.com/apache/lucene/pull/418 2. Run `./gradlew clean; ./gradlew check -Pvalidation.git.failOnModified=false` # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [x] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
[ https://issues.apache.org/jira/browse/LUCENE-10236?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Zach Chen updated LUCENE-10236: --- Description: This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. was: This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. > CombinedFieldsQuery to use fieldAndWeights.values() when constructing > MultiNormsLeafSimScorer for scoring > - > > Key: LUCENE-10236 > URL: https://issues.apache.org/jira/browse/LUCENE-10236 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/sandbox >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > > This is a spin-off issue from discussion in > [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a > quick fix in CombinedFieldsQuery scoring. > Currently
[jira] [Created] (LUCENE-10236) CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring
Zach Chen created LUCENE-10236: -- Summary: CombinedFieldsQuery to use fieldAndWeights.values() when constructing MultiNormsLeafSimScorer for scoring Key: LUCENE-10236 URL: https://issues.apache.org/jira/browse/LUCENE-10236 Project: Lucene - Core Issue Type: Improvement Components: modules/sandbox Reporter: Zach Chen Assignee: Zach Chen This is a spin-off issue from discussion in [https://github.com/apache/lucene/pull/418#issuecomment-967790816], for a quick fix in CombinedFieldsQuery scoring. Currently CombinedFieldsQuery would use a constructed [fields|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L420-L421] object to create a MultiNormsLeafSimScorer for scoring, but the fields object may contain duplicated field-weight pairs as it is [built from looping over fieldTerms|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L404-L414], resulting into duplicated norms being added during scoring calculation in MultiNormsLeafSimScorer. E.g. for CombinedFieldsQuery with two fields and two values matching a particular doc: {code:java} CombinedFieldQuery query = new CombinedFieldQuery.Builder() .addField("field1", (float) 1.0) .addField("field2", (float) 1.0) .addTerm(new BytesRef("foo")) .addTerm(new BytesRef("zoo")) .build(); {code} I would imagine the scoring to be based on the following: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) but the current logic would use the following for scoring: # Sum of freqs on doc = freq(field1:foo) + freq(field2:foo) + freq(field1:zoo) + freq(field2:zoo) # Sum of norms on doc = norm(field1) + norm(field2) + norm(field1) + norm(field2) In addition, this differs from how MultiNormsLeafSimScorer is constructed from CombinedFieldsQuery explain function, which [uses fieldAndWeights.values()|https://github.com/apache/lucene/blob/3b914a4d73eea8923f823cbdb869de39213411dd/lucene/sandbox/src/java/org/apache/lucene/sandbox/search/CombinedFieldQuery.java#L387-L389] and does not contain duplicated field-weight pairs. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10212) Add luceneutil benchmark task for CombinedFieldsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444254#comment-17444254 ] Zach Chen commented on LUCENE-10212: No problem [~julietibs] ! Glad to be able to contribute! > Add luceneutil benchmark task for CombinedFieldsQuery > - > > Key: LUCENE-10212 > URL: https://issues.apache.org/jira/browse/LUCENE-10212 > Project: Lucene - Core > Issue Type: Task >Reporter: Zach Chen >Assignee: Zach Chen >Priority: Minor > > This is a spin-off task from > https://issues.apache.org/jira/browse/LUCENE-10061 . In order to objectively > evaluate performance changes for CombinedFieldsQuery, we would like to add > benchmark task and parsing for CombinedFieldsQuery. > One proposal to the query syntax to enable CombinedFieldsQuery benchmarking > would be the following: > {code:java} > taskName: term1 term2 term3 term4 > +combinedFields=field1^boost1,field2^boost2,field3^boost3 > {code} > > > -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting
[ https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444178#comment-17444178 ] Greg Miller commented on LUCENE-10062: -- Here's the PR targeting 9.0: [https://github.com/apache/lucene/pull/443] Let's iterate there. Once it's merged, I'll pull the change over to 9x as well and then update my other PR against main with a much simpler version that removes the back-compat concerns with 8.x indices. > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for > faceting > > > Key: LUCENE-10062 > URL: https://issues.apache.org/jira/browse/LUCENE-10062 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Time Spent: 3h 40m > Remaining Estimate: 0h > > We currently encode taxonomy ordinals using varint style packing in a binary > doc values field. I suspect there have been a number of improvements to > SortedNumericDocValues since taxonomy faceting was first introduced, and I > plan to explore replacing the custom binary format we have today with a > SORTED_NUMERIC type dv field instead. > I'll report benchmark results and index size impact here. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on a change in pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on a change in pull request #443: URL: https://github.com/apache/lucene/pull/443#discussion_r749715273 ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetsConfig.java ## @@ -409,9 +410,26 @@ private void processFacetFields( indexDrillDownTerms(doc, indexFieldName, dimConfig, facetLabel); } - // Facet counts: - // DocValues are considered stored fields: - doc.add(new BinaryDocValuesField(indexFieldName, dedupAndEncode(ordinals.get(; + // Store the taxonomy ordinals associated with each doc. Prefer to use SortedNumericDocValues + // but "fall back" to a custom binary format to maintain backwards compatibility with Lucene 8 + // indexes. + if (taxoWriter.useNumericDocValuesForOrdinals()) { Review comment: I'm not entirely sure this is correct. This tells us that the taxonomy index was created with an 8.x or newer version, but doesn't tell us anything about the version of the main index. It's assuming that the same version is used to create both the taxonomy index and main index. Is it possible for these versions to not match? I wonder if we need to account for that situation? If so, I'm not entirely how to go about doing so as we don't have any information about the main index here in this class. So that would need some more thought. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller commented on pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller commented on pull request #443: URL: https://github.com/apache/lucene/pull/443#issuecomment-969362319 **NOTE**: I'm working on additional testing but hoping to get some early feedback on this approach, particularly as backwards-compatibility is concerned. I'll update with more tests and a CHANGES entry. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] gsmiller opened a new pull request #443: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals
gsmiller opened a new pull request #443: URL: https://github.com/apache/lucene/pull/443 # Description In benchmarks, using numeric doc values to store taxonomy facet ordinals shows almost a 400% qps improvement in browse-related taxonomy-based tasks (instead of custom delta-encoding into a binary doc values field). This PR changes the encoding of facet ordinals, while maintaining backwards compatibility with 8.x indexes. # Solution This change moves to standard numeric doc values for storing taxonomy ordinals. # Tests No new tests added. Lots of existing test coverage for taxonomy faceting functionality. **NOTE**: I will add new testing that ensures backwards compatibility support remains with 8.x. I'm putting this PR out for feedback while doing so. # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10062) Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for faceting
[ https://issues.apache.org/jira/browse/LUCENE-10062?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444124#comment-17444124 ] Greg Miller commented on LUCENE-10062: -- I'm posting a new PR now for adding this format change to the 9.0 release. The intention is to maintain backwards compatibility with 8.x indices, and then drop support for the older binary doc values format in 10 (which allows us to avoid some of the back-compat complexity on the main branch). The PR I'm posting takes a fairly aggressive approach to deprecation, and I'm curious what folks will think of this. I'll outline the deprecation approach here, starting with the less-controversial followed by the potentially more-controversial. *Less Controversial* Starting with Lucene 9.0, taxonomy ordinals will be stored as a {{SortedNumericDocValues}} field. For backwards compatibility, if Lucene 9.x code is writing to an index created with 8.x, it will revert back to using a {{BinaryDocValues}} format. In Lucene 8.x, we allow users to plug in their own custom binary format if they don't want the default. This will continue to work in 9.x, but only if writing to an 8.x index. Users will not be able to plug in any sort of custom format for indexes created with 9.0 onward (they'll get a {{SortedNumericDocValues}} field as-is). When merging segments, we will honor the present format for backwards compatibility. So if the segments being merged were written with 9.x, we'll merge the {{SortedNumericDocValues}} fields. If we're merging 8.x segments, we'll maintain the older binary format (including any customization plugged in by the user). Again, no custom format support will be provided for 9.0 onwards. When reading the ordinals, we'll be backwards compatible with 8.x indexes (using the binary format). *Potentially Controversial* Users currently have the ability to provide ordinals for a given document through the concept of an {{OrdinalsReader}} when using {{{}TaxonomyFacetCounts{}}}, {{TaxonomyFacetSumValueSource}} and {{{}TaxonomyFacetLabels{}}}. This seems like it's available mainly to support users that have created a custom binary format for their taxonomy ordinals. But, in theory, it could be useful more generally if users have some need to provide ordinals in some other, custom way. I propose deprecating this concept entirely. While it's not terribly hard to keep it around, I struggle to think of a real use-case for users needing to provide ordinals in a custom way if we no longer support the ability to plug in a custom binary format. Note that the other facet implementations (including things like {{{}FastTaxonomyFacetCounts{}}}) assume the default encoding, so they'll seamlessly switch from the binary format to the numeric format under-the-hood in a backwards-compatible fashion. If users really have some custom need, there's nothing preventing them from implementing their own {{Facets}} sub-class, etc. If anyone knows of real-world use-cases for maintaining the support for {{{}OrdinalsReader{}}}, I'm happy to keep it in. I have a version of the change that does so already, so it's not really any extra work, it just seems a good opportunity to remove some code complexity. > Explore using SORTED_NUMERIC doc values to encode taxonomy ordinals for > faceting > > > Key: LUCENE-10062 > URL: https://issues.apache.org/jira/browse/LUCENE-10062 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Reporter: Greg Miller >Assignee: Greg Miller >Priority: Minor > Time Spent: 3h 10m > Remaining Estimate: 0h > > We currently encode taxonomy ordinals using varint style packing in a binary > doc values field. I suspect there have been a number of improvements to > SortedNumericDocValues since taxonomy faceting was first introduced, and I > plan to explore replacing the custom binary format we have today with a > SORTED_NUMERIC type dv field instead. > I'll report benchmark results and index size impact here. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant commented on pull request #430: LUCENE-10225: Improve IntroSelector.
bruno-roustant commented on pull request #430: URL: https://github.com/apache/lucene/pull/430#issuecomment-969337494 Woah, @jpountz your idea has a clear effect. It brings +(10 - 12)% perf, compared without the top-k optim. And top-k optim adds another +3% on top (plus it is still very effective for k very close to boundaries). So I'd like to keep both. I hope it does not make the code too complex. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mkhludnev commented on pull request #2611: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
mkhludnev commented on pull request #2611: URL: https://github.com/apache/lucene-solr/pull/2611#issuecomment-969325618 @jpountz I still not sure if i did it right. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mkhludnev closed pull request #2610: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
mkhludnev closed pull request #2610: URL: https://github.com/apache/lucene-solr/pull/2610 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] rmuir commented on a change in pull request #264: LUCENE-10062: Switch to numeric doc values for encoding taxonomy ordinals (instead of custom binary format)
rmuir commented on a change in pull request #264: URL: https://github.com/apache/lucene/pull/264#discussion_r749671027 ## File path: lucene/facet/src/java/org/apache/lucene/facet/FacetUtils.java ## @@ -81,4 +84,19 @@ public long cost() { } }; } + + /** + * Determine whether-or-not an index segment is using the older-style binary format or the newer + * NumericDocValues format for storing taxonomy faceting ordinals (for the specified field). + * + * @deprecated Please do not rely on this method. It is added as a temporary measure for providing + * index backwards-compatibility with Lucene 9 and earlier indexes, and will be removed in + * Lucene 11. + */ + // TODO: Remove in Lucene 11 + @Deprecated + public static boolean usesOlderBinaryOrdinals(LeafReader reader, String field) { Review comment: Why this check? Can't we simply do `reader.getMetaData().getCreatedVersionMajor() < 9` or whatever, matching the exact logic at write-time in DirectoryTaxonomyWriter? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444098#comment-17444098 ] Haoyu Zhai commented on LUCENE-10122: - OK, here's the new PR (with the back-compatibility): [https://github.com/apache/lucene/pull/442] [~jpountz] I set that PR to target on 9.0 branch based on the previous email thread, but since we've already in process of releasing please let me know if you want this to be targeting main branch instead. > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (10.0) >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 2h 10m > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] zhaih opened a new pull request #442: LUCENE-10122 Use NumericDocValue to store taxonomy parent array
zhaih opened a new pull request #442: URL: https://github.com/apache/lucene/pull/442 # Description As mentioned in the issue, use NumericDocValues to store parent array instead of term positioning. Benchmark results available in JIRA, in short we're seeing a slightly larger taxonomy index and plain performance Added some if-branches for backward compatibility. ## Misc I've unified my names in "CHNAGES.txt" to be "Patrick Zhai" (Previously sometimes I used "Haoyu Zhai") # Checklist Please review the following and check all that apply: - [x] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [x] I have created a Jira issue and added the issue ID to my pull request title. - [x] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [x] I have developed this patch against the `main` branch. - [x] I have run `./gradlew check`. - [ ] I have added tests for my changes. (Old tests should be sufficient) -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444091#comment-17444091 ] Michael McCandless commented on LUCENE-10122: - OK I am convinced too – let's move forward! It is really weird to abuse positions like this. [~zhai7631] what do you think? I think the only issue on the PR is back-compat from 8.x indices? Later we can also try to eliminate the separate large heap-resident {{int[]}} as [~rmuir] suggested on the dev list today. > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (10.0) >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] bruno-roustant commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
bruno-roustant commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r749654605 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Thanks for this idea Adrien. I'll try it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mkhludnev commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
mkhludnev commented on pull request #2609: URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969271708 Uhgg... Thanks, Adrien. I'm on it. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Comment Edited] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070 ] Feng Guo edited comment on LUCENE-10233 at 11/15/21, 7:51 PM: -- [~jpountz] Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: https://github.com/apache/lucene/pull/438, i wonder if this approach makes sense to you. was (Author: gf2121): [~jpountz] Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: https://github.com/apache/lucene/pull/438, i want to know if this approach makes sense to you. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444070#comment-17444070 ] Feng Guo commented on LUCENE-10233: --- [~jpountz] Thanks! +1 to remove the OffsetBitSet class, but i find that there has not been a faster implementation for FixedBitSet.or(SparseFixedBitSet), do you mean we should implement it? Another way to replace the OffsetBitSet class i can think of is to support a docBase in BitSetIterator, i implement this in the newest commit: https://github.com/apache/lucene/pull/438, i want to know if this approach makes sense to you. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444029#comment-17444029 ] Arjen commented on LUCENE-9921: --- Thanks, although this kind of issue can happen with any library. Antlr used an even older version, but luckily only with usages that didn't require such a hard version check. > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444025#comment-17444025 ] Robert Muir commented on LUCENE-9921: - Sorry you got jar hell because lucene has the outdated dependency. I mean we really should be using the latest stuff. We can address it for 9.1 for sure. > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444024#comment-17444024 ] Arjen commented on LUCENE-9921: --- I know that. I even fixated it in my build.gradle at 62.2 to prevent other libraries from picking a different version... but if another library needs a higher version, that higher version is apparently selected by Gradle (rather than stopping with a version conflict). Which I discovered with Antlr; they upgraded from something below 62.2 to 70.1 in their version 4.9.3 :( I can stick to 4.9.2 for now, but it would be nice to be able to upgrade in the semi-near future (i.e. waiting for 9.1 should do). > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444016#comment-17444016 ] Robert Muir commented on LUCENE-9921: - You can't just override the advertised dependency version and upgrade the library yourself to an arbitrary version. It won't work. > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17444011#comment-17444011 ] Arjen commented on LUCENE-9921: --- If the window is closing on 9, I agree it would be too risky. My reason for looking for a ticket like this, was the {{utr30.nrm}} file used in {{ICUFoldingFilter.}} That seems to trigger a version check with icu 70.1: {code:java} Caused by: com.ibm.icu.util.ICUUncheckedIOException: java.io.IOException: ICU data file error: Header authentication failed, please check if you have a valid ICU data file; data format 4e726d32, format version 3.0.0.0 at app//com.ibm.icu.impl.Normalizer2Impl.load(Normalizer2Impl.java:506) at app//com.ibm.icu.impl.Norm2AllModes$1.createInstance(Norm2AllModes.java:351) at app//com.ibm.icu.impl.Norm2AllModes$1.createInstance(Norm2AllModes.java:344) at app//com.ibm.icu.impl.SoftCache.getInstance(SoftCache.java:69) at app//com.ibm.icu.impl.Norm2AllModes.getInstance(Norm2AllModes.java:341) at app//com.ibm.icu.text.Normalizer2.getInstance(Normalizer2.java:202) at app//org.apache.lucene.analysis.icu.ICUFoldingFilter.(ICUFoldingFilter.java:72) ... 6 more Caused by: java.io.IOException: ICU data file error: Header authentication failed, please check if you have a valid ICU data file; data format 4e726d32, format version 3.0.0.0 at com.ibm.icu.impl.ICUBinary.readHeader(ICUBinary.java:606) at com.ibm.icu.impl.ICUBinary.readHeaderAndDataVersion(ICUBinary.java:557) at com.ibm.icu.impl.Normalizer2Impl.load(Normalizer2Impl.java:453) {code} The line causing it is a custom analyzer that does this: {{{}new ICUFoldingFilter(tokenStream){}}}. So if someone upgrades the icu-dependency manually (or uses another library that requires > 68.2), they end up with the above exception. That may actually be yet another reason for a new issue, but I don't know how to test this particular code with version 70.1 to see if its simply a matter of regenerating the particular file or not :) > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene
[ https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443976#comment-17443976 ] Adrien Grand commented on LUCENE-8739: -- Side thought: it would be nice to use Project Panama's Foreign linker when it gets released instead of depending on this JNI library. > ZSTD Compressor support in Lucene > - > > Key: LUCENE-8739 > URL: https://issues.apache.org/jira/browse/LUCENE-8739 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs >Reporter: Sean Torres >Priority: Minor > Labels: features > Time Spent: 1h 10m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443972#comment-17443972 ] Robert Muir commented on LUCENE-9921: - I agree, let's not rush this in close to a release. Note that we don't need to upgrade this jar for the 37 new unicode 14 emoji to work, they will already be tokenized/tagged correctly, due to the way the preallocation happens in unicode: See test: https://github.com/apache/lucene/blob/main/lucene/analysis/icu/src/test/org/apache/lucene/analysis/icu/segmentation/TestICUTokenizer.java#L506-L519 See list: https://s.apache.org/pqnnc > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10122) Explore using NumericDocValue to store taxonomy parent array
[ https://issues.apache.org/jira/browse/LUCENE-10122?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443965#comment-17443965 ] Greg Miller commented on LUCENE-10122: -- Yeah, +1 to moving to doc values. Even if we see a minor taxonomy size growth, it's a more sensible data structure for this use-case. Taxonomy indices are generally quite small anyway (compared to the main index), so I'd rather align the use-case with an appropriate data structure then see if we can optimize it over time. > Explore using NumericDocValue to store taxonomy parent array > > > Key: LUCENE-10122 > URL: https://issues.apache.org/jira/browse/LUCENE-10122 > Project: Lucene - Core > Issue Type: Improvement > Components: modules/facet >Affects Versions: main (10.0) >Reporter: Haoyu Zhai >Priority: Minor > Time Spent: 2h > Remaining Estimate: 0h > > We currently use term position of a hardcoded term in a hardcoded field to > represent the parent ordinal of each taxonomy label. That is an old way and > perhaps could be dated back to the time where doc values didn't exist. > We probably would want to use NumericDocValues instead given we have spent > quite a lot of effort optimizing them. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene
[ https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443964#comment-17443964 ] Adrien Grand commented on LUCENE-8739: -- I ran your PR with the new stored fields benchmark to see how codecs compare: ||Codec ||Indexing time (ms) ||Disk usage (MB) || Retrieval time per 10k docs (ms) || | BEST_SPEED | 35383 | 90.175 | 190.17524 | | BEST_COMPRESSION (vanilla zlib) | 76671 | 58.682 | 1910.42106 | | BEST_COMPRESSION (Cloudflare zlib) | 54791 | 58.601 | 1395.53593 | | ZSTD (level=1) | 42433 | 70.527 | 240.04036 | | ZSTD (level=3) | 53426 | 68.737 | 259.61897 | | ZSTD (level=6) | 100697 | 66.283 | 251.91177 | >From a quick look at your PR, it looks like you are not using dictionaries, >which would explain why we're seeing a worse compression ratio? > ZSTD Compressor support in Lucene > - > > Key: LUCENE-8739 > URL: https://issues.apache.org/jira/browse/LUCENE-8739 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs >Reporter: Sean Torres >Priority: Minor > Labels: features > Time Spent: 1h 10m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10235) LRUQueryCache should not count never-cacheable queries as a miss
Yannick Welsch created LUCENE-10235: --- Summary: LRUQueryCache should not count never-cacheable queries as a miss Key: LUCENE-10235 URL: https://issues.apache.org/jira/browse/LUCENE-10235 Project: Lucene - Core Issue Type: Improvement Reporter: Yannick Welsch Hit and miss counts of a cache are typically used to check how effective a caching layer is. While looking at a system that exhibited a very high miss to hit ratio, I took a closer look at Lucene's LRUQueryCache and noticed that it's treating the handling of queries as a miss that it would never ever even think about caching in the first place. (e.g. TermQuery and others mentioned in UsageTrackingQueryCachingPolicy.shouldNeverCache). The reason these are counted as a miss is that LRUQueryCache (scorerSupplier and bulkScorer methods) first does a lookup on the cache, incrementing hit or miss counters, and upon miss, only then checks QueryCachingPolicy.shouldCache to decide whether that query should be put into the cache. This issue is made more complex by the fact that QueryCachingPolicy.shouldCache is a stateful method, and cacheability of a query can change over time (e.g. after appearing N times). I'm opening this issue to discuss whether others also feel that the current way of accounting misses is unintuitive / confusing. I would also like to put forward a proposal to: * generalize the boolean QueryCachingPolicy.shouldCache method to return an enum instead (one of YES, NOT_RIGHT_NOW, NEVER), and only account queries that are (eventually) cacheable and not in the cache as a miss, * optionally introduce another metric for queries that are never cacheable, e.g. "ignored", and * optionally refine miss count into a count for items that are cacheable right away, and those that will eventually be cacheable. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10216) Add concurrency to addIndexes(CodecReader…) API
[ https://issues.apache.org/jira/browse/LUCENE-10216?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443953#comment-17443953 ] Michael McCandless commented on LUCENE-10216: - I like this plan (extending {{MergePolicy}} so it also has purview over how merging is done in {{{}addIndexes(CodecReader[]){}}}). Reference counting might get tricky, if {{OneMerge}} or {{IndexWriter}} holding completed {{OneMerge}} instances try to {{decRef}} readers. To improve testing we could create a new {{LuceneTestCase}} method to {{addIndexes}} from {{Directory[]}} that randomly does so with both impls and fix tests to sometimes use that for adding indices. > Add concurrency to addIndexes(CodecReader…) API > --- > > Key: LUCENE-10216 > URL: https://issues.apache.org/jira/browse/LUCENE-10216 > Project: Lucene - Core > Issue Type: Improvement > Components: core/index >Reporter: Vigya Sharma >Priority: Major > > I work at Amazon Product Search, and we use Lucene to power search for the > e-commerce platform. I’m working on a project that involves applying > metadata+ETL transforms and indexing documents on n different _indexing_ > boxes, combining them into a single index on a separate _reducer_ box, and > making it available for queries on m different _search_ boxes (replicas). > Segments are asynchronously copied from indexers to reducers to searchers as > they become available for the next layer to consume. > I am using the addIndexes API to combine multiple indexes into one on the > reducer boxes. Since we also have taxonomy data, we need to remap facet field > ordinals, which means I need to use the {{addIndexes(CodecReader…)}} version > of this API. The API leverages {{SegmentMerger.merge()}} to create segments > with new ordinal values while also merging all provided segments in the > process. > _This is however a blocking call that runs in a single thread._ Until we have > written segments with new ordinal values, we cannot copy them to searcher > boxes, which increases the time to make documents available for search. > I was playing around with the API by creating multiple concurrent merges, > each with only a single reader, creating a concurrently running 1:1 > conversion from old segments to new ones (with new ordinal values). We follow > this up with non-blocking background merges. This lets us copy the segments > to searchers and replicas as soon as they are available, and later replace > them with merged segments as background jobs complete. On the Amazon dataset > I profiled, this gave us around 2.5 to 3x improvement in addIndexes() time. > Each call was given about 5 readers to add on average. > This might be useful add to Lucene. We could create another {{addIndexes()}} > API with a {{boolean}} flag for concurrency, that internally submits multiple > merge jobs (each with a single reader) to the {{ConcurrentMergeScheduler}}, > and waits for them to complete before returning. > While this is doable from outside Lucene by using your thread pool, starting > multiple addIndexes() calls and waiting for them to complete, I felt it needs > some understanding of what addIndexes does, why you need to wait on the merge > and why it makes sense to pass a single reader in the addIndexes API. > Out of box support in Lucene could simplify this for folks a similar use case. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-9921) Can ICU regeneration tasks treat icu version as input?
[ https://issues.apache.org/jira/browse/LUCENE-9921?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443934#comment-17443934 ] Dawid Weiss commented on LUCENE-9921: - I think it's too late for 9.0 - too little testing would be run before it's shipped to risk the upgrade (?). But for 9.1 - certainly. > Can ICU regeneration tasks treat icu version as input? > -- > > Key: LUCENE-9921 > URL: https://issues.apache.org/jira/browse/LUCENE-9921 > Project: Lucene - Core > Issue Type: Task >Reporter: Robert Muir >Priority: Major > > ICU 69 was released, so i was playing with the upgrade just to test it out > and test out our regeneration. > Running {{gradlew regenerate}} naively wasn't helpful, regeneration tasks > were SKIPPED by the build. > So I'm curious if the ICU version can be treated as an "input" to these > tasks, such that if it changes, tasks know the generated output is stale? -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Resolved] (LUCENE-10234) Add automatic module name to JAR manifests.
[ https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss resolved LUCENE-10234. -- Fix Version/s: 9.0 Resolution: Fixed > Add automatic module name to JAR manifests. > --- > > Key: LUCENE-10234 > URL: https://issues.apache.org/jira/browse/LUCENE-10234 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Fix For: 9.0 > > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the first step to make Lucene a proper fit for the java module > system. I chose a shorthand "lucene.[x]" module name convention, without the > "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.
[ https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443919#comment-17443919 ] ASF subversion and git services commented on LUCENE-10234: -- Commit 9d0eb88d2cd5339d4988417b5e46789a23d60d6f in lucene's branch refs/heads/branch_9x from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=9d0eb88 ] LUCENE-10234: Add automatic module name to JAR manifests. (#440) > Add automatic module name to JAR manifests. > --- > > Key: LUCENE-10234 > URL: https://issues.apache.org/jira/browse/LUCENE-10234 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the first step to make Lucene a proper fit for the java module > system. I chose a shorthand "lucene.[x]" module name convention, without the > "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Assigned] (LUCENE-10234) Add automatic module name to JAR manifests.
[ https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Dawid Weiss reassigned LUCENE-10234: Assignee: Dawid Weiss > Add automatic module name to JAR manifests. > --- > > Key: LUCENE-10234 > URL: https://issues.apache.org/jira/browse/LUCENE-10234 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Assignee: Dawid Weiss >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the first step to make Lucene a proper fit for the java module > system. I chose a shorthand "lucene.[x]" module name convention, without the > "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.
[ https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443918#comment-17443918 ] ASF subversion and git services commented on LUCENE-10234: -- Commit bf8072c1e9465ab491aa290baf8c1ac53e22d41c in lucene's branch refs/heads/branch_9_0 from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=bf8072c ] LUCENE-10234: Add automatic module name to JAR manifests. (#440) > Add automatic module name to JAR manifests. > --- > > Key: LUCENE-10234 > URL: https://issues.apache.org/jira/browse/LUCENE-10234 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the first step to make Lucene a proper fit for the java module > system. I chose a shorthand "lucene.[x]" module name convention, without the > "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10234) Add automatic module name to JAR manifests.
[ https://issues.apache.org/jira/browse/LUCENE-10234?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443917#comment-17443917 ] ASF subversion and git services commented on LUCENE-10234: -- Commit f5e5cf008aa791e0add8e1cc71ef2447b6b54c46 in lucene's branch refs/heads/main from Dawid Weiss [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=f5e5cf0 ] LUCENE-10234: Add automatic module name to JAR manifests. (#440) > Add automatic module name to JAR manifests. > --- > > Key: LUCENE-10234 > URL: https://issues.apache.org/jira/browse/LUCENE-10234 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Dawid Weiss >Priority: Trivial > Time Spent: 0.5h > Remaining Estimate: 0h > > This is the first step to make Lucene a proper fit for the java module > system. I chose a shorthand "lucene.[x]" module name convention, without the > "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss merged pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.
dweiss merged pull request #440: URL: https://github.com/apache/lucene/pull/440 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on a change in pull request #430: LUCENE-10225: Improve IntroSelector.
jpountz commented on a change in pull request #430: URL: https://github.com/apache/lucene/pull/430#discussion_r749462492 ## File path: lucene/core/src/java/org/apache/lucene/util/IntroSelector.java ## @@ -185,6 +204,115 @@ private void shuffle(int from, int to) { } } + /** Selects the k-th entry with a bottom-k algorithm, given that k is close to {@code from}. */ Review comment: Interesting. When I raised this question, I had more in mind borrowing ideas from interpolation search (https://en.wikipedia.org/wiki/Binary_search_algorithm#Interpolation_search) and e.g. see if picking the lowest of the 3 medians when `k-from < (to - from) / 8` and the highest of the 3 medians when `to-k < (to - from) / 8)` makes things any faster. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-8739) ZSTD Compressor support in Lucene
[ https://issues.apache.org/jira/browse/LUCENE-8739?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443912#comment-17443912 ] Praveen Nishchal commented on LUCENE-8739: -- I have created a pull request - [https://github.com/apache/lucene/pull/439] I am using Zstd-JNI [https://github.com/luben/zstd-jni] in a new custom codec which integrates Zstd compression and decompression in StoredFieldFormat. > ZSTD Compressor support in Lucene > - > > Key: LUCENE-8739 > URL: https://issues.apache.org/jira/browse/LUCENE-8739 > Project: Lucene - Core > Issue Type: New Feature > Components: core/codecs >Reporter: Sean Torres >Priority: Minor > Labels: features > Time Spent: 1h 10m > Remaining Estimate: 0h > > ZStandard has a great speed and compression ratio tradeoff. > ZStandard is open source compression from Facebook. > More about ZSTD > [https://github.com/facebook/zstd] > [https://code.facebook.com/posts/1658392934479273/smaller-and-faster-data-compression-with-zstandard/] -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on pull request #2607: SOLR-15794: Switching a PRS collection from true -> false -> true results in INACTIVE replicas
jpountz commented on pull request #2607: URL: https://github.com/apache/lucene-solr/pull/2607#issuecomment-969035659 @noblepaul Note that it should be backported to `branch_8_11` if you want this change to ever be released. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
jpountz commented on pull request #2609: URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969034069 Also it looks like you missed backporting to `branch_8_11`. `branch_8x` is essentially dead at this point since 8.11 is going the be the last 8.x minor release. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] jpountz commented on pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
jpountz commented on pull request #2609: URL: https://github.com/apache/lucene-solr/pull/2609#issuecomment-969033135 You need to move the CHANGES entry to 8.11.1 since the 8.11 won't have this change. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443902#comment-17443902 ] Adrien Grand commented on LUCENE-10233: --- This is an interesting idea! One drawback of this approach is that we're trying to keep the number of classes that implement oal.util.BitSet at 2, and this would be a 3rd one. I wonder if we could use SparseFixedBitSet instead of this new OffsetBitSet class? > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10085) Implement Weight#count on DocValuesFieldExistsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443897#comment-17443897 ] ASF subversion and git services commented on LUCENE-10085: -- Commit e034a2d6e2913ae10ac38434240ed8e9b9aa19ad in lucene's branch refs/heads/branch_9x from Quentin Pradet [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=e034a2d ] LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441) FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test wasn't renamed. > Implement Weight#count on DocValuesFieldExistsQuery > --- > > Key: LUCENE-10085 > URL: https://issues.apache.org/jira/browse/LUCENE-10085 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 20m > Remaining Estimate: 0h > > Now that we require all documents to use the same features (LUCENE-9334) we > could implement {{Weight#count}} to return docCount if either terms or points > are indexed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Commented] (LUCENE-10085) Implement Weight#count on DocValuesFieldExistsQuery
[ https://issues.apache.org/jira/browse/LUCENE-10085?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17443895#comment-17443895 ] ASF subversion and git services commented on LUCENE-10085: -- Commit 1e5e997880ee30a94f0fc2fd15cc071f9ef43c29 in lucene's branch refs/heads/main from Quentin Pradet [ https://gitbox.apache.org/repos/asf?p=lucene.git;h=1e5e997 ] LUCENE-10085: Rename DocValuesFieldExistsQuery test (#441) FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test wasn't renamed. > Implement Weight#count on DocValuesFieldExistsQuery > --- > > Key: LUCENE-10085 > URL: https://issues.apache.org/jira/browse/LUCENE-10085 > Project: Lucene - Core > Issue Type: Improvement >Reporter: Adrien Grand >Priority: Minor > Time Spent: 10m > Remaining Estimate: 0h > > Now that we require all documents to use the same features (LUCENE-9334) we > could implement {{Weight#count}} to return docCount if either terms or points > are indexed. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz merged pull request #441: LUCENE-10085: Rename DocValuesFieldExistsQuery test
jpountz merged pull request #441: URL: https://github.com/apache/lucene/pull/441 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] jpountz commented on pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.
jpountz commented on pull request #440: URL: https://github.com/apache/lucene/pull/440#issuecomment-968952307 Please do! -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] pquentin opened a new pull request #441: LUCENE-10085: Rename DocValuesFieldExistsQuery test
pquentin opened a new pull request #441: URL: https://github.com/apache/lucene/pull/441 FieldValueQuery got renamed to DocValuesFieldExistsQuery but the test wasn't renamed. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] dweiss commented on pull request #440: LUCENE-10234: Add automatic module name to JAR manifests.
dweiss commented on pull request #440: URL: https://github.com/apache/lucene/pull/440#issuecomment-968862647 This is what I'd like to add to branch_9_0 (and branch_9x), @jpountz . Can I? -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Created] (LUCENE-10234) Add automatic module name to JAR manifests.
Dawid Weiss created LUCENE-10234: Summary: Add automatic module name to JAR manifests. Key: LUCENE-10234 URL: https://issues.apache.org/jira/browse/LUCENE-10234 Project: Lucene - Core Issue Type: Improvement Reporter: Dawid Weiss This is the first step to make Lucene a proper fit for the java module system. I chose a shorthand "lucene.[x]" module name convention, without the "org.apache" prefix. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and {{intersect}} will get into {{addAll}} logic. If we store ids as bitset, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Optimization will be triggered when the following conditions are met at the same time: # leafCardinality = 1 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too much storage) # no duplicate doc id I mocked a field that has 10,000,000 docs per value and search it with a 1 term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. (WIP, Just post this first to see whether you think this optimization makes sense) was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. I mocked a field that has 10,000,000 docs per value and search it with a PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become a bit slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. > (WIP, Just post this first to see whether you think this optimization makes > sense) -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and {{intersect}} will get into {{addAll}} logic. If we store ids as bitset, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Optimization will be triggered when the following conditions are met at the same time: # leafCardinality = 1 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too much storage) # no duplicate doc id I mocked a field that has 10,000,000 docs per value and search it with a 1 term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and {{intersect}} will get into {{addAll}} logic. If we store ids as bitset, and give the IntersectVisitor bulk visiting ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Optimization will be triggered when the following conditions are met at the same time: # leafCardinality = 1 # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding too much storage) # no duplicate doc id I mocked a field that has 10,000,000 docs per value and search it with a 1 term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. (WIP, Just post this first to see whether you think this optimization makes sense) > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and {{intersect}} will get into {{addAll}} logic. > If we store ids as bitset, and give the IntersectVisitor bulk visiting > ability, we can speed up addAll because we can just execute the 'or' logic > between the result and the block ids. > Optimization will be triggered when the following conditions are met at the > same time: > # leafCardinality = 1 > # max(docId) - min(docId) <= 16 * pointCount (in order to avoid expanding > too much storage) > # no duplicate doc id > I mocked a field that has 10,000,000 docs per value and search it with a 1 > term PointInSetQuery, the build scorer time decreased from 71ms to 8ms. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene-solr] mkhludnev merged pull request #2609: SOLR-15635: avoid redundant closeHooks invocation by MDCThreadPool
mkhludnev merged pull request #2609: URL: https://github.com/apache/lucene-solr/pull/2609 -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. I mocked a field that has 10,000,000 docs per value and search it with a PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become a bit slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. I mocked a field that has 10,000,000 docs per value and search it with a PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visit ability, we can speed up addAll because we can > just execute the 'or' logic between the result and the block ids. > I mocked a field that has 10,000,000 docs per value and search it with a > PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's (max-min) <= n * count?) > 2. MergeReader will become a bit slower because it needs to iterate docIds > one by one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[jira] [Updated] (LUCENE-10233) Store docIds as bitset when leafCardinality = 1 to speed up addAll
[ https://issues.apache.org/jira/browse/LUCENE-10233?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Feng Guo updated LUCENE-10233: -- Description: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visit ability, we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. I mocked a field that has 10,000,000 docs per value and search it with a PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. was: In low cardinality points cases, id blocks will usually store doc ids that have the same point value, and intersect will get into addAll logic. If we store ids as bitset when the leafCadinality = 1, and give the IntersectVisitor bulk visiting ability (something like visit(DocIdSetIterator iterator), we can speed up addAll because we can just execute the 'or' logic between the result and the block ids. Concerns: 1. Bitset could occupy more disk space.(Maybe we can force this optimization only works when block's (max-min) <= n * count?) 2. MergeReader will become slower because it needs to iterate docIds one by one. > Store docIds as bitset when leafCardinality = 1 to speed up addAll > -- > > Key: LUCENE-10233 > URL: https://issues.apache.org/jira/browse/LUCENE-10233 > Project: Lucene - Core > Issue Type: Improvement > Components: core/codecs >Reporter: Feng Guo >Priority: Major > Time Spent: 10m > Remaining Estimate: 0h > > In low cardinality points cases, id blocks will usually store doc ids that > have the same point value, and intersect will get into addAll logic. If we > store ids as bitset when the leafCadinality = 1, and give the > IntersectVisitor bulk visit ability, we can speed up addAll because we can > just execute the 'or' logic between the result and the block ids. > I mocked a field that has 10,000,000 docs per value and search it with a > PointInSetQuery with 1 term, the build scorer time decreased from 71ms to 8ms. > Concerns: > 1. Bitset could occupy more disk space.(Maybe we can force this optimization > only works when block's (max-min) <= n * count?) > 2. MergeReader will become slower because it needs to iterate docIds one by > one. -- This message was sent by Atlassian Jira (v8.20.1#820001) - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org
[GitHub] [lucene] praveennish opened a new pull request #439: LUCENE-8739: custom codec providing Zstandard compression/decompression
praveennish opened a new pull request #439: URL: https://github.com/apache/lucene/pull/439 # Description Lucene currently supports LZ4 and Zlib compression/decompression for StoredFieldsFormat. We propose Zstandard (https://facebook.github.io/zstd/) compression/decompression for StoredFieldsFormat for the following reasons: * ZStandard is being used in some of the most popular open source projects like Apache Cassandra, Hadoop, and Kafka. * Zstandard, at the default setting of 3, is expected to show substantial improvements in both compression and decompression speed while compressing at the same ratio as Zlib as per study mentioned by Yann Collet at Facebook (https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/). * Zstandard currently offers 22 different Compression levels, which enable flexible, granular trade-offs between compression speed and ratios for future data. This solution also provides the flexibility of choosing compression levels between 1 and 22. For example, we can use level 1 if speed is most important and level 22 if the size is most important. * Zstandard designed to scale with modern hardware(https://engineering.fb.com/2016/08/31/core-data/smaller-and-faster-data-compression-with-zstandard/). # Solution * Developed a custom codec to enable Zstandard compression/decompression support. * This custom codec also provides flexibility to add any other compression algorithms. # Tests * Added required test case for testing new custom codec. Ran with option -Dtests.codec for newly developed custom codec # Checklist Please review the following and check all that apply: - [X] I have reviewed the guidelines for [How to Contribute](https://wiki.apache.org/lucene/HowToContribute) and my code conforms to the standards described there to the best of my ability. - [X] I have created a Jira issue and added the issue ID to my pull request title. - [X] I have given Lucene maintainers [access](https://help.github.com/en/articles/allowing-changes-to-a-pull-request-branch-created-from-a-fork) to contribute to my PR branch. (optional but recommended) - [X] I have developed this patch against the `main` branch. - [X] I have run `./gradlew check`. - [X] I have added tests for my changes. -- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For queries about this service, please contact Infrastructure at: us...@infra.apache.org - To unsubscribe, e-mail: issues-unsubscr...@lucene.apache.org For additional commands, e-mail: issues-h...@lucene.apache.org