[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 closed the pull request at: https://github.com/apache/lucene-solr/pull/105 --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85619786 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java --- @@ -65,58 +65,88 @@ public String getField() { */ public abstract List getOffsetsEnums(IndexReader reader, int docId, String content) throws IOException; - protected List createOffsetsEnums(LeafReader leafReader, int doc, TokenStream tokenStream) throws IOException { -List offsetsEnums = createOffsetsEnumsFromReader(leafReader, doc); -if (automata.length > 0) { - offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream)); + protected List createOffsetsEnumsFromReader(LeafReader leafReader, int doc) throws IOException { +final Terms termsIndex = leafReader.terms(field); +if (termsIndex == null) { + return Collections.emptyList(); } -return offsetsEnums; - } - protected List createOffsetsEnumsFromReader(LeafReader atomicReader, int doc) throws IOException { // For strict positions, get a Map of term to Spans: //note: ScriptPhraseHelper.NONE does the right thing for these method calls final MapstrictPhrasesTermToSpans = -strictPhrases.getTermToSpans(atomicReader, doc); +phraseHelper.getTermToSpans(leafReader, doc); // Usually simply wraps terms in a List; but if willRewrite() then can be expanded final List sourceTerms = -strictPhrases.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); +phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); -final List offsetsEnums = new ArrayList<>(sourceTerms.size() + 1); +final List offsetsEnums = new ArrayList<>(sourceTerms.size() + automata.length); -Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? null : atomicReader.terms(field); -if (termsIndex != null) { +// Handle sourceTerms: +if (!sourceTerms.isEmpty()) { TermsEnum termsEnum = termsIndex.iterator();//does not return null for (BytesRef term : sourceTerms) { -if (!termsEnum.seekExact(term)) { - continue; // term not found -} -PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); -if (postingsEnum == null) { - // no offsets or positions available - throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); -} -if (doc != postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted - continue; +if (termsEnum.seekExact(term)) { + PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); + + if (postingsEnum == null) { +// no offsets or positions available +throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + + if (doc == postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted +postingsEnum = phraseHelper.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); +if (postingsEnum != null) { + offsetsEnums.add(new OffsetsEnum(term, postingsEnum)); +} + } } -postingsEnum = strictPhrases.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); -if (postingsEnum == null) { - continue;// completely filtered out + } +} + +// Handle automata +if (automata.length > 0) { + offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc)); +} + +return offsetsEnums; + } + + protected List createAutomataOffsetsFromTerms(Terms termsIndex, int doc) throws IOException { +Map automataPostings = new IdentityHashMap<>(automata.length); --- End diff -- Pushed that for now to see what you think. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85618489 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; --- End diff -- Hmm, did you mean to calculate freq dynamically in the method and just use a remainingPositions count as walking over the postings? Instead of re-computing it's cached in the constructor, but if we modified it, the freq count would change as the postings were iterated. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85616894 --- Diff: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java --- @@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() { @Test public void testUnifiedHighlighterExtensibility() { final int maxLength = 1000; -UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(random())){ +UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(new Random())){ --- End diff -- I'm a bit confused. What I should I change? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85616651 --- Diff: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java --- @@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() { @Test public void testUnifiedHighlighterExtensibility() { final int maxLength = 1000; -UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(random())){ +UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(new Random())){ --- End diff -- I recall now why I changed it. I started getting this error and figured it was a change elsewhere: java.lang.IllegalStateException: No context information for thread: Thread[id=1, name=main, state=RUNNABLE, group=main]. Is this thread running under a class com.carrotsearch.randomizedtesting.RandomizedRunner runner context? Add @RunWith(class com.carrotsearch.randomizedtesting.RandomizedRunner.class) to your test class. Make sure your code accesses random contexts within @BeforeClass and @AfterClass boundary (for example, static test class initializers are not permitted to access random contexts). --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85614277 --- Diff: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java --- @@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() { @Test public void testUnifiedHighlighterExtensibility() { final int maxLength = 1000; -UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(random())){ +UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(new Random())){ --- End diff -- fixed --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85613814 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java --- @@ -65,58 +65,88 @@ public String getField() { */ public abstract List getOffsetsEnums(IndexReader reader, int docId, String content) throws IOException; - protected List createOffsetsEnums(LeafReader leafReader, int doc, TokenStream tokenStream) throws IOException { -List offsetsEnums = createOffsetsEnumsFromReader(leafReader, doc); -if (automata.length > 0) { - offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream)); + protected List createOffsetsEnumsFromReader(LeafReader leafReader, int doc) throws IOException { +final Terms termsIndex = leafReader.terms(field); +if (termsIndex == null) { + return Collections.emptyList(); } -return offsetsEnums; - } - protected List createOffsetsEnumsFromReader(LeafReader atomicReader, int doc) throws IOException { // For strict positions, get a Map of term to Spans: //note: ScriptPhraseHelper.NONE does the right thing for these method calls final MapstrictPhrasesTermToSpans = -strictPhrases.getTermToSpans(atomicReader, doc); +phraseHelper.getTermToSpans(leafReader, doc); // Usually simply wraps terms in a List; but if willRewrite() then can be expanded final List sourceTerms = -strictPhrases.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); +phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); -final List offsetsEnums = new ArrayList<>(sourceTerms.size() + 1); +final List offsetsEnums = new ArrayList<>(sourceTerms.size() + automata.length); -Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? null : atomicReader.terms(field); -if (termsIndex != null) { +// Handle sourceTerms: +if (!sourceTerms.isEmpty()) { TermsEnum termsEnum = termsIndex.iterator();//does not return null for (BytesRef term : sourceTerms) { -if (!termsEnum.seekExact(term)) { - continue; // term not found -} -PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); -if (postingsEnum == null) { - // no offsets or positions available - throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); -} -if (doc != postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted - continue; +if (termsEnum.seekExact(term)) { + PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); + + if (postingsEnum == null) { +// no offsets or positions available +throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + + if (doc == postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted +postingsEnum = phraseHelper.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); +if (postingsEnum != null) { + offsetsEnums.add(new OffsetsEnum(term, postingsEnum)); +} + } } -postingsEnum = strictPhrases.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); -if (postingsEnum == null) { - continue;// completely filtered out + } +} + +// Handle automata +if (automata.length > 0) { + offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc)); +} + +return offsetsEnums; + } + + protected List createAutomataOffsetsFromTerms(Terms termsIndex, int doc) throws IOException { +Map automataPostings = new IdentityHashMap<>(automata.length); --- End diff -- How about List that should give better locality without lose of type safety? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85613130 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java --- @@ -65,58 +65,88 @@ public String getField() { */ public abstract List getOffsetsEnums(IndexReader reader, int docId, String content) throws IOException; - protected List createOffsetsEnums(LeafReader leafReader, int doc, TokenStream tokenStream) throws IOException { -List offsetsEnums = createOffsetsEnumsFromReader(leafReader, doc); -if (automata.length > 0) { - offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream)); + protected List createOffsetsEnumsFromReader(LeafReader leafReader, int doc) throws IOException { +final Terms termsIndex = leafReader.terms(field); +if (termsIndex == null) { + return Collections.emptyList(); } -return offsetsEnums; - } - protected List createOffsetsEnumsFromReader(LeafReader atomicReader, int doc) throws IOException { // For strict positions, get a Map of term to Spans: //note: ScriptPhraseHelper.NONE does the right thing for these method calls final MapstrictPhrasesTermToSpans = -strictPhrases.getTermToSpans(atomicReader, doc); +phraseHelper.getTermToSpans(leafReader, doc); // Usually simply wraps terms in a List; but if willRewrite() then can be expanded final List sourceTerms = -strictPhrases.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); +phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); -final List offsetsEnums = new ArrayList<>(sourceTerms.size() + 1); +final List offsetsEnums = new ArrayList<>(sourceTerms.size() + automata.length); -Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? null : atomicReader.terms(field); -if (termsIndex != null) { +// Handle sourceTerms: +if (!sourceTerms.isEmpty()) { TermsEnum termsEnum = termsIndex.iterator();//does not return null for (BytesRef term : sourceTerms) { -if (!termsEnum.seekExact(term)) { - continue; // term not found -} -PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); -if (postingsEnum == null) { - // no offsets or positions available - throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); -} -if (doc != postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted - continue; +if (termsEnum.seekExact(term)) { + PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); + + if (postingsEnum == null) { +// no offsets or positions available +throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + + if (doc == postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted +postingsEnum = phraseHelper.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); +if (postingsEnum != null) { + offsetsEnums.add(new OffsetsEnum(term, postingsEnum)); +} + } } -postingsEnum = strictPhrases.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); -if (postingsEnum == null) { - continue;// completely filtered out + } +} + +// Handle automata +if (automata.length > 0) { + offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc)); +} + +return offsetsEnums; + } + + protected List createAutomataOffsetsFromTerms(Terms termsIndex, int doc) throws IOException { +Map automataPostings = new IdentityHashMap<>(automata.length); --- End diff -- One minor problem with that was that the code would not longer by type-safe because of the lack of generic arrays in java. I wouldn't be able to do `List[]` = new ArrayList[automata.length];` but could do `List[]` = new ArrayList[automata.length];` with unchecked casts. Seem worth it? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- -
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85612011 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/AnalysisOffsetStrategy.java --- @@ -17,174 +17,28 @@ package org.apache.lucene.search.uhighlight; import java.io.IOException; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.List; import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.FilteringTokenFilter; import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.memory.MemoryIndex; -import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.automaton.Automata; import org.apache.lucene.util.automaton.CharacterRunAutomaton; +public abstract class AnalysisOffsetStrategy extends FieldOffsetStrategy { --- End diff -- Thanks --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user Timothy055 commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85611978 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; +private int position; +private int nextPosition; +private int positionInc = 1; + +private int startOffset; +private int endOffset; + +BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws IOException { + this.postingsEnum = postingsEnum; + this.freq = postingsEnum.freq(); + nextPosition = postingsEnum.nextPosition(); + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); +} + +private boolean hasMorePositions() throws IOException { + return positionInc < freq; +} + +/** + * Returns the next position of the underlying postings enum unless + * it cannot iterate further and returns NO_MORE_POSITIONS; + * @return + * @throws IOException + */ +private int nextPosition() throws IOException { + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); + if (hasMorePositions()) { +positionInc++; +nextPosition = postingsEnum.nextPosition(); + } else { +nextPosition = NO_MORE_POSITIONS; + } + return position; +} + + } + + CompositePostingsEnum(BytesRef term, List postingsEnums) throws IOException { +this.term = term; +queue = new PriorityQueue(postingsEnums.size()) { + @Override + protected boolean lessThan(BoundsCheckingPostingsEnum a, BoundsCheckingPostingsEnum b) { +return a.position < b.position; + } +}; + +int freqAdd = 0; +for (PostingsEnum postingsEnum : postingsEnums) { + queue.add(new BoundsCheckingPostingsEnum(postingsEnum)); + freqAdd += postingsEnum.freq(); +} +freq = freqAdd; + } + + @Override + public int freq() throws IOException { +return freq; + } + + @Override + public int nextPosition() throws IOException { +int position = NO_MORE_POSITIONS; +while (queue.size() >= 1) { + queue.top().nextPosition(); + queue.updateTop(); //the new position may be behind another postingsEnum in the queue + position = queue.top().position; + + if (position == NO_MORE_POSITIONS) { +queue.pop(); //this postingsEnum is consumed, let's get rid of it + } else { +break; //we got a new position + } + +} +return position; + } + + @Override + public int startOffset() throws IOException { +return
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85606885 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/Passage.java --- @@ -40,7 +40,7 @@ BytesRef matchTerms[] = new BytesRef[8]; int numMatches = 0; -void addMatch(int startOffset, int endOffset, BytesRef term) { +public void addMatch(int startOffset, int endOffset, BytesRef term) { --- End diff -- Excellent; now it's possible for someone to override `FieldHighlighter.highlightOffsetsEnums()` to make Passages. But do add @lucene.experimental. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85607859 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/UnifiedHighlighter.java --- @@ -116,6 +116,8 @@ private boolean defaultHighlightPhrasesStrictly = true; // AKA "accuracy" or "query debugging" + private boolean defaultPassageRelevancyOverSpeed = true; //Prefer using a memory index --- End diff -- Suggest the comment be: For analysis, prefer MemoryIndex approach --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85603812 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; +private int position; +private int nextPosition; +private int positionInc = 1; + +private int startOffset; +private int endOffset; + +BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws IOException { + this.postingsEnum = postingsEnum; + this.freq = postingsEnum.freq(); + nextPosition = postingsEnum.nextPosition(); + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); +} + +private boolean hasMorePositions() throws IOException { + return positionInc < freq; +} + +/** + * Returns the next position of the underlying postings enum unless + * it cannot iterate further and returns NO_MORE_POSITIONS; + * @return + * @throws IOException + */ +private int nextPosition() throws IOException { + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); + if (hasMorePositions()) { +positionInc++; +nextPosition = postingsEnum.nextPosition(); + } else { +nextPosition = NO_MORE_POSITIONS; + } + return position; +} + + } + + CompositePostingsEnum(BytesRef term, List postingsEnums) throws IOException { +this.term = term; +queue = new PriorityQueue(postingsEnums.size()) { + @Override + protected boolean lessThan(BoundsCheckingPostingsEnum a, BoundsCheckingPostingsEnum b) { +return a.position < b.position; --- End diff -- In the event the positions are equal (e.g. two terms a the same position in which the wildcard matches both), we might want to fall-back on startOffset then endOffset? Or maybe simply ignore position altogether and just do offsets, so then you needn't even track the position? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85608645 --- Diff: lucene/highlighter/src/test/org/apache/lucene/search/uhighlight/visibility/TestUnifiedHighlighterExtensibility.java --- @@ -79,7 +90,7 @@ public void testFieldOffsetStrategyExtensibility() { @Test public void testUnifiedHighlighterExtensibility() { final int maxLength = 1000; -UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(random())){ +UnifiedHighlighter uh = new UnifiedHighlighter(null, new MockAnalyzer(new Random())){ --- End diff -- ? why not `random()` ? This will likely fail precommit. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85607262 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/TokenStreamOffsetStrategy.java --- @@ -0,0 +1,60 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.Closeable; +import java.io.IOException; +import java.util.Collections; +import java.util.List; + +import org.apache.lucene.analysis.Analyzer; +import org.apache.lucene.analysis.TokenStream; +import org.apache.lucene.index.IndexReader; +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.automaton.CharacterRunAutomaton; + +public class TokenStreamOffsetStrategy extends AnalysisOffsetStrategy { + + private static final BytesRef[] ZERO_LEN_BYTES_REF_ARRAY = new BytesRef[0]; + + public TokenStreamOffsetStrategy(String field, BytesRef[] terms, PhraseHelper phraseHelper, CharacterRunAutomaton[] automata, Analyzer indexAnalyzer) { +super(field, terms, phraseHelper, automata, indexAnalyzer); +this.automata = convertTermsToAutomata(terms, automata); +this.terms = ZERO_LEN_BYTES_REF_ARRAY; + } + + @Override + public List getOffsetsEnums(IndexReader reader, int docId, String content) throws IOException { +TokenStream tokenStream = tokenStream(content); +PostingsEnum mtqPostingsEnum = MultiTermHighlighting.getDocsEnum(tokenStream, automata); --- End diff -- I think there's a case to be made in moving ` MultiTermHighlighting.getDocsEnum` into this class, to thus keep the TokenStream aspect more isolated? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85602404 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { --- End diff -- A comment would be helpful to explain the scope/purpose. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85602867 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; --- End diff -- Instead of holding `freq` and `nextPosition`, why not just `remainingPositions`? --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85606333 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java --- @@ -65,58 +65,88 @@ public String getField() { */ public abstract List getOffsetsEnums(IndexReader reader, int docId, String content) throws IOException; - protected List createOffsetsEnums(LeafReader leafReader, int doc, TokenStream tokenStream) throws IOException { -List offsetsEnums = createOffsetsEnumsFromReader(leafReader, doc); -if (automata.length > 0) { - offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream)); + protected List createOffsetsEnumsFromReader(LeafReader leafReader, int doc) throws IOException { +final Terms termsIndex = leafReader.terms(field); +if (termsIndex == null) { + return Collections.emptyList(); } -return offsetsEnums; - } - protected List createOffsetsEnumsFromReader(LeafReader atomicReader, int doc) throws IOException { // For strict positions, get a Map of term to Spans: //note: ScriptPhraseHelper.NONE does the right thing for these method calls final MapstrictPhrasesTermToSpans = -strictPhrases.getTermToSpans(atomicReader, doc); +phraseHelper.getTermToSpans(leafReader, doc); // Usually simply wraps terms in a List; but if willRewrite() then can be expanded final List sourceTerms = -strictPhrases.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); +phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans); -final List offsetsEnums = new ArrayList<>(sourceTerms.size() + 1); +final List offsetsEnums = new ArrayList<>(sourceTerms.size() + automata.length); -Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? null : atomicReader.terms(field); -if (termsIndex != null) { +// Handle sourceTerms: +if (!sourceTerms.isEmpty()) { TermsEnum termsEnum = termsIndex.iterator();//does not return null for (BytesRef term : sourceTerms) { -if (!termsEnum.seekExact(term)) { - continue; // term not found -} -PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); -if (postingsEnum == null) { - // no offsets or positions available - throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); -} -if (doc != postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted - continue; +if (termsEnum.seekExact(term)) { + PostingsEnum postingsEnum = termsEnum.postings(null, PostingsEnum.OFFSETS); + + if (postingsEnum == null) { +// no offsets or positions available +throw new IllegalArgumentException("field '" + field + "' was indexed without offsets, cannot highlight"); + } + + if (doc == postingsEnum.advance(doc)) { // now it's positioned, although may be exhausted +postingsEnum = phraseHelper.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); +if (postingsEnum != null) { + offsetsEnums.add(new OffsetsEnum(term, postingsEnum)); +} + } } -postingsEnum = strictPhrases.filterPostings(term, postingsEnum, strictPhrasesTermToSpans.get(term)); -if (postingsEnum == null) { - continue;// completely filtered out + } +} + +// Handle automata +if (automata.length > 0) { + offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc)); +} + +return offsetsEnums; + } + + protected List createAutomataOffsetsFromTerms(Terms termsIndex, int doc) throws IOException { +Map automataPostings = new IdentityHashMap<>(automata.length); --- End diff -- I suggest a parallel array to automata, so that later you can avoid a map lookup on each matching term. Also, I suggest lazy-initializing the array later... perhaps some wildcards in a disjunction might never match. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85602210 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/AnalysisOffsetStrategy.java --- @@ -17,174 +17,28 @@ package org.apache.lucene.search.uhighlight; import java.io.IOException; -import java.util.ArrayList; -import java.util.Arrays; -import java.util.Collections; -import java.util.List; import org.apache.lucene.analysis.Analyzer; -import org.apache.lucene.analysis.FilteringTokenFilter; import org.apache.lucene.analysis.TokenStream; -import org.apache.lucene.analysis.tokenattributes.CharTermAttribute; -import org.apache.lucene.index.IndexReader; -import org.apache.lucene.index.LeafReader; -import org.apache.lucene.index.Terms; -import org.apache.lucene.index.memory.MemoryIndex; -import org.apache.lucene.search.spans.SpanQuery; import org.apache.lucene.util.BytesRef; -import org.apache.lucene.util.automaton.Automata; import org.apache.lucene.util.automaton.CharacterRunAutomaton; +public abstract class AnalysisOffsetStrategy extends FieldOffsetStrategy { --- End diff -- All public classes need a javadoc comment. Remember lucene.internal for this one. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85603003 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; +private int position; +private int nextPosition; +private int positionInc = 1; + +private int startOffset; --- End diff -- Don't need these. They can be fetched on-demand from the head of the queue, easily & cheaply enough. --- If your project is set up for it, you can reply to this email and have your reply appear on GitHub as well. If your project does not have this feature enabled and wishes so, or if the feature is enabled but not working, please contact infrastructure at infrastruct...@apache.org or file a JIRA ticket with INFRA. --- - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
Github user dsmiley commented on a diff in the pull request: https://github.com/apache/lucene-solr/pull/105#discussion_r85604667 --- Diff: lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java --- @@ -0,0 +1,165 @@ +/* + * Licensed to the Apache Software Foundation (ASF) under one or more + * contributor license agreements. See the NOTICE file distributed with + * this work for additional information regarding copyright ownership. + * The ASF licenses this file to You under the Apache License, Version 2.0 + * (the "License"); you may not use this file except in compliance with + * the License. You may obtain a copy of the License at + * + * http://www.apache.org/licenses/LICENSE-2.0 + * + * Unless required by applicable law or agreed to in writing, software + * distributed under the License is distributed on an "AS IS" BASIS, + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + * See the License for the specific language governing permissions and + * limitations under the License. + */ +package org.apache.lucene.search.uhighlight; + +import java.io.IOException; +import java.util.List; + +import org.apache.lucene.index.PostingsEnum; +import org.apache.lucene.util.BytesRef; +import org.apache.lucene.util.PriorityQueue; + + +final class CompositePostingsEnum extends PostingsEnum { + + private static final int NO_MORE_POSITIONS = -2; + private final BytesRef term; + private final int freq; + private final PriorityQueue queue; + + + /** + * This class is used to ensure we don't over iterate the underlying + * postings enum by keeping track of the position relative to the + * frequency. + * Ideally this would've been an implementation of a PostingsEnum + * but it would have to delegate most methods and it seemed easier + * to just wrap the tweaked method. + */ + private static final class BoundsCheckingPostingsEnum { + + +private final PostingsEnum postingsEnum; +private final int freq; +private int position; +private int nextPosition; +private int positionInc = 1; + +private int startOffset; +private int endOffset; + +BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws IOException { + this.postingsEnum = postingsEnum; + this.freq = postingsEnum.freq(); + nextPosition = postingsEnum.nextPosition(); + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); +} + +private boolean hasMorePositions() throws IOException { + return positionInc < freq; +} + +/** + * Returns the next position of the underlying postings enum unless + * it cannot iterate further and returns NO_MORE_POSITIONS; + * @return + * @throws IOException + */ +private int nextPosition() throws IOException { + position = nextPosition; + startOffset = postingsEnum.startOffset(); + endOffset = postingsEnum.endOffset(); + if (hasMorePositions()) { +positionInc++; +nextPosition = postingsEnum.nextPosition(); + } else { +nextPosition = NO_MORE_POSITIONS; + } + return position; +} + + } + + CompositePostingsEnum(BytesRef term, List postingsEnums) throws IOException { +this.term = term; +queue = new PriorityQueue(postingsEnums.size()) { + @Override + protected boolean lessThan(BoundsCheckingPostingsEnum a, BoundsCheckingPostingsEnum b) { +return a.position < b.position; + } +}; + +int freqAdd = 0; +for (PostingsEnum postingsEnum : postingsEnums) { + queue.add(new BoundsCheckingPostingsEnum(postingsEnum)); + freqAdd += postingsEnum.freq(); +} +freq = freqAdd; + } + + @Override + public int freq() throws IOException { +return freq; + } + + @Override + public int nextPosition() throws IOException { +int position = NO_MORE_POSITIONS; +while (queue.size() >= 1) { + queue.top().nextPosition(); + queue.updateTop(); //the new position may be behind another postingsEnum in the queue + position = queue.top().position; + + if (position == NO_MORE_POSITIONS) { +queue.pop(); //this postingsEnum is consumed, let's get rid of it + } else { +break; //we got a new position + } + +} +return position; + } + + @Override + public int startOffset() throws IOException { +return
[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...
GitHub user Timothy055 opened a pull request: https://github.com/apache/lucene-solr/pull/105 LUCENE-7526 Improvements to UnifiedHighlighter OffsetStrategies Pull request for LUCENE-7526 You can merge this pull request into a Git repository by running: $ git pull https://github.com/Timothy055/lucene-solr master Alternatively you can review and apply these changes as the patch at: https://github.com/apache/lucene-solr/pull/105.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #105 commit 02e932c4a6146363680b88f4947a693c6697c955 Author: Timothy RodriguezDate: 2016-09-01T19:23:50Z Initial fork of PostingsHighlighter for UnifiedHighlighter commit 9d88411b3985a98851384d78d681431dba710e89 Author: Timothy Rodriguez Date: 2016-09-01T23:17:06Z Initial commit of the UnifiedHighlighter for OSS contribution commit e45e39bc4b07ea33e4423b264c2fefb9aa08777a Author: David Smiley Date: 2016-09-02T12:45:49Z Fix misc issues; "ant test" now works. (#1) commit 046a28ef31acf4cea7d2554b827e6a714e3d Author: Timothy Rodriguez Date: 2016-09-02T20:58:31Z Minor refactoring of the AnalysisFieldHighlighter commit ccd1a2280abd4b48cfef8122696e5d9cfd12920f Author: David Smiley Date: 2016-09-03T12:55:20Z AbstractFieldHighlighter: order methods more sensibly; renamed a couple. commit d4714a04a3e41d5e95bbe942b275c32ed69b9c2e Author: David Smiley Date: 2016-09-04T01:03:29Z Improve javadocs and @lucene.external/internal labeling & scope. "ant precommit" now passes. commit e0659f18a59bf2893076da6d7643ff30f2fa5a52 Author: David Smiley Date: 2016-09-04T01:25:55Z Analysis: remove dubious filter() method commit ccd7ce707bff2c06da89b31853cca9aecea72008 Author: David Smiley Date: 2016-09-04T01:44:01Z getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once. commit ffc2a22c700b8abcbf87673d5d05bb3659d177c9 Author: David Smiley Date: 2016-09-04T15:21:08Z UnifiedHighlighter round 2 (#2) * AbstractFieldHighlighter: order methods more sensibly; renamed a couple. * Improve javadocs and @lucene.external/internal labeling & scope. "ant precommit" now passes. * Analysis: remove dubious filter() method * getStrictPhraseHelper -> rm "Strict", getHighlightAccuracy -> getFlags, and only call filterExtractedTerms once. commit 5f95e05595db462d3ab5bffc68c2c92f70875072 Author: David Smiley Date: 2016-09-04T16:12:33Z Refactor: FieldOffsetStrategy commit 86fb6265fbbdb955ead6d4baf944bf708175715e Author: David Smiley Date: 2016-09-04T16:21:32Z stop passing maxPassages into highlightFieldForDoc() commit f6fd80544eae9fab953b94b1e9346c0883f956eb Author: David Smiley Date: 2016-09-04T16:12:33Z Refactor: FieldOffsetStrategy commit b335a673c2ce45904890c1e9af7cbfda2bd27b0f Author: David Smiley Date: 2016-09-04T16:21:32Z stop passing maxPassages into highlightFieldForDoc() commit 478db9437b92214cbf459f82ba2e3a67c966a150 Author: David Smiley Date: 2016-09-04T18:29:44Z Rename subclasses of FieldOffsetStrategy. commit dbf4280755c11420a5032445cd618fadb7444b61 Author: David Smiley Date: 2016-09-04T18:31:34Z Re-order and harmonize params on methods called by UH.getFieldHighlighter() commit f0340e27e61dcda2e11992f08ec07a72fad6c24c Author: David Smiley Date: 2016-09-04T18:53:51Z FieldHighlighter: harmonize field/param order. And don't apply maxNoHighlightPasses twice. commit 817f63c1d48fd523c13b9c40a2ae9b8a4047209a Author: Timothy Rodriguez Date: 2016-09-06T20:43:20Z Merge of renaming changes commit 0f644a4f53c1ed4d41d562848f6fe51a87442a75 Author: Timothy Rodriguez Date: 2016-09-06T20:54:13Z add visibility tests commit 9171f49e117085e7d086267bb73836831ff07f8e Author: Timothy Rodriguez Date: 2016-09-07T14:26:59Z ADd additional extensibility test commit 7ce488147cb811e15cb6e9125a835171157746f2 Author: Timothy Rodriguez Date: 2016-09-28T22:04:15Z Reduce visibility of MultiTermHighlighting to package protected commit 2f08465020448592b0e8750db568ade5a9218267 Author: Timothy M. Rodriguez Date: 2016-10-11T16:44:29Z Initial commit that will use memory index to generate offsets enum if the tokenstream is null commit 357f3dfb9ace4deef20787af19bc2e5a6b4ff61e Author: Timothy M. Rodriguez