[
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616548#comment-15616548
]
ASF GitHub Bot commented on LUCENE-7526:
----------------------------------------
Github user dsmiley commented on a diff in the pull request:
https://github.com/apache/lucene-solr/pull/105#discussion_r85603812
--- Diff:
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/CompositePostingsEnum.java
---
@@ -0,0 +1,165 @@
+/*
+ * Licensed to the Apache Software Foundation (ASF) under one or more
+ * contributor license agreements. See the NOTICE file distributed with
+ * this work for additional information regarding copyright ownership.
+ * The ASF licenses this file to You under the Apache License, Version 2.0
+ * (the "License"); you may not use this file except in compliance with
+ * the License. You may obtain a copy of the License at
+ *
+ * http://www.apache.org/licenses/LICENSE-2.0
+ *
+ * Unless required by applicable law or agreed to in writing, software
+ * distributed under the License is distributed on an "AS IS" BASIS,
+ * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+ * See the License for the specific language governing permissions and
+ * limitations under the License.
+ */
+package org.apache.lucene.search.uhighlight;
+
+import java.io.IOException;
+import java.util.List;
+
+import org.apache.lucene.index.PostingsEnum;
+import org.apache.lucene.util.BytesRef;
+import org.apache.lucene.util.PriorityQueue;
+
+
+final class CompositePostingsEnum extends PostingsEnum {
+
+ private static final int NO_MORE_POSITIONS = -2;
+ private final BytesRef term;
+ private final int freq;
+ private final PriorityQueue<BoundsCheckingPostingsEnum> queue;
+
+
+ /**
+ * This class is used to ensure we don't over iterate the underlying
+ * postings enum by keeping track of the position relative to the
+ * frequency.
+ * Ideally this would've been an implementation of a PostingsEnum
+ * but it would have to delegate most methods and it seemed easier
+ * to just wrap the tweaked method.
+ */
+ private static final class BoundsCheckingPostingsEnum {
+
+
+ private final PostingsEnum postingsEnum;
+ private final int freq;
+ private int position;
+ private int nextPosition;
+ private int positionInc = 1;
+
+ private int startOffset;
+ private int endOffset;
+
+ BoundsCheckingPostingsEnum(PostingsEnum postingsEnum) throws
IOException {
+ this.postingsEnum = postingsEnum;
+ this.freq = postingsEnum.freq();
+ nextPosition = postingsEnum.nextPosition();
+ position = nextPosition;
+ startOffset = postingsEnum.startOffset();
+ endOffset = postingsEnum.endOffset();
+ }
+
+ private boolean hasMorePositions() throws IOException {
+ return positionInc < freq;
+ }
+
+ /**
+ * Returns the next position of the underlying postings enum unless
+ * it cannot iterate further and returns NO_MORE_POSITIONS;
+ * @return
+ * @throws IOException
+ */
+ private int nextPosition() throws IOException {
+ position = nextPosition;
+ startOffset = postingsEnum.startOffset();
+ endOffset = postingsEnum.endOffset();
+ if (hasMorePositions()) {
+ positionInc++;
+ nextPosition = postingsEnum.nextPosition();
+ } else {
+ nextPosition = NO_MORE_POSITIONS;
+ }
+ return position;
+ }
+
+ }
+
+ CompositePostingsEnum(BytesRef term, List<PostingsEnum> postingsEnums)
throws IOException {
+ this.term = term;
+ queue = new
PriorityQueue<BoundsCheckingPostingsEnum>(postingsEnums.size()) {
+ @Override
+ protected boolean lessThan(BoundsCheckingPostingsEnum a,
BoundsCheckingPostingsEnum b) {
+ return a.position < b.position;
--- End diff --
In the event the positions are equal (e.g. two terms a the same position in
which the wildcard matches both), we might want to fall-back on startOffset
then endOffset? Or maybe simply ignore position altogether and just do
offsets, so then you needn't even track the position?
> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
> Key: LUCENE-7526
> URL: https://issues.apache.org/jira/browse/LUCENE-7526
> Project: Lucene - Core
> Issue Type: Improvement
> Components: modules/highlighter
> Reporter: Timothy M. Rodriguez
> Assignee: David Smiley
> Priority: Minor
> Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
> ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a
> MemoryIndex for producing Offsets
> ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a
> MemoryIndex. Can only be used if the query distills down to terms and
> automata.
> * TokenStream removal
> ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill
> the memory index and then once consumed a new one was generated by
> uninverting the MemoryIndex back into a TokenStream if there were automata
> (wildcard/mtq queries) involved. Now this is avoided, which should save
> memory and avoid a second pass over the data.
> ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid
> generating a TokenStream if automata are involved.
> ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for
> wildcard/mtq queries. This should improve relevancy by providing unified
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy
--
This message was sent by Atlassian JIRA
(v6.3.4#6332)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]