[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

ASF GitHub Bot (JIRA) Fri, 28 Oct 2016 13:58:24 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616555#comment-15616555
 ]


ASF GitHub Bot commented on LUCENE-7526:
----------------------------------------

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/105#discussion_r85606333
  
    --- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
    @@ -65,58 +65,88 @@ public String getField() {
        */
       public abstract List<OffsetsEnum> getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
     
    -  protected List<OffsetsEnum> createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
    -    List<OffsetsEnum> offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
    -    if (automata.length > 0) {
    -      offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
    +  protected List<OffsetsEnum> createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
    +    final Terms termsIndex = leafReader.terms(field);
    +    if (termsIndex == null) {
    +      return Collections.emptyList();
         }
    -    return offsetsEnums;
    -  }
     
    -  protected List<OffsetsEnum> createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
         // For strict positions, get a Map of term to Spans:
         //    note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
         final Map<BytesRef, Spans> strictPhrasesTermToSpans =
    -        strictPhrases.getTermToSpans(atomicReader, doc);
    +        phraseHelper.getTermToSpans(leafReader, doc);
         // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
         final List<BytesRef> sourceTerms =
    -        strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
    +        phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
     
    -    final List<OffsetsEnum> offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
    +    final List<OffsetsEnum> offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
     
    -    Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
    -    if (termsIndex != null) {
    +    // Handle sourceTerms:
    +    if (!sourceTerms.isEmpty()) {
           TermsEnum termsEnum = termsIndex.iterator();//does not return null
           for (BytesRef term : sourceTerms) {
    -        if (!termsEnum.seekExact(term)) {
    -          continue; // term not found
    -        }
    -        PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
    -        if (postingsEnum == null) {
    -          // no offsets or positions available
    -          throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
    -        }
    -        if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
    -          continue;
    +        if (termsEnum.seekExact(term)) {
    +          PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
    +
    +          if (postingsEnum == null) {
    +            // no offsets or positions available
    +            throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
    +          }
    +
    +          if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
    +            postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
    +            if (postingsEnum != null) {
    +              offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
    +            }
    +          }
             }
    -        postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
    -        if (postingsEnum == null) {
    -          continue;// completely filtered out
    +      }
    +    }
    +
    +    // Handle automata
    +    if (automata.length > 0) {
    +      offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
    +    }
    +
    +    return offsetsEnums;
    +  }
    +
    +  protected List<OffsetsEnum> createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
    +    Map<CharacterRunAutomaton, List<PostingsEnum>> automataPostings = new 
IdentityHashMap<>(automata.length);
    --- End diff --
    
    I suggest a parallel array to automata, so that later you can avoid a map 
lookup on each matching term.  Also, I suggest lazy-initializing the array 
later... perhaps some wildcards in a disjunction might never match.


> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
>                 Key: LUCENE-7526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7526
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-7526) Improvements to UnifiedHighlighter OffsetStrategies

Reply via email to