[ 
https://issues.apache.org/jira/browse/LUCENE-7526?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15616549#comment-15616549
 ] 

ASF GitHub Bot commented on LUCENE-7526:
----------------------------------------

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/105#discussion_r85607262
  
    --- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/TokenStreamOffsetStrategy.java
 ---
    @@ -0,0 +1,60 @@
    +/*
    + * Licensed to the Apache Software Foundation (ASF) under one or more
    + * contributor license agreements.  See the NOTICE file distributed with
    + * this work for additional information regarding copyright ownership.
    + * The ASF licenses this file to You under the Apache License, Version 2.0
    + * (the "License"); you may not use this file except in compliance with
    + * the License.  You may obtain a copy of the License at
    + *
    + *     http://www.apache.org/licenses/LICENSE-2.0
    + *
    + * Unless required by applicable law or agreed to in writing, software
    + * distributed under the License is distributed on an "AS IS" BASIS,
    + * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
    + * See the License for the specific language governing permissions and
    + * limitations under the License.
    + */
    +package org.apache.lucene.search.uhighlight;
    +
    +import java.io.Closeable;
    +import java.io.IOException;
    +import java.util.Collections;
    +import java.util.List;
    +
    +import org.apache.lucene.analysis.Analyzer;
    +import org.apache.lucene.analysis.TokenStream;
    +import org.apache.lucene.index.IndexReader;
    +import org.apache.lucene.index.PostingsEnum;
    +import org.apache.lucene.util.BytesRef;
    +import org.apache.lucene.util.automaton.CharacterRunAutomaton;
    +
    +public class TokenStreamOffsetStrategy extends AnalysisOffsetStrategy {
    +
    +  private static final BytesRef[] ZERO_LEN_BYTES_REF_ARRAY = new 
BytesRef[0];
    +
    +  public TokenStreamOffsetStrategy(String field, BytesRef[] terms, 
PhraseHelper phraseHelper, CharacterRunAutomaton[] automata, Analyzer 
indexAnalyzer) {
    +    super(field, terms, phraseHelper, automata, indexAnalyzer);
    +    this.automata = convertTermsToAutomata(terms, automata);
    +    this.terms = ZERO_LEN_BYTES_REF_ARRAY;
    +  }
    +
    +  @Override
    +  public List<OffsetsEnum> getOffsetsEnums(IndexReader reader, int docId, 
String content) throws IOException {
    +    TokenStream tokenStream = tokenStream(content);
    +    PostingsEnum mtqPostingsEnum = 
MultiTermHighlighting.getDocsEnum(tokenStream, automata);
    --- End diff --
    
    I think there's a case to be made in moving ` 
MultiTermHighlighting.getDocsEnum` into this class, to thus keep the 
TokenStream aspect more isolated?


> Improvements to UnifiedHighlighter OffsetStrategies
> ---------------------------------------------------
>
>                 Key: LUCENE-7526
>                 URL: https://issues.apache.org/jira/browse/LUCENE-7526
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Timothy M. Rodriguez
>            Assignee: David Smiley
>            Priority: Minor
>             Fix For: 6.4
>
>
> This ticket improves several of the UnifiedHighlighter FieldOffsetStrategies 
> by reducing reliance on creating or re-creating TokenStreams.
> The primary changes are as follows:
> * AnalysisOffsetStrategy - split into two offset strategies
>   ** MemoryIndexOffsetStrategy - the primary analysis mode that utilizes a 
> MemoryIndex for producing Offsets
>   ** TokenStreamOffsetStrategy - an offset strategy that avoids creating a 
> MemoryIndex.  Can only be used if the query distills down to terms and 
> automata.
> * TokenStream removal 
>   ** MemoryIndexOffsetStrategy - previously a TokenStream was created to fill 
> the memory index and then once consumed a new one was generated by 
> uninverting the MemoryIndex back into a TokenStream if there were automata 
> (wildcard/mtq queries) involved.  Now this is avoided, which should save 
> memory and avoid a second pass over the data.
>   ** TermVectorOffsetStrategy - this was refactored in a similar way to avoid 
> generating a TokenStream if automata are involved.
>   ** PostingsWithTermVectorsOffsetStrategy - similar refactoring
> * CompositePostingsEnum - aggregates several underlying PostingsEnums for 
> wildcard/mtq queries.  This should improve relevancy by providing unified 
> metrics for a wildcard across all it's term matches
> * Added a HighlightFlag for enabling the newly separated 
> TokenStreamOffsetStrategy since it can adversely affect passage relevancy



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to