[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

dsmiley Fri, 28 Oct 2016 13:57:59 -0700

Github user dsmiley commented on a diff in the pull request:

    https://github.com/apache/lucene-solr/pull/105#discussion_r85606333
  
    --- Diff: 
lucene/highlighter/src/java/org/apache/lucene/search/uhighlight/FieldOffsetStrategy.java
 ---
    @@ -65,58 +65,88 @@ public String getField() {
        */
       public abstract List<OffsetsEnum> getOffsetsEnums(IndexReader reader, 
int docId, String content) throws IOException;
     
    -  protected List<OffsetsEnum> createOffsetsEnums(LeafReader leafReader, 
int doc, TokenStream tokenStream) throws IOException {
    -    List<OffsetsEnum> offsetsEnums = 
createOffsetsEnumsFromReader(leafReader, doc);
    -    if (automata.length > 0) {
    -      offsetsEnums.add(createOffsetsEnumFromTokenStream(doc, tokenStream));
    +  protected List<OffsetsEnum> createOffsetsEnumsFromReader(LeafReader 
leafReader, int doc) throws IOException {
    +    final Terms termsIndex = leafReader.terms(field);
    +    if (termsIndex == null) {
    +      return Collections.emptyList();
         }
    -    return offsetsEnums;
    -  }
     
    -  protected List<OffsetsEnum> createOffsetsEnumsFromReader(LeafReader 
atomicReader, int doc) throws IOException {
         // For strict positions, get a Map of term to Spans:
         //    note: ScriptPhraseHelper.NONE does the right thing for these 
method calls
         final Map<BytesRef, Spans> strictPhrasesTermToSpans =
    -        strictPhrases.getTermToSpans(atomicReader, doc);
    +        phraseHelper.getTermToSpans(leafReader, doc);
         // Usually simply wraps terms in a List; but if willRewrite() then can 
be expanded
         final List<BytesRef> sourceTerms =
    -        strictPhrases.expandTermsIfRewrite(terms, 
strictPhrasesTermToSpans);
    +        phraseHelper.expandTermsIfRewrite(terms, strictPhrasesTermToSpans);
     
    -    final List<OffsetsEnum> offsetsEnums = new 
ArrayList<>(sourceTerms.size() + 1);
    +    final List<OffsetsEnum> offsetsEnums = new 
ArrayList<>(sourceTerms.size() + automata.length);
     
    -    Terms termsIndex = atomicReader == null || sourceTerms.isEmpty() ? 
null : atomicReader.terms(field);
    -    if (termsIndex != null) {
    +    // Handle sourceTerms:
    +    if (!sourceTerms.isEmpty()) {
           TermsEnum termsEnum = termsIndex.iterator();//does not return null
           for (BytesRef term : sourceTerms) {
    -        if (!termsEnum.seekExact(term)) {
    -          continue; // term not found
    -        }
    -        PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
    -        if (postingsEnum == null) {
    -          // no offsets or positions available
    -          throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
    -        }
    -        if (doc != postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
    -          continue;
    +        if (termsEnum.seekExact(term)) {
    +          PostingsEnum postingsEnum = termsEnum.postings(null, 
PostingsEnum.OFFSETS);
    +
    +          if (postingsEnum == null) {
    +            // no offsets or positions available
    +            throw new IllegalArgumentException("field '" + field + "' was 
indexed without offsets, cannot highlight");
    +          }
    +
    +          if (doc == postingsEnum.advance(doc)) { // now it's positioned, 
although may be exhausted
    +            postingsEnum = phraseHelper.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
    +            if (postingsEnum != null) {
    +              offsetsEnums.add(new OffsetsEnum(term, postingsEnum));
    +            }
    +          }
             }
    -        postingsEnum = strictPhrases.filterPostings(term, postingsEnum, 
strictPhrasesTermToSpans.get(term));
    -        if (postingsEnum == null) {
    -          continue;// completely filtered out
    +      }
    +    }
    +
    +    // Handle automata
    +    if (automata.length > 0) {
    +      offsetsEnums.addAll(createAutomataOffsetsFromTerms(termsIndex, doc));
    +    }
    +
    +    return offsetsEnums;
    +  }
    +
    +  protected List<OffsetsEnum> createAutomataOffsetsFromTerms(Terms 
termsIndex, int doc) throws IOException {
    +    Map<CharacterRunAutomaton, List<PostingsEnum>> automataPostings = new 
IdentityHashMap<>(automata.length);
    --- End diff --
    
    I suggest a parallel array to automata, so that later you can avoid a map 
lookup on each matching term.  Also, I suggest lazy-initializing the array 
later... perhaps some wildcards in a disjunction might never match.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at [email protected] or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[GitHub] lucene-solr pull request #105: LUCENE-7526 Improvements to UnifiedHighlighte...

Reply via email to