[ 
https://issues.apache.org/jira/browse/LUCENE-8145?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16347433#comment-16347433
 ] 

David Smiley commented on LUCENE-8145:
--------------------------------------

*This is a really nice improvement Alan!*  I once thought about replacing 
List<OffsetsEnum> with a PriorityQueue based one as you did here but I didn't 
follow through because I didn't know how the OffsetsEnum.weight would 
accommodate that.  Here you've shifted that tracking to the Passage.

We can go further with your refactoring -- making CompositeOffsetsPostingsEnum 
obsolete.  Delete it and then change 
FieldOffsetStrategy.createOffsetsEnumsForAutomata so that it's final loop looks 
like this:
{code}
    for (int i = 0; i < automata.length; i++) {
      CharacterRunAutomaton automaton = automata[i];
      List<PostingsEnum> postingsEnums = automataPostings.get(i);
      if (postingsEnums.isEmpty()) {
        continue;
      }
      // Build one OffsetsEnum exposing the automata.toString as the term, and 
the sum of freq
      BytesRef wildcardTerm = new BytesRef(automaton.toString());
      int sumFreq = 0;
      for (PostingsEnum postingsEnum : postingsEnums) {
        sumFreq += postingsEnum.freq();
      }
      for (PostingsEnum postingsEnum : postingsEnums) {
        results.add(new OffsetsEnum.OfPostings(wildcardTerm, sumFreq, 
postingsEnum));
      }
    }
{code}
And then overload the OfPostings constructor to take a manual "freq". Or 
Subclass OfPostings in-kind (e.g. OfPostingsWithFreqc ould be inner class even 
(I played with this a little; it was annoying that sumFreq couldn't be final 
but no biggie).

Unfortunately it seems we have no test that ensures the "freq" is correct for a 
highlighted wildcard query whose wildcard matches multiple terms of varying 
frequencies each in the document.  We should add such a test.

I definitely like how FieldHighlighter is simpler.  You can go further and 
remove FieldHighlighter.EMPTY too -- Rob had used that to simplify the queue 
initial state logic that is now obsoleted with your change (you chose a boolean 
"started" flag instead).

It's a shame Passage is now fairly heavy with a BytesRefHash on it.  I want to 
think about that a bit more.

The first place maybeAddPassage is called, it's guarded by an if condition. But 
that if condition can be removed as it is redundant with the same logic that 
maybeAddPassage starts with.  You should copy along the comment that explains 
what -1 means.

Nitpicks:

FieldHighlighter:
* highlightOffsetsEnums:  Add braces to the "if" block that returns early.  The 
"while" of do-while should be on the same line as the close bracket.
* maybeAddPassage: passage ought to be the first parameter IMO.  And add braces 
to the "if" block.

FieldOffsetStrategy: the javadocs on getOffsetsEnum should not say "remember to 
close them *all*" since it just returns one.  so maybe "remember to close it"

MultiOffsetsEnum.close: I see it calls close on all the remaining OffsetsEnums 
on the queue... but at this point it's likely empty.  Based on our 
implementations of OffsetsEnum this won't be an issue but I think it's bad to 
leave it this way.  I think nextPosition could be modified to close the "top" 
item when it reaches the end.  close would then have a comment to mentioned the 
others have been closed already in nextPosition.

TestUnifiedHighlighterExtensibility: you removed calling p.setScore but I think 
we want to ensure all such methods are exposed thus enabling someone to fully 
use this Passage.

> UnifiedHighlighter should use single OffsetEnum rather than List<OffsetEnum>
> ----------------------------------------------------------------------------
>
>                 Key: LUCENE-8145
>                 URL: https://issues.apache.org/jira/browse/LUCENE-8145
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: modules/highlighter
>            Reporter: Alan Woodward
>            Assignee: Alan Woodward
>            Priority: Minor
>         Attachments: LUCENE-8145.patch
>
>
> The UnifiedHighlighter deals with several different aspects of highlighting: 
> finding highlight offsets, breaking content up into snippets, and passage 
> scoring.  It would be nice to split this up so that consumers can use them 
> separately.
> As a first step, I'd like to change the API of FieldOffsetStrategy to return 
> a single unified OffsetsEnum, rather than a collection of them.  This will 
> make it easier to expose the OffsetsEnum of a document directly from the 
> highlighter, bypassing snippet extraction and scoring.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to