UnifiedHighlighter and extraction of exact hit offset ranges

Dawid Weiss Wed, 11 Jan 2017 02:35:54 -0800

Can any of the folks who contributed to UnifiedHighlighter (David?)
clarify my thinking here?


I have a requirement to extract (for a set of search results) a list
of exact "hit" ranges (field offsets, with support for multi-term
queries and span queries). Obviously, I'm only talking about queries
that relate to field content somehow, but this has always been quite
problematic and required the use of multiple helper classes
(WeightedSpanTermExtractor, MultiTermHighlighting, etc.) and pretty
hairy logic.

So I turned to look at UnifiedHighlighter for help.

Seems like the right way (?) to do it would be to override (and abuse)
UnifiedHighlighter's getFieldHighlighter method and return a field
highlighter with an override of:

protected Passage[] highlightOffsetsEnums(List<OffsetsEnum>
offsetsEnums) throws IOException {

so that I can capture and return a separate Passage for each
OffsetsEnum (I have my own code to deal with overlaps and merging, so
I can skip this entirely). Then, with a custom no-op PassageFormatter
I could simply get a list of those offsets.

The problem with this approach is that there is currently no way to
access offsets in OffsetsEnum -- everything is protected (so
subclassable), but OffsetsEnum are closed to package-private scope.
Namely these two:

  int startOffset() throws IOException {
    return postingsEnum.startOffset();
  }

  int endOffset() throws IOException {
    return postingsEnum.endOffset();
  }

Should these two be protected to allow such customizations (I agree
it's *very* low-level, but I have a practical use case where this
would be useful).

Am I on the right track here?

Separately from that, I think it'd be nice to have some sort of
generic utility that, for a given document (or a set of documents)
would return such hit ranges... UnifiedHighlighter seems

Dawid

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

UnifiedHighlighter and extraction of exact hit offset ranges

Reply via email to