I was recently asked if/how the UnifiedHighlighter can return a Passage
centered around the highlighted words.  I'm responding to a wider audience
(java-user list, ...).

Each highlighter implementation fragments the content into passages (with
highlights) using a different algorithm.

The UnifiedHighlighter (and now defunct PostingsHighlighter from which it
derives) fragment the content to create passages entirely based on a
java.text.BreakIterator.  BreakIterator only sees/knows about the content
(it's initialized with it via setText(string); it doesn't know where
highlighted words are.  This is why the default UH BreakIterator impl is a
sentence based one and most people probably will let it be.  Given how the
UH actually uses the BreakIterator, you can create a custom one that is
only designed to work with this highlighter that makes some assumptions of
how it's used, resulting in some fragmentation that isn't so rigidly based
on the content.  The LengthGoalBreakIterator is such a BreakIterator.  But
it can only "see" the first highlighted word of a passage and make
fragmentation decisions based on that alone.

The other two highlighters (the original Highlighter and I think the
FastVectorHighlighter) are more flexible in this regard; they have their
own abstraction that allows for Passages to be formed sensitive to where
exactly the highlighted words are.  Thus you could fairly easily achieve a
goal of say, 10 words before the first highlighted word, and highlight more
words within 10 words of each other until the next is too far away, then 10
more trailing words with the original Highlighter.  I suspect
FastVectorHighlighter can do it this but its API confuses me.  The
FastVectorHighlighter also uses a BreakIterator in
BreakIteratorBoundaryScanner but it's use is entirely different from how
the UnifiedHighlighter uses one.

Perhaps the UnifiedHighlighter should be enhanced to make more flexible
fragmentation algorithms possible.  Today you'd need to override
FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone;
even doing that is annoying and then re-implemenitng that method is onerous
since it's so complex -- it's really the heart of the UH.  The UH could add
an entirely new abstraction apart from BreakIterators (with a BI based impl
available), or perhaps an optional marker interface for UH-aware
BreakIterators.  The former (a new abstraction) would be cleaner, and might
also remove a wart in the API due to the statefulness of BreakIterators.
It's also kinda hard to write a BI correctly. I've implemented a few
already and I know.  It's an old API.

~ David

-- 
Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker
LinkedIn: http://linkedin.com/in/davidwsmiley | Book:
http://www.solrenterprisesearchserver.com

Reply via email to