I was recently asked if/how the UnifiedHighlighter can return a Passage centered around the highlighted words. I'm responding to a wider audience (java-user list, ...).
Each highlighter implementation fragments the content into passages (with highlights) using a different algorithm. The UnifiedHighlighter (and now defunct PostingsHighlighter from which it derives) fragment the content to create passages entirely based on a java.text.BreakIterator. BreakIterator only sees/knows about the content (it's initialized with it via setText(string); it doesn't know where highlighted words are. This is why the default UH BreakIterator impl is a sentence based one and most people probably will let it be. Given how the UH actually uses the BreakIterator, you can create a custom one that is only designed to work with this highlighter that makes some assumptions of how it's used, resulting in some fragmentation that isn't so rigidly based on the content. The LengthGoalBreakIterator is such a BreakIterator. But it can only "see" the first highlighted word of a passage and make fragmentation decisions based on that alone. The other two highlighters (the original Highlighter and I think the FastVectorHighlighter) are more flexible in this regard; they have their own abstraction that allows for Passages to be formed sensitive to where exactly the highlighted words are. Thus you could fairly easily achieve a goal of say, 10 words before the first highlighted word, and highlight more words within 10 words of each other until the next is too far away, then 10 more trailing words with the original Highlighter. I suspect FastVectorHighlighter can do it this but its API confuses me. The FastVectorHighlighter also uses a BreakIterator in BreakIteratorBoundaryScanner but it's use is entirely different from how the UnifiedHighlighter uses one. Perhaps the UnifiedHighlighter should be enhanced to make more flexible fragmentation algorithms possible. Today you'd need to override FieldHighlighter.highlightOffsetsEnums which is a lot to ask of anyone; even doing that is annoying and then re-implemenitng that method is onerous since it's so complex -- it's really the heart of the UH. The UH could add an entirely new abstraction apart from BreakIterators (with a BI based impl available), or perhaps an optional marker interface for UH-aware BreakIterators. The former (a new abstraction) would be cleaner, and might also remove a wart in the API due to the statefulness of BreakIterators. It's also kinda hard to write a BI correctly. I've implemented a few already and I know. It's an old API. ~ David -- Lucene/Solr Search Committer, Consultant, Developer, Author, Speaker LinkedIn: http://linkedin.com/in/davidwsmiley | Book: http://www.solrenterprisesearchserver.com