[ https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Koji Sekiguchi updated LUCENE-1824: ----------------------------------- Attachment: LUCENE-1824.patch First draft. I introduced BoundaryScanner interface and two implementations of the interface, Simple and BreakIterator. SimpleBoundaryScanner uses the following default boundary chars: {code} public static final Character[] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?', '(', '[', '{', '\t', '\n'}; {code} And they are used by SimpleBoundaryScanner to find word/sentence boundary. BreakIteratorBoundaryScanner can also be used to find the break of char/word/sentence/line. I made BaseFragmentsBuilder boundary-aware, rather than creating a new FragmentsBuilder something like BoundaryAwareFragmentsBuilder. As a result, all FragmentsBuilder is now boundary-aware natively, as long as using an appropriate BoundaryScanner. I've not touched test yet. Because this patch changes fragments boundaries, the existing test should go fail! > FastVectorHighlighter truncates words at beginning and end of fragments > ----------------------------------------------------------------------- > > Key: LUCENE-1824 > URL: https://issues.apache.org/jira/browse/LUCENE-1824 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/highlighter > Environment: any > Reporter: Alex Vigdor > Assignee: Koji Sekiguchi > Priority: Minor > Fix For: 4.0 > > Attachments: LUCENE-1824.patch, LUCENE-1824.patch > > > FastVectorHighlighter does not take word boundaries into consideration when > building fragments, so that in most cases the first and last word of a > fragment are truncated. This makes the highlights less legible than they > should be. I will attach a patch to BaseFragmentBuilder that resolves this > by expanding the start and end boundaries of the fragment to the first > whitespace character on either side of the fragment, or the beginning or end > of the source text, whichever comes first. This significantly improves > legibility, at the cost of returning a slightly larger number of characters > than specified for the fragment size. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org