[
https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Koji Sekiguchi updated LUCENE-1824:
-----------------------------------
Attachment: LUCENE-1824.patch
First draft. I introduced BoundaryScanner interface and two implementations of
the interface, Simple and BreakIterator.
SimpleBoundaryScanner uses the following default boundary chars:
{code}
public static final Character[] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?',
'(', '[', '{', '\t', '\n'};
{code}
And they are used by SimpleBoundaryScanner to find word/sentence boundary.
BreakIteratorBoundaryScanner can also be used to find the break of
char/word/sentence/line.
I made BaseFragmentsBuilder boundary-aware, rather than creating a new
FragmentsBuilder something like BoundaryAwareFragmentsBuilder. As a result, all
FragmentsBuilder is now boundary-aware natively, as long as using an
appropriate BoundaryScanner.
I've not touched test yet. Because this patch changes fragments boundaries, the
existing test should go fail!
> FastVectorHighlighter truncates words at beginning and end of fragments
> -----------------------------------------------------------------------
>
> Key: LUCENE-1824
> URL: https://issues.apache.org/jira/browse/LUCENE-1824
> Project: Lucene - Java
> Issue Type: Improvement
> Components: modules/highlighter
> Environment: any
> Reporter: Alex Vigdor
> Assignee: Koji Sekiguchi
> Priority: Minor
> Fix For: 4.0
>
> Attachments: LUCENE-1824.patch, LUCENE-1824.patch
>
>
> FastVectorHighlighter does not take word boundaries into consideration when
> building fragments, so that in most cases the first and last word of a
> fragment are truncated. This makes the highlights less legible than they
> should be. I will attach a patch to BaseFragmentBuilder that resolves this
> by expanding the start and end boundaries of the fragment to the first
> whitespace character on either side of the fragment, or the beginning or end
> of the source text, whichever comes first. This significantly improves
> legibility, at the cost of returning a slightly larger number of characters
> than specified for the fragment size.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]