[ 
https://issues.apache.org/jira/browse/LUCENE-1824?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Koji Sekiguchi updated LUCENE-1824:
-----------------------------------

    Attachment: LUCENE-1824.patch

First draft. I introduced BoundaryScanner interface and two implementations of 
the interface, Simple and BreakIterator.

SimpleBoundaryScanner uses the following default boundary chars:

{code}
public static final Character[] DEFAULT_BOUNDARY_CHARS = {'.', ',', '!', '?', 
'(', '[', '{', '\t', '\n'};
{code}

And they are used by SimpleBoundaryScanner to find word/sentence boundary.

BreakIteratorBoundaryScanner can also be used to find the break of 
char/word/sentence/line.

I made BaseFragmentsBuilder boundary-aware, rather than creating a new 
FragmentsBuilder something like BoundaryAwareFragmentsBuilder. As a result, all 
FragmentsBuilder is now boundary-aware natively, as long as using an 
appropriate BoundaryScanner.

I've not touched test yet. Because this patch changes fragments boundaries, the 
existing test should go fail!

> FastVectorHighlighter truncates words at beginning and end of fragments
> -----------------------------------------------------------------------
>
>                 Key: LUCENE-1824
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1824
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: modules/highlighter
>         Environment: any
>            Reporter: Alex Vigdor
>            Assignee: Koji Sekiguchi
>            Priority: Minor
>             Fix For: 4.0
>
>         Attachments: LUCENE-1824.patch, LUCENE-1824.patch
>
>
> FastVectorHighlighter does not take word boundaries into consideration when 
> building fragments, so that in most cases the first and last word of a 
> fragment are truncated.  This makes the highlights less legible than they 
> should be.  I will attach a patch to BaseFragmentBuilder that resolves this 
> by expanding the start and end boundaries of the fragment to the first 
> whitespace character on either side of the fragment, or the beginning or end 
> of the source text, whichever comes first.  This significantly improves 
> legibility, at the cost of returning a slightly larger number of characters 
> than specified for the fragment size.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to