[
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734
]
Nick Wellnhofer commented on LUCY-199:
--------------------------------------
Thinking more about a better fix for this problem, it's important to note that
choosing a good excerpt is an operation that can be done without knowledge of
the actual tokenization algorithm used in the indexing process. I think it's
enough to
* find boundaries that are more or less correct in a semantic and visual sense,
and
* be tolerant enough to find boundaries in long substrings without whitespace
that might exceed excerpt_length (considering that whitespace is the obvious
place to break words like in the current implementation).
If the highlighter finds additional word breaks, it shouldn't be a problem as
long as the result is visually correct.
Such an approach wouldn't depend on the analyzer at all and it wouldn't
introduce additional coupling of Lucy's components. Of course, it would mean to
implement a separate Unicode-capable word breaking algorithm for the
highlighter. But this shouldn't be very hard as we could reuse parts of the
StandardTokenizer.
> Highlighting/excerpt on URLs
> -----------------------------
>
> Key: LUCY-199
> URL: https://issues.apache.org/jira/browse/LUCY-199
> Project: Lucy
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.2.2 (incubating)
> Environment: Linux
> Reporter: Henry
> Attachments: LUCY-199-quickfix.patch, hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl = Lucy::Highlight::Highlighter->new(
> searcher => $searcher,
> query => $query_compiler,
> field => 'site',
> excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return …/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - ……), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira