[
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178438#comment-13178438
]
Nick Wellnhofer commented on LUCY-199:
--------------------------------------
Thanks for the excellent test case.
The whole thing is a bug in Highlighter_raw_excerpt. When prepending or
appending an ellipsis, the code tries to make sure this happens on a word
boundary. So it chops off words to make place for the ellipsis. Unfortunately,
it simply looks for whitespace to determine a word boundary. In case of URLs it
doesn't find whitespace and deletes the whole URL from raw_excerpt.
I see two approaches to fix this:
* Since the word boundaries are already computed during analysis, we could try
to reuse this data. AFAICS this would mean to loop through all the terms of the
document and extract and finally sort all start and end offsets. I'm not sure
how expensive this would be.
* We recompute the word boundaries during highlighting like we do now. We could
use the analyzer of the current schema for that. But we could also use any
other kind algorithm that is better than the current one. This might be very
cheap since we only work on a small subset of the document.
> Highlighting/excerpt on URLs
> -----------------------------
>
> Key: LUCY-199
> URL: https://issues.apache.org/jira/browse/LUCY-199
> Project: Lucy
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.2.2 (incubating)
> Environment: Linux
> Reporter: Henry
> Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl = Lucy::Highlight::Highlighter->new(
> searcher => $searcher,
> query => $query_compiler,
> field => 'site',
> excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return …/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - ……), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira