[
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13178535#comment-13178535
]
Marvin Humphrey commented on LUCY-199:
--------------------------------------
> We recompute the word boundaries during highlighting like we do now. We
> could use the analyzer of the current schema for that. But we could also use
> any other kind algorithm that is better than the current one. This might be
> very cheap since we only work on a small subset of the document.
That sounds like a really good idea for a lot of reasons! :) I don't quite
understand how it will solve this bug, but that's partly because the boundary
detection code in Highlighter is complex and messy -- and using an Analyzer
would help to clean it up.
One thing to bear in mind is that Highlighter is not only concerned with word
boundaries, but sentence boundaries. Take a look at the excerpts on the SERPs
for Google or any other major web search engine -- they tend to prefer
complete sentences. Lucy's own highlighter favors sentences just because I
had a gut feeling that it was superior to the random word boundaries chosen by
the Lucene highlighter, but I'm sure there are academic papers by now which
explain why it's desirable.
I note that UAX #29 describes an algorithm for sentence boundary detection.
Our StandardTokenizer implements UAX #29 word boundary tokenization; we could
implement a new Analyzer for sentence boundary detection using the same
techniques. (Lucy::Analysis::StandardSentenceTokenizer?) Then we could
leverage Lucy's analysis apparatus for *both* boundary detection phases within
Highlighter, while still utilizing the existing highlighting data generated at
index-time for generating heat maps and scoring excerpt candidates. That
would get a lot of ugly code out of Highlighter and make it much easier to
work on.
If we want a quick fix for this bug, though, I think we could also just wrap
an "if" test aroud the code which deals with the closing ellipsis and if we
eat the whole string looking for a boundary, fall back to swapping out the
last character for an ellipsis.
> Highlighting/excerpt on URLs
> -----------------------------
>
> Key: LUCY-199
> URL: https://issues.apache.org/jira/browse/LUCY-199
> Project: Lucy
> Issue Type: Bug
> Components: Core
> Affects Versions: 0.2.2 (incubating)
> Environment: Linux
> Reporter: Henry
> Attachments: hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl = Lucy::Highlight::Highlighter->new(
> searcher => $searcher,
> query => $query_compiler,
> field => 'site',
> excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return …/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - ……), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira