[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Nick Wellnhofer (Commented) (JIRA) Wed, 18 Jan 2012 14:07:06 -0800

    [ 
https://issues.apache.org/jira/browse/LUCY-199?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13188734#comment-13188734
 ]


Nick Wellnhofer commented on LUCY-199:
--------------------------------------

Thinking more about a better fix for this problem, it's important to note that 
choosing a good excerpt is an operation that can be done without knowledge of 
the actual tokenization algorithm used in the indexing process. I think it's 
enough to

* find boundaries that are more or less correct in a semantic and visual sense, 
and
* be tolerant enough to find boundaries in long substrings without whitespace 
that might exceed excerpt_length (considering that whitespace is the obvious 
place to break words like in the current implementation).

If the highlighter finds additional word breaks, it shouldn't be a problem as 
long as the result is visually correct.

Such an approach wouldn't depend on the analyzer at all and it wouldn't 
introduce additional coupling of Lucy's components. Of course, it would mean to 
implement a separate Unicode-capable word breaking algorithm for the 
highlighter. But this shouldn't be very hard as we could reuse parts of the 
StandardTokenizer.
                
> Highlighting/excerpt on URLs 
> -----------------------------
>
>                 Key: LUCY-199
>                 URL: https://issues.apache.org/jira/browse/LUCY-199
>             Project: Lucy
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 0.2.2 (incubating)
>         Environment: Linux
>            Reporter: Henry
>         Attachments: LUCY-199-quickfix.patch, hltest1.tgz
>
>
> If I explicitly specify excerpt_length:
> my $hl             = Lucy::Highlight::Highlighter->new(
>    searcher       => $searcher,
>    query          => $query_compiler,
>    field          => 'site',
>    excerpt_length => 60,
> );
> ...and the field content is longer than 60, then
> $page_highlighter->create_excerpt($hit);
> returns '...'.
> Content which is short than 60, returns the highlighted excerpt as expected.
> If I comment out "excerpt_length => 60," above, then it returns the full
> non-truncated excerpt with highlighting as expected.
> Some >60char samples which return &#8230;/"...", searching for [iol.co.za] or
> [news24.com] (brackets are mine):
> [www.iol.co.za/tonight/books/what-the-dickens-gets-a-statue-1.1130220]
> [http://www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSiteHome/0,,,00.html]
> [www.news24.com/News24v2/Travel/Mini_Site/ContentDisplay/n24TravelMiniSite_TravelClub/0,,,00.html]
> The following return double-ellipses ("......" - &#8230;&#8230;), searching
> for [adsl mweb.com]:
> [http://www.mweb.co.za/helpcentre/ADSL/ADSLGeneralIdisagreewithyourusagereport.aspx]
> [http://www.mweb.co.za/helpcentre/FrequentlyAskedQuestions/MWEBHelpCentreFAQsHowdoI/FAQHowdoIHowdoImigratemyADSL/tabid/661/Default.aspx]

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: 
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

[lucy-issues] [jira] [Commented] (LUCY-199) Highlighting/excerpt on URLs

Reply via email to