[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Mark Harwood (JIRA) Tue, 27 Jan 2009 03:57:24 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-1489?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12667654#action_12667654
 ]


Mark Harwood commented on LUCENE-1489:
--------------------------------------

It looks to me like this could be fixed in the "Formatter" classes when marking 
up the output string.

Currently classes such as SimpleHTMLFormatter in their "highlightTerm" method 
put a tag around the whole section of text, if it contains a hit, i.e.

{code:title=SimpleHTMLFormatter.java|borderStyle=solid}
        public String highlightTerm(String originalText, TokenGroup tokenGroup)
        {
                StringBuffer returnBuffer;
                if(tokenGroup.getTotalScore()>0)
                {
                        returnBuffer=new StringBuffer();
                        returnBuffer.append(preTag);
                        returnBuffer.append(originalText);
                        returnBuffer.append(postTag);
                        return returnBuffer.toString();
                }
                return originalText;
        }
{code}

The TokenGroup object passed to this method contains all of the tokens and 
their scores so it should be possible to use this information to deconstruct 
the originalText parameter and inject markup according to which tokens in the 
group had a match rather than putting a tag around the whole block.  Some 
complexity may lie in handling token streams that produce tokens that "rewind" 
to earlier offsets.
SimpleHtmlFormatter suddenly seems less simple!

TokenStreams that produce entirely overlapping streams of tokens will 
automatically be broken into multiple TokenGroups because TokenGroup has a 
maximum number of linked Tokens it will ever hold in a single group.

I haven't got the time to fix this right now but if someone has a burning need 
to leap in, the above seems like what may be required.

Cheers
Mark






> highlighter problem with n-gram tokens
> --------------------------------------
>
>                 Key: LUCENE-1489
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1489
>             Project: Lucene - Java
>          Issue Type: Bug
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Priority: Minor
>
> I have a problem when using n-gram and highlighter. I thought it had been 
> solved in LUCENE-627...
> Actually, I found this problem when I was using CJKTokenizer on Solr, though, 
> here is lucene program to reproduce it using NGramTokenizer(min=2,max=2) 
> instead of CJKTokenizer:
> {code:java}
> public class TestNGramHighlighter {
>   public static void main(String[] args) throws Exception {
>     Analyzer analyzer = new NGramAnalyzer();
>     final String TEXT = "Lucene can make index. Then Lucene can search.";
>     final String QUERY = "can";
>     QueryParser parser = new QueryParser("f",analyzer);
>     Query query = parser.parse(QUERY);
>     QueryScorer scorer = new QueryScorer(query,"f");
>     Highlighter h = new Highlighter( scorer );
>     System.out.println( h.getBestFragment(analyzer, "f", TEXT) );
>   }
>   static class NGramAnalyzer extends Analyzer {
>     public TokenStream tokenStream(String field, Reader input) {
>       return new NGramTokenizer(input,2,2);
>     }
>   }
> }
> {code}
> expected output is:
> Lucene <B>can</B> make index. Then Lucene <B>can</B> search.
> but the actual output is:
> Lucene <B>can make index. Then Lucene can</B> search.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1489) highlighter problem with n-gram tokens

Reply via email to