[jira] Commented: (LUCENE-1522) another highlighter

Marvin Humphrey (JIRA) Wed, 18 Mar 2009 09:39:13 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12683064#action_12683064
 ]


Marvin Humphrey commented on LUCENE-1522:
-----------------------------------------

> OK, it sounds like one can simply use different models to score
> fragdocs and it's still an open debate how much each of these criteria
> (IDF, showing surround context, being on sentence boundary, diversity
> of terms) should impact the score. 

With Michael Busch's priority queue approach, the algorithm for choosing the
fragments can be abstracted into the class of object we put in the queue and
its lessThan() method.  The output from the queue just has to be something the
Highlighter can chew.

> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1522) another highlighter

Reply via email to