[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

Mark Miller (JIRA) Mon, 23 Mar 2009 14:14:17 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419
 ]


Mark Miller edited comment on LUCENE-1522 at 3/23/09 2:12 PM:
--------------------------------------------------------------

I think you are reading more into that than I see - that guy is just frustrated 
that PhraseQueries don't highlight correctly. That was/is a common occurrence 
and you can find tons of examples. There are one or two JIRA highlighters that 
address it, and the their is the Span highlighter (more interestingly, there is 
a link to the birth of the Span highlighter idea on that page - thanks M. 
Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.

*edit*
A lot of the sweat that is given has been fragmented by the 3 or 4 alternate 
highlighters.

      was (Author: [email protected]):
    I think you are reading more into that than I see - that guy is just 
frustrated that PhraseQueries don't highlight correctly. That was/is a common 
occurrence and you can find tons of examples. There are one or two JIRA 
highlighters that address it, and the their is the Span highlighter (more 
interestingly, there is a link to the birth of the Span highlighter idea on 
that page - thanks M. Harwood).

When users see the PhraseQuery look right, I havn't seen any other repeated 
complaints really. While it would be nice to match boolean logic fully, I 
almost don't think its worth the effort. You likely have an interest in those 
terms anyway - its not a given that the terms that caused the match (non 
positional) matter. I have not seen a complaint on that one - mostly just 
positional type stuff. And I think we have positional solved fairly well with 
the current API - its just too darn slow. Not that I am against things being 
sweet and perfect, and getting exact matches, but there has been lots of talk 
in the past about integrating the highlighter into core and making things 
really fast and efficient - and generally it comes down to what work actually 
gets done (and all this stuff ends up at the hard end of the pool).

When I wrote the SpanScorer, many times it was discussed how things should 
*really* be done. Most methods involved working with core - but what has been 
there for a couple years now is the SpanScorer that plugs into the current 
highlighter API and nothing else has made any progress. Not really an argument, 
just kind of thinking out loud at this point...

I'm all for improving the speed and accuracy of the highlighter at the end of 
the day, but its a tall order considering how much attention the Highlighter 
has managed to receive in the past. Its large on ideas and low on sweat.
  
> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Issue Comment Edited: (LUCENE-1522) another highlighter

Reply via email to