[jira] Commented: (LUCENE-1522) another highlighter

Michael McCandless (JIRA) Wed, 18 Mar 2009 03:56:20 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682985#action_12682985
 ]


Michael McCandless commented on LUCENE-1522:
--------------------------------------------


{quote}
>> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the
>> spans produced by their children.
> 
> Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the 
> fact that you should try to include a match from each of the sub-clauses' 
> spans.

A good idea. ANDQuery's highlightSpans() method could probably be improved by
post-processing the child spans to take this into account. That way we
wouldn't have to gum up the main Highlighter code with a bunch of conditionals
which afford special treatment to certain query types.
{quote}

I think we may need a tree-structured result returned by the
Weight/Scorer, compactly representing the "space" of valid fragdocs
for this one doc.  And then somehow we walk that tree,
enumerating/scoring individual "valid" fragdocs that are created from
that tree.

{quote}
> What I meant was: all other things being equal, do you more strongly
> favor a fragment that has all N of the terms in a query vs another
> fragment that has fewer than N but say higher net number of occurrences.

No, the diversity of the terms in a fragment isn't factored in. The span 
objects only tell the Highlighter that a particular range of characters 
was important; they don't say why.

However, note that IDF would prevent a bunch of hits on "the" from causing too
hot a hotspot in the heat map. So you're likely to see fragments with high
discriminatory value.
{quote}

This still seems subjectively wrong to me.  If I search for "president
bush", probably bush is the rarer term and so you would favor showing
me a single fragment that had bush occur twice, over a fragment that
had a single occurrence of president and bush?

{quote}
> Google picks more than one fragment; it seems like it picks one or two
> fragments.

I probably overstated my opposition to supplying an excerpt containing more
than one fragment. It seems OK to me to select more than one, so long as they
all scan easily, and so long as the excerpts don't get long enough to force
excessive scrolling and slow down the time it takes the user to scan the whole
results page.

What bothers me is that the excerpts don't scan easily right now. I consider
that a much more important defect than the fact that the fragdoc doesn't hit 
every term (which isn't even possible for large queries), and it seemed to me 
that pursuing exhaustive term matching was likely to yield even more highly 
fragmented, visually chaotic fragdocs.
{quote}

Which excerpts don't scan easily right now?  Google's, KS's, Lucene's
H1 or H2?

I think with a tree structure representing the search space for all
fragdocs, we could then efficiently enumerate fradocs with an
appropriate scoring model (favoring sentence starts or surrounding
context, breadth of terms, etc.).  This way we can do a real search
(on all fragdocs) subject to the preference for
consumability/breadth.


> another highlighter
> -------------------
>
>                 Key: LUCENE-1522
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1522
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: contrib/highlighter
>            Reporter: Koji Sekiguchi
>            Assignee: Michael McCandless
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: colored-tag-sample.png, LUCENE-1522.patch, 
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream 
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test 
> code in patch). The idea was inherited from my previous project with my 
> colleague and LUCENE-644. This approach needs highlight fields to be 
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This 
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
>   // fieldName="content", fragCharSize=100, numFragments=3
>   String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, 
> "content", 100, 3 );
>   if( fragments != null ){
>     for( String fragment : fragments )
>       System.out.println( fragment );
>   }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size" 
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it 
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

[jira] Commented: (LUCENE-1522) another highlighter

Reply via email to