[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682985#action_12682985 ]
Michael McCandless commented on LUCENE-1522: -------------------------------------------- {quote} >> ANDQuery, ORQuery, and RequiredOptionalQuery just return the union of the >> spans produced by their children. > > Hmm - it seems like that loses information. Ie, for ANDQuery, you lose the > fact that you should try to include a match from each of the sub-clauses' > spans. A good idea. ANDQuery's highlightSpans() method could probably be improved by post-processing the child spans to take this into account. That way we wouldn't have to gum up the main Highlighter code with a bunch of conditionals which afford special treatment to certain query types. {quote} I think we may need a tree-structured result returned by the Weight/Scorer, compactly representing the "space" of valid fragdocs for this one doc. And then somehow we walk that tree, enumerating/scoring individual "valid" fragdocs that are created from that tree. {quote} > What I meant was: all other things being equal, do you more strongly > favor a fragment that has all N of the terms in a query vs another > fragment that has fewer than N but say higher net number of occurrences. No, the diversity of the terms in a fragment isn't factored in. The span objects only tell the Highlighter that a particular range of characters was important; they don't say why. However, note that IDF would prevent a bunch of hits on "the" from causing too hot a hotspot in the heat map. So you're likely to see fragments with high discriminatory value. {quote} This still seems subjectively wrong to me. If I search for "president bush", probably bush is the rarer term and so you would favor showing me a single fragment that had bush occur twice, over a fragment that had a single occurrence of president and bush? {quote} > Google picks more than one fragment; it seems like it picks one or two > fragments. I probably overstated my opposition to supplying an excerpt containing more than one fragment. It seems OK to me to select more than one, so long as they all scan easily, and so long as the excerpts don't get long enough to force excessive scrolling and slow down the time it takes the user to scan the whole results page. What bothers me is that the excerpts don't scan easily right now. I consider that a much more important defect than the fact that the fragdoc doesn't hit every term (which isn't even possible for large queries), and it seemed to me that pursuing exhaustive term matching was likely to yield even more highly fragmented, visually chaotic fragdocs. {quote} Which excerpts don't scan easily right now? Google's, KS's, Lucene's H1 or H2? I think with a tree structure representing the search space for all fragdocs, we could then efficiently enumerate fradocs with an appropriate scoring model (favoring sentence starts or surrounding context, breadth of terms, etc.). This way we can do a real search (on all fragdocs) subject to the preference for consumability/breadth. > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org