[ https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682987#action_12682987 ]
Michael McCandless commented on LUCENE-1522: -------------------------------------------- OK to sum up here with observations / wish list / ideas / controversies / etc. for Lucene's future merged highlighter: * Fragmenter should aim for fast "eye + brain scanning consumability" (eg, try hard to start on sentence boundaries, include context) * Let's try for single source -- each Query/Weight/Scorer should be able to enumerate the set of term positions/spans that caused it to match a specific doc (like explain(), but provides positions/spans detailing the match). Trying to "reverse engineer" the matching is brittle * Sliding window is better than static "top down" fragmentation * To scale, we should make a simple IndexReader impl on top of term vectors, but still allow the "re-index single doc on the fly" option * Favoring breadth (more unique terms instead of many occurences of certain terms) seems important, except for too-many-term queries where this gets unwieldy * Prefer a single fragment if it scores well enough, but fall back to several, if necessary, to show "breadth" * Produce structured output so non-HTML front ends (eg Flex) can render * Try to include "context around the hits", when possible (eg the "favor middle of hte sentence" that Michael described) * Maybe or maybe don't let IDF affect fragment scoring * Performance is important -- use TermVectors if present, add early termination if you've already found a good enough fragdoc, etc. * Maybe a tree-based fragdoc enumeration / searching model; I think this'd be even more efficient than sliding window, especially for large docs * Multi-color, HeatMap default ootb HTML UIs are nice * It's all very subjective and quite a good challenge!! In the meantime, it seems like we should commit this H2 and give users the choice? We can then iterate over time on our wish list.... > another highlighter > ------------------- > > Key: LUCENE-1522 > URL: https://issues.apache.org/jira/browse/LUCENE-1522 > Project: Lucene - Java > Issue Type: Improvement > Components: contrib/highlighter > Reporter: Koji Sekiguchi > Assignee: Michael McCandless > Priority: Minor > Fix For: 2.9 > > Attachments: colored-tag-sample.png, LUCENE-1522.patch, > LUCENE-1522.patch > > > I've written this highlighter for my project to support bi-gram token stream > (general token stream (e.g. WhitespaceTokenizer) also supported. see test > code in patch). The idea was inherited from my previous project with my > colleague and LUCENE-644. This approach needs highlight fields to be > TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This > depends on LUCENE-1448 to get refined term offsets. > usage: > {code:java} > TopDocs docs = searcher.search( query, 10 ); > Highlighter h = new Highlighter(); > FieldQuery fq = h.getFieldQuery( query ); > for( ScoreDoc scoreDoc : docs.scoreDocs ){ > // fieldName="content", fragCharSize=100, numFragments=3 > String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc, > "content", 100, 3 ); > if( fragments != null ){ > for( String fragment : fragments ) > System.out.println( fragment ); > } > } > {code} > features: > - fast for large docs > - supports not only whitespace-based token stream, but also "fixed size" > N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489) > - supports PhraseQuery, phrase-unit highlighting with slops > {noformat} > q="w1 w2" > <b>w1 w2</b> > --------------- > q="w1 w2"~1 > <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b> > {noformat} > - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS > - easy to apply patch due to independent package (contrib/highlighter2) > - uses Java 1.5 > - looks query boost to score fragments (currently doesn't see idf, but it > should be possible) > - pluggable FragListBuilder > - pluggable FragmentsBuilder > to do: > - term positions can be unnecessary when phraseHighlight==false > - collects performance numbers -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org