[
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12682987#action_12682987
]
Michael McCandless commented on LUCENE-1522:
--------------------------------------------
OK to sum up here with observations / wish list / ideas /
controversies / etc. for Lucene's future merged highlighter:
* Fragmenter should aim for fast "eye + brain scanning
consumability" (eg, try hard to start on sentence boundaries,
include context)
* Let's try for single source -- each Query/Weight/Scorer should be
able to enumerate the set of term positions/spans that caused it
to match a specific doc (like explain(), but provides
positions/spans detailing the match). Trying to "reverse
engineer" the matching is brittle
* Sliding window is better than static "top down" fragmentation
* To scale, we should make a simple IndexReader impl on top of term
vectors, but still allow the "re-index single doc on the fly"
option
* Favoring breadth (more unique terms instead of many occurences of
certain terms) seems important, except for too-many-term queries
where this gets unwieldy
* Prefer a single fragment if it scores well enough, but fall back
to several, if necessary, to show "breadth"
* Produce structured output so non-HTML front ends (eg Flex) can
render
* Try to include "context around the hits", when possible (eg the
"favor middle of hte sentence" that Michael described)
* Maybe or maybe don't let IDF affect fragment scoring
* Performance is important -- use TermVectors if present, add early
termination if you've already found a good enough fragdoc, etc.
* Maybe a tree-based fragdoc enumeration / searching model; I think
this'd be even more efficient than sliding window, especially for
large docs
* Multi-color, HeatMap default ootb HTML UIs are nice
* It's all very subjective and quite a good challenge!!
In the meantime, it seems like we should commit this H2 and give users
the choice? We can then iterate over time on our wish list....
> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch,
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test
> code in patch). The idea was inherited from my previous project with my
> colleague and LUCENE-644. This approach needs highlight fields to be
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc,
> "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size"
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]