[
https://issues.apache.org/jira/browse/LUCENE-1522?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12688419#action_12688419
]
Mark Miller edited comment on LUCENE-1522 at 3/23/09 2:12 PM:
--------------------------------------------------------------
I think you are reading more into that than I see - that guy is just frustrated
that PhraseQueries don't highlight correctly. That was/is a common occurrence
and you can find tons of examples. There are one or two JIRA highlighters that
address it, and the their is the Span highlighter (more interestingly, there is
a link to the birth of the Span highlighter idea on that page - thanks M.
Harwood).
When users see the PhraseQuery look right, I havn't seen any other repeated
complaints really. While it would be nice to match boolean logic fully, I
almost don't think its worth the effort. You likely have an interest in those
terms anyway - its not a given that the terms that caused the match (non
positional) matter. I have not seen a complaint on that one - mostly just
positional type stuff. And I think we have positional solved fairly well with
the current API - its just too darn slow. Not that I am against things being
sweet and perfect, and getting exact matches, but there has been lots of talk
in the past about integrating the highlighter into core and making things
really fast and efficient - and generally it comes down to what work actually
gets done (and all this stuff ends up at the hard end of the pool).
When I wrote the SpanScorer, many times it was discussed how things should
*really* be done. Most methods involved working with core - but what has been
there for a couple years now is the SpanScorer that plugs into the current
highlighter API and nothing else has made any progress. Not really an argument,
just kind of thinking out loud at this point...
I'm all for improving the speed and accuracy of the highlighter at the end of
the day, but its a tall order considering how much attention the Highlighter
has managed to receive in the past. Its large on ideas and low on sweat.
*edit*
A lot of the sweat that is given has been fragmented by the 3 or 4 alternate
highlighters.
was (Author: [email protected]):
I think you are reading more into that than I see - that guy is just
frustrated that PhraseQueries don't highlight correctly. That was/is a common
occurrence and you can find tons of examples. There are one or two JIRA
highlighters that address it, and the their is the Span highlighter (more
interestingly, there is a link to the birth of the Span highlighter idea on
that page - thanks M. Harwood).
When users see the PhraseQuery look right, I havn't seen any other repeated
complaints really. While it would be nice to match boolean logic fully, I
almost don't think its worth the effort. You likely have an interest in those
terms anyway - its not a given that the terms that caused the match (non
positional) matter. I have not seen a complaint on that one - mostly just
positional type stuff. And I think we have positional solved fairly well with
the current API - its just too darn slow. Not that I am against things being
sweet and perfect, and getting exact matches, but there has been lots of talk
in the past about integrating the highlighter into core and making things
really fast and efficient - and generally it comes down to what work actually
gets done (and all this stuff ends up at the hard end of the pool).
When I wrote the SpanScorer, many times it was discussed how things should
*really* be done. Most methods involved working with core - but what has been
there for a couple years now is the SpanScorer that plugs into the current
highlighter API and nothing else has made any progress. Not really an argument,
just kind of thinking out loud at this point...
I'm all for improving the speed and accuracy of the highlighter at the end of
the day, but its a tall order considering how much attention the Highlighter
has managed to receive in the past. Its large on ideas and low on sweat.
> another highlighter
> -------------------
>
> Key: LUCENE-1522
> URL: https://issues.apache.org/jira/browse/LUCENE-1522
> Project: Lucene - Java
> Issue Type: Improvement
> Components: contrib/highlighter
> Reporter: Koji Sekiguchi
> Assignee: Michael McCandless
> Priority: Minor
> Fix For: 2.9
>
> Attachments: colored-tag-sample.png, LUCENE-1522.patch,
> LUCENE-1522.patch
>
>
> I've written this highlighter for my project to support bi-gram token stream
> (general token stream (e.g. WhitespaceTokenizer) also supported. see test
> code in patch). The idea was inherited from my previous project with my
> colleague and LUCENE-644. This approach needs highlight fields to be
> TermVector.WITH_POSITIONS_OFFSETS, but is fast and can support N-grams. This
> depends on LUCENE-1448 to get refined term offsets.
> usage:
> {code:java}
> TopDocs docs = searcher.search( query, 10 );
> Highlighter h = new Highlighter();
> FieldQuery fq = h.getFieldQuery( query );
> for( ScoreDoc scoreDoc : docs.scoreDocs ){
> // fieldName="content", fragCharSize=100, numFragments=3
> String[] fragments = h.getBestFragments( fq, reader, scoreDoc.doc,
> "content", 100, 3 );
> if( fragments != null ){
> for( String fragment : fragments )
> System.out.println( fragment );
> }
> }
> {code}
> features:
> - fast for large docs
> - supports not only whitespace-based token stream, but also "fixed size"
> N-gram (e.g. (2,2), not (1,3)) (can solve LUCENE-1489)
> - supports PhraseQuery, phrase-unit highlighting with slops
> {noformat}
> q="w1 w2"
> <b>w1 w2</b>
> ---------------
> q="w1 w2"~1
> <b>w1</b> w3 <b>w2</b> w3 <b>w1 w2</b>
> {noformat}
> - highlight fields need to be TermVector.WITH_POSITIONS_OFFSETS
> - easy to apply patch due to independent package (contrib/highlighter2)
> - uses Java 1.5
> - looks query boost to score fragments (currently doesn't see idf, but it
> should be possible)
> - pluggable FragListBuilder
> - pluggable FragmentsBuilder
> to do:
> - term positions can be unnecessary when phraseHighlight==false
> - collects performance numbers
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]