Sebastian Lutze created LUCENE-4133:
---------------------------------------
Summary: FastVectorHighlighter: A weighted approach for ordered
fragments
Key: LUCENE-4133
URL: https://issues.apache.org/jira/browse/LUCENE-4133
Project: Lucene - Java
Issue Type: Improvement
Components: modules/highlighter
Affects Versions: 4.0, 5.0
Reporter: Sebastian Lutze
Priority: Minor
Fix For: 4.0
Attachments: LUCENE-4133.patch
The FastVectorHighlighter currently disregards IDF-weights for matching terms
within generated fragments. In the worst case, a fragment, which contains high
number of very common words, is scored higher, than a fragment that contains
*all* of the terms which have been used in the original query.
This patch provides ordered fragments with IDF-weighted terms:
*For each distinct matching term per fragment:*
_weight = weight + IDF * boost_
*For each fragment:*
_weight = weight * numTerms * 1 / sqrt( numTerms )_
|weight| total weight of fragment
|IDF| inverse document frequency for each distinct matching term
|boost| query boost as provided, for example _term^2_
|numTerms| total number of matching terms per fragment
*Method:*
{code:java}
public void add( int startOffset, int endOffset, List<WeightedPhraseInfo>
phraseInfoList ) {
float totalBoost = 0;
List<SubInfo> subInfos = new ArrayList<SubInfo>();
HashSet<String> distinctTerms = new HashSet<String>();
int length = 0;
for( WeightedPhraseInfo phraseInfo : phraseInfoList ){
subInfos.add( new SubInfo( phraseInfo.getText(),
phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) );
for ( TermInfo ti : phraseInfo.getTermsInfos()) {
if ( distinctTerms.add( ti.getText() ) )
totalBoost += ti.getWeight() * phraseInfo.getBoost();
length++;
}
}
totalBoost *= length * ( 1 / Math.sqrt( length ) );
getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos,
totalBoost ) );
}
{code}
The ranking-formula should be the same, or at least similar, to that one used
in QueryTermScorer.
*This patch contains:*
* a changed class-member in FieldPhraseList (termInfos to termsInfos)
* a changed local variable in SimpleFieldFragList (score to totalBoost)
* adds a missing @override in SimpleFragListBuilder
* class WeightedFieldFragList, a implementation of FieldFragList
* class WeightedFragListBuilder, a implementation of BaseFragListBuilder
* class WeightedFragListBuilderTest, a simple test-case
* updated docs for FVH
Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440.
--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators:
https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]