Sebastian Lutze created LUCENE-4133: ---------------------------------------
Summary: FastVectorHighlighter: A weighted approach for ordered fragments Key: LUCENE-4133 URL: https://issues.apache.org/jira/browse/LUCENE-4133 Project: Lucene - Java Issue Type: Improvement Components: modules/highlighter Affects Versions: 4.0, 5.0 Reporter: Sebastian Lutze Priority: Minor Fix For: 4.0 Attachments: LUCENE-4133.patch The FastVectorHighlighter currently disregards IDF-weights for matching terms within generated fragments. In the worst case, a fragment, which contains high number of very common words, is scored higher, than a fragment that contains *all* of the terms which have been used in the original query. This patch provides ordered fragments with IDF-weighted terms: *For each distinct matching term per fragment:* _weight = weight + IDF * boost_ *For each fragment:* _weight = weight * numTerms * 1 / sqrt( numTerms )_ |weight| total weight of fragment |IDF| inverse document frequency for each distinct matching term |boost| query boost as provided, for example _term^2_ |numTerms| total number of matching terms per fragment *Method:* {code:java} public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> phraseInfoList ) { float totalBoost = 0; List<SubInfo> subInfos = new ArrayList<SubInfo>(); HashSet<String> distinctTerms = new HashSet<String>(); int length = 0; for( WeightedPhraseInfo phraseInfo : phraseInfoList ){ subInfos.add( new SubInfo( phraseInfo.getText(), phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) ); for ( TermInfo ti : phraseInfo.getTermsInfos()) { if ( distinctTerms.add( ti.getText() ) ) totalBoost += ti.getWeight() * phraseInfo.getBoost(); length++; } } totalBoost *= length * ( 1 / Math.sqrt( length ) ); getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, subInfos, totalBoost ) ); } {code} The ranking-formula should be the same, or at least similar, to that one used in QueryTermScorer. *This patch contains:* * a changed class-member in FieldPhraseList (termInfos to termsInfos) * a changed local variable in SimpleFieldFragList (score to totalBoost) * adds a missing @override in SimpleFragListBuilder * class WeightedFieldFragList, a implementation of FieldFragList * class WeightedFragListBuilder, a implementation of BaseFragListBuilder * class WeightedFragListBuilderTest, a simple test-case * updated docs for FVH Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org