[ https://issues.apache.org/jira/browse/LUCENE-4133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13293461#comment-13293461 ]
Sebastian Lutze commented on LUCENE-4133: ----------------------------------------- Hi Koji, bq. I changed 0.26632088 to 0.86791086 in WeightedFragListBuilderTest and use prettify in package.html in the patch. Ups, yes, sure that's perfectly fine with me. I forgot to update the test when I removed the Math.pow from the formula. Sloppy me! > FastVectorHighlighter: A weighted approach for ordered fragments > ---------------------------------------------------------------- > > Key: LUCENE-4133 > URL: https://issues.apache.org/jira/browse/LUCENE-4133 > Project: Lucene - Java > Issue Type: Improvement > Components: modules/highlighter > Affects Versions: 4.0, 5.0 > Reporter: Sebastian Lutze > Assignee: Koji Sekiguchi > Priority: Minor > Labels: FastVectorHighlighter > Fix For: 4.0 > > Attachments: LUCENE-4133.patch, LUCENE-4133.patch > > > The FastVectorHighlighter currently disregards IDF-weights for matching terms > within generated fragments. In the worst case, a fragment, which contains > high number of very common words, is scored higher, than a fragment that > contains *all* of the terms which have been used in the original query. > This patch provides ordered fragments with IDF-weighted terms: > *For each distinct matching term per fragment:* > _weight = weight + IDF * boost_ > *For each fragment:* > _weight = weight * length * 1 / sqrt( length )_ > |weight| total weight of fragment > |IDF| inverse document frequency for each distinct matching term > |boost| query boost as provided, for example _term^2_ > |length| total number of non-distinct matching terms per fragment > *Method:* > {code:java} > public void add( int startOffset, int endOffset, List<WeightedPhraseInfo> > phraseInfoList ) { > > float totalBoost = 0; > > List<SubInfo> subInfos = new ArrayList<SubInfo>(); > HashSet<String> distinctTerms = new HashSet<String>(); > > int length = 0; > for( WeightedPhraseInfo phraseInfo : phraseInfoList ){ > subInfos.add( new SubInfo( phraseInfo.getText(), > phraseInfo.getTermsOffsets(), phraseInfo.getSeqnum() ) ); > for ( TermInfo ti : phraseInfo.getTermsInfos()) { > if ( distinctTerms.add( ti.getText() ) ) > totalBoost += ti.getWeight() * phraseInfo.getBoost(); > length++; > } > } > totalBoost *= length * ( 1 / Math.sqrt( length ) ); > > getFragInfos().add( new WeightedFragInfo( startOffset, endOffset, > subInfos, totalBoost ) ); > } > {code} > The ranking-formula should be the same, or at least similar, to that one used > in QueryTermScorer. > *This patch contains:* > * a changed class-member in FieldPhraseList (termInfos to termsInfos) > * a changed local variable in SimpleFieldFragList (score to totalBoost) > * adds a missing @override in SimpleFragListBuilder > * class WeightedFieldFragList, a implementation of FieldFragList > * class WeightedFragListBuilder, a implementation of BaseFragListBuilder > * class WeightedFragListBuilderTest, a simple test-case > * updated docs for FVH > Last part (see also LUCENE-4091, LUCENE-4107, LUCENE-4113) of LUCENE-3440. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org