[ https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060680#comment-13060680 ]
Robert Muir commented on LUCENE-3220: ------------------------------------- Hi David: I had some ideas on stats to simplify some of these sims: # I think we can use an easier way to compute average document length: sumTotalTermFreq() / maxDoc(). This way the average is 'exact' and not skewed by index-time-boosts, smallfloat quantization, or anything like that. # To support pivoted unique normalization like lnu.ltc, I think we can solve this in a similar way: add sumDocFreq(), which is just a single long, and divide this by maxDoc. this gives us avg # of unique terms. I think terrier might have a similar stat (#postings or #pointers or something)? so i think this could make for nice simplifications: especially for switching norms completely over to docvalues: we should be able to do #1 immediately right now, change the way we compute avgdoclen for e.g. BM25 and DFR. then in a separate issue i could revert this norm summation stuff to make the docvalues integration simpler, and open a new issue for sumDocFreq() > Implement various ranking models as Similarities > ------------------------------------------------ > > Key: LUCENE-3220 > URL: https://issues.apache.org/jira/browse/LUCENE-3220 > Project: Lucene - Java > Issue Type: Sub-task > Components: core/search > Affects Versions: flexscoring branch > Reporter: David Mark Nemeskey > Assignee: David Mark Nemeskey > Labels: gsoc > Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, > LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch > > Original Estimate: 336h > Remaining Estimate: 336h > > With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we > can finally work on implementing the standard ranking models. Currently DFR, > BM25 and LM are on the menu. > TODO: > * {{EasyStats}}: contains all statistics that might be relevant for a > ranking algorithm > * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the > DocScorers and as much implementation detail as possible > * _BM25_: the current "mock" implementation might be OK > * _LM_ > * _DFR_ > Done: -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org