[ https://issues.apache.org/jira/browse/LUCENE-2392?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2392: -------------------------------- Attachment: LUCENE-2392.patch Updated patch, i brought the patch to trunk, cleaned up, enabled some more of the stats in scoring (e.g. totalTermFreq/sumOfTotalTermFreq). In src/test i added a MockLMSimilarity, that implements "Bayesian smoothing using Dirichlet priors" from http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.136.8113 This one is interesting, as its faster than lucene's scoring formula today :) I want to get some of this stuff in shape for David (or any other GSOC students) to be able to implement their algorithms, but there is a lot of refactoring (e.g. explains) to do. I'll create a branch under https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring with this infrastructure in a bit. Tonight i'll see if i can get the avg doc length stuff in the branch too. > Enable flexible scoring > ----------------------- > > Key: LUCENE-2392 > URL: https://issues.apache.org/jira/browse/LUCENE-2392 > Project: Lucene - Java > Issue Type: Improvement > Components: Search > Reporter: Michael McCandless > Assignee: Michael McCandless > Fix For: 4.0 > > Attachments: LUCENE-2392.patch, LUCENE-2392.patch, LUCENE-2392.patch, > LUCENE-2392_take2.patch > > > This is a first step (nowhere near committable!), implementing the > design iterated to in the recent "Baby steps towards making Lucene's > scoring more flexible" java-dev thread. > The idea is (if you turn it on for your Field; it's off by default) to > store full stats in the index, into a new _X.sts file, per doc (X > field) in the index. > And then have FieldSimilarityProvider impls that compute doc's boost > bytes (norms) from these stats. > The patch is able to index the stats, merge them when segments are > merged, and provides an iterator-only API. It also has starting point > for per-field Sims that use the stats iterator API to compute boost > bytes. But it's not at all tied into actual searching! There's still > tons left to do, eg, how does one configure via Field/FieldType which > stats one wants indexed. > All tests pass, and I added one new TestStats unit test. > The stats I record now are: > - field's boost > - field's unique term count (a b c a a b --> 3) > - field's total term count (a b c a a b --> 6) > - total term count per-term (sum of total term count for all docs > that have this term) > Still need at least the total term count for each field. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org