[ https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786135#action_12786135 ]
Michael McCandless commented on LUCENE-2091: -------------------------------------------- bq. I think that should be possible to implement BM25F for it Ahh OK I just misunderstood -- BM25 can score PhraseQuery; it's just that the current patch doesn't implement that. bq. 1. docFreq at document level, something like "int docFreq(term, doc_id)" and return the number of documents where term occurs, but if it is not possible a catch-all field will be enough. OK, catch all seems like an OK starting point. I wonder if we could enable storing terms dict but not postings... then we could store catch all just for the terms stats, so we wouldn't waste disk space. Though merging gets tricky, since we'd have to walk postings for all fields (or at least all involved in BM25F) in parallel, re-computing the catch-all stats. bq. 2. The Collection Average Document Length and Collection Average Field Length (per each field). Lucene doesn't store/compute this today... we can easily compute these stats for newly created segments, and record in the segments file, but then recomputing them during segment merging with deletions gets tricky. We could just take the linear approximate avg with deletions, but that may end up being too approximate, so we could instead make a dedicated posting list, which would be properly merged, but we'd then have to re-walk to compute the stats for the newly merged segment. > Add BM25 Scoring to Lucene > -------------------------- > > Key: LUCENE-2091 > URL: https://issues.apache.org/jira/browse/LUCENE-2091 > Project: Lucene - Java > Issue Type: New Feature > Components: contrib/* > Reporter: Yuval Feinstein > Priority: Minor > Fix For: 3.1 > > Attachments: LUCENE-2091.patch, persianlucene.jpg > > Original Estimate: 48h > Remaining Estimate: 48h > > http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of > Okapi-BM25 scoring in the Lucene framework, > as an alternative to the standard Lucene scoring (which is a version of mixed > boolean/TFIDF). > I have refactored this a bit, added unit tests and improved the runtime > somewhat. > I would like to contribute the code to Lucene under contrib. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org