[ 
https://issues.apache.org/jira/browse/LUCENE-2091?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786135#action_12786135
 ] 

Michael McCandless commented on LUCENE-2091:
--------------------------------------------

bq. I think that should be possible to implement BM25F for it 

Ahh OK I just misunderstood -- BM25 can score PhraseQuery; it's just that the 
current patch doesn't implement that.

bq. 1. docFreq at document level, something like "int docFreq(term, doc_id)" 
and return the number of documents where term occurs, but if it is not possible 
a catch-all field will be enough.

OK, catch all seems like an OK starting point.  I wonder if we could enable 
storing terms dict but not postings... then we could store catch all just for 
the terms stats, so we wouldn't waste disk space.  Though merging gets tricky, 
since we'd have to walk postings for all fields (or at least all involved in 
BM25F) in parallel, re-computing the catch-all stats.

bq. 2. The Collection Average Document Length and Collection Average Field 
Length (per each field).

Lucene doesn't store/compute this today... we can easily compute these stats 
for newly created segments, and record in the segments file, but then 
recomputing them during segment merging with deletions gets tricky.  We could 
just take the linear approximate avg with deletions, but that may end up being 
too approximate, so we could instead make a dedicated posting list, which would 
be properly merged, but we'd then have to re-walk to compute the stats for the 
newly merged segment.

> Add BM25 Scoring to Lucene
> --------------------------
>
>                 Key: LUCENE-2091
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2091
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: contrib/*
>            Reporter: Yuval Feinstein
>            Priority: Minor
>             Fix For: 3.1
>
>         Attachments: LUCENE-2091.patch, persianlucene.jpg
>
>   Original Estimate: 48h
>  Remaining Estimate: 48h
>
> http://nlp.uned.es/~jperezi/Lucene-BM25/ describes an implementation of 
> Okapi-BM25 scoring in the Lucene framework,
> as an alternative to the standard Lucene scoring (which is a version of mixed 
> boolean/TFIDF).
> I have refactored this a bit, added unit tests and improved the runtime 
> somewhat.
> I would like to contribute the code to Lucene under contrib. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org

Reply via email to