[jira] [Commented] (LUCENE-2187) improve lucene's similarity algorithm defaults

Tom Burton-West (JIRA) Fri, 04 Jan 2013 08:18:13 -0800

    [ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13543985#comment-13543985
 ]


Tom Burton-West commented on LUCENE-2187:
-----------------------------------------

Hi Robert,

Is this implementation made moot by the new GSOC work, or would it still be 
worth testing this as well as BM25, DFR and INF?    

I can't seem to find a link to the ORP collections.  Can you point me to it?
(I plan to test with our long docs, but thought I would try out some of the ORP 
collections as well)


Tom
                
> improve lucene's similarity algorithm defaults
> ----------------------------------------------
>
>                 Key: LUCENE-2187
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2187
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/query/scoring
>            Reporter: Robert Muir
>             Fix For: 4.2, 5.0
>
>         Attachments: LUCENE-2187.patch, scoring.pdf, scoring.pdf, scoring.pdf
>
>
> First things first: I am not an IR guy. The goal of this issue is to make 
> 'surgical' tweaks to lucene's formula to bring its performance up to that of 
> more modern algorithms such as BM25.
> In my opinion, the concept of having some 'flexible' scoring with good speed 
> across the board is an interesting goal, but not practical in the short term.
> Instead here I propose incorporating some work similar to lnu.ltc and 
> friends, but slightly different. I noticed this seems to be in line with that 
> paper published before about the trec million queries track... 
> Here is what I propose in pseudocode (overriding DefaultSimilarity):
> {code}
>   @Override
>   public float tf(float freq) {
>     return 1 + (float) Math.log(freq);
>   }
>   
>   @Override
>   public float lengthNorm(String fieldName, int numTerms) {
>     return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
>   }
> {code}
> Where slope is a constant (I used 0.25 for all relevance evaluations: the 
> goal is to have a better default), and pivot is the average field length. 
> Obviously we shouldnt make the user provide this but instead have the system 
> provide it.
> These two pieces do not improve lucene much independently, but together they 
> are competitive with BM25 scoring with the test collections I have run so 
> far. 
> The idea here is that this logarithmic tf normalization is independent of the 
> tf / mean TF that you see in some of these algorithms, in fact I implemented 
> lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
> stuff and it did not fare as well as this method, and this is simpler, we do 
> not need to calculate this mean TF at all.
> The BM25-like "binary" pivot here works better on the test collections I have 
> run, but of course only with the tf modification.
> I am uploading a document with results from 3 test collections (Persian, 
> Hindi, and Indonesian). I will test at least 3 more languages... yes 
> including English... across more collections and upload those results also, 
> but i need to process these corpora to run the tests with the benchmark 
> package, so this will take some time (maybe weeks)
> so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
> 3 test collections are in the openrelevance svn, so if you think you have a 
> great idea, don't hesitate to test it and upload results, this is what it is 
> for. 
> also keep in mind again I am not a scoring or IR guy, the only thing i can 
> really bring to the table here is the willingness to do a lot of relevance 
> testing!

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

[jira] [Commented] (LUCENE-2187) improve lucene's similarity algorithm defaults

Reply via email to