[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults

Robert Muir (JIRA) Sat, 02 Jan 2010 12:18:19 -0800

     [ 
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]


Robert Muir updated LUCENE-2187:
--------------------------------

    Attachment: scoring.pdf

document with some simple results from the 3 collections i tested thus far.

i chose to display simple graphs with descriptions of the collections and some 
of their peculiarities. 

if you want submission.txt dumps or verbose output from trec_eval, I can do 
that too, but I think its less useful to start with.


> improve lucene's similarity algorithm defaults
> ----------------------------------------------
>
>                 Key: LUCENE-2187
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2187
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Query/Scoring
>            Reporter: Robert Muir
>             Fix For: Flex Branch
>
>         Attachments: scoring.pdf
>
>
> First things first: I am not an IR guy. The goal of this issue is to make 
> 'surgical' tweaks to lucene's formula to bring its performance up to that of 
> more modern algorithms such as BM25.
> In my opinion, the concept of having some 'flexible' scoring with good speed 
> across the board is an interesting goal, but not practical in the short term.
> Instead here I propose incorporating some work similar to lnu.ltc and 
> friends, but slightly different. I noticed this seems to be in line with that 
> paper published before about the trec million queries track... 
> Here is what I propose in pseudocode (overriding DefaultSimilarity):
> {code}
>   @Override
>   public float tf(float freq) {
>     return 1 + (float) Math.log(freq);
>   }
>   
>   @Override
>   public float lengthNorm(String fieldName, int numTerms) {
>     return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
>   }
> {code}
> Where slope is a constant (I used 0.25 for all relevance evaluations: the 
> goal is to have a better default), and pivot is the average field length. 
> Obviously we shouldnt make the user provide this but instead have the system 
> provide it.
> These two pieces do not improve lucene much independently, but together they 
> are competitive with BM25 scoring with the test collections I have run so 
> far. 
> The idea here is that this logarithmic tf normalization is independent of the 
> tf / mean TF that you see in some of these algorithms, in fact I implemented 
> lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) 
> stuff and it did not fare as well as this method, and this is simpler, we do 
> not need to calculate this mean TF at all.
> The BM25-like "binary" pivot here works better on the test collections I have 
> run, but of course only with the tf modification.
> I am uploading a document with results from 3 test collections (Persian, 
> Hindi, and Indonesian). I will test at least 3 more languages... yes 
> including English... across more collections and upload those results also, 
> but i need to process these corpora to run the tests with the benchmark 
> package, so this will take some time (maybe weeks)
> so, please rip it apart with scoring theory etc, but keep in mind 2 of these 
> 3 test collections are in the openrelevance svn, so if you think you have a 
> great idea, don't hesitate to test it and upload results, this is what it is 
> for. 
> also keep in mind again I am not a scoring or IR guy, the only thing i can 
> really bring to the table here is the willingness to do a lot of relevance 
> testing!

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Updated: (LUCENE-2187) improve lucene's similarity algorithm defaults

Reply via email to