[jira] Created: (LUCENE-2187) improve lucene's similarity algorithm defaults

Robert Muir (JIRA) Sat, 02 Jan 2010 12:16:19 -0800

improve lucene's similarity algorithm defaults
----------------------------------------------


                 Key: LUCENE-2187
                 URL: https://issues.apache.org/jira/browse/LUCENE-2187
             Project: Lucene - Java
          Issue Type: Improvement
          Components: Query/Scoring
            Reporter: Robert Muir
             Fix For: Flex Branch


First things first: I am not an IR guy. The goal of this issue is to make 
'surgical' tweaks to lucene's formula to bring its performance up to that of 
more modern algorithms such as BM25.

In my opinion, the concept of having some 'flexible' scoring with good speed 
across the board is an interesting goal, but not practical in the short term.

Instead here I propose incorporating some work similar to lnu.ltc and friends, 
but slightly different. I noticed this seems to be in line with that paper 
published before about the trec million queries track... 

Here is what I propose in pseudocode (overriding DefaultSimilarity):

{code}
  @Override
  public float tf(float freq) {
    return 1 + (float) Math.log(freq);
  }
  
  @Override
  public float lengthNorm(String fieldName, int numTerms) {
    return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
  }
{code}

Where slope is a constant (I used 0.25 for all relevance evaluations: the goal 
is to have a better default), and pivot is the average field length. Obviously 
we shouldnt make the user provide this but instead have the system provide it.

These two pieces do not improve lucene much independently, but together they 
are competitive with BM25 scoring with the test collections I have run so far. 

The idea here is that this logarithmic tf normalization is independent of the 
tf / mean TF that you see in some of these algorithms, in fact I implemented 
lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff 
and it did not fare as well as this method, and this is simpler, we do not need 
to calculate this mean TF at all.

The BM25-like "binary" pivot here works better on the test collections I have 
run, but of course only with the tf modification.

I am uploading a document with results from 3 test collections (Persian, Hindi, 
and Indonesian). I will test at least 3 more languages... yes including 
English... across more collections and upload those results also, but i need to 
process these corpora to run the tests with the benchmark package, so this will 
take some time (maybe weeks)

so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3 
test collections are in the openrelevance svn, so if you think you have a great 
idea, don't hesitate to test it and upload results, this is what it is for. 

also keep in mind again I am not a scoring or IR guy, the only thing i can 
really bring to the table here is the willingness to do a lot of relevance 
testing!


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Created: (LUCENE-2187) improve lucene's similarity algorithm defaults

Reply via email to