improve lucene's similarity algorithm defaults
----------------------------------------------
Key: LUCENE-2187
URL: https://issues.apache.org/jira/browse/LUCENE-2187
Project: Lucene - Java
Issue Type: Improvement
Components: Query/Scoring
Reporter: Robert Muir
Fix For: Flex Branch
First things first: I am not an IR guy. The goal of this issue is to make
'surgical' tweaks to lucene's formula to bring its performance up to that of
more modern algorithms such as BM25.
In my opinion, the concept of having some 'flexible' scoring with good speed
across the board is an interesting goal, but not practical in the short term.
Instead here I propose incorporating some work similar to lnu.ltc and friends,
but slightly different. I noticed this seems to be in line with that paper
published before about the trec million queries track...
Here is what I propose in pseudocode (overriding DefaultSimilarity):
{code}
@Override
public float tf(float freq) {
return 1 + (float) Math.log(freq);
}
@Override
public float lengthNorm(String fieldName, int numTerms) {
return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
}
{code}
Where slope is a constant (I used 0.25 for all relevance evaluations: the goal
is to have a better default), and pivot is the average field length. Obviously
we shouldnt make the user provide this but instead have the system provide it.
These two pieces do not improve lucene much independently, but together they
are competitive with BM25 scoring with the test collections I have run so far.
The idea here is that this logarithmic tf normalization is independent of the
tf / mean TF that you see in some of these algorithms, in fact I implemented
lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) stuff
and it did not fare as well as this method, and this is simpler, we do not need
to calculate this mean TF at all.
The BM25-like "binary" pivot here works better on the test collections I have
run, but of course only with the tf modification.
I am uploading a document with results from 3 test collections (Persian, Hindi,
and Indonesian). I will test at least 3 more languages... yes including
English... across more collections and upload those results also, but i need to
process these corpora to run the tests with the benchmark package, so this will
take some time (maybe weeks)
so, please rip it apart with scoring theory etc, but keep in mind 2 of these 3
test collections are in the openrelevance svn, so if you think you have a great
idea, don't hesitate to test it and upload results, this is what it is for.
also keep in mind again I am not a scoring or IR guy, the only thing i can
really bring to the table here is the willingness to do a lot of relevance
testing!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]