[
https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]
Robert Muir updated LUCENE-2187:
--------------------------------
Attachment: scoring.pdf
document with some simple results from the 3 collections i tested thus far.
i chose to display simple graphs with descriptions of the collections and some
of their peculiarities.
if you want submission.txt dumps or verbose output from trec_eval, I can do
that too, but I think its less useful to start with.
> improve lucene's similarity algorithm defaults
> ----------------------------------------------
>
> Key: LUCENE-2187
> URL: https://issues.apache.org/jira/browse/LUCENE-2187
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Query/Scoring
> Reporter: Robert Muir
> Fix For: Flex Branch
>
> Attachments: scoring.pdf
>
>
> First things first: I am not an IR guy. The goal of this issue is to make
> 'surgical' tweaks to lucene's formula to bring its performance up to that of
> more modern algorithms such as BM25.
> In my opinion, the concept of having some 'flexible' scoring with good speed
> across the board is an interesting goal, but not practical in the short term.
> Instead here I propose incorporating some work similar to lnu.ltc and
> friends, but slightly different. I noticed this seems to be in line with that
> paper published before about the trec million queries track...
> Here is what I propose in pseudocode (overriding DefaultSimilarity):
> {code}
> @Override
> public float tf(float freq) {
> return 1 + (float) Math.log(freq);
> }
>
> @Override
> public float lengthNorm(String fieldName, int numTerms) {
> return (float) (1 / ((1 - slope) * pivot + slope * numTerms));
> }
> {code}
> Where slope is a constant (I used 0.25 for all relevance evaluations: the
> goal is to have a better default), and pivot is the average field length.
> Obviously we shouldnt make the user provide this but instead have the system
> provide it.
> These two pieces do not improve lucene much independently, but together they
> are competitive with BM25 scoring with the test collections I have run so
> far.
> The idea here is that this logarithmic tf normalization is independent of the
> tf / mean TF that you see in some of these algorithms, in fact I implemented
> lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF)
> stuff and it did not fare as well as this method, and this is simpler, we do
> not need to calculate this mean TF at all.
> The BM25-like "binary" pivot here works better on the test collections I have
> run, but of course only with the tf modification.
> I am uploading a document with results from 3 test collections (Persian,
> Hindi, and Indonesian). I will test at least 3 more languages... yes
> including English... across more collections and upload those results also,
> but i need to process these corpora to run the tests with the benchmark
> package, so this will take some time (maybe weeks)
> so, please rip it apart with scoring theory etc, but keep in mind 2 of these
> 3 test collections are in the openrelevance svn, so if you think you have a
> great idea, don't hesitate to test it and upload results, this is what it is
> for.
> also keep in mind again I am not a scoring or IR guy, the only thing i can
> really bring to the table here is the willingness to do a lot of relevance
> testing!
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]