[ https://issues.apache.org/jira/browse/LUCENE-2187?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
Robert Muir updated LUCENE-2187: -------------------------------- Attachment: scoring.pdf sorry, correct some transposition of axes labels and some grammatical mistakes :) > improve lucene's similarity algorithm defaults > ---------------------------------------------- > > Key: LUCENE-2187 > URL: https://issues.apache.org/jira/browse/LUCENE-2187 > Project: Lucene - Java > Issue Type: Improvement > Components: Query/Scoring > Reporter: Robert Muir > Fix For: Flex Branch > > Attachments: scoring.pdf, scoring.pdf, scoring.pdf > > > First things first: I am not an IR guy. The goal of this issue is to make > 'surgical' tweaks to lucene's formula to bring its performance up to that of > more modern algorithms such as BM25. > In my opinion, the concept of having some 'flexible' scoring with good speed > across the board is an interesting goal, but not practical in the short term. > Instead here I propose incorporating some work similar to lnu.ltc and > friends, but slightly different. I noticed this seems to be in line with that > paper published before about the trec million queries track... > Here is what I propose in pseudocode (overriding DefaultSimilarity): > {code} > @Override > public float tf(float freq) { > return 1 + (float) Math.log(freq); > } > > @Override > public float lengthNorm(String fieldName, int numTerms) { > return (float) (1 / ((1 - slope) * pivot + slope * numTerms)); > } > {code} > Where slope is a constant (I used 0.25 for all relevance evaluations: the > goal is to have a better default), and pivot is the average field length. > Obviously we shouldnt make the user provide this but instead have the system > provide it. > These two pieces do not improve lucene much independently, but together they > are competitive with BM25 scoring with the test collections I have run so > far. > The idea here is that this logarithmic tf normalization is independent of the > tf / mean TF that you see in some of these algorithms, in fact I implemented > lnu.ltc with cosine pivoted length normalization and log(tf)/log(mean TF) > stuff and it did not fare as well as this method, and this is simpler, we do > not need to calculate this mean TF at all. > The BM25-like "binary" pivot here works better on the test collections I have > run, but of course only with the tf modification. > I am uploading a document with results from 3 test collections (Persian, > Hindi, and Indonesian). I will test at least 3 more languages... yes > including English... across more collections and upload those results also, > but i need to process these corpora to run the tests with the benchmark > package, so this will take some time (maybe weeks) > so, please rip it apart with scoring theory etc, but keep in mind 2 of these > 3 test collections are in the openrelevance svn, so if you think you have a > great idea, don't hesitate to test it and upload results, this is what it is > for. > also keep in mind again I am not a scoring or IR guy, the only thing i can > really bring to the table here is the willingness to do a lot of relevance > testing! -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. --------------------------------------------------------------------- To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org