I applied the Lucene patch mentioned in
https://issues.apache.org/jira/browse/LUCENE-2091 and ran the MAP numbers on
TREC-3 collection using topics 151-200. I am not getting worse results
comparing to Lucene DefaultSimilarity. I suspect, I am not using it correctly.
I have single field documents. This is the process I use:
1. During the indexing, I am setting the similarity to BM25 as such:
IndexWriter writer = new IndexWriter(dir, new StandardAnalyzer(
Version.LUCENE_CURRENT), true,
IndexWriter.MaxFieldLength.UNLIMITED);
writer.setSimilarity(new BM25Similarity());
2. During the Precision/Recall measurements, I am using a SimpleBM25QQParser
extension I added to the benchmark:
QualityQueryParser qqParser = new SimpleBM25QQParser("title", "TEXT");
3. Here is the parser code (I set an avg doc length here):
public Query parse(QualityQuery qq) throws ParseException {
BM25Parameters.setAverageLength(indexField, 798.30f);//avg doc length
BM25Parameters.setB(0.5f);//tried default values
BM25Parameters.setK1(2f);
return query = new BM25BooleanQuery(qq.getValue(qqName), indexField, new
StandardAnalyzer(Version.LUCENE_CURRENT));
}
4. The searcher is using BM25 similarity:
Searcher searcher = new IndexSearcher(dir, true);
searcher.setSimilarity(sim);
Am I missing some steps? Does anyone have experience with this code?
Thanks,
Ivan
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]