Hi all, I have found, much to my dismay, that the documentation on Lucene’s default similarity formula is very dangerously misleading. See it here: http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf
Term Frequency (TF) counts are expected to be per-document in the IR literature, and this documentation doesn’t say any differently. However, it turns out that for Lucene, TF scores are in fact PER-FIELD. This furthermore applies to the /coord/ component. I realise that /coord/ is a ratio of query terms matched over total query terms, but I believe an effort could be made to make clear that field1:term1 and field2:term1 count as 2 different query terms. As an example, for 2 documents with fields field1 and field2, where query1=”field1:term1” query2=”field1:term1 or field2:term1” document1={field1:”term1 term1”, field2:””} document2={field2:”term1”, field2:”term1”} Coord(query1,document1)= 1/1 = 1 Coord(query2,document1)= 1/2 = 0.5 Coord(query1,document2)= 1/2 = 0.5 Coord(query2,document2)= 2/2 = 1 Now, the TF scores will be normalized with the fieldNorm component which is computed based on field length at indexing time and stored in a single byte, with a significant loss of precision. These things together make it impossible to run Lucene retrieval in such a way that *similarity(query2,document1) == similarity(query2,document2)* which is precisely what I need in my use case. Here are my questions: 1. I think the documentation should be updated to make this clear! Can I do this myself? 2. Has anyone encountered this problem before? Is there an easy fix? Cheers, Daniel -- View this message in context: http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html Sent from the Lucene - Java Users mailing list archive at Nabble.com. --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscr...@lucene.apache.org For additional commands, e-mail: java-user-h...@lucene.apache.org