Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Michael Sokolov Wed, 14 Jan 2015 19:08:51 -0800

In practice, normalization by field length proves to be more useful thannormalization by the sum of the lengths of all fields (document length),which I think is what you seem to be after. Think of a book chapterdocument with two fields: title and full text. It makes little sense toweight the terms in the title differently for longer and shorter texts.

To get the behavior (I think) you want, you could index your documentslike this:


document1={field:"field1:term1 field1:term1"}
document2={field:"field1:term1 field2:term1"}

and form queries like:

query1="field:field1\:term1"
query2="field:(field1\:term1 or field2\:term1)"

-Mike

On 1/13/15 2:24 PM, danield wrote:

Hi all,

I have found, much to my dismay, that the documentation on Lucene’s default
similarity formula is very dangerously misleading. See it here:
http://lucene.apache.org/core/4_9_0/core/org/apache/lucene/search/similarities/TFIDFSimilarity.html#formula_tf

Term Frequency (TF) counts are expected to be per-document in the IR
literature, and this documentation doesn’t say any differently. However, it
turns out that for Lucene, TF scores are in fact PER-FIELD.

This furthermore applies to the /coord/ component. I realise that /coord/ is
a ratio of query terms matched over total query terms, but I believe an
effort could be made to make clear that field1:term1 and field2:term1 count
as 2 different query terms.

As an example, for 2 documents with fields field1 and field2, where
query1=”field1:term1”
query2=”field1:term1 or field2:term1”

document1={field1:”term1 term1”, field2:””}
document2={field2:”term1”, field2:”term1”}

Coord(query1,document1)= 1/1 = 1
Coord(query2,document1)= 1/2 = 0.5
Coord(query1,document2)= 1/2 = 0.5
Coord(query2,document2)= 2/2 = 1

Now, the TF scores will be normalized with the fieldNorm component which is
computed based on field length at indexing time and stored in a single byte,
with a significant loss of precision. These things together make it
impossible to run Lucene retrieval in such a way that

*similarity(query2,document1) == similarity(query2,document2)*

which is precisely what I need in my use case.

Here are my questions:
1. I think the documentation should be updated to make this clear! Can I do
this myself?
2. Has anyone encountered this problem before? Is there an easy fix?

Cheers,
Daniel



--
View this message in context: 
http://lucene.472066.n3.nabble.com/Similarity-formula-documentation-is-misleading-how-to-make-field-agnostic-queries-tp4179307.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]



---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Re: Similarity formula documentation is misleading + how to make field-agnostic queries?

Reply via email to