[
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716
]
Doron Cohen commented on LUCENE-1908:
-------------------------------------
{quote}
The intro to ir book appears to break it down so that you can explain it with
the math (why going into the unit vector space favors longer docs) - but other
work I am seeing says the math tells you no such thing, and its just comparing
it to the computed relevancy curve that tells you its not great.
{quote}
To my (current) understanding it goes like this: normalizing all V(d)'s to unit
vector is losing all information about lengths of documents. For a large
document made by duplicating a smaller one this is probably ok. For a large
document which just contains lots of "unique" text this is probably wrong. To
solve this, a different normalization is sometimes preferred, one that would
not normalize V(d) to the unit vector. (very much in line with what you (Mark)
wrote above, finally I understand this...).
The pivoted length normalization which you mentioned is one different such
normalization. Juru in fact is using this document length normalization. In our
TREC experiments with Lucene we tried this approach (we modified Lucene
indexing such that all require components were indexed as stored/cached fields
and at search time we could try various scoring algorithms). It is interesting
that pivoted length normalization did not work well - by our experiments - with
Lucene for TREC.
The document length normalization of Lucene's DefaultSimilarity (DS) now seems
to me - intuitively - not so good - at least for the previously mentioned two
edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N
distinct terms, but its length is 2N because each term appears twice. For doc1
DS will normalize to the unit vector same as EN, and for doc2 DS will normalize
to a vector larger than the unit vector. However I think the desired behavior
is the other way around - for doc2 to be the same as EN, and for doc1 to be
normalized to a vector larger than the unit vector.
Back to the documentation patch, again it seems wrong presenting as if both EU
and some additional doc length normalization are required - fixed patch to
follow...
> Similarity javadocs for scoring function to relate more tightly to scoring
> models in effect
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-1908
> URL: https://issues.apache.org/jira/browse/LUCENE-1908
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Doron Cohen
> Assignee: Doron Cohen
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]