[ 
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754716#action_12754716
 ] 

Doron Cohen commented on LUCENE-1908:
-------------------------------------

{quote}
The intro to ir book appears to break it down so that you can explain it with 
the math (why going into the unit vector space favors longer docs) - but other 
work I am seeing says the math tells you no such thing, and its just comparing 
it to the computed relevancy curve that tells you its not great.
{quote}

To my (current) understanding it goes like this: normalizing all V(d)'s to unit 
vector is losing all information about lengths of documents. For a large 
document made by duplicating a smaller one this is probably ok. For a large 
document which just contains lots of "unique" text this is probably wrong. To 
solve this, a different normalization is sometimes preferred, one that would 
not normalize V(d) to the unit vector. (very much in line with what you (Mark) 
wrote above, finally I understand this...).

The pivoted length normalization which you mentioned is one different such 
normalization. Juru in fact is using this document length normalization. In our 
TREC experiments with Lucene we tried this approach (we modified Lucene 
indexing such that all require components were indexed as stored/cached fields 
and at search time we could try various scoring algorithms). It is interesting 
that pivoted length normalization did not work well - by our experiments - with 
Lucene for TREC.

The document length normalization of Lucene's DefaultSimilarity (DS) now seems 
to me - intuitively - not so good - at least for the previously mentioned two 
edge cases, where doc1 is made of N distinct terms, and doc2 is made of same N 
distinct terms, but its length is 2N because each term appears twice. For doc1 
DS will normalize to the unit vector same as EN, and for doc2 DS will normalize 
to a vector larger than the unit vector. However I think the desired behavior 
is the other way around - for doc2 to be the same as EN, and for doc1 to be 
normalized to a vector larger than the unit vector.

Back to the documentation patch, again it seems wrong presenting as if both EU 
and some additional doc length normalization are required - fixed patch to 
follow...

> Similarity javadocs for scoring function to relate more tightly to scoring 
> models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to