[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

Doron Cohen (JIRA) Sun, 13 Sep 2009 11:48:21 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754743#action_12754743
 ]


Doron Cohen commented on LUCENE-1908:
-------------------------------------

Thanks for reviewing this Ted. 

{quote}
the new text seems to say things like "the scoring function is like this 
(formula) except that it isn't because it is really like this (other-formula) 
but that isn't really right either because it is like this 
(still-another-formula) which actually isn't right because of fields and 
<mumble>".
{quote}

I see what you mean. 

I tried to take the reader of this from VSM to the actual elements computed and 
aggregated in Lucene scoring code. This would also answer questions several 
times asked in the lists: "but what is the scoring model of Lucene" - it is not 
that straightforward to tell why a certain method is called during scoring. 

But I think you have a good point - the reader is told "this is the scoring 
formula" just to discover 20 lines ahead that in fact "that is the formula" and 
yet again the same thing in another paragraph. 

I think all 3 formulas are required, just the gluing text should improve. Might 
have helped to have better English than mine for this, but I'll give it a try, 
I think I know how to write it better in this sense.

{quote}
There are also many small errors such as claiming that tf is proportional to 
term frequency and idf is proportional to inverse of document frequency. 
Proportional means that there is a linear relationship which is definitely not 
the case here. It would be better to say tf usually increases with increasing 
term frequency, although occasionally a constant might be used. IDF, on the 
other hand, decreases with increasing document frequency.
{quote}

I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears 
wrongly in two other places in Similarity - idf() and in idfExplain().  In 
these two other places I think replacing it to "related" would be correct, i.e. 
like this:

{noformat}
Note that Searcher.maxDoc() is used instead of
org.apache.lucene.index.IndexReader.numDocs() 
because it is related to Searcher.docFreq(Term) , 
i.e., when one is inaccurate, so is the other, and 
in the same direction.
{noformat}

For tf and idf I think this will do: (?)

{noformat}
Tf and Idf are described in more detail below, 
but for now, for completion, let's just say that 
for given term t and document (or query) x, 
Tf(t,x) is related to the number of occurrences of 
term t in x - when one increases so does 
the other - and idf(t) is similarly related to the 
inverse of the number of index documents 
containing term t. 
{noformat}


> Similarity javadocs for scoring function to relate more tightly to scoring 
> models in effect
> -------------------------------------------------------------------------------------------
>
>                 Key: LUCENE-1908
>                 URL: https://issues.apache.org/jira/browse/LUCENE-1908
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>            Reporter: Doron Cohen
>            Assignee: Doron Cohen
>            Priority: Minor
>             Fix For: 2.9
>
>         Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] Commented: (LUCENE-1908) Similarity javadocs for scoring function to relate more tightly to scoring models in effect

Reply via email to