[
https://issues.apache.org/jira/browse/LUCENE-1908?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12754743#action_12754743
]
Doron Cohen commented on LUCENE-1908:
-------------------------------------
Thanks for reviewing this Ted.
{quote}
the new text seems to say things like "the scoring function is like this
(formula) except that it isn't because it is really like this (other-formula)
but that isn't really right either because it is like this
(still-another-formula) which actually isn't right because of fields and
<mumble>".
{quote}
I see what you mean.
I tried to take the reader of this from VSM to the actual elements computed and
aggregated in Lucene scoring code. This would also answer questions several
times asked in the lists: "but what is the scoring model of Lucene" - it is not
that straightforward to tell why a certain method is called during scoring.
But I think you have a good point - the reader is told "this is the scoring
formula" just to discover 20 lines ahead that in fact "that is the formula" and
yet again the same thing in another paragraph.
I think all 3 formulas are required, just the gluing text should improve. Might
have helped to have better English than mine for this, but I'll give it a try,
I think I know how to write it better in this sense.
{quote}
There are also many small errors such as claiming that tf is proportional to
term frequency and idf is proportional to inverse of document frequency.
Proportional means that there is a linear relationship which is definitely not
the case here. It would be better to say tf usually increases with increasing
term frequency, although occasionally a constant might be used. IDF, on the
other hand, decreases with increasing document frequency.
{quote}
I agree. "Proportional" is wrong. Thanks for catching this! In fact it appears
wrongly in two other places in Similarity - idf() and in idfExplain(). In
these two other places I think replacing it to "related" would be correct, i.e.
like this:
{noformat}
Note that Searcher.maxDoc() is used instead of
org.apache.lucene.index.IndexReader.numDocs()
because it is related to Searcher.docFreq(Term) ,
i.e., when one is inaccurate, so is the other, and
in the same direction.
{noformat}
For tf and idf I think this will do: (?)
{noformat}
Tf and Idf are described in more detail below,
but for now, for completion, let's just say that
for given term t and document (or query) x,
Tf(t,x) is related to the number of occurrences of
term t in x - when one increases so does
the other - and idf(t) is similarly related to the
inverse of the number of index documents
containing term t.
{noformat}
> Similarity javadocs for scoring function to relate more tightly to scoring
> models in effect
> -------------------------------------------------------------------------------------------
>
> Key: LUCENE-1908
> URL: https://issues.apache.org/jira/browse/LUCENE-1908
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Reporter: Doron Cohen
> Assignee: Doron Cohen
> Priority: Minor
> Fix For: 2.9
>
> Attachments: LUCENE-1908.patch, LUCENE-1908.patch, LUCENE-1908.patch
>
>
> See discussion in the related issue.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]