[
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547
]
Doron Cohen commented on LUCENE-965:
------------------------------------
> Is there a way to plug in a patch into my local source repository, so I can
> diff with my favorite diff tool?
: patch -p 0 < foo.patch
Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin
that lets you visually diff patches just before applying them.
> But may I suggest the alternative?
I think you have a valid point here. I too don't understand the proposed
"Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value
stored for a document (assuming all boosts are 1) is
norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with
tf(t in d) - idf(t)^^2
as described in
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html
Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf
describes Lucene scoring using |D| in place of norm(D) above, and describes ARF
scoring using |D| again, the same as it seems to be implemented in this patch
e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find
this confusing. Usually "|D|" really means the number of words in a document,
and "avgdl" would mean the average of all |D|'s in the collection (see for
instance "Okapi BM25" in Wikipedia).
Now, your proposed change is something I can understand - it first translates
back norm(D) into Length(D) (ignoring boosts), and only then averaging it.
In any case - I mean if either this is fixed or I am wrong and an explanation
shows why no fix is needed - I have to admit I still don't understand the logic
behind ARF, intuitively, why would it be better? Guess provable search quality
results can help in persuading... (LUCENE-836 is resolved btw).
> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
> Issue Type: Improvement
> Components: Search
> Affects Versions: 2.2
> Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art
> retrieval function, to
> replace the default similarity function in Lucene. We compared the
> performance of these two functions and reported the results at
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf.
> The report shows that the performance of the axiomatic retrieval function is
> much better than the default function. The axiomatic retrieval function is
> able to find more relevant documents and users can see more relevant
> documents in the top-ranked documents. Incorporating such a state-of-the-art
> retrieval function could improve the search performance of all the
> applications which were built upon Lucene.
> Most changes related to the implementation are made in AXSimilarity,
> TermScorer and TermQuery.java. However, many test cases are hand coded to
> test whether the implementation of the default function is correct. Thus, I
> also made the modification to many test files to make the new retrieval
> function pass those cases. In fact, we found that some old test cases are not
> reasonable. For example, in the testQueries02 of TestBoolean2.java,
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3
> xx w2 yy w3".
> The second document should be more relevant than the first one, because it
> has more
> occurrences of the query term "w3". But the original test case would require
> us to rank
> the first document higher than the second one, which is not reasonable.
--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.
---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]