[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Doron Cohen (JIRA) Mon, 30 Jul 2007 14:43:22 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547
 ]


Doron Cohen commented on LUCENE-965:
------------------------------------

> Is there a way to plug in a patch into my local source repository, so I can 
> diff with my favorite diff tool?
: patch -p 0 < foo.patch  

Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin 
that lets you visually diff patches just before applying them.

> But may I suggest the alternative? 

I think you have a valid point here. I too don't understand the proposed 
"Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value 
stored for a document (assuming all boosts are 1) is
    norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with  
    tf(t in d)  -   idf(t)^^2   
as described in 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf 
describes Lucene scoring using |D| in place of norm(D) above, and describes ARF 
scoring using |D| again, the same as it seems to be implemented in this patch 
e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find 
this confusing. Usually "|D|" really means the number of words in a document, 
and "avgdl" would mean the average of all |D|'s in the collection (see for 
instance "Okapi BM25" in Wikipedia). 

Now, your proposed change is something I can understand - it first translates 
back norm(D) into Length(D) (ignoring boosts), and only then averaging it. 

In any case - I mean if either this is fixed or I am wrong and an explanation 
shows why no fix is needed - I have to admit I still don't understand the logic 
behind ARF, intuitively, why would it be better? Guess provable search quality 
results can help in persuading...  (LUCENE-836 is resolved btw).

> Implement a state-of-the-art retrieval function in Lucene
> ---------------------------------------------------------
>
>                 Key: LUCENE-965
>                 URL: https://issues.apache.org/jira/browse/LUCENE-965
>             Project: Lucene - Java
>          Issue Type: Improvement
>          Components: Search
>    Affects Versions: 2.2
>            Reporter: Hui Fang
>         Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


---------------------------------------------------------------------
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]

[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

Reply via email to