[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2009-11-20 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12780607#action_12780607
 ] 

Grant Ingersoll commented on LUCENE-965:


Hi Hui,

I see you updated your paper on this, have you looked at how this might be 
implemented given the flexible indexing work under way?  

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.1
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2009-08-21 Thread Hui Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745995#action_12745995
 ] 

Hui Fang commented on LUCENE-965:
-

Jason,  the problem has been fixed.  Please try again.  Thanks. 



> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.0
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2009-08-20 Thread Jason Rutherglen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745617#action_12745617
 ] 

Jason Rutherglen commented on LUCENE-965:
-

The link http://www.ece.udel.edu/~hfang/lucene/lucene-2.9-dev-AX-contrib.tar.gz 
doesn't work?

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.0
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2009-08-20 Thread Hui Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12745608#action_12745608
 ] 

Hui Fang commented on LUCENE-965:
-

Hello everyone, 

We have re-implemented the retrieval functions in a very different way. The 
main differences are (1) the average document length will not be computed in 
the retrieval process as we did the previous implementation, which could make 
the retrieval process more efficiently and (2) instead of modifying the 
existing search related classes, we integrate the new retrieval functions 
through two new classes, i.e., AXTermQuery and. AXTermScorer by extending 
TermQuery and TermScorer classes.  I think that the current implementation 
addresses most concerns raised in this discussion threads. 

The source codes and the updated reports of our implementation is now available 
at http://www.ece.udel.edu/~hfang/LuceneAX.html.   We have implemented two 
slightly versions for lucene-2.4.1 and lucene-2.9-dev.   We hope that the 
implementation of the axiomatic retrieval function could be integrated in the 
releases of the Lucene.Please feel free to let me know if you have any 
questions or comments. 

Thanks,
-Hui 

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.0
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: java-dev-h...@lucene.apache.org



Re: [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2008-12-11 Thread Grant Ingersoll
I don't think the original authors have followed up on this patch at  
all since first posting.


On Nov 27, 2008, at 6:44 AM, Ian Holsman (JIRA) wrote:



   [ https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651332 
#action_12651332 ]


Ian Holsman commented on LUCENE-965:


It's a bit late over here, but when I try to apply the patch it  
doesn't seem to have the AXSimilarity class in it.
is there a file missing here, or should i not be looking at applying  
patches late at night?



Implement a state-of-the-art retrieval function in Lucene
-

   Key: LUCENE-965
   URL: https://issues.apache.org/jira/browse/LUCENE-965
   Project: Lucene - Java
Issue Type: Improvement
Components: Search
  Affects Versions: 2.2
  Reporter: Hui Fang
   Fix For: 3.0

   Attachments: axiomaticFunction.patch


We implemented the axiomatic retrieval function, which is a state- 
of-the-art retrieval function, to
replace the default similarity function in Lucene. We compared the  
performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf 
.
The report shows that the performance of the axiomatic retrieval  
function is much better than the default function. The axiomatic  
retrieval function is able to find more relevant documents and  
users can see more relevant documents in the top-ranked documents.  
Incorporating such a state-of-the-art retrieval function could  
improve the search performance of all the applications which were  
built upon Lucene.
Most changes related to the implementation are made in  
AXSimilarity, TermScorer and TermQuery.java.  However, many test  
cases are hand coded to test whether the implementation of the  
default function is correct. Thus, I also made the modification to  
many test files to make the new retrieval function pass those  
cases. In fact, we found that some old test cases are not  
reasonable. For example, in the testQueries02 of TestBoolean2.java,
the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3"  
and "w1 w3 xx w2 yy w3".
The second document should be more relevant than the first one,  
because it has more
occurrences of the query term "w3". But the original test case  
would require us to rank
the first document higher than the second one, which is not  
reasonable.


--
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



--
Grant Ingersoll

Lucene Helpful Hints:
http://wiki.apache.org/lucene-java/BasicsOfPerformance
http://wiki.apache.org/lucene-java/LuceneFAQ











-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2008-11-27 Thread Ian Holsman (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12651332#action_12651332
 ] 

Ian Holsman commented on LUCENE-965:


It's a bit late over here, but when I try to apply the patch it doesn't seem to 
have the AXSimilarity class in it.
is there a file missing here, or should i not be looking at applying patches 
late at night?

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Fix For: 3.0
>
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2008-09-13 Thread Mark Miller (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12630820#action_12630820
 ] 

Mark Miller commented on LUCENE-965:


Lets not forget about trying to get avg doc length in by 3.0 -if it can be done 
with little/to no impact on non users of it, would be really cool to have.

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2008-01-08 Thread Otis Gospodnetic (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12557139#action_12557139
 ] 

Otis Gospodnetic commented on LUCENE-965:
-

Hui - would it be possible to bring this patch up to date, so it's in sync with 
Lucene 2.3?

Mike McCandless & Co. have made so many changes to the Lucene index format, I 
get a feeling that avg. doc. length could also make it into the index format at 
the segment/index level if this patch is revived.


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-09-22 Thread Hui Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12529725
 ] 

Hui Fang commented on LUCENE-965:
-

Hi Charlie, 

I am sorry for the late reply. I just saw your message. I am not sure why your 
results are different from mine. But your problem setting is same as mine. Did 
you use any document preprocessing? 

 

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-30 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516547
 ] 

Doron Cohen commented on LUCENE-965:


> Is there a way to plug in a patch into my local source repository, so I can 
> diff with my favorite diff tool?
: patch -p 0 < foo.patch  

Try with --dry-run first.
Another convenient way in case you are using Eclipse is the Subclipse plugin 
that lets you visually diff patches just before applying them.

> But may I suggest the alternative? 

I think you have a valid point here. I too don't understand the proposed 
"Axiomatic Retrieval Function" (ARF) in this regard: in Lucene, the norm value 
stored for a document (assuming all boosts are 1) is
norm(D) = 1 / sqrt(numTerms(D))
This value is ready to use at scoring time, multiplying it with  
tf(t in d)  -   idf(t)^^2   
as described in 
http://lucene.zones.apache.org:8080/hudson/job/Lucene-Nightly/javadoc/org/apache/lucene/search/Similarity.html

Now, the ARF paper in http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf 
describes Lucene scoring using |D| in place of norm(D) above, and describes ARF 
scoring using |D| again, the same as it seems to be implemented in this patch 
e.g. in TermScorer. However, the paper defines |D| as "the length of D". I find 
this confusing. Usually "|D|" really means the number of words in a document, 
and "avgdl" would mean the average of all |D|'s in the collection (see for 
instance "Okapi BM25" in Wikipedia). 

Now, your proposed change is something I can understand - it first translates 
back norm(D) into Length(D) (ignoring boosts), and only then averaging it. 

In any case - I mean if either this is fixed or I am wrong and an explanation 
shows why no fix is needed - I have to admit I still don't understand the logic 
behind ARF, intuitively, why would it be better? Guess provable search quality 
results can help in persuading...  (LUCENE-836 is resolved btw).

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-30 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516507
 ] 

Doug Cutting commented on LUCENE-965:
-

> Is there a way to plug in a patch into my local source repository, so I can 
> diff with my favorite diff tool?

patch -p 0 < foo.patch


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-28 Thread Charlie Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12516185
 ] 

Charlie Zhao commented on LUCENE-965:
-


Regarding the approach to compute avgDL, this patch goes like this: 

+float avgDL=0.0f;
+for (int i=0; iD2

|D|/avgD is what we got from the suggested alternative code and D1 < D2

They have totally flipped the relationship between D1 and D2. 

My impression of the Axiomatic Retrieval Function is: it still tries to 
penalize longer doc. So maybe the alternative code is what we need? 

By the same token, |D| != Similarity.decodeNorm(fieldNorms[doc]). 

Note: since we are recovering from the norm, so avgDL and DL != their original 
absolute value. But they suffice for the scoring purpose. 

Based on Doug's previous comment, I totally agree that avgDL should be 
pre-computed and cached in the searcher before where the rubber meets the road. 
And the cost might be invisible if we warm up the searcher first. Thanks for 
explaining. 

Not sure where Doron implemented 1 / sqrt((1 - Slope) * Pivot + (Slope) * 
Doclen). Since LUCENE-836 looks will be committed soon. I am really excited to 
see which similarity function will prevail in this era.  

BTW, anyone would like to share how to read Lucene patches more efficiently? I 
mean I had hard time to make sense of those +s and -s independently from their 
source files. Is there a way to plug in a patch into my local source 
repository, so I can diff with my favorite diff tool? Thanks in advance. 


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515854
 ] 

Michael Busch commented on LUCENE-965:
--

> I guess I would not be in favor of a special term, I would rather see it 
> integrated 
> into the file format somehow. Special terms get deleted, misused, etc. 

Well yes, I would also prefer to have real per-doc payloads in the file format, 
but
until we have that we can use this workaround to try things out, as the 
performance
should be comparable to real per-doc payloads.

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515850
 ] 

Doug Cutting commented on LUCENE-965:
-

> Did I miss something?

What I meant is that the loops added by this patch to compute average document 
length per query term could be more efficiently computed once per field in a 
searcher.  They could thus be cached in, e.g., a 
WeakHashMap>.

The cost of computing these is proportional to the size of the norms, which 
means that it is proportional to the cost of reading the norms.  Computing them 
on demand when a searcher is opened would not be as fast as pre-computing them, 
but it might not prohibitively slow either, and would be simple to implement 
without other changes to Lucene.

By "average norm" I guess I really meant "easily computable from norms".  This 
may not always be possible, since, e.g., with boosting, document lengths may 
not be recoverable from the norms.  But, in many cases, it might suffice.

Does that help?

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515846
 ] 

Grant Ingersoll commented on LUCENE-965:


I guess I would not be in favor of a special term, I would rather see it 
integrated into the file format somehow.  Special terms get deleted, misused, 
etc.  Plus the avg. doc length is going to be something that is going to need 
to be updated frequently, right? 

Since we are talking 3.x of Lucene fairly soon anyway (assuming the JDK 1.5 
vote passes), this would allow us to make the file format change as well, as 
long as we can still read prior versions.

Charlie, as for you question about what users value in Lucene, speed or recall 
and precision, the answer is both.  :-)  Some people care more about speed 
while others care about p/r.  I think most people that use Lucene have the 
feeling that the results are good enough in production environments and that we 
don't always worry about getting every last bit out of TREC (especially since 
we can't, as a group, test against TREC).  That being said, I would bet most 
users would be willing to trade off a few percentage points of speed in 
exchange for the kind of MAP improvements we are talking here.  Especially 
since we probably can eventually figure out a way to make it as fast anyway, or 
at least find other things we can speed up.

Correct me if I am wrong, but there are other IR strategies that can use the 
avg. doc. length, too, right?  So, not to sidetrack too much, but if we do this 
right, maybe we can also open up the door to other scoring strategies as well 
without much downside.  Just something to consider.



> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Michael Busch (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515840
 ] 

Michael Busch commented on LUCENE-965:
--

> Can we store the "document length" (with multiple fields) and "average 
> document length" 
> as the payload data at document level and index level respectively? The 
> current payload 
> is designed at term level, is it right? If we want to store something at 
> document and 
> index level, do we necessary change the Lucene file format? 

You are right, currently we can only store payloads per term occurrence, not at 
the doc
level. However, it is possible to simply add a special term to every document 
that has
only one occurrence with a payload, then you have one payload per doc.

Coincidentally I am currently testing how search performance would suffer if we 
stored
the norms as payloads in the posting lists. At search time we would then not 
buffer the
norms but read them on demand from the prx file. This is probably somewhat 
slower than 
buffering the norms, but has a lot of advantages, such as much simpler code and 
less 
memory consumption by the IndexReader. Since all norms are then stored in a 
single 
posting lists I'm hoping that the FS cache will help that the search 
performance won't
suffer too much. And multi-level skipping should help too. Well let's see, I'm 
currently
building an index with norms as payloads, I should have some numbers tonight or 
tomorrow.

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Charlie Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515833
 ] 

Charlie Zhao commented on LUCENE-965:
-



Document Length and Average Document Length are sort of speed bottlenecks of  
Lucene's implementation of some IR models, like Axiomatic Retrieval Function we 
just saw and one Language Model I have extended in Lucene. I said speed, 
instead of performance. Because Lucene's performance measures (in the sense of 
recall and precision) are relatively low comparing with other IR models with my 
experimental results. And since early Lucene, we never updated the kernel of 
similarity measure algorithm. Do general users value (recall+precision) more 
than (speed)? 

How to conveniently store and retrieve "field length", "document length", 
"average document length", etc.? Can they be the payload data at document level 
and index level? So we may say bye to their corresponding overhead during query 
time? 

I used to leverage from TermFreqVector's getTermFrequencies() to obtain the 
field length. (size() only return the unique terms)  But shall I just reverse 
that field's norm value back to its length as (1/norm)^2? Which might be 
faster. Can someone confirm this?

BTW, I need help to understand the claim of "a small constant factor to the 
cost of reading them." in Doug's comment. Average norm does not give us the 
average field length. We need to recover the individual field length to get the 
average field length, which involve a great deal of floating point operations 
there. Did I miss something?

Can we store the "document length" (with multiple fields) and "average document 
length" as the payload data at document level and index level respectively? The 
current payload is designed at term level, is it right? If we want to store 
something at document and index level, do we necessary change the Lucene file 
format? 





> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Doug Cutting (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515803
 ] 

Doug Cutting commented on LUCENE-965:
-

> It does seem like calculating the average field length at index time should 
> be relatively cheap. 

Yes, it should.  But if average norm suffices, that can be computed on demand 
and cached in the Searcher without significantly impacting performance.  The 
norms need to be read anyway, and averaging them adds only a small constant 
factor to the cost of reading them.


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Charlie Zhao (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515783
 ] 

Charlie Zhao commented on LUCENE-965:
-

Hello Hui:

Thank you for contributing your axiomatic retrieval function to Lucene. Can not 
wait for the test drive.

Would you please report your settings for your experiment on

Collection Function MAP P5 P10 P20 P100
 NumRR
ROBUST04Lucene Default 0.048 0.12 0.09 0.08 0.05 21

Since there are disparities comparing with mine.

num_q 249
num_ret 239436
num_rel 17412
num_rel_ret 9780
map 0.2076
gm_ap 0.1049
R-prec 0.2551
bpref 0.2189
recip_rank 0.5684
ircl_prn.0.00 0.6288
ircl_prn.0.10 0.4459
ircl_prn.0.20 0.3562
ircl_prn.0.30 0.2864
ircl_prn.0.40 0.2289
ircl_prn.0.50 0.1925
ircl_prn.0.60 0.145
ircl_prn.0.70 0.1062
ircl_prn.0.80 0.0702
ircl_prn.0.90 0.0461
ircl_prn.1.00 0.0261
P5 0.3944
P10 0.3598
P15 0.3307
P20 0.307
P30 0.2657
P100 0.1618
P200 0.1117
P500 0.0635
P1000 0.0393

Before we go further, let us make sure we are in the same page.

Here is my setting:

Data: TREC Disk 4 & 5; 528,155 documents; 1,904 MB of text.

Query Number: TREC Query Number 301-700

Query Field:  only

IR Engine: Lucene 2.0 (need double check, but close:)

Note: default Lucene similarity function, using title words only.

If we are in the same page, then 0.048 MAP score is terribly low for 301-700, 
whereas 0.2076 in mine.

Still your axiomatic retrieval function outperformed the default in many other 
aspects. So if you would like to check your experimental setting, and if my 
result is more closer to the real default, then we might discover a further 
improvement with the axiomatic retrieval function. That is my hope. 

Charlie Zhao


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread Yonik Seeley (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515773
 ] 

Yonik Seeley commented on LUCENE-965:
-

It does seem like calculating the average field length at index time should be 
relatively cheap.
I haven't seen the Similarity implementation, but the axiomatic 
TermScorer.score() will be somewhat slower than Lucene's due to the necessary 
division (all but one can be turned into a multiply I think).



> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



Re: [jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-26 Thread crspan

Hello Hui:

Thank you for contributing your axiomatic retrieval function to Lucene. 
Can not wait for the test drive :)


Would you please report your setting for your experiment on

Collection Function MAP [EMAIL PROTECTED] [EMAIL PROTECTED] [EMAIL PROTECTED] 
[EMAIL PROTECTED] NumRR
ROBUST04 Lucene Default 0.048 0.12 0.09 0.08 0.05 21

Since there is a disparity comparing with mine.

num_q   249
num_ret 239436
num_rel 17412
num_rel_ret 9780
map 0.2076
gm_ap   0.1049
R-prec  0.2551
bpref   0.2189
recip_rank  0.5684
ircl_prn.0.00   0.6288
ircl_prn.0.10   0.4459
ircl_prn.0.20   0.3562
ircl_prn.0.30   0.2864
ircl_prn.0.40   0.2289
ircl_prn.0.50   0.1925
ircl_prn.0.60   0.145
ircl_prn.0.70   0.1062
ircl_prn.0.80   0.0702
ircl_prn.0.90   0.0461
ircl_prn.1.00   0.0261
P5  0.3944
P10 0.3598
P15 0.3307
P20 0.307
P30 0.2657
P1000.1618
P2000.1117
P5000.0635
P1000   0.0393



Before I go further, let us make sure we are in the same page.

Here is my setting:

Data: TREC Disk 4 & 5; 528,155 documents; 1,904 MB of text.
Query Number: TREC Query Number 301-700
Query Field:  only
IR Engine: Lucene 2.0 (need double check, but close:)
Note: default Lucene similarity function, using title words only.

If we are in the same page, then 0.048 MAP score is terribly low for 
301-700, whereas 0.2076 in mine.


Still your axiomatic retrieval function outperformed the default in many 
other aspects. So if you would like to check your experimental setting, 
and if my result is more closer to the real default, then we might 
discover a further improvement with the axiomatic retrieval function. 
That is my hope.


Charlie Zhao




  

Implement a state-of-the-art retrieval function in Lucene
-

Key: LUCENE-965
URL: https://issues.apache.org/jira/browse/LUCENE-965
Project: Lucene - Java
 Issue Type: Improvement
 Components: Search
   Affects Versions: 2.2
   Reporter: Hui Fang
Attachments: axiomaticFunction.patch


We implemented the axiomatic retrieval function, which is a state-of-the-art retrieval function, to 
replace the default similarity function in Lucene. We compared the performance of these two functions and reported the results at http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
The report shows that the performance of the axiomatic retrieval function is much better than the default function. The axiomatic retrieval function is able to find more relevant documents and users can see more relevant documents in the top-ranked documents. Incorporating such a state-of-the-art retrieval function could improve the search performance of all the applications which were built upon Lucene. 





-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-25 Thread Hui Fang (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515514
 ] 

Hui Fang commented on LUCENE-965:
-

Hi Grant and Doron, 

Thank you very much for your comments! They are very useful. I agree that it 
would be interesting to evaluate this in the context of Lucene-836, which is a 
very nice idea. Actually, my advisor and I also discussed that we could put 
some evaluation scripts in Lucene so that others could easily evaluate the 
retrieval performance. Hope that Lucene-836 would be finalized soon, and please 
let me know if there is anything I could help. 

Regarding to the speed, the axiomatic retrieval function should have the same 
computatlonal complexity as the default function if we could compute the 
average document length at the indexing time instead of search time.  As Doron 
pointed out, my current implementation is not optimal, I will fix this problem 
and other svn related problems as soon as possible, and resubmit a new patch. 

Thanks,
-Hui

 

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-25 Thread Doron Cohen (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515485
 ] 

Doron Cohen commented on LUCENE-965:


Thanks for contributing this Hui Fang! Very interesting.
I agree with Grant that we should be able to asses this in the context of 
LUCENE-836 - I hope to finalize it pretty soon. 
I looked into the patch and read the short paper referenced and I have a few 
comments:

1) Interestingly this too makes use of the average document length, as 
discussed in http://www.gossamer-threads.com/lists/lucene/java-dev/50537
2) The current patch seem out dated comparing to trunk and also contain many 
changes that are not part of the proposed improvement. You need to run "svn 
update" to update with trunk (but do "svn stat -u" beforehand to see what is 
going to be updated and that there are no conflicts, and bkup your code before 
that just in case...)
3) The AXSimilarity class itself was is not included in the patch (note that 
you need to "svn add" the new files in order for "svn diff" to include these 
new files in the patch.
4) On first reading of the patch it seems that the avarage length is computed 
at search time for each scored term... right? This is good enough for the 
evaluation of this Similarity function, but for a running solution a better 
performance method would be required, like the one Hoss suggested in 
http://www.gossamer-threads.com/lists/lucene/java-dev/5053


> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]



[jira] Commented: (LUCENE-965) Implement a state-of-the-art retrieval function in Lucene

2007-07-25 Thread Grant Ingersoll (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-965?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12515482
 ] 

Grant Ingersoll commented on LUCENE-965:


What do people make of this?  Interesting claims.  I haven't looked at the 
patch yet or read up on the Axiomatic retrieval model, but the precision 
numbers in the report are impressive.  I think it dovetails nicely with Doron 
and Chris' discussions on retrieval performance and making better efforts to 
gauge Lucene's retrieval effectiveness.  These numbers are for TREC and that 
doesn't always correlate to the real world, but still, not to be discounted, 
either.

I think it would be cool to see a couple things out of this (at least):
1. contrib/benchmark algorithms to be applied for before and after, including 
LUCENE-836.  This would give everyone a way of easily evaluating (assuming they 
have TREC data).  I would wait for 836 to be committed, though, as it is not 
final yet.
2. Search speed numbers comparing the two approaches.  That is if it is 
significantly slower, than it probably isn't going to be the default way of 
doing things

My gut reaction would be, if everything checks out of course, is to see how to 
factor it in as a separate querying mechanism, if possible like the Spans 
functionality, to give people the option of using this and if the claims hold 
up in the wild and feedback is positive, then we could look to making it the 
default approach.  Not sure how feasible this is, though

In the meantime, looks like I've got some reading to do...

Cheers,
Grant

> Implement a state-of-the-art retrieval function in Lucene
> -
>
> Key: LUCENE-965
> URL: https://issues.apache.org/jira/browse/LUCENE-965
> Project: Lucene - Java
>  Issue Type: Improvement
>  Components: Search
>Affects Versions: 2.2
>Reporter: Hui Fang
> Attachments: axiomaticFunction.patch
>
>
> We implemented the axiomatic retrieval function, which is a state-of-the-art 
> retrieval function, to 
> replace the default similarity function in Lucene. We compared the 
> performance of these two functions and reported the results at 
> http://sifaka.cs.uiuc.edu/hfang/lucene/Lucene_exp.pdf. 
> The report shows that the performance of the axiomatic retrieval function is 
> much better than the default function. The axiomatic retrieval function is 
> able to find more relevant documents and users can see more relevant 
> documents in the top-ranked documents. Incorporating such a state-of-the-art 
> retrieval function could improve the search performance of all the 
> applications which were built upon Lucene. 
> Most changes related to the implementation are made in AXSimilarity, 
> TermScorer and TermQuery.java.  However, many test cases are hand coded to 
> test whether the implementation of the default function is correct. Thus, I 
> also made the modification to many test files to make the new retrieval 
> function pass those cases. In fact, we found that some old test cases are not 
> reasonable. For example, in the testQueries02 of TestBoolean2.java, 
> the query is "+w3 xx", and we have two documents "w1 xx w2 yy w3" and "w1 w3 
> xx w2 yy w3". 
> The second document should be more relevant than the first one, because it 
> has more 
> occurrences of the query term "w3". But the original test case would require 
> us to rank 
> the first document higher than the second one, which is not reasonable. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


-
To unsubscribe, e-mail: [EMAIL PROTECTED]
For additional commands, e-mail: [EMAIL PROTECTED]