[ 
https://issues.apache.org/jira/browse/LUCENE-5847?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Hadas Raviv updated LUCENE-5847:
--------------------------------

    Description: 
The current implementation of language models in lucene is based on the paper 
"A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information 
Retrieval" by Zhai and Lafferty ('01). Specifically, LMDiricheltSimilarity and 
LMJelinikMercerSimilarity use a normalized smoothed score for a matching term 
in a document, as suggested in the above mentioned paper.

However, lucene doesn't assign a score to query terms that do not appear in a 
matched document. According to the "pure" LM approach, these terms should be 
assigned with a collection probability "background score". If one uses the 
Jelinik Mercer smoothing method, the final result list produced by lucene is 
rank equivalent to the one that would have been created by a full LM 
implementation. However, this is not the case for Dirichlet smoothing method, 
because the background score is document dependent. Documents in which not all 
query terms appear, are missing the document-dependant background score for the 
missing terms. This component affects the final ranking of documents in the 
list.

Since LM is a baseline method in many works in the IR research field, I attach 
a patch that implements a full LM in lucene. The basic issue that should be 
addressed here is assigning a document with a score that depends on *all* the 
query terms, collection statistics and the document length. The general idea of 
what I did is adding a new getBackGroundScore(int docID) method to similarity, 
scorer and bulkScorer. Than, when a collector assigns a score to a document 
(score = scorer.score()) I added the backgound score 
(score=scorer.score()+scorer.background(doc)) that is assigned by the 
similarity class used for ranking. 

The patch also includes a correction of the document length such that it will 
be the real document length and not the encoded one. It is required for the 
full LM implementation.  

  was:
The current implementation of language models in lucene is based on the paper 
"A Study of Smoothing Methods for Language Models Applied to Ad Hoc Information 
Retrieval" by Zhai and Lafferty ('01). Specifically, LMDiricheltSimilarity and 
LMJelinikMercerSimilarity use a normalized smoothed score for a matching term 
in a document, as suggested in the above mentioned paper.

However, lucene doesn't assign a score to query terms that do not appear in a 
matched document. According to the "pure" LM approach, these terms should be 
assigned with a collection probability "background score". If one uses the 
Jelinik Mercer smoothing method, the final result list produced by lucene is 
rank equivalent to the one that would have been created by a full LM 
implementation. However, this is not the case for Dirichlet smoothing method, 
because the background score is document dependent. Documents in which not all 
query terms appear, are missing the document-dependant background score for the 
missing terms. This component affects the final ranking of documents in the 
list.

Since LM is a baseline method in many works in the IR research field, I attach 
a patch that implements a full LM in lucene. The basic issue that should be 
addressed here is assigning a document with a score that depends on *all* the 
query terms, collection statistics and the document length. The general idea of 
what I did is adding a new getBackGroundScore(int docID) method to similarity, 
scorer and bulkScorer. Than, when a collector assigns a score to a document 
(score = scorer.score()) I added the backgound score 
(score=scorer.score()+scorer.background(doc)) that is assigned by the 
similarity class used for ranking. 


> Improved implementation of language models in lucene 
> -----------------------------------------------------
>
>                 Key: LUCENE-5847
>                 URL: https://issues.apache.org/jira/browse/LUCENE-5847
>             Project: Lucene - Core
>          Issue Type: Improvement
>          Components: core/search
>            Reporter: Hadas Raviv
>            Priority: Minor
>             Fix For: 5.0
>
>         Attachments: LUCENE-2507.patch
>
>
> The current implementation of language models in lucene is based on the paper 
> "A Study of Smoothing Methods for Language Models Applied to Ad Hoc 
> Information Retrieval" by Zhai and Lafferty ('01). Specifically, 
> LMDiricheltSimilarity and LMJelinikMercerSimilarity use a normalized smoothed 
> score for a matching term in a document, as suggested in the above mentioned 
> paper.
> However, lucene doesn't assign a score to query terms that do not appear in a 
> matched document. According to the "pure" LM approach, these terms should be 
> assigned with a collection probability "background score". If one uses the 
> Jelinik Mercer smoothing method, the final result list produced by lucene is 
> rank equivalent to the one that would have been created by a full LM 
> implementation. However, this is not the case for Dirichlet smoothing method, 
> because the background score is document dependent. Documents in which not 
> all query terms appear, are missing the document-dependant background score 
> for the missing terms. This component affects the final ranking of documents 
> in the list.
> Since LM is a baseline method in many works in the IR research field, I 
> attach a patch that implements a full LM in lucene. The basic issue that 
> should be addressed here is assigning a document with a score that depends on 
> *all* the query terms, collection statistics and the document length. The 
> general idea of what I did is adding a new getBackGroundScore(int docID) 
> method to similarity, scorer and bulkScorer. Than, when a collector assigns a 
> score to a document (score = scorer.score()) I added the backgound score 
> (score=scorer.score()+scorer.background(doc)) that is assigned by the 
> similarity class used for ranking. 
> The patch also includes a correction of the document length such that it will 
> be the real document length and not the encoded one. It is required for the 
> full LM implementation.  



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

Reply via email to