[
https://issues.apache.org/jira/browse/LUCENE-9537?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17201089#comment-17201089
]
Cameron VandenBerg commented on LUCENE-9537:
--------------------------------------------
Hi Adrien,
Unfortunately, the smoothing score that we use is document specific, so I am
not sure if I could make it "transferable". I am definitely interested in
brainstorming ways that we can make Indri fit into the Lucene architecture
better though. Perhaps an example of how Indri smoothing scores would be
helpful.
Supposed we have an index with 4 documents (so sorry for the political nature
of the documents... it's just what I can easily think of at the moment):
1) Donald Trump is the president of the United States.
2) There are three branches of government. The president is the head of the
executive branch.
3) Jane Doe is president of the PTO.
4) Trump was elected in the 2016 election.
Say that the query is: President Trump.
In this index, the term president occurs more than the term Trump. The
smoothing score acts like and idf for the query terms so that documents with
just the term Trump will be ranked higher than documents with just the term
president.
Consider documents 3&4, which have the same length and each have one search
term, but Document 4 has the more rare search term. Therefore the smoothing
score for the term Trump in Document 3, will be lower than the smoothing score
for the term president in Document 4. The addition of the smoothing scores for
the terms that don't exist allows Document 4 to get a higher score and be
ranked above Document 3.
Let me know whether this example makes sense. Can you see a way that I can
refactor the smoothing score so that it better fits into Lucene's existing
architecture? Or let me know if I misunderstood your comment and you still
feel that what you suggested will work.
Thank you!
> Add Indri Search Engine Functionality to Lucene
> -----------------------------------------------
>
> Key: LUCENE-9537
> URL: https://issues.apache.org/jira/browse/LUCENE-9537
> Project: Lucene - Core
> Issue Type: Improvement
> Components: core/search
> Reporter: Cameron VandenBerg
> Priority: Major
> Labels: patch
> Attachments: LUCENE-INDRI.patch
>
>
> Indri ([http://lemurproject.org/indri.php]) is an academic search engine
> developed by The University of Massachusetts and Carnegie Mellon University.
> The major difference between Lucene and Indri is that Indri will give a
> document a "smoothing score" to a document that does not contain the search
> term, which has improved the search ranking accuracy in our experiments. I
> have created an Indri patch, which adds the search code needed to implement
> the Indri AND logic as well as Indri's implementation of Dirichlet Smoothing.
--
This message was sent by Atlassian Jira
(v8.3.4#803005)
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]