[
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013944#comment-13013944
]
Robert Muir commented on LUCENE-2959:
-------------------------------------
{quote}
I think the main point would be to make the addition of a new ranking function
as easy as possible. At least a prototype implementation should be very
straightforward, even at the expense of performance. Then, if the new method
provides good results, the developer can go on to the lower level to squeeze
more juice out of it. It's hard for me to discuss new this without knowing the
code, of course, but do you think it is possible?
{quote}
This sounds great! For example, you could extend the low-level api, gather
every possible statistic that lucene has, and present a high-level api that
looks more like terrier's scoring api (which i'm guessing is what researchers
would prefer?), where they basically implement the scoring in one method with
all the stats there.
So someone would extend this API to do prototyping, it would make it easier to
experiment.
{quote}
I think I will follow your advice and concentrate on how to make BM25F fast.
{quote}
Actually as far as BM25f, this one presents a few challenges (some already
discussed on LUCENE-2091).
To summarize:
* for any field, Lucene has a per-field terms dictionary that contains that
term's docFreq. To compute BM25f's IDF method would be challenging, because it
wants a docFreq "across all the fields". (its not clear to me at a glance
either from the original paper, if this should be across only the fields in the
query, across all the fields in the document, and if a "static" schema is
implied in this scoring system (in lucene document 1 can have 3 fields and
document 2 can have 40 different ones, even with different properties).
* the same issue applies to length normalization, lucene has a "field length"
but really no concept of document length.
So I just wanted to mention that while its possible here to apply a per-field
TF boost before the non-linear TF saturation, its not immediately clear how to
adjust the BM25f formula to lucene: how to combine these scores without using a
(wasteful) "catch-all-field" and some lying behind the scenes to force this
catch-all-field's length normalization and docFreq to be used.
Too many questions arise for BM25f and how it would "fit" with lucene, for
example the fact that "multiple fields" can really mean anything, and having a
field in lucene doesnt mean at all that it was in your original document! For
example, Solr users frequently use a "copyField" to take the content of one
field, duplicate it to a different field (and perhaps apply some processing).
In terms of things like length normalization, it seems that "document length"
calculated as the sum across the fields would be wrong for many use cases.
I only wanted to recommend against this one because of this rather serious
challenge, it seems its something we might want to table at the moment: lucene
is changing fast and as new capabilities arise, we might realize there is a
more elegant way to address this... but at the moment I think I would recommend
starting with BM25.
> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
> Key: LUCENE-2959
> URL: https://issues.apache.org/jira/browse/LUCENE-2959
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Examples, Javadocs, Query/Scoring
> Reporter: David Mark Nemeskey
> Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf,
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to
> implement a
> query architecture with pluggable ranking functions.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]