[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Robert Muir (JIRA) Thu, 31 Mar 2011 05:23:47 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13013944#comment-13013944
 ]


Robert Muir commented on LUCENE-2959:
-------------------------------------

{quote}
I think the main point would be to make the addition of a new ranking function 
as easy as possible. At least a prototype implementation should be very 
straightforward, even at the expense of performance. Then, if the new method 
provides good results, the developer can go on to the lower level to squeeze 
more juice out of it. It's hard for me to discuss new this without knowing the 
code, of course, but do you think it is possible?
{quote}

This sounds great! For example, you could extend the low-level api, gather 
every possible statistic that lucene has, and present a high-level api that 
looks more like terrier's scoring api (which i'm guessing is what researchers 
would prefer?), where they basically implement the scoring in one method with 
all the stats there.

So someone would extend this API to do prototyping, it would make it easier to 
experiment.

{quote}
I think I will follow your advice and concentrate on how to make BM25F fast.
{quote}

Actually as far as BM25f, this one presents a few challenges (some already 
discussed on LUCENE-2091). 

To summarize:
* for any field, Lucene has a per-field terms dictionary that contains that 
term's docFreq. To compute BM25f's IDF method would be challenging, because it 
wants a docFreq "across all the fields". (its not clear to me at a glance 
either from the original paper, if this should be across only the fields in the 
query, across all the fields in the document, and if a "static" schema is 
implied in this scoring system (in lucene document 1 can have 3 fields and 
document 2 can have 40 different ones, even with different properties).
* the same issue applies to length normalization, lucene has a "field length" 
but really no concept of document length. 

So I just wanted to mention that while its possible here to apply a per-field 
TF boost before the non-linear TF saturation, its not immediately clear how to 
adjust the BM25f formula to lucene: how to combine these scores without using a 
(wasteful) "catch-all-field" and some lying behind the scenes to force this 
catch-all-field's length normalization and docFreq to be used.

Too many questions arise for BM25f and how it would "fit" with lucene, for 
example the fact that "multiple fields" can really mean anything, and having a 
field in lucene doesnt mean at all that it was in your original document! For 
example, Solr users frequently use a "copyField" to take the content of one 
field, duplicate it to a different field (and perhaps apply some processing). 
In terms of things like length normalization, it seems that "document length" 
calculated as the sum across the fields would be wrong for many use cases.

I only wanted to recommend against this one because of this rather serious 
challenge, it seems its something we might want to table at the moment: lucene 
is changing fast and as new capabilities arise, we might realize there is a 
more elegant way to address this... but at the moment I think I would recommend 
starting with BM25.




> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the 
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions 
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to 
> implement a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Reply via email to