[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Robert Muir (JIRA) Fri, 01 Apr 2011 06:17:47 -0700

    [ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014547#comment-13014547
 ]


Robert Muir commented on LUCENE-2959:
-------------------------------------

{quote}
One thing that is not clear for me is why these limitations would not be a 
problem for BM25. As I see it, the difference between the two methods is that 
BM25 simply computes tfs, idfs and document length from the whole document – 
which, according to what you said, is not available Lucene. That's why I 
figured that a variant of BM25F would actually be more straightforward to 
implement.
{quote}

A variant sounds really interesting? I think you know better than me here, I 
just looked at the original paper and thought to myself that to implement this 
"by the book" might not be feasible for a while.

{quote}
Robert, would you be so kind to have a look at my proposal? It can be found at 
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
 It's basically the same as what I sent to the mailing list. I wrote that I 
want to implement BM25, BM25F and DFR ("the framework", I meant with one or two 
smoothing models), as well as to convert the original scoring to the new 
framework. In light of the thread here, I guess it would be better to modify 
these goals, perhaps by:

deleting the conversion part?
committing myself to BM25/BM25F only?
explicitly stating that I want a higher level API based on the low-level one?
{quote}

I think you can decide what you want to do? Obviously I would love to see all 
of it done :)

But its your choice, I could see you going a couple different ways:
* closer to your original proposal, you could still develop a flexible scoring 
API on top of Similarity. Hey, all I did was move stuff from Scorer to 
Similarity really, which does give flexibility, but its probably not what an IR 
researcher would want (its low-level and confusing). So you could make a 
"SimpleSimilarity" or "EasySimilarity" or something thats presents a much 
simpler API (something closer to what terrier/indri present) on top of this, 
for easily implementing ranking functions? I think this would be extremely 
valuable long-term: who cares if we have a low-level flexible scoring API that 
only speed demons like, but IR practitioners find confusing and hideous? 
Someone who is trying to experiment with an enhancement to relevance likely 
doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the 
API is really easy and they aren't wasting time fighting with lucene? If you go 
this route, you could implement BM25, DFR, etc as you suggested as examples to 
how to use this API, and there would be more of a focus on API quality and 
simplicity instead of performance.
* or alternatively, you could refine your proposal to implement a really 
"production strength" version of one of these scoring systems on top of the 
low-level API, that would ideally have competitive 
performance/documentation/etc with Lucene's default scoring today. If you 
decide to do this, then yes, I would definitely suggest picking only one, 
because I think its a ton of work as I listed above, and I think there would be 
more focus on practical things (some probably being nuances of lucene) and 
performance.


> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the 
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions 
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to 
> implement a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]

[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

Reply via email to