[
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13014547#comment-13014547
]
Robert Muir commented on LUCENE-2959:
-------------------------------------
{quote}
One thing that is not clear for me is why these limitations would not be a
problem for BM25. As I see it, the difference between the two methods is that
BM25 simply computes tfs, idfs and document length from the whole document –
which, according to what you said, is not available Lucene. That's why I
figured that a variant of BM25F would actually be more straightforward to
implement.
{quote}
A variant sounds really interesting? I think you know better than me here, I
just looked at the original paper and thought to myself that to implement this
"by the book" might not be feasible for a while.
{quote}
Robert, would you be so kind to have a look at my proposal? It can be found at
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
It's basically the same as what I sent to the mailing list. I wrote that I
want to implement BM25, BM25F and DFR ("the framework", I meant with one or two
smoothing models), as well as to convert the original scoring to the new
framework. In light of the thread here, I guess it would be better to modify
these goals, perhaps by:
deleting the conversion part?
committing myself to BM25/BM25F only?
explicitly stating that I want a higher level API based on the low-level one?
{quote}
I think you can decide what you want to do? Obviously I would love to see all
of it done :)
But its your choice, I could see you going a couple different ways:
* closer to your original proposal, you could still develop a flexible scoring
API on top of Similarity. Hey, all I did was move stuff from Scorer to
Similarity really, which does give flexibility, but its probably not what an IR
researcher would want (its low-level and confusing). So you could make a
"SimpleSimilarity" or "EasySimilarity" or something thats presents a much
simpler API (something closer to what terrier/indri present) on top of this,
for easily implementing ranking functions? I think this would be extremely
valuable long-term: who cares if we have a low-level flexible scoring API that
only speed demons like, but IR practitioners find confusing and hideous?
Someone who is trying to experiment with an enhancement to relevance likely
doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the
API is really easy and they aren't wasting time fighting with lucene? If you go
this route, you could implement BM25, DFR, etc as you suggested as examples to
how to use this API, and there would be more of a focus on API quality and
simplicity instead of performance.
* or alternatively, you could refine your proposal to implement a really
"production strength" version of one of these scoring systems on top of the
low-level API, that would ideally have competitive
performance/documentation/etc with Lucene's default scoring today. If you
decide to do this, then yes, I would definitely suggest picking only one,
because I think its a ton of work as I listed above, and I think there would be
more focus on practical things (some probably being nuances of lucene) and
performance.
> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
> Key: LUCENE-2959
> URL: https://issues.apache.org/jira/browse/LUCENE-2959
> Project: Lucene - Java
> Issue Type: New Feature
> Components: Examples, Javadocs, Query/Scoring
> Reporter: David Mark Nemeskey
> Labels: gsoc2011, lucene-gsoc-11, mentor
> Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf,
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to
> implement a
> query architecture with pluggable ranking functions.
--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira
---------------------------------------------------------------------
To unsubscribe, e-mail: [email protected]
For additional commands, e-mail: [email protected]