[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
 ]

Robert Muir updated LUCENE-2959:
--------------------------------

    Attachment: LUCENE-2959_mockdfr.patch

David, for your perusal here is another sim i tried to write: DFR I(F)L2

its probably got bugs, but demonstrates again the challenges here.

If we want to support ranking systems like this, how can they be made fast?

The one i wrote has no score caching, so it does a lot of per-document 
divisions, multiplications, etc and this is no good.

So its gonna be hard to make these have competitive performance with lucene's 
current scoring, which for TF < 32 is an array lookup and a single 
multiplication.

Its more obvious to me how to eek good performance from the language modelling 
formula because you can re-arrange the log and boil it down to some addition, 
but we need to get creative thinking about how to make some of these other 
models fast, and its more complicated if you want to make say a dfr "framework" 
that allows you to pick basic model and the 2 normalizations, versus 
specializing the code for each possibility (and there are many).

My advice to you for GSOC would be to just pick one of these (e.g. BM25) and 
figure out how to do it really well, good performance, good api and 
documentation, and good relevance testing to ensure its quality.

I'm more than happy to help with the boring parts like refactoring lucene's 
Explanations API :)


> [GSoC] Implementing State of the Art Ranking for Lucene
> -------------------------------------------------------
>
>                 Key: LUCENE-2959
>                 URL: https://issues.apache.org/jira/browse/LUCENE-2959
>             Project: Lucene - Java
>          Issue Type: New Feature
>          Components: Examples, Javadocs, Query/Scoring
>            Reporter: David Mark Nemeskey
>              Labels: gsoc2011, lucene-gsoc-11, mentor
>         Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
> proposal.pdf
>
>
> Lucene employs the Vector Space Model (VSM) to rank documents, which compares
> unfavorably to state of the art algorithms, such as BM25. Moreover, the 
> architecture is
> tailored specically to VSM, which makes the addition of new ranking functions 
> a non-
> trivial task.
> This project aims to bring state of the art ranking methods to Lucene and to 
> implement a
> query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to