[ 
https://issues.apache.org/jira/browse/LUCENE-3220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065896#comment-13065896
 ] 

Robert Muir commented on LUCENE-3220:
-------------------------------------

Hi David, this is looking really good! The patch is quite large so what i did 
was:
# re-sync flexscoring branch to trunk
# commit your patch as is (i did a tiny tweak for LUCENE-3299)

I saw a couple things we should address (full review will really mean i have to 
take quite a bit of time for each model!)
But we can take care of some of this easy stuff first!

* numberOfFieldTokens seems to be the same as sumOfTotalTF? we should only have 
one name for this stat i think
* i like the idea of NoAfterAffect/NoNormalization in DFR, maybe we should make 
these ordinary classes, and in DFR we just don't allow null for any of the 
components? just thought it might look cleaner.
* some of the files in .similarities need apache license header.
* because we dont need the norm for averaging, maybe we should use lucene's 
encoding? we can pre-build the decode table like TF-IDF similarity, except our 
decode table is basically 1 / decode(float)^2 to give us the quantized doc 
length. from a practical perspective, this would mean someone could use this 
stuff with existing lucene indexes (once they upgrade their segments to 4.0's 
format), and easily switch between things without reindexing.
 
if you want, you can do these things on this issue or open separate issues, 
whichever is easiest. but i think looking at smaller patches at this point will 
make iteration easier!

> Implement various ranking models as Similarities
> ------------------------------------------------
>
>                 Key: LUCENE-3220
>                 URL: https://issues.apache.org/jira/browse/LUCENE-3220
>             Project: Lucene - Java
>          Issue Type: Sub-task
>          Components: core/search
>    Affects Versions: flexscoring branch
>            Reporter: David Mark Nemeskey
>            Assignee: David Mark Nemeskey
>              Labels: gsoc
>         Attachments: LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, 
> LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch, LUCENE-3220.patch
>
>   Original Estimate: 336h
>  Remaining Estimate: 336h
>
> With [LUCENE-3174|https://issues.apache.org/jira/browse/LUCENE-3174] done, we 
> can finally work on implementing the standard ranking models. Currently DFR, 
> BM25 and LM are on the menu.
> Done:
>  * {{EasyStats}}: contains all statistics that might be relevant for a 
> ranking algorithm
>  * {{EasySimilarity}}: the ancestor of all the other similarities. Hides the 
> DocScorers and as much implementation detail as possible
>  * _BM25_: the current "mock" implementation might be OK
>  * _LM_
>  * _DFR_
>  * The so-called _Information-Based Models_

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org

Reply via email to