[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-24 Thread hadas raviv (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113958#comment-13113958
 ] 

hadas raviv commented on LUCENE-2959:
-

Hi,

First of all, I would like to thank you for the great contribution you made by 
adding the state of the art ranking methods to lucene. I was waiting for these  
features for a long time, since they enable an IR researcher like me to use 
lucene, which is a powerful tool, for research purposes.

I downloaded the last version of lucene trunk and played a little with the 
models you implemented. There is question I have and I would really appreciate 
your answer (my apology in advance - I'm new to lucene so maybe this question 
is trivial for you):

I saw that you didn't change the default implementation of lucene for coding 
the document length which is used for ranking in language models (one byte for 
coding the document length together with boosting). Why did you decide that? Is 
it possible to save the real document length coded in some other way (maybe 
with the new flexible index)? Is there any example for such an implementation? 
It is just that I'm concerned with the effect of using an inaccurate document 
length on results quality. Did you check this issue?

In addition - do you know about intentions to implement some more advanced 
ranking models (such as relevance models, mrf) in the near future?

Thanks in advance,
Hadas

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113977#comment-13113977
 ] 

Robert Muir commented on LUCENE-2959:
-

{quote}
I saw that you didn't change the default implementation of lucene for coding 
the document length which is used for ranking in language models (one byte for 
coding the document length together with boosting). Why did you decide that?
{quote}

So that you can switch between ranking models without re-indexing.

{quote}
It is just that I'm concerned with the effect of using an inaccurate document 
length on results quality. Did you check this issue?
{quote}

I ran experiments on this a long time ago, the changes were not statistically 
significant.
But, there is an issue open to still switch norms to docvalues fields, for 
other reasons: LUCENE-3221

{quote}
In addition - do you know about intentions to implement some more advanced 
ranking models (such as relevance models, mrf) in the near future?
{quote}

No, there won't be any additional work on this issue, GSOC is over. 






 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-09-11 Thread Michael McCandless (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102371#comment-13102371
 ] 

Michael McCandless commented on LUCENE-2959:


Thanks David and Robert!

What an incredible step forward: now you can easily try out all sorts of 
pre-existing scoring models, or make your own.  Yay :)

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch, 4.0

 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, 
 LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, 
 implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-08-25 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090872#comment-13090872
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Hi Robert,

I would very much like to run this test on the other sims as well. How do I do 
that?

David



 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch

 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-08-25 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090891#comment-13090891
 ] 

Robert Muir commented on LUCENE-2959:
-

There is a project here that lets you benchmark using wikipedia: 
http://code.google.com/a/apache-extras.org/p/luceneutil/

You need this patch to benchmark Similarities: 
http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=6

(More information on how to get started here: 
http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt)


 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch

 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-08-24 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090685#comment-13090685
 ] 

Robert Muir commented on LUCENE-2959:
-

I rearranged the BM25 in the branch a little bit, its now as fast as lucene's 
ranking formula:
{noformat}
Task   QPS tfidf StdDev tfidf   QPS bm25 StdDev bm25  Pct 
diff
SpanNear4.290.524.140.49  -24% -   
22%
  Phrase3.970.253.890.25  -13% -   
11%
Term   82.184.78   81.002.56   -9% -
7%
  TermBGroup1M1P   83.302.41   82.122.20   -6% -
4%
SloppyPhrase8.030.317.930.43  -10% -
8%
 AndHighHigh   19.380.59   19.160.71   -7% -
5%
PKLookup  175.494.33  173.674.20   -5% -
3%
  AndHighMed   40.991.12   40.711.07   -5% -
4%
 TermGroup1M   25.690.39   25.690.44   -3% -
3%
  Fuzzy2   42.621.83   42.651.80   -8% -
8%
  Fuzzy1   91.743.48   91.863.44   -7% -
7%
 Respell   73.963.30   74.183.29   -8% -
9%
Wildcard   56.330.97   56.601.08   -3% -
4%
 Prefix3   33.360.83   33.590.97   -4% -
6%
TermBGroup1M   55.581.03   56.170.88   -2% -
4%
  IntNRQ   13.380.74   13.580.94  -10% -   
14%
   OrHighMed   11.711.18   11.940.97  -14% -   
22%
  OrHighHigh8.910.749.130.63  -11% -   
19%
{noformat}

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch

 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-08-23 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089449#comment-13089449
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Robert: maybe we could resolve this issue as well? Once we decide what to do 
with 3173 -- perhaps a won'tfix?

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch

 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-08-23 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089453#comment-13089453
 ] 

Robert Muir commented on LUCENE-2959:
-

I think we can defer that one or just leave it open for consideration later.

As far as this issue, lets keep it open until we merge branch to trunk!

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: core/query/scoring, general/javadocs, modules/examples
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Fix For: flexscoring branch

 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.
 The wiki page for the project can be found at 
 http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira



-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-04-07 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016741#comment-13016741
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Thanks Robert, that would be terrific.

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
Assignee: Robert Muir
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-04-01 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014472#comment-13014472
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Robert,

As for the problems with BM25F

{quote}
* for any field, Lucene has a per-field terms dictionary that contains that 
term's docFreq. To compute BM25f's IDF method would be challenging, because it 
wants a docFreq across all the fields.
* the same issue applies to length normalization, lucene has a field 
length but really no concept of document length.
{quote}

One thing that is not clear for me is why these limitations would not be a 
problem for BM25. As I see it, the difference between the two methods is that 
BM25 simply computes tfs, idfs and document length from the whole document -- 
which, according to what you said, is not available Lucene. That's why I 
figured that a variant of BM25F would actually be more straightforward to 
implement.

{quote}
(its not clear to me at a glance either from the original paper, if this should 
be across only the fields in the query, across all the fields in the document, 
and if a static schema is implied in this scoring system (in lucene document 
1 can have 3 fields and document 2 can have 40 different ones, even with 
different properties).
{quote}

Actually I am not sure there is a consensus on what BM25F actually is. :) For 
example, the BM25 formula can be applied to the weighted sum of field tfs, or 
alternatively, the per-field BM25 scores can be summarized as well after 
normalization. I've seen both called (maybe incorrectly) BM25F.

If I understand correctly, the current scoring algorithm takes into account 
only the fields explicitly specified in the query. Is that right? If so, I see 
no reason why BM25 should behave otherwise. Which of course also means that we 
probably won't be able to save the summarized doc length and idf.

Robert, would you be so kind to have a look at my proposal? It can be found at 
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
 It's basically the same as what I sent to the mailing list. I wrote that I 
want to implement BM25, BM25F and DFR (the framework, I meant with one or two 
smoothing models), as well as to convert the original scoring to the new 
framework. In light of the thread here, I guess it would be better to modify 
these goals, perhaps by:
* deleting the conversion part?
* committing myself to BM25/BM25F only?
* explicitly stating that I want a higher level API based on the low-level one?

As for the last item, it is only if I continue / join the work in 2392. Since I 
guess nobody wants two ranking frameworks, of course I will, but then in this 
part of the proposal should I just concentrate on the higher level API?

Thanks!

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-04-01 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014547#comment-13014547
 ] 

Robert Muir commented on LUCENE-2959:
-

{quote}
One thing that is not clear for me is why these limitations would not be a 
problem for BM25. As I see it, the difference between the two methods is that 
BM25 simply computes tfs, idfs and document length from the whole document – 
which, according to what you said, is not available Lucene. That's why I 
figured that a variant of BM25F would actually be more straightforward to 
implement.
{quote}

A variant sounds really interesting? I think you know better than me here, I 
just looked at the original paper and thought to myself that to implement this 
by the book might not be feasible for a while.

{quote}
Robert, would you be so kind to have a look at my proposal? It can be found at 
http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1.
 It's basically the same as what I sent to the mailing list. I wrote that I 
want to implement BM25, BM25F and DFR (the framework, I meant with one or two 
smoothing models), as well as to convert the original scoring to the new 
framework. In light of the thread here, I guess it would be better to modify 
these goals, perhaps by:

deleting the conversion part?
committing myself to BM25/BM25F only?
explicitly stating that I want a higher level API based on the low-level one?
{quote}

I think you can decide what you want to do? Obviously I would love to see all 
of it done :)

But its your choice, I could see you going a couple different ways:
* closer to your original proposal, you could still develop a flexible scoring 
API on top of Similarity. Hey, all I did was move stuff from Scorer to 
Similarity really, which does give flexibility, but its probably not what an IR 
researcher would want (its low-level and confusing). So you could make a 
SimpleSimilarity or EasySimilarity or something thats presents a much 
simpler API (something closer to what terrier/indri present) on top of this, 
for easily implementing ranking functions? I think this would be extremely 
valuable long-term: who cares if we have a low-level flexible scoring API that 
only speed demons like, but IR practitioners find confusing and hideous? 
Someone who is trying to experiment with an enhancement to relevance likely 
doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the 
API is really easy and they aren't wasting time fighting with lucene? If you go 
this route, you could implement BM25, DFR, etc as you suggested as examples to 
how to use this API, and there would be more of a focus on API quality and 
simplicity instead of performance.
* or alternatively, you could refine your proposal to implement a really 
production strength version of one of these scoring systems on top of the 
low-level API, that would ideally have competitive 
performance/documentation/etc with Lucene's default scoring today. If you 
decide to do this, then yes, I would definitely suggest picking only one, 
because I think its a ton of work as I listed above, and I think there would be 
more focus on practical things (some probably being nuances of lucene) and 
performance.


 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-04-01 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014619#comment-13014619
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

{quote}
I think you can decide what you want to do?
{quote}
Fair enough. :) I guess I'll stick with my original proposal then, though I 
might change a few things here and there; maybe change the focus from 
flexibility (as it seems to be already underway) to simplicity.

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-03-31 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013907#comment-13013907
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Robert: thanks for all the info! It's nice to see so much work has already been 
done. I plan to delve into it after the selection, and try to get other things 
out of the way until then, so that I can concentrate on GSoC during the summer.

I think the main point would be to make the addition of a new ranking function 
as easy as possible. At least a prototype implementation should be very 
straightforward, even at the expense of performance. Then, if the new method 
provides good results, the developer can go on to the lower level to squeeze 
more juice out of it. It's hard for me to discuss new this without knowing the 
code, of course, but do you think it is possible?

Even though I added a Performance section to my proposal 
(http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1),
 I see now that it's probably more important than I believed it to be at first. 
I think I will follow your advice and concentrate on how to make BM25F fast. It 
may be a bit tougher nut to crack than DFR, as the latter has logarithms 
scattered all over it. However, the first thing that comes to mind is that the 
tf-BM25 curve becomes almost flat very quickly (less so for a high k1 value, 
though). So it may be possible to pre-compute a tf map or array for a query.

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-03-31 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013944#comment-13013944
 ] 

Robert Muir commented on LUCENE-2959:
-

{quote}
I think the main point would be to make the addition of a new ranking function 
as easy as possible. At least a prototype implementation should be very 
straightforward, even at the expense of performance. Then, if the new method 
provides good results, the developer can go on to the lower level to squeeze 
more juice out of it. It's hard for me to discuss new this without knowing the 
code, of course, but do you think it is possible?
{quote}

This sounds great! For example, you could extend the low-level api, gather 
every possible statistic that lucene has, and present a high-level api that 
looks more like terrier's scoring api (which i'm guessing is what researchers 
would prefer?), where they basically implement the scoring in one method with 
all the stats there.

So someone would extend this API to do prototyping, it would make it easier to 
experiment.

{quote}
I think I will follow your advice and concentrate on how to make BM25F fast.
{quote}

Actually as far as BM25f, this one presents a few challenges (some already 
discussed on LUCENE-2091). 

To summarize:
* for any field, Lucene has a per-field terms dictionary that contains that 
term's docFreq. To compute BM25f's IDF method would be challenging, because it 
wants a docFreq across all the fields. (its not clear to me at a glance 
either from the original paper, if this should be across only the fields in the 
query, across all the fields in the document, and if a static schema is 
implied in this scoring system (in lucene document 1 can have 3 fields and 
document 2 can have 40 different ones, even with different properties).
* the same issue applies to length normalization, lucene has a field length 
but really no concept of document length. 

So I just wanted to mention that while its possible here to apply a per-field 
TF boost before the non-linear TF saturation, its not immediately clear how to 
adjust the BM25f formula to lucene: how to combine these scores without using a 
(wasteful) catch-all-field and some lying behind the scenes to force this 
catch-all-field's length normalization and docFreq to be used.

Too many questions arise for BM25f and how it would fit with lucene, for 
example the fact that multiple fields can really mean anything, and having a 
field in lucene doesnt mean at all that it was in your original document! For 
example, Solr users frequently use a copyField to take the content of one 
field, duplicate it to a different field (and perhaps apply some processing). 
In terms of things like length normalization, it seems that document length 
calculated as the sum across the fields would be wrong for many use cases.

I only wanted to recommend against this one because of this rather serious 
challenge, it seems its something we might want to table at the moment: lucene 
is changing fast and as new capabilities arise, we might realize there is a 
more elegant way to address this... but at the moment I think I would recommend 
starting with BM25.




 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11, mentor
 Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, 
 proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-03-30 Thread Robert Muir (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012996#comment-13012996
 ] 

Robert Muir commented on LUCENE-2959:
-

Hi David, to try to help get things moving, I created a branch: 
https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring

This is the in-progress work from LUCENE-2392, which separates the scoring 
calculates from the postings-list matching. In short, Similarity becomes very 
low-level, but the idea is that you extend it to present a higher-level API 
(for example TFIDFSimilarity: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java)
 that is user-friendly and allows users to adjust parameters in a way that 
makes sense to that scoring system.

As a start I implemented some very very rough/basic models in src/test:
BM25: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java

Dirichlet LM: 
http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java

But these are in no way correct or extensible or nice. For example, the BM25 
similarity is slow, because as implemented its average document length is 
live (e.g. if you add more segments its immediately adjusted for each 
query)... there is no caching at all. 

For example in this case, to speed up BM25, it could be nice for the Similarity 
to pull this up-front, and create cached calculations. If a user wants to 
refresh their bm25 stats then they could do something in SimilarityProvider to 
call it to recalculate the caches.

However, for a user that wants super-realtime view, it might be better for it 
to stay the way it is now, or alternatively for the Sim to up-front do the 256 
calculations per query (ideally in weight, not per-segment in docscorer) to 
tableize the length normalizations.

So these are the API challenges we need to consider if we want to provide 
actual implementations of these scoring systems: how to make them perform close 
to or as fast as lucene's current scoring model.

Separately on the issue, I want to make Weight completely opaque to the sim, 
really its just a way for a Similarity to compute things up front (such as IDF, 
but maybe things like these bm25 length norm caches too). Currently it can only 
have a single float value (see my un-sqrt'ing and other hacks in the Mock 
sims), so this should be fixed.

Additionally another big TODO: just as Scorer was split (maybe we should rename 
it to Matcher now that sim does the calcs?), the process of Explanations need 
to be split too, where a Sim is completely responsible for explaining itself.

Another TODO i have is to write the norm summation into the norms file as a 
single vlong, rather than computing it across all byte[] in segmentreader like 
I do now... I just implemented it this way so that we could play with scoring 
algorithms easily.

So, the good news would be that scoring is a lot more flexible, but the bad 
news is that in order to support lucene's features, implementing a new ranking 
system on top of Similarity is really *serious* work, as you need to:
# implement the lower-level API efficiently, yet expose a nice high-level API 
such as TFIDFSimilarity's tf() and idf() hooks for users.
# implement explanations so that users can debug relevance issues.
# think about allowing users to balance the various performance tradeoffs, such 
as balancing the performance gained by caching things versus using realtime 
statistics (some of this could be in my head, maybe computing 256 norm decoder 
caches up-front is really cheap and a non-issue).
# consider how to integrate lucene's features into the ranking system, for 
example how to estimate a reasonable phrase IDF for phrase/multiphrase/span 
queries, how to integrate index-time boosts (in my example BM25 etc, I just 
made the documents appear shorter to accomplish this), depending upon how the 
length normalization is being stored in the index, how to pick the best 
quantization (might not be SmallFloat352), etc etc. 
# do all the relevance testing to ensure that things are correct (i found lots 
of bugs doing rough testing on my Mock ones, there are probably more!, but on 
the couple test collections i tried they seemed reasonable)
# adding good quality documentation such as what we have today in 
TFIDFSimilarity that explains how the ranking system works and how you can tune 
it.


 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David 

[jira] Commented: (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-03-11 Thread David Mark Nemeskey (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005575#comment-13005575
 ] 

David Mark Nemeskey commented on LUCENE-2959:
-

Andrzej: thanks! Indeed, I have read that that paper, but have only skimmed 
through the code. I am also aware of at least one BM25 implementation for 
Lucene, which may or may not be what issue LUCENE-2091 is about. I need to have 
a look into it.

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11
 Attachments: implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org



[jira] Commented: (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene

2011-03-10 Thread Andrzej Bialecki (JIRA)

[ 
https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005345#comment-13005345
 ] 

Andrzej Bialecki  commented on LUCENE-2959:
---

You are probably familiar with this paper and the code... just in case I'm 
adding a reference here: http://arxiv.org/abs/0911.5046

 [GSoC] Implementing State of the Art Ranking for Lucene
 ---

 Key: LUCENE-2959
 URL: https://issues.apache.org/jira/browse/LUCENE-2959
 Project: Lucene - Java
  Issue Type: New Feature
  Components: Examples, Javadocs, Query/Scoring
Reporter: David Mark Nemeskey
  Labels: gsoc2011, lucene-gsoc-11
 Attachments: implementation_plan.pdf, proposal.pdf


 Lucene employs the Vector Space Model (VSM) to rank documents, which compares
 unfavorably to state of the art algorithms, such as BM25. Moreover, the 
 architecture is
 tailored specically to VSM, which makes the addition of new ranking functions 
 a non-
 trivial task.
 This project aims to bring state of the art ranking methods to Lucene and to 
 implement a
 query architecture with pluggable ranking functions.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

-
To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org
For additional commands, e-mail: dev-h...@lucene.apache.org