[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113958#comment-13113958 ] hadas raviv commented on LUCENE-2959: - Hi, First of all, I would like to thank you for the great contribution you made by adding the state of the art ranking methods to lucene. I was waiting for these features for a long time, since they enable an IR researcher like me to use lucene, which is a powerful tool, for research purposes. I downloaded the last version of lucene trunk and played a little with the models you implemented. There is question I have and I would really appreciate your answer (my apology in advance - I'm new to lucene so maybe this question is trivial for you): I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? Is it possible to save the real document length coded in some other way (maybe with the new flexible index)? Is there any example for such an implementation? It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? Thanks in advance, Hadas [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch, 4.0 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13113977#comment-13113977 ] Robert Muir commented on LUCENE-2959: - {quote} I saw that you didn't change the default implementation of lucene for coding the document length which is used for ranking in language models (one byte for coding the document length together with boosting). Why did you decide that? {quote} So that you can switch between ranking models without re-indexing. {quote} It is just that I'm concerned with the effect of using an inaccurate document length on results quality. Did you check this issue? {quote} I ran experiments on this a long time ago, the changes were not statistically significant. But, there is an issue open to still switch norms to docvalues fields, for other reasons: LUCENE-3221 {quote} In addition - do you know about intentions to implement some more advanced ranking models (such as relevance models, mrf) in the near future? {quote} No, there won't be any additional work on this issue, GSOC is over. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch, 4.0 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13102371#comment-13102371 ] Michael McCandless commented on LUCENE-2959: Thanks David and Robert! What an incredible step forward: now you can easily try out all sorts of pre-existing scoring models, or make your own. Yay :) [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch, 4.0 Attachments: LUCENE-2959.patch, LUCENE-2959.patch, LUCENE-2959_mockdfr.patch, LUCENE-2959_nocommits.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090872#comment-13090872 ] David Mark Nemeskey commented on LUCENE-2959: - Hi Robert, I would very much like to run this test on the other sims as well. How do I do that? David [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090891#comment-13090891 ] Robert Muir commented on LUCENE-2959: - There is a project here that lets you benchmark using wikipedia: http://code.google.com/a/apache-extras.org/p/luceneutil/ You need this patch to benchmark Similarities: http://code.google.com/a/apache-extras.org/p/luceneutil/issues/detail?id=6 (More information on how to get started here: http://code.google.com/a/apache-extras.org/p/luceneutil/source/browse/README.txt) [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13090685#comment-13090685 ] Robert Muir commented on LUCENE-2959: - I rearranged the BM25 in the branch a little bit, its now as fast as lucene's ranking formula: {noformat} Task QPS tfidf StdDev tfidf QPS bm25 StdDev bm25 Pct diff SpanNear4.290.524.140.49 -24% - 22% Phrase3.970.253.890.25 -13% - 11% Term 82.184.78 81.002.56 -9% - 7% TermBGroup1M1P 83.302.41 82.122.20 -6% - 4% SloppyPhrase8.030.317.930.43 -10% - 8% AndHighHigh 19.380.59 19.160.71 -7% - 5% PKLookup 175.494.33 173.674.20 -5% - 3% AndHighMed 40.991.12 40.711.07 -5% - 4% TermGroup1M 25.690.39 25.690.44 -3% - 3% Fuzzy2 42.621.83 42.651.80 -8% - 8% Fuzzy1 91.743.48 91.863.44 -7% - 7% Respell 73.963.30 74.183.29 -8% - 9% Wildcard 56.330.97 56.601.08 -3% - 4% Prefix3 33.360.83 33.590.97 -4% - 6% TermBGroup1M 55.581.03 56.170.88 -2% - 4% IntNRQ 13.380.74 13.580.94 -10% - 14% OrHighMed 11.711.18 11.940.97 -14% - 22% OrHighHigh8.910.749.130.63 -11% - 19% {noformat} [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089449#comment-13089449 ] David Mark Nemeskey commented on LUCENE-2959: - Robert: maybe we could resolve this issue as well? Once we decide what to do with 3173 -- perhaps a won'tfix? [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13089453#comment-13089453 ] Robert Muir commented on LUCENE-2959: - I think we can defer that one or just leave it open for consideration later. As far as this issue, lets keep it open until we merge branch to trunk! [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: core/query/scoring, general/javadocs, modules/examples Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Fix For: flexscoring branch Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. The wiki page for the project can be found at http://wiki.apache.org/lucene-java/SummerOfCode2011ProjectRanking. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13016741#comment-13016741 ] David Mark Nemeskey commented on LUCENE-2959: - Thanks Robert, that would be terrific. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Assignee: Robert Muir Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014472#comment-13014472 ] David Mark Nemeskey commented on LUCENE-2959: - Robert, As for the problems with BM25F {quote} * for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq across all the fields. * the same issue applies to length normalization, lucene has a field length but really no concept of document length. {quote} One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document -- which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. {quote} (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a static schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties). {quote} Actually I am not sure there is a consensus on what BM25F actually is. :) For example, the BM25 formula can be applied to the weighted sum of field tfs, or alternatively, the per-field BM25 scores can be summarized as well after normalization. I've seen both called (maybe incorrectly) BM25F. If I understand correctly, the current scoring algorithm takes into account only the fields explicitly specified in the query. Is that right? If so, I see no reason why BM25 should behave otherwise. Which of course also means that we probably won't be able to save the summarized doc length and idf. Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR (the framework, I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: * deleting the conversion part? * committing myself to BM25/BM25F only? * explicitly stating that I want a higher level API based on the low-level one? As for the last item, it is only if I continue / join the work in 2392. Since I guess nobody wants two ranking frameworks, of course I will, but then in this part of the proposal should I just concentrate on the higher level API? Thanks! [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014547#comment-13014547 ] Robert Muir commented on LUCENE-2959: - {quote} One thing that is not clear for me is why these limitations would not be a problem for BM25. As I see it, the difference between the two methods is that BM25 simply computes tfs, idfs and document length from the whole document – which, according to what you said, is not available Lucene. That's why I figured that a variant of BM25F would actually be more straightforward to implement. {quote} A variant sounds really interesting? I think you know better than me here, I just looked at the original paper and thought to myself that to implement this by the book might not be feasible for a while. {quote} Robert, would you be so kind to have a look at my proposal? It can be found at http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1. It's basically the same as what I sent to the mailing list. I wrote that I want to implement BM25, BM25F and DFR (the framework, I meant with one or two smoothing models), as well as to convert the original scoring to the new framework. In light of the thread here, I guess it would be better to modify these goals, perhaps by: deleting the conversion part? committing myself to BM25/BM25F only? explicitly stating that I want a higher level API based on the low-level one? {quote} I think you can decide what you want to do? Obviously I would love to see all of it done :) But its your choice, I could see you going a couple different ways: * closer to your original proposal, you could still develop a flexible scoring API on top of Similarity. Hey, all I did was move stuff from Scorer to Similarity really, which does give flexibility, but its probably not what an IR researcher would want (its low-level and confusing). So you could make a SimpleSimilarity or EasySimilarity or something thats presents a much simpler API (something closer to what terrier/indri present) on top of this, for easily implementing ranking functions? I think this would be extremely valuable long-term: who cares if we have a low-level flexible scoring API that only speed demons like, but IR practitioners find confusing and hideous? Someone who is trying to experiment with an enhancement to relevance likely doesn't care if their TREC run takes 30 seconds instead of 20 seconds if the API is really easy and they aren't wasting time fighting with lucene? If you go this route, you could implement BM25, DFR, etc as you suggested as examples to how to use this API, and there would be more of a focus on API quality and simplicity instead of performance. * or alternatively, you could refine your proposal to implement a really production strength version of one of these scoring systems on top of the low-level API, that would ideally have competitive performance/documentation/etc with Lucene's default scoring today. If you decide to do this, then yes, I would definitely suggest picking only one, because I think its a ton of work as I listed above, and I think there would be more focus on practical things (some probably being nuances of lucene) and performance. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13014619#comment-13014619 ] David Mark Nemeskey commented on LUCENE-2959: - {quote} I think you can decide what you want to do? {quote} Fair enough. :) I guess I'll stick with my original proposal then, though I might change a few things here and there; maybe change the focus from flexibility (as it seems to be already underway) to simplicity. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013907#comment-13013907 ] David Mark Nemeskey commented on LUCENE-2959: - Robert: thanks for all the info! It's nice to see so much work has already been done. I plan to delve into it after the selection, and try to get other things out of the way until then, so that I can concentrate on GSoC during the summer. I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible? Even though I added a Performance section to my proposal (http://www.google-melange.com/gsoc/proposal/review/google/gsoc2011/davidnemeskey/1), I see now that it's probably more important than I believed it to be at first. I think I will follow your advice and concentrate on how to make BM25F fast. It may be a bit tougher nut to crack than DFR, as the latter has logarithms scattered all over it. However, the first thing that comes to mind is that the tf-BM25 curve becomes almost flat very quickly (less so for a high k1 value, though). So it may be possible to pre-compute a tf map or array for a query. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13013944#comment-13013944 ] Robert Muir commented on LUCENE-2959: - {quote} I think the main point would be to make the addition of a new ranking function as easy as possible. At least a prototype implementation should be very straightforward, even at the expense of performance. Then, if the new method provides good results, the developer can go on to the lower level to squeeze more juice out of it. It's hard for me to discuss new this without knowing the code, of course, but do you think it is possible? {quote} This sounds great! For example, you could extend the low-level api, gather every possible statistic that lucene has, and present a high-level api that looks more like terrier's scoring api (which i'm guessing is what researchers would prefer?), where they basically implement the scoring in one method with all the stats there. So someone would extend this API to do prototyping, it would make it easier to experiment. {quote} I think I will follow your advice and concentrate on how to make BM25F fast. {quote} Actually as far as BM25f, this one presents a few challenges (some already discussed on LUCENE-2091). To summarize: * for any field, Lucene has a per-field terms dictionary that contains that term's docFreq. To compute BM25f's IDF method would be challenging, because it wants a docFreq across all the fields. (its not clear to me at a glance either from the original paper, if this should be across only the fields in the query, across all the fields in the document, and if a static schema is implied in this scoring system (in lucene document 1 can have 3 fields and document 2 can have 40 different ones, even with different properties). * the same issue applies to length normalization, lucene has a field length but really no concept of document length. So I just wanted to mention that while its possible here to apply a per-field TF boost before the non-linear TF saturation, its not immediately clear how to adjust the BM25f formula to lucene: how to combine these scores without using a (wasteful) catch-all-field and some lying behind the scenes to force this catch-all-field's length normalization and docFreq to be used. Too many questions arise for BM25f and how it would fit with lucene, for example the fact that multiple fields can really mean anything, and having a field in lucene doesnt mean at all that it was in your original document! For example, Solr users frequently use a copyField to take the content of one field, duplicate it to a different field (and perhaps apply some processing). In terms of things like length normalization, it seems that document length calculated as the sum across the fields would be wrong for many use cases. I only wanted to recommend against this one because of this rather serious challenge, it seems its something we might want to table at the moment: lucene is changing fast and as new capabilities arise, we might realize there is a more elegant way to address this... but at the moment I think I would recommend starting with BM25. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11, mentor Attachments: LUCENE-2959_mockdfr.patch, implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] [Commented] (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13012996#comment-13012996 ] Robert Muir commented on LUCENE-2959: - Hi David, to try to help get things moving, I created a branch: https://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring This is the in-progress work from LUCENE-2392, which separates the scoring calculates from the postings-list matching. In short, Similarity becomes very low-level, but the idea is that you extend it to present a higher-level API (for example TFIDFSimilarity: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/java/org/apache/lucene/search/TFIDFSimilarity.java) that is user-friendly and allows users to adjust parameters in a way that makes sense to that scoring system. As a start I implemented some very very rough/basic models in src/test: BM25: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockBM25Similarity.java Dirichlet LM: http://svn.apache.org/repos/asf/lucene/dev/branches/flexscoring/lucene/src/test/org/apache/lucene/search/MockLMSimilarity.java But these are in no way correct or extensible or nice. For example, the BM25 similarity is slow, because as implemented its average document length is live (e.g. if you add more segments its immediately adjusted for each query)... there is no caching at all. For example in this case, to speed up BM25, it could be nice for the Similarity to pull this up-front, and create cached calculations. If a user wants to refresh their bm25 stats then they could do something in SimilarityProvider to call it to recalculate the caches. However, for a user that wants super-realtime view, it might be better for it to stay the way it is now, or alternatively for the Sim to up-front do the 256 calculations per query (ideally in weight, not per-segment in docscorer) to tableize the length normalizations. So these are the API challenges we need to consider if we want to provide actual implementations of these scoring systems: how to make them perform close to or as fast as lucene's current scoring model. Separately on the issue, I want to make Weight completely opaque to the sim, really its just a way for a Similarity to compute things up front (such as IDF, but maybe things like these bm25 length norm caches too). Currently it can only have a single float value (see my un-sqrt'ing and other hacks in the Mock sims), so this should be fixed. Additionally another big TODO: just as Scorer was split (maybe we should rename it to Matcher now that sim does the calcs?), the process of Explanations need to be split too, where a Sim is completely responsible for explaining itself. Another TODO i have is to write the norm summation into the norms file as a single vlong, rather than computing it across all byte[] in segmentreader like I do now... I just implemented it this way so that we could play with scoring algorithms easily. So, the good news would be that scoring is a lot more flexible, but the bad news is that in order to support lucene's features, implementing a new ranking system on top of Similarity is really *serious* work, as you need to: # implement the lower-level API efficiently, yet expose a nice high-level API such as TFIDFSimilarity's tf() and idf() hooks for users. # implement explanations so that users can debug relevance issues. # think about allowing users to balance the various performance tradeoffs, such as balancing the performance gained by caching things versus using realtime statistics (some of this could be in my head, maybe computing 256 norm decoder caches up-front is really cheap and a non-issue). # consider how to integrate lucene's features into the ranking system, for example how to estimate a reasonable phrase IDF for phrase/multiphrase/span queries, how to integrate index-time boosts (in my example BM25 etc, I just made the documents appear shorter to accomplish this), depending upon how the length normalization is being stored in the index, how to pick the best quantization (might not be SmallFloat352), etc etc. # do all the relevance testing to ensure that things are correct (i found lots of bugs doing rough testing on my Mock ones, there are probably more!, but on the couple test collections i tried they seemed reasonable) # adding good quality documentation such as what we have today in TFIDFSimilarity that explains how the ranking system works and how you can tune it. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David
[jira] Commented: (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005575#comment-13005575 ] David Mark Nemeskey commented on LUCENE-2959: - Andrzej: thanks! Indeed, I have read that that paper, but have only skimmed through the code. I am also aware of at least one BM25 implementation for Lucene, which may or may not be what issue LUCENE-2091 is about. I need to have a look into it. [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11 Attachments: implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-2959) [GSoC] Implementing State of the Art Ranking for Lucene
[ https://issues.apache.org/jira/browse/LUCENE-2959?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=13005345#comment-13005345 ] Andrzej Bialecki commented on LUCENE-2959: --- You are probably familiar with this paper and the code... just in case I'm adding a reference here: http://arxiv.org/abs/0911.5046 [GSoC] Implementing State of the Art Ranking for Lucene --- Key: LUCENE-2959 URL: https://issues.apache.org/jira/browse/LUCENE-2959 Project: Lucene - Java Issue Type: New Feature Components: Examples, Javadocs, Query/Scoring Reporter: David Mark Nemeskey Labels: gsoc2011, lucene-gsoc-11 Attachments: implementation_plan.pdf, proposal.pdf Lucene employs the Vector Space Model (VSM) to rank documents, which compares unfavorably to state of the art algorithms, such as BM25. Moreover, the architecture is tailored specically to VSM, which makes the addition of new ranking functions a non- trivial task. This project aims to bring state of the art ranking methods to Lucene and to implement a query architecture with pluggable ranking functions. -- This message is automatically generated by JIRA. For more information on JIRA, see: http://www.atlassian.com/software/jira - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org