Re: [GoSC] I'm interested in LUCENE-3333
On Sun, Mar 9, 2014 at 10:46 AM, Da Huang dhuang...@gmail.com wrote: Thanks a lot. That's very helpful. I think you get exactly what I mean about the LUCENE-4396. By grouping up the MUST clauses, the conjunctive query can be done specifiedly with easy way. Then, the original query would have no more than 1 MUST clause. I think in this situation, it's much more easier to judge whether to use BooleanScorer or BooleanScorer2. :) Well, because we now have the DISI.cost() method, we can use this to find the least-cost MUST clause (e.g. the one matching the fewest documents) and then make a call up front on whether BS or BS2 is appropriate. But these would all be fun ideas to explore under a GSoC project, if we can scope it appropriately. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GoSC] I'm interested in LUCENE-3333
Hi, Da, On Sun, Mar 9, 2014 at 1:30 AM, Da Huang dhuang...@gmail.com wrote: I have spent some time considering your suggestions in last mail. I find that I'm interested in the suggestion Filter and Query should be more 'combined' . OK, cool, and ambitious; it might be safer to choose a less ambitious/controversial change for a GSoC project. Maybe, have a look at LUCENE-1518? There was lots of discussion there. In my opinion, to implement this suggestion, a new class FilterQuery, which is a subclass of Query, should be created. If FilterQuery is implemented, then it can be the query element of BooleanClause, and the BooleanQuery can naturally add a Filter as a BooleanClause. I think one of the most important things is to deal with the scores, as Filter does not contribute anything to score. I feel like it should be the opposite? Like, a Filter has less functionality that a Query, because it does only matching? So I would think a Quey would subclass Filter and then add scoring onto it? But there was lots of discussion on the above issue that I don't remember... Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GoSC] I'm interested in LUCENE-3333
Hi, Mike. You're right. After having a look at the comments on LUCENE-1518, I find that my idea about that has many bugs. Sorry for that. Thus, I have checked some other suggestions you gave me to see whether relevant comments can be found in jira. I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be used for MUST clauses. Can we adjust the query to make the problem easier? For the query +a b c +e +f as an example, maybe we can turn it into (+a +e +f) b c which has only one MUST clause. Then, it would be easier to judge which scorer to use? Besides, I seems that the suggestion we should pass a needsScorers boolean up-front to Weight.scorer is not on jira. But it sounds that it can be done by adjusting some class methods' arguments and return value to pass the needsScorers? not sure. At last, recently I find something strange in the code about heap. I find heap has been implemented duplicately for many times in the trunk, and a PriorityQueue is also implemented in the package org.apache.lucene.util. I remember java has already implemented the PriorityQueue. Why not use that? Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China
Re: [GoSC] I'm interested in LUCENE-3333
On Sun, Mar 9, 2014 at 9:55 AM, Da Huang dhuang...@gmail.com wrote: Hi, Mike. You're right. After having a look at the comments on LUCENE-1518, I find that my idea about that has many bugs. Sorry for that. It's fine, it's a VERY hard fix :) This is why it hasn't been done yet! Thus, I have checked some other suggestions you gave me to see whether relevant comments can be found in jira. I think I have some idea on LUCENE-4396: BooleanScorer should sometimes be used for MUST clauses. Can we adjust the query to make the problem easier? For the query +a b c +e +f as an example, maybe we can turn it into (+a +e +f) b c which has only one MUST clause. Then, it would be easier to judge which scorer to use? You mean create nesting when there wasn't before, by grouping all MUST clauses together? We could explore that ... Or we could pass all the clauses (still flat) to BooleanScorer. I think this would only be faster when the MUST clauses are high cost relative to all other clauses. E.g. a super-rare MUST'd clause would probably be faster with BooleanScorer2. I think this could make a good GSoC project. Besides, I seems that the suggestion we should pass a needsScorers boolean up-front to Weight.scorer is not on jira. But it sounds that it can be done by adjusting some class methods' arguments and return value to pass the needsScorers? not sure. I think it's this Jira: https://issues.apache.org/jira/browse/LUCENE-3331 (I just searched for needs scores on http://jirasearch.mikemccandless.com and it was one of the suggestions). All that should be needed here is to add a boolean needsScores (or something) to the Weight.scorer method, and fix the numerous places where this method is invoked to pass the right value. E.g. ConstantScoreQuery would pass false, and this would mean e.g. if it wraps a TermQuery, we could avoid decoding freq blocks from the postings. At last, recently I find something strange in the code about heap. I find heap has been implemented duplicately for many times in the trunk, and a PriorityQueue is also implemented in the package org.apache.lucene.util. I remember java has already implemented the PriorityQueue. Why not use that? Good question! There is a fair amount of duplicated code, and we should fix that over time. Lucene has had its own PQ class forever, and we do strange things like pre-filling the queue with a sentinel value to avoid if (queueIsNotFullYet) checks in collect(int doc), and we can replace the top value and re-heap ... but maybe these do not in fact matter in practice and if so we should stop duplicating code :) Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GoSC] I'm interested in LUCENE-3333
Thanks a lot. That's very helpful. I think you get exactly what I mean about the LUCENE-4396. By grouping up the MUST clauses, the conjunctive query can be done specifiedly with easy way. Then, the original query would have no more than 1 MUST clause. I think in this situation, it's much more easier to judge whether to use BooleanScorer or BooleanScorer2. :) Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China
Re: [GoSC] I'm interested in LUCENE-3333
On Fri, Mar 7, 2014 at 9:01 PM, Da Huang dhuang...@gmail.com wrote: Hello, everyone, My name is Da Huang. I'm studying for my master degree of Computer Science in Peking University. I have been using lucene for about half a year. It's so elegent that I hope to have a chance to contribute some code for it. Welcome! Therefore, I have been scaned the jira GoSC 2014 Ideas page about lucene for several days. I find LUCENE-: Specialize DisjunctionScorer if all clauses are TermQueries more suitable for me to do. I have spent some time to scan the revelant code, and the Issue LUCENE-3328 which spinoff LUCENE-. I find the following questions confusing me. 1) I have checkout the code from http://svn.apache.org/repos/asf/lucene/dev/trunk lucene_trunk, but I couldn't find the relevant code of the fixed Issue LUCENE-3328. It seems that the patch attached on the page is not on the trunk. Why? Well, some time after LUCENE-3328, we made further changes and discovered that this code specialized scorer was not in fact [that much?] faster. I forget which issue removed it... but you could probably find it with some svn archaeology. Net/net the trend in Lucene has been against adding source code specialization, since this is really code duplication to try to make hotspot's life easier. Unfortunately, it does sometimes work; e.g see http://blog.mikemccandless.com/2013/06/screaming-fast-lucene-searches-using-c.html though that's not a fair comparison since it was also a different programming language! So, while it's a nice tradeoff for performance, it's a poor tradeoff for ongoing code management. See all the specialized collectors we have in TopFieldCollector! So I'm not sure at this point if we should even pursue LUCENE-. There are however tons of other things to fix on the search side; maybe we could craft a good GSoC project from something else; e.g.: - we should pass a needsScorers boolean up-front to Weight.scorer - disjunctions now score during matching - BooleanScorer should sometimes be used for MUST clauses - We sort of duplicate code across BooleanQuery, FilteredQuery, BooleanFilter, TermsFilter - Somehow, Filter and Query should be more combined; e.g. you should be able to add a Filter as a clause onto a BooleanQuery - Post filtering is too hard to use today - ... 2) My intuitive idea of solving this issue is to make a class DisjunctionTermScorer to do the all TermQueries clauses; then, judging whether to use DisjunctionTermScorer in the method 'scorer' in class BooleanQuery. Is this idea right? Yes this would be the right idea. Above are my questions about LUCENE-. Besides, I would like to propose the following issue which is about the QueryParser. When we use QueryParser to parse a querystring like science AND (engineering AND technology). The generated query would be +science (+engineering +technology). I think it would be more efficient for searching if the final query is +science +engineering +technology. My idea is to make the cascaded AND and cascaded OR flat. Do you agree? I hope I have made my idea clear. I think this would make tons of sense; the only challenge is that this will change how scores are computed, when coord is enabled. I'm not sure how much that'd matter in practice; if it is important to preserve that, then maybe we could still make a single 3-clause BooleanQuery, but somehow remember the original structure for the sake of coord scoring ... not sure. Mike McCandless http://blog.mikemccandless.com - To unsubscribe, e-mail: dev-unsubscr...@lucene.apache.org For additional commands, e-mail: dev-h...@lucene.apache.org
Re: [GoSC] I'm interested in LUCENE-3333
Hello, Mike. I have spent some time considering your suggestions in last mail. I find that I'm interested in the suggestion Filter and Query should be more 'combined' . In my opinion, to implement this suggestion, a new class FilterQuery, which is a subclass of Query, should be created. If FilterQuery is implemented, then it can be the query element of BooleanClause, and the BooleanQuery can naturally add a Filter as a BooleanClause. I think one of the most important things is to deal with the scores, as Filter does not contribute anything to score. Above is my intuitive idea about this suggestion. Do you think it makes sense? I hope I have made my idea clear. Thanks, Da Huang -- 黄达(Da Huang) Team of Search Engine Web Mining School of Electronic Engineering Computer Science Peking University, Beijing, 100871, P.R.China