[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704499#action_12704499 ] Shai Erera commented on LUCENE-1518: I would like to query why do we need to make Filter and Query of the same type? After all, they both do different things, even though it looks like they are similar. Attempting to do this yields those peculiarities: # If Filter extends Query, it now has to implement all sorts of methods like weight, toString, rewrite, getTerms and scoresDocInOrder (an addition from LUCENE-1593). # If Query extends Filter, it has to implement getDocIdSet. # Introduce instanceof checks in places just to check if a given Query is actually a Filter or not. Both (1) and (2) are completely redundant for both Query and Filter, i.e. why should Filter implement toString(term) or scoresDocInOrder when it does score docs? Why should Query implement getDocIdSet when it already implements a weight().scorer() which returns a DocIdSetIterator? I read the different posts on this issue and I don't understand why we think that the API is not clear enough today, or is not convenient: * If I want to just filter the entire index, I have two ways: (1) execute a search with MatchAllDocsQuery and a Filter (2) Wrap a filter with ConstantScoreQuery. I don't see the difference between the two, and I don't think it forces any major/difficult decision on the user. * If I want to have a BooleanQuery with several clauses and I want a clause to be a complex one with a Filter, I can wrap the Filter with CSQ. * If I want to filter a Query, there is already API today on Searcher which accepts both Query and Filter. At least as I understand it, Queries are supposed to score documents, while Filters to just filter. If there is an API which requires Queries only, then I can wrap my Filter with CSQ, but I'd prefer to check if we can change that API first (for example, allowing BooleanClause to accept a Filter, and implement a weight(IndexReader) rather than just getQuery()). So if Filters just filter and Queries just score, the API on both is very clear: Filter returns a DISI and Query returns a Scorer (which is also a DISI). I don't see the advantage of having the code unaware to the fact a certain Query is actually a Fitler - I prefer it to be upfront. That way, we can do all sorts of optimizations, like asking the Filter for next() first, if we know it's supposed to filter most of the documents. At the end of the day, both Filter and Query iterate on documents. The difference lies in the purpose of iteration. In my code there are several Query implementations today that just filter documents, and I plan to change all of them to implement Filter instead (that was originally the case because Filter had just bits() and now it's more efficient with the iterator() version, at least to me). I want to do this for a couple of reasons, clarity being one of the most important. If Filter just filters, I don't see why it should inherit all the methods from Query (or vice versa BTW), especially when I have this CSQ wrapper. To me, as a Lucene user, I make far more complicated decisions every day than deciding whether I want to use a Filter as a Query or not. If I pass it directly to IndexSearcher, I use it as a filter. If I use a different API which accepts just Query, I wrap it with CSQ. As simple as that. But that's just my two cents. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704522#action_12704522 ] Shai Erera commented on LUCENE-1614: I think I understand what you mean, but please correct me if I'm wrong. You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle. This warrants an explicit documentation and implementation by current DISIs ... I don't think that if you call a DISI today with next(10) and next(10) it will not move to 11 in the second call. But calling check(10) and next(10) MUST not advance the DISI further than 10. If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation. BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer? Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704556#action_12704556 ] Michael McCandless commented on LUCENE-1614: bq. Just to clarify for myself, in the example I gave above, suppose thar the scorer is on 3 and you call check(8). On check(8), TermScorer would go to 10, stop there, and return false. (It would not rewind to 3). Check can only be called on increasing arguments, so it's not truly random access. It's forward only random access. bq. You propose this check() so that in case a DISI can save any extra operations it does in next() (such as reading a payload for example) it will do so. Therefore in the example you give above with CS, next()'s contract forces it to advance all the sub-scorers, but with check() it could stop in the middle. Precisely. This is important when you have a super-cheap iterator (say a somewhat sparse (=10%?) in-memory filter that's represented as list-of-docIDs). It's very fast for such a filter to iterate over its docIDs. But when that iterator is AND'd with a Scorer, as is done today by IndexSearcher, they effectively play leap frog, where first it's the filter's turn to next(), then it's the Scorer's turn, etc. But for the Scorer, next() can be extremely costly, only to find the filter doesn't accept it. So for such situations it's better to let the filter drive the search, calling Scorer.check() on the docs. But... once we switch to filter-as-BooleanClause, it's less clear whether check() is worthwhile, because I think the filter's constraint is more efficiently taken into account. For filters that support random access (if they are less sparse, say = 25% or so), we should push them all the way down to the TermScorers and factor them in just like deletedDocs. bq. . If the default impl in DISI just uses nextDoc() and returns true if the return value is the requested, we should be safe back-compat-wise, but this is still dangerous and we need clear documentation. Yes it does have a good default impl, I think. bq. BTW, perhaps a testAndSet-like version can save check(10) followed by a next(10), and will fit nicer? Not sure what you mean by testAndSet-like version? Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
What are we allowed to do in 3.0?
Hi Recently I was involved in several issues that required some runtime changes to be done in 3.0 and it was not so clear what is it that we're actually allowed to do in 3.0. So I'd like to open it for discussion, unless everybody agree on it already. So far I can tell that 3.0 allows us to: 1) Get rid of all the deprecated methods 2) Move to Java 1.5 But what about changes to runtime behavior, re-factoring a whole set of classes etc? I'd like to relate to them one by one, so that we can comment on each. Removing deprecated methods - As I was told, 2.9 is the last release we are allowed to mark methods as deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there is a method that should be renamed, its signature should change or be removed altogether, we can't just do it and we'd have to deprecate it and remove it in 4.0 (?). I personally thought that 2.9 allows us to make these changes without letting anyone know about them in advance, which I'm ok with since upgrading to 3.0 is not going to be as 'smooth', but I also understand why 'letting people know in advance' (which means a release prior to the one when they are removed) gives a better service to our users. I also thought that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0 (but I was corrected previously on that). Which means upon releasing 2.9, the very first issue that's opened and committed should be removing the current deprecated methods, or otherwise we could open an issue that deprecates a method and accidentally remove it later, when we handle the massive deprecation removal. We should also create a 2.9 tag. Changes to runtime behavior - What is exactly the policy here? If we document in 2.9 that certain features' runtime behavior will change in 3.0 - is that ok to make those changes? And if we don't document them and do it (in the transition from 2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0 to run all of his unit tests to assert that everything still works (I expect that to happen for every release, but for 3.0 in particular). Obviously the runtime behaviors that were changed and documented in 2.9 are ones that he might have already taken care of, but why can't he do the same reading the CHANGES of 3.0? I just feel that this policy forces us to think really hard and foresee those changes in runtime behavior that we'd like to do in 3.0 so that we can get them into 2.9, but at the end of the day we're not improving anything. Upon upgrading to 2.9 I cannot handle the changes in runtime behaviors as they weren't done yet. I can only do it after I upgrade to 3.0. So what's the difference for me between fixing those that were documented in 2.9 and the new ones that were just released? Going forward, I don't think this community changes runtime behavior every other Monday, and so I'd like to have the ability to make those changes without such a strict policy. Those changes are meant to help our users (and we are amongst them) to achieve better performance, usually, and so why should we fear from making them, or if fear is too strong a word - why should we refrain from doing them, while documenting the changes? If we don't want to do it for every 'dot' release, we can do them in major releases and I'd also vote for doing them in a mid-major release, like 3.5. Refactoring Today we are quite limited with refactoring. We cannot add methods to interfaces or abstract methods to abstract classes, or even make classes abstract. I'm perfectly fine with that as I don't want to face the need to suddenly refactor my application just because Lucene decided to add a method to an interface. But recently, while working on LUCENE-1593, me and Mike spotted a need to add some methods to Weight, but since it is an interface we can't. So I said something let's do it in 3.0 but then we were not sure if this can be done in 3.0. So the alternative was let's deprecate Weight, create an AbstractWeight class and do it there, but we weren't sure if that's even something we can push for in 3.0, unless we do all of it in 2.9. This also messes up the code, introducing new classes with bad names (AbstractWeight, AbstractSearchable) where we could have avoided it if we just changed Weight to an abstract class in 3.0. --- I think it all boils down to whether we MUST support jar drop-in-ability when upgrading from 2.9 to 3.0. I think that we shouldn't as the whole notion of 3.0 (or any future major version) is major revision to code, index structure, JDK etc. If we're always expected to support it, then 2.9 really becomes 3.0 in terms of our ability to make changes to the API between 2.9-3.0. I'm afraid that if that's the case, we might choose to hold on with 2.9 as much as we can so we can push as many changes as we foresee into it, so that they can be finalized in 3.0. I'm not
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704559#action_12704559 ] Michael McCandless commented on LUCENE-1593: Another things we should improve about the Scorer API: Enrich Scorer API to optionally provide more details on positions that caused a match to occur. This would improve highlighting (LUCENE-1522) since we'd know exactly why a match occurred (single source) rather than trying to reverse-engineer the match. It'd also address a number of requests over time by users on how can I get details on why this doc matched?. I *think* if we did this, the *SpanQuery would be able to share much more w/ their normal counterparts; this was discussed @ http://www.nabble.com/Re%3A-Make-TermScorer-non-final-p22577575.html. Ie we would have a single TermQuery, just as efficient as the one today, but it would expose a getMatches (say) that enumerates all positions that matched. Then, if one wanted these details for every hit on in the topN, we could make an IndexReader impl that wraps TermVectors for the docs in the topN (since TermVectors are basically a single-doc inverted index), run the query on it, and request the match details per doc. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Build failed in Hudson: Lucene-trunk #812
This was a false failure. There's a timed test in contrib/benchmark that apparently can fail if the machine happens to be swamped at the time. I'll work out a more robust test. Mike On Wed, Apr 29, 2009 at 10:41 PM, Apache Hudson Server hud...@hudson.zones.apache.org wrote: See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/812/changes Changes: [pjaol] Fixed bug caused by multiSegmentIndexReader -- [...truncated 11293 lines...] init: test: [echo] Building swing... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: compile-test: [echo] Building swing... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: common.compile-test: common.test: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test [junit] Testsuite: org.apache.lucene.swing.models.TestBasicList [junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.555 sec [junit] [junit] Testsuite: org.apache.lucene.swing.models.TestBasicTable [junit] Tests run: 3, Failures: 0, Errors: 0, Time elapsed: 0.584 sec [junit] [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingList [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.627 sec [junit] [junit] Testsuite: org.apache.lucene.swing.models.TestSearchingTable [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 0.637 sec [junit] [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingList [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 0.73 sec [junit] [junit] Testsuite: org.apache.lucene.swing.models.TestUpdatingTable [junit] Tests run: 4, Failures: 0, Errors: 0, Time elapsed: 1.007 sec [junit] [delete] Deleting: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/swing/test/junitfailed.flag [echo] Building wikipedia... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: test: [echo] Building wikipedia... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: compile-test: [echo] Building wikipedia... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: compile-core: common.compile-test: common.test: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test [junit] Testsuite: org.apache.lucene.wikipedia.analysis.WikipediaTokenizerTest [junit] Tests run: 5, Failures: 0, Errors: 0, Time elapsed: 0.356 sec [junit] [delete] Deleting: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/wikipedia/test/junitfailed.flag [echo] Building wordnet... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: test: [echo] Building xml-query-parser... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: test: [echo] Building xml-query-parser... javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: compile-test: [echo] Building xml-query-parser... build-queries: javacc-uptodate-check: javacc-notice: jflex-uptodate-check: jflex-notice: common.init: build-lucene: build-lucene-tests: init: clover.setup: clover.info: clover: common.compile-core: compile-core: common.compile-test: common.test: [mkdir] Created dir: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test [junit] Testsuite: org.apache.lucene.xmlparser.TestParser [junit] Tests run: 18, Failures: 0, Errors: 0, Time elapsed: 2.215 sec [junit] [junit] Testsuite: org.apache.lucene.xmlparser.TestQueryTemplateManager [junit] Tests run: 1, Failures: 0, Errors: 0, Time elapsed: 1.463 sec [junit] [delete] Deleting: http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build/contrib/xml-query-parser/test/junitfailed.flag BUILD FAILED http://hudson.zones.apache.org/hudson/job/Lucene-trunk/ws/trunk/build.xml :649: Contrib tests failed! Total time: 20 minutes 45 seconds Publishing Javadoc Recording test results Publishing Clover coverage report...
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704561#action_12704561 ] Eks Dev commented on LUCENE-1518: - imo, it is really not all that important to make Filter and Query the same (that is just one alternative to achieve goal). Basic problem we try to solve is adding Filter directly to BoolenQuery, and making optimizations after that easier. Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder. On the other hand, I must accept, conceptually FIter and Query are the same, supporting together following options: 1. Pure boolean model: You do not care about scores (today we can do it only wia CSQ, as Filter does not enter BoolenQuery) 2. Mixed boolean and ranked: you have to define Filter contribution to the documents (CSQ) 3. Pure ranked: No filters, all gets scored (the same as 2.) Ideally, as a user, I define only Query (Filter based or not) and for each clause in my Query define Query.setScored(true/false) or useConstantScore(double score); also I should be able to say, Dear Lucene please materialize this Query_Filter for me as I would like to have it cached and please store only DocIds (Filter today). Maybe open possibility to open possibility to cache scores of the documents as well. one thing is concept and another is optimization. From optimization point of view, we have couple of decisions to make: - DocID Set supports random access, yes or no (my Materialized Query) - Decide if clause should / should not be scored/ or should be constant So, for each Query we need to decide/support: - scoring{yes, no, constant} and - opening option to materialize Query (that is how we today create Filters today) - these Materialized Queries (aka Filter) should be able to tell us if they support random access, if they cache only doc id's or scores as well nothing usefull in this email, just thinking aloud, sometimes helps :) Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704566#action_12704566 ] Shai Erera commented on LUCENE-1614: bq. Not sure what you mean by testAndSet-like version? I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8. I don't propose to replace check() with it as sometimes you might want to check a couple of DISIs before making a decision to which doc to advance, but it could save calling advance() in case check() returns true. bq. Yes it does have a good default impl, I think. It _will_ have a good default impl, I can guarantee to try :). What I meant is that we should have clear documentation about check() and nextDoc() and the possibility that check will be called for doc Id 'X' and later nextDoc or advance will be called with 'X', in that case the impl must ensure 'X' is not skipped, as is done today by TermScorer for example. So should I add this check()? Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
R-trees in Lucene for spatio-textual search
Hi, Has anybody recently used lucene to implement R-trees for range-queries. I came across the GeoLucene project but not sure how stable/efficient it is to use it for production. Any pointers in this direction will be great. Thanks -P
[jira] Updated: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Shai Erera updated LUCENE-1593: --- Attachment: LUCENE-1593.patch Patch includes: # New scoresDocsInOrder to Query #* Default to false #* Override in extensions to return true, except in BQ which still returns false until we resolve how BQ is used explicitly (top-score vs. not). In some queries that delegate the work, I used the delegatee or return true if all sub-queries return true. # Changed TopFieldCollector and TopScoreDocCollector to take a docsScoredInOrder parameter and create the appropriate instance (breaking ties by doc Id or not). # Added TestTopScoreDocCollector and a test case to TestSort which test out-of-order collection (they trigger the use of BooleanScorer, though whether document collection happens truly out of order I cannot tell). # Updates to CHANGES All tests pass, including test-tag. BTW, the patch also includes the fix to TestSort in tag, but without the fix for MultiSearcher and ParallelMultiSearcher on tag as I'm not sure if we should back-port the fix as well. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1536) if a filter can support random access API, we should use it
[ https://issues.apache.org/jira/browse/LUCENE-1536?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704590#action_12704590 ] Yonik Seeley commented on LUCENE-1536: -- Interesting stuff! Has anyone tested if this results in a performance degradation on SegmentTermDocs? This is very inner loop stuff, and it's replacing a non virtual BitVector.get() which can be easily inlined with two dispatches through base classes. Hopefully hotspot could handle it, but it's tough to figure out, esp in a real system where sometimes a user RAF will be used and sometimes not. if a filter can support random access API, we should use it --- Key: LUCENE-1536 URL: https://issues.apache.org/jira/browse/LUCENE-1536 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Michael McCandless Assignee: Michael McCandless Priority: Minor Fix For: 2.9 Attachments: LUCENE-1536.patch, LUCENE-1536.patch, LUCENE-1536.patch I ran some performance tests, comparing applying a filter via random-access API instead of current trunk's iterator API. This was inspired by LUCENE-1476, where we realized deletions should really be implemented just like a filter, but then in testing found that switching deletions to iterator was a very sizable performance hit. Some notes on the test: * Index is first 2M docs of Wikipedia. Test machine is Mac OS X 10.5.6, quad core Intel CPU, 6 GB RAM, java 1.6.0_07-b06-153. * I test across multiple queries. 1-X means an OR query, eg 1-4 means 1 OR 2 OR 3 OR 4, whereas +1-4 is an AND query, ie 1 AND 2 AND 3 AND 4. u s means united states (phrase search). * I test with multiple filter densities (0, 1, 2, 5, 10, 25, 75, 90, 95, 98, 99, 99.9 (filter is non-null but all bits are set), 100 (filter=null, control)). * Method high means I use random-access filter API in IndexSearcher's main loop. Method low means I use random-access filter API down in SegmentTermDocs (just like deleted docs today). * Baseline (QPS) is current trunk, where filter is applied as iterator up high (ie in IndexSearcher's search loop). -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704594#action_12704594 ] Michael McCandless commented on LUCENE-1593: bq. Not in this issue though, right? Right: I'm back into the mode of throwing out all future improvements I know of, to help guide us in picking the right next step. These would all be done in separate issues, and many of them would not be done today but still we should try not to preclude them for tomorrow. {quote} I like the idea of having Scorer be able to tell why a doc was matched. But I think we should make sure that if a user is not interested in this information, then he should not incur any overhead by it, such as aggregating information in-memory or doing any extra computations. Something like we've done for TopFieldCollector with tracking document scores and maxScore. {quote} Exactly, and I think/hope this'd be achievable. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
RE: R-trees in Lucene for spatio-textual search
R-trees are for spatial queries (two dimensions). If you want to have the same for one-dimensional range queries, use TrieRangeQuery (see 2.9's contrib queries package, this may move to core as a NumericRangeQuery etc.), which is very stable and tested since years, but now to be included in Lucene. Uwe - Uwe Schindler H.-H.-Meier-Allee 63, D-28213 Bremen http://www.thetaphi.de eMail: u...@thetaphi.de _ From: Mukherjee, Prasenjit [mailto:p.mukher...@corp.aol.com] Sent: Thursday, April 30, 2009 12:43 PM To: java-dev@lucene.apache.org Subject: R-trees in Lucene for spatio-textual search Hi, Has anybody recently used lucene to implement R-trees for range-queries. I came across the GeoLucene project but not sure how stable/efficient it is to use it for production. Any pointers in this direction will be great. Thanks -P
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704597#action_12704597 ] Shai Erera commented on LUCENE-1518: bq. Wrapping with CSQ is just adding anothe layer between Lucene search machinery and Filter, making these optimizations harder. Right. But making Filter sub-class Query and check in BQ 'if (query instanceof Filter) { Filter f = (Filter) query)' is not going to improve anything. It adds instanceof and casting, and I'd think those are more expensive than wrapping a Filter with CSQ and returning an appropriate Scorer, which will use the Filter in its next() and skipTo() calls. bq. On the other hand, I must accept, conceptually FIter and Query are the same, supporting together following options I think that if we allow BooleanClause to implement a Weight(IndexReader) (just like Query) we'll be one more step closer to that goal? BQ uses this method to construct BooleanWeight, only today it calls clause.getQuery().createWeight(). Instead it could do clause.getWeight, and if the BooleanClause holds a Filter it will return a FilterWeight, otherwise delegate that call to the contained Query. Regarding pure ranked, CSQ is really what we need, no? So how about the following: # Add add(Filter, Occur) to BooleanClause. # Add weight(Searcher) to BooleanClause. # Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). # Add isRandomAccess to Filter. # Create a RandomAccessFilter which extends Filter and defines an additional seek(target) method. # Add asRandomAccessFilter() to Filter, which will materialize that Filter into memory, or into another RandomAccess data structure (e.g. keeping it on disk but still provide random access to it, even if not very efficient) and return a RandomAccessFilter type, which will implement seek(target) and possibly override next() and skipTo(), but still use whatever other methods this Filter declares. #* I think we should default it to throw UOE providing that we document that isRandomAccess should first be called. I'm thinking out loud just like you, so I hope my stuff makes sense :). Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605 ] Paul Elschot commented on LUCENE-1518: -- How about materializing the DocIds _and_ the score values? Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704608#action_12704608 ] Shai Erera commented on LUCENE-1593: bq. BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one to ask the Query for either/or? I'm afraid this might mean a larger change. What will TermQuery do? Today it returns true, and does not have any implementation that can return docs out-of-order. So what should TQ do when outOfOrderScorer is called? Just return what inOrderScorer returns, or throw an exception? That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true. And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck :)? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case? bq. But I'm not sure in practice when one would want to use an out-of-order non-top iterator. I agree. I think that iteration on Scorer is dictated to be in order because it extends DISI with next() and skipTo() methods which don't imply in any way they can return something out of order (besides next() maybe, but it will be hard to use such next() with a skipTo()). Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704605#action_12704605 ] Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:19 AM: --- bq. opening option to materialize Query (that is how we today create Filters today) How about materializing the DocIds _and_ the score values? was (Author: paul.elsc...@xs4all.nl): How about materializing the DocIds _and_ the score values? Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704600#action_12704600 ] Michael McCandless commented on LUCENE-1518: Let's not forget we also have provides scores but NOT filtering type things as well, eg function queries, MatchAllDocsQuery, I want to boost documents by recency use case (which sort of a Scorer filter in that it takes another Scorer and modifies its output, per doc), etc. It's just that very often the scoring part is in fact very much intertwined with the filtering part. EG a TermQuery iterates a SegmentTermDocs, and reads holds freq/doc in pairs. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609 ] Paul Elschot commented on LUCENE-1518: -- bq. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704603#action_12704603 ] Michael McCandless commented on LUCENE-1593: BTW, I wonder if instead of Query.scoresDocsInOrder we should allow one to ask the Query for either/or? Ie, a BooleanQuery can produce a scorer that scores docs in order; it's just lower performance. Sure, our top doc collectors can accept in order or out of order collection, but perhaps one has a collector out there that must get the docs in order, so shouldn't we be able to ask the Query to give us docs always in order or doesn't have to be in order? Also: I wonder if we would ever want to allow for non-top-scorer usage that does not return docs in order? Ie, next() would be allowed to yield docs out of order. Obviously this is not allowed today... but we are now mixing top vs not-top with out-of-order vs in-order, where maybe they should be independent? But I'm not sure in practice when one would want to use an out-of-order non-top iterator. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609 ] Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:27 AM: --- bq. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Using score=0.0f for no scoring might not work for BooleanQuery because it also has a coordination factor that depends on the number of matching sub_queries_. The patch at 1345 does not change that coordination factor for backward compatibility, even though the coordination factor might also depend on the number of a matching filter clauses. was (Author: paul.elsc...@xs4all.nl): bq. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Issue Comment Edited: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704609#action_12704609 ] Paul Elschot edited comment on LUCENE-1518 at 4/30/09 5:29 AM: --- bq. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Using score=0.0f for no scoring might not work for BooleanQuery because it also has a coordination factor that depends on the number of matching queries clauses. The patch at 1345 does not change that coordination factor for backward compatibility, even though the coordination factor might also depend on the number of a matching filter clauses. was (Author: paul.elsc...@xs4all.nl): bq. Create a FilterWeight which wraps a Filter and provide a Scorer implementation with a constant score. (This does not handle the no scoring mode, unless no scoring can be achieved with score=0.0f, while constant is any other value, defaulting to 1.0f). The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values. Using score=0.0f for no scoring might not work for BooleanQuery because it also has a coordination factor that depends on the number of matching sub_queries_. The patch at 1345 does not change that coordination factor for backward compatibility, even though the coordination factor might also depend on the number of a matching filter clauses. Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704613#action_12704613 ] Eks Dev commented on LUCENE-1518: - Shai, Regarding pure ranked, CSQ is really what we need, no? --- Yep, it would work for Filters, but why not making it possible to have normal Query constant score. For these cases, I am just not sure if this aproach gets max performance (did not look at this code for quite a while). Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?) Paul, How about materializing the DocIds _and_ the score values? exactly, that would open full caching posibility (original purpose of Filters). Think Search Results caching ... that is practically another name for search() method. It is easy to create this, but using it again would require some bigger changes :) Filter_on_Steroids materialize(boolean without_score); Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1518) Merge Query and Filter classes
[ https://issues.apache.org/jira/browse/LUCENE-1518?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704618#action_12704618 ] Eks Dev commented on LUCENE-1518: - Paul: ...The current patch at LUCENE-1345 does not need such a FilterWeight; the no scoring case is handled by not asking for score values... Me: ...Imagine you have a Query and you are not interested in Scoring at all, this can be acomplished with only DocID iterator arithmetic, ignoring score() totally. But that is only an optimization (maybe allready there?)... I knew Paul will kick in at this place, he sad exactly the same thing I did, but, as oposed to me, he made formulation that executes :) Pfff, I feel bad :) Merge Query and Filter classes -- Key: LUCENE-1518 URL: https://issues.apache.org/jira/browse/LUCENE-1518 Project: Lucene - Java Issue Type: Improvement Components: Search Affects Versions: 2.4 Reporter: Uwe Schindler Fix For: 2.9 Attachments: LUCENE-1518.patch This issue presents a patch, that merges Queries and Filters in a way, that the new Filter class extends Query. This would make it possible, to use every filter as a query. The new abstract filter class would contain all methods of ConstantScoreQuery, deprecate ConstantScoreQuery. If somebody implements the Filter's getDocIdSet()/bits() methods he has nothing more to do, he could just use the filter as a normal query. I do not want to completely convert Filters to ConstantScoreQueries. The idea is to combine Queries and Filters in such a way, that every Filter can automatically be used at all places where a Query can be used (e.g. also alone a search query without any other constraint). For that, the abstract Query methods must be implemented and return a default weight for Filters which is the current ConstantScore Logic. If the filter is used as a real filter (where the API wants a Filter), the getDocIdSet part could be directly used, the weight is useless (as it is currently, too). The constant score default implementation is only used when the Filter is used as a Query (e.g. as direct parameter to Searcher.search()). For the special case of BooleanQueries combining Filters and Queries the idea is, to optimize the BooleanQuery logic in such a way, that it detects if a BooleanClause is a Filter (using instanceof) and then directly uses the Filter API and not take the burden of the ConstantScoreQuery (see LUCENE-1345). Here some ideas how to implement Searcher.search() with Query and Filter: - User runs Searcher.search() using a Filter as the only parameter. As every Filter is also a ConstantScoreQuery, the query can be executed and returns score 1.0 for all matching documents. - User runs Searcher.search() using a Query as the only parameter: No change, all is the same as before - User runs Searcher.search() using a BooleanQuery as parameter: If the BooleanQuery does not contain a Query that is subclass of Filter (the new Filter) everything as usual. If the BooleanQuery only contains exactly one Filter and nothing else the Filter is used as a constant score query. If BooleanQuery contains clauses with Queries and Filters the new algorithm could be used: The queries are executed and the results filtered with the filters. For the user this has the main advantage: That he can construct his query using a simplified API without thinking about Filters oder Queries, you can just combine clauses together. The scorer/weight logic then identifies the cases to use the filter or the query weight API. Just like the query optimizer of a RDB. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704628#action_12704628 ] Michael McCandless commented on LUCENE-1313: {quote} Perhaps the best way to make this clean is to keep the ram merge policy and real dir merge policies different? That way we don't merge policy implementations don't need to worry about ram and non-ram dir cases. {quote} OK tentatively this feels like a good approach. Would you re-use MergePolicy, or make a new RAMMergePolicy? Would we use the same MergeScheduler to then execute the selected merges? How would we handle the it's time to flush some RAM to disk case? Would RAMMergePolicy make that decision? bq. Perhaps an IW.updatePendingRamMerges method should be added that handles this separately? Yes? bq. Does the ram dir ever need to worry about things like maxNumSegmentsOptimize and optimize? No? {quote} I think having the ram merge policy should cover the reasons I had for having a separate ram writer. Although the IW.addWriter method I implemented would not have blocked, but I don't think it's necessary now if we have a separate ram merge policy. {quote} OK good. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704632#action_12704632 ] Michael McCandless commented on LUCENE-1614: bq. So should I add this check()? Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check(). Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704631#action_12704631 ] Michael McCandless commented on LUCENE-1614: bq. I mean, instead of having the code call check(8), get true and then advance(8), just call checkAndAdvance(8) which returns true if 8 is supported and false otherwise, AND moves to 8. Oh, sorry: that's in fact what I intended check to do. But by moves to 8 what it really means is you now cannot call check on anything 8 (maybe 9) I think after check(N) is called, one cannot call doc() -- the results are not defined. So check(N) logically puts the iterator at N, and you may at that point call next if you want, or call another check(M) but you cannot call doc() right after check. bq. So should I add this check()? I think so? We can then do perf tests of that vs filter-as-BooleanClause? Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704633#action_12704633 ] Shai Erera commented on LUCENE-1614: bq. I think after check(N) is called, one cannot call doc() I think one cannot even call next(). If check(8) returns true, then you know that doc() will return 8 (otherwise it's a bug?). But if it returns false, it might be in 10 already, so calling next() will move it to 11 or something. So to be on the safe side, we should document that doc()'s result is unspecified if check() returns false, and next() is not recommended in that case, but skipTo() or check(M). bq. Though, in order to run perf tests, we'd need the AND/OR scorers to efficiently implement check(). I plan to, as much as I can, efficiently implement nextDoc() and advance() in all Scorers/DISIs. So I can include check() in the list as well. Or .. maybe you know something I don't and you think this should deserve its own issue? Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1607) String.intern() faster alternative
[ https://issues.apache.org/jira/browse/LUCENE-1607?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Yonik Seeley updated LUCENE-1607: - Attachment: LUCENE-1607.patch bq. Yonik, the string is being interned twice in your latest patch Thanks - I had actually fixed that... but it didn't make it into the patch apparently :-) String.intern() faster alternative -- Key: LUCENE-1607 URL: https://issues.apache.org/jira/browse/LUCENE-1607 Project: Lucene - Java Issue Type: Improvement Reporter: Earwin Burrfoot Fix For: 2.9 Attachments: intern.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch, LUCENE-1607.patch By using our own interned string pool on top of default, String.intern() can be greatly optimized. On my setup (java 6) this alternative runs ~15.8x faster for already interned strings, and ~2.2x faster for 'new String(interned)' For java 5 and 4 speedup is lower, but still considerable. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704640#action_12704640 ] Yonik Seeley commented on LUCENE-1593: -- docsInOrder() would be an implementation detail (and could actually vary per reader or per segment) and should be on the Scorer/DocIdSetIterator rather than the Query or Weight, right? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: What are we allowed to do in 3.0?
I think 3.0 should be a fast turnaround after 2.9? Ie, no new development should take place? We should remove deprecated APIs, change defaults, etc., but that's about it. (I think this is how past major releases worked?). It's a fast switch. Which then means we need to do all the hard work in 2.9... So, I think any API changes we want to make must be present in 2.9 as deprecations. We shouldn't up and remove / rename something in 3.0 with no fore-warning in 2.9. Is there a case where this is too painful? Likewise I think we should give notification of expected changes in runtime behavior with 2.9 (and not suddenly do them in 3.0). Which means upon releasing 2.9, the very first issue that's opened and committed should be removing the current deprecated methods, or otherwise we could open an issue that deprecates a method and accidentally remove it later, when we handle the massive deprecation removal. I think we should not target JAR drop-inability, and we should allow changes to runtime behavior, as well as certain minor API changes in 3.0. EG here are some of the changes already slated for 3.0: * IndexReader.open returns readOnly reader by default * IndexReader.norms returns null on fields that don't have norms * InterruptedException is thrown by many APIs * IndexWriter.autoCommit is hardwired to false * Things that now return the deprecated IndexCommitPoint (interface) will be changed to return IndexCommit (abstract base class) * Directory.list will be removed; Directory.listAll will become an abstract method * Stop tracking scores by default when sorting by field But recently, while working on LUCENE-1593, me and Mike spotted a need to add some methods to Weight, but since it is an interface we can't. So I said something let's do it in 3.0 but then we were not sure if this can be done in 3.0. I think for this we should probably introduce an abstract base class (implementing Weight) in 2.9, stating that Weight interface will be removed in 3.0? (EG, this is what was done for IndexCommitPoint/IndexCommit). Simply changing Weight to be an abstract class in 3.0 is spooky because Java is single inheritance, ie for existing classes that implements Weight but subclass something else it would be nicer to give a heads up with 2.9 that they'll need to refactor? Mike On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote: Hi Recently I was involved in several issues that required some runtime changes to be done in 3.0 and it was not so clear what is it that we're actually allowed to do in 3.0. So I'd like to open it for discussion, unless everybody agree on it already. So far I can tell that 3.0 allows us to: 1) Get rid of all the deprecated methods 2) Move to Java 1.5 But what about changes to runtime behavior, re-factoring a whole set of classes etc? I'd like to relate to them one by one, so that we can comment on each. Removing deprecated methods - As I was told, 2.9 is the last release we are allowed to mark methods as deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there is a method that should be renamed, its signature should change or be removed altogether, we can't just do it and we'd have to deprecate it and remove it in 4.0 (?). I personally thought that 2.9 allows us to make these changes without letting anyone know about them in advance, which I'm ok with since upgrading to 3.0 is not going to be as 'smooth', but I also understand why 'letting people know in advance' (which means a release prior to the one when they are removed) gives a better service to our users. I also thought that jar drop-in-ability is not supposed to be supported from 2.9 to 3.0 (but I was corrected previously on that). Which means upon releasing 2.9, the very first issue that's opened and committed should be removing the current deprecated methods, or otherwise we could open an issue that deprecates a method and accidentally remove it later, when we handle the massive deprecation removal. We should also create a 2.9 tag. Changes to runtime behavior - What is exactly the policy here? If we document in 2.9 that certain features' runtime behavior will change in 3.0 - is that ok to make those changes? And if we don't document them and do it (in the transition from 2.9-3.0) then it's not? Why? After all, I expect anyone who upgrades to 3.0 to run all of his unit tests to assert that everything still works (I expect that to happen for every release, but for 3.0 in particular). Obviously the runtime behaviors that were changed and documented in 2.9 are ones that he might have already taken care of, but why can't he do the same reading the CHANGES of 3.0? I just feel that this policy forces us to think really hard and foresee those changes in runtime behavior that we'd like to do in 3.0 so that we can get them
[jira] Commented: (LUCENE-1614) Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean
[ https://issues.apache.org/jira/browse/LUCENE-1614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704665#action_12704665 ] Michael McCandless commented on LUCENE-1614: bq. I think one cannot even call next(). Hmm, yeah I think you're right. We could perhaps make this an entirely different interface (abstract class). Ie, one should not mix and match checking with next/advanceing. In the case I can think of, at least, it's an up-front decision as to which scorer does next vs check. bq. So I can include check() in the list as well. I think including it in this issue is fine. Add next() and skipTo() variants to DocIdSetIterator that return the current doc, instead of boolean Key: LUCENE-1614 URL: https://issues.apache.org/jira/browse/LUCENE-1614 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 See http://www.nabble.com/Another-possible-optimization---now-in-DocIdSetIterator-p23223319.html for the full discussion. The basic idea is to add variants to those two methods that return the current doc they are at, to save successive calls to doc(). If there are no more docs, return -1. A summary of what was discussed so far: # Deprecate those two methods. # Add nextDoc() and skipToDoc(int) that return doc, with default impl in DISI (calls next() and skipTo() respectively, and will be changed to abstract in 3.0). #* I actually would like to propose an alternative to the names: advance() and advance(int) - the first advances by one, the second advances to target. # Wherever these are used, do something like '(doc = advance()) = 0' instead of comparing to -1 for improved performance. I will post a patch shortly -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704674#action_12704674 ] Yonik Seeley commented on LUCENE-1593: -- Query objects are relatively abstract. Weights are created only with respect to a Searcher, and Scorers are created only from within that context with respect to an IndexReader. It really seems like we should maintain this separation and avoid putting implementation details into the Query object (or the Weight object for that matter). bq. A user might want to know what Collector implementation to create before calling search(Query, Collector) Having to create a certain type of collector sounds error prone. Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704692#action_12704692 ] Michael McCandless commented on LUCENE-1593: bq. What will TermQuery do? Oh: it's fine to return an in-order scorer, always. It's just that if a Query wants to use an out-of-order scorer, it should also implement an in-order one. Ie, there'd be a mating process to match the scorer to the collector. That that there might be a Collector out there that requires docs in order is not something I think we should handle. Reason is, there wasn't any guarantee until today that docs are returned in order. So how can somehow write a Collector which has a hard assumption on that? Maybe only if he used a Query which he knows always scores in order, such as TQ, but then I don't think this guy will have a problem since TQ returns true. bq. And if that someone needs docs in order, but the query at hand returns docs out of order, then I'd say tough luck ? I mean, maybe with BQ we can ensure in/out of order on request, but if there will be a query which returns docs in random, or based on other criteria which causes it to return out of order, what good will forcing it to return docs in order do? I'd say that you should just use a different query in that case? Well... we have to be careful. EG say we had some great optimization for iterating over matches to PhraseQuery, but it returned docs out of order. In that case, I think we'd preserve the in-order Scorer as well? bq. But I'm not sure in practice when one would want to use an out-of-order non-top iterator. One case might be a random access filter AND'd w/ a BooleanQuery. In that case I could ask for a BooleanScorer to return a DISI whose next is allowed to return docs out of order, because 1) my filter doesn't care and 2) my collector doesn't care. Though, we are thinking about pushing random access filters all the way down to the TermScorer, so this is example isn't realistic in that future... but it still feels like out of order iteration and I'm top scorer or not are orthogonal concepts. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704693#action_12704693 ] Michael McCandless commented on LUCENE-1593: One further optimization can be enabled if we can properly mate out-of-orderness between Scorer Collector: BooleanScorer could be automatically used when appropriate. Today, one must call BooleanQuery.setAllowDocsOutOfOrder which is rather silly (it's very much an under the hood detail of how the Scorer interacts w/ the Collector). The vast majority of time it's Lucene that creates the collector, and so now that we can create Collectors that either do or do not care if docs arrive out of order, we should allow BooleanScorer when we can. Though that means we have two ways to score a BooleanQuery: * Use BooleanScorer2 w/ a Collector that doesn't fallback to docID to break ties * Use BooleanScorer w/ a Collector that does fallback We'd need to test which is most performant (I'm guessing the 2nd one). So maybe we should in fact add a acceptsDocsOutOfOrder to Collector. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704699#action_12704699 ] Michael McCandless commented on LUCENE-1593: bq. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait. Yonik would you suggest we migrate Weight to be an abstract class instead? (This is also being discussed in a separate thread on java-dev, if you want to respond there...). Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704703#action_12704703 ] Michael McCandless commented on LUCENE-1593: Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704702#action_12704702 ] Michael McCandless commented on LUCENE-1593: {quote} IndexSearcher creates the Collector before it obtains a Scorer. Therefore all it has at hand is the Weight. Since Weight is an interface, we can't change it, so I added it to Query with a default of false. {quote} In early iterations on LUCENE-1483, we allowed Collector.setNextReader to return a new Collector on the possibility that a new segment might require different collector. We could consider going back to that... and allowing the builtin collectors to receive a Scorer on creation, which they could interact with to figure out in/out of order types of issues. We could then also enrich setNextReader a bit to also receive a Scorer, so that if somehow the Scorer for the next segment switched to be in-order vs out-of-order, the Collector could properly respond. Or we could require homogeneity for Scorer across all segments (which'd be quite a bit simpler). {quote} Why not reverse the flow of information and tell the Weight.scorer() method if an out-of-order scorer is acceptable via some flags or a context object. This is also not backward compatible because Weight is an interface, so perhaps this optimization will just have to wait. {quote} I tentatively like this approach, ie add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer. But still internal creation of Collectors could go both ways, and so we should retain the freedom to optimize (the BooleanScorer example above). Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704705#action_12704705 ] Michael McCandless commented on LUCENE-1593: bq. BooleanScorer could be automatically used when appropriate If we do this (and I think we should -- good perf gains, though I haven't tested just how good, recently), then we should deprecate setAllowDocsOutOfOrder in favor of Weight.scorer(boolean allowDocsOutOfOrder). And make it clear that internally Lucene may ask for either scorer, depending on the collector. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1252) Avoid using positions when not all required terms are present
[ https://issues.apache.org/jira/browse/LUCENE-1252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704711#action_12704711 ] Michael McCandless commented on LUCENE-1252: Here's a simple example that might drive this issue forward: +h1n1 flu +united states Ideally, to score this query, you'd want to first AND all 4 terms together, and only for docs matching that, consult the positions of each pair of terms. But we fail to do this today. It's like somehow Weight.scorer() needs to be able to return a cheap and an expensive scorer (which must be AND'd). I think PhraseQuery would somehow return cheap/expensive scorers that under the hood share the same SegmentTermDocs/Positions iterators, such that after cheap.next() has run, cheap.expensive only needs to check the current doc. So in fact maybe the expensive scorer should not be a Scorer but some other simple passes or doesn't API. Or maybe it returns say a TwoStageScorer, which adds a reallyPasses() (needs better name) method to otherwise normal the Scorer (DISI) API. Or something else Avoid using positions when not all required terms are present - Key: LUCENE-1252 URL: https://issues.apache.org/jira/browse/LUCENE-1252 Project: Lucene - Java Issue Type: Wish Components: Search Reporter: Paul Elschot Priority: Minor In the Scorers of queries with (lots of) Phrases and/or (nested) Spans, currently next() and skipTo() will use position information even when other parts of the query cannot match because some required terms are not present. This could be avoided by adding some methods to Scorer that relax the postcondition of next() and skipTo() to something like all required terms are present, but no position info was checked yet, and implementing these methods for Scorers that do conjunctions: BooleanScorer, PhraseScorer, and SpanScorer/NearSpans. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704720#action_12704720 ] Marvin Humphrey commented on LUCENE-1593: - I made Weight a subclass of Query and all of a sudden Searcher method signatures got easier to manage. PS: Is this a good place to discuss why [having rambling conversations in the bug tracker is a bad idea|http://producingoss.com/en/bug-tracker-usage.html], or should I open a new issue? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704721#action_12704721 ] Yonik Seeley commented on LUCENE-1593: -- bq. Yonik does Solr have any Scorers that iterate on docs out of order? Or is BooleanScorer the only one we all know about? Nope. BooleanScorer is the only one I know about. And it's sort of special too... it's not like BooleanScorer can accept out-of-order scorers as sub-scorers itself - the ids need to be delivered in the range of the current bucket. IMO custom out-of-order scorers aren't supported in Lucene. Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
dbsight
Hello Everyone, I just started to use lucene recently. Great project BTW. I was wondering if anyone has suggested making an open source version of dbsight (www.dbsight.net/). I've just started using it and I think it would be awesome if it was open source. Does anyone know of a project that's like this that is OS? If not, then how can I propose a project that does a similar thing? Thanks, Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: dbsight
Hi Mike, You may want to ask your question on java-u...@lucene.apache.org -J On Thu, Apr 30, 2009 at 11:59 AM, Michael Masters mmast...@gmail.comwrote: Hello Everyone, I just started to use lucene recently. Great project BTW. I was wondering if anyone has suggested making an open source version of dbsight (www.dbsight.net/). I've just started using it and I think it would be awesome if it was open source. Does anyone know of a project that's like this that is OS? If not, then how can I propose a project that does a similar thing? Thanks, Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Resolved: (LUCENE-1611) Do not launch new merges if IndexWriter has hit OOME
[ https://issues.apache.org/jira/browse/LUCENE-1611?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Michael McCandless resolved LUCENE-1611. Resolution: Fixed Fix Version/s: 2.4.2 Thanks Christiaan! Do not launch new merges if IndexWriter has hit OOME Key: LUCENE-1611 URL: https://issues.apache.org/jira/browse/LUCENE-1611 Project: Lucene - Java Issue Type: Bug Components: Index Affects Versions: 2.4.1 Reporter: Michael McCandless Assignee: Michael McCandless Fix For: 2.4.2, 2.9 Attachments: LUCENE-1611-241.patch, LUCENE-1611.patch, LUCENE-1611.patch if IndexWriter has hit OOME, it defends itself by refusing to commit changes to the index, including merges. But this can lead to infinite merge attempts because we fail to prevent starting a merge. Spinoff from http://www.nabble.com/semi-infinite-loop-during-merging-td23036156.html. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: What are we allowed to do in 3.0?
So I understand from your responses that you think that 2.9 should include as much as possible, so that a user will have ~0 work upgrading from 2.9 to 3.0, assuming he upgraded fully to 2.9 (moved to use the not-deprecated APIs etc.). If 3.0 is supposed to be released quickly after 2.9 then this makes sense, and leaves very small room for sudden major changes anyway. Also I read that you do support introducing minor changes to the API as well as runtime behvavior, but still prefer that we do them in 2.9. So refactoring like we discussed - changing all interfaces to abstract classes - should not happen in 2.9-3.0, which makes sense. I think anyway this type of refactoring should happen one at a time, and not a complete overhaul of the code. Well ... I guess that if that's the case (releasing 3.0 soon after we release 2.9, and introducing very minor and few changes to API and runtime behavior that were not documented in 2.9 may be acceptable), then we should be fine and this very short discussion (I somehow expected much more responses) can end. Thanks a lot for clarifying that to me. Shai On Thu, Apr 30, 2009 at 6:20 PM, Michael McCandless luc...@mikemccandless.com wrote: I think 3.0 should be a fast turnaround after 2.9? Ie, no new development should take place? We should remove deprecated APIs, change defaults, etc., but that's about it. (I think this is how past major releases worked?). It's a fast switch. Which then means we need to do all the hard work in 2.9... So, I think any API changes we want to make must be present in 2.9 as deprecations. We shouldn't up and remove / rename something in 3.0 with no fore-warning in 2.9. Is there a case where this is too painful? Likewise I think we should give notification of expected changes in runtime behavior with 2.9 (and not suddenly do them in 3.0). Which means upon releasing 2.9, the very first issue that's opened and committed should be removing the current deprecated methods, or otherwise we could open an issue that deprecates a method and accidentally remove it later, when we handle the massive deprecation removal. I think we should not target JAR drop-inability, and we should allow changes to runtime behavior, as well as certain minor API changes in 3.0. EG here are some of the changes already slated for 3.0: * IndexReader.open returns readOnly reader by default * IndexReader.norms returns null on fields that don't have norms * InterruptedException is thrown by many APIs * IndexWriter.autoCommit is hardwired to false * Things that now return the deprecated IndexCommitPoint (interface) will be changed to return IndexCommit (abstract base class) * Directory.list will be removed; Directory.listAll will become an abstract method * Stop tracking scores by default when sorting by field But recently, while working on LUCENE-1593, me and Mike spotted a need to add some methods to Weight, but since it is an interface we can't. So I said something let's do it in 3.0 but then we were not sure if this can be done in 3.0. I think for this we should probably introduce an abstract base class (implementing Weight) in 2.9, stating that Weight interface will be removed in 3.0? (EG, this is what was done for IndexCommitPoint/IndexCommit). Simply changing Weight to be an abstract class in 3.0 is spooky because Java is single inheritance, ie for existing classes that implements Weight but subclass something else it would be nicer to give a heads up with 2.9 that they'll need to refactor? Mike On Thu, Apr 30, 2009 at 6:25 AM, Shai Erera ser...@gmail.com wrote: Hi Recently I was involved in several issues that required some runtime changes to be done in 3.0 and it was not so clear what is it that we're actually allowed to do in 3.0. So I'd like to open it for discussion, unless everybody agree on it already. So far I can tell that 3.0 allows us to: 1) Get rid of all the deprecated methods 2) Move to Java 1.5 But what about changes to runtime behavior, re-factoring a whole set of classes etc? I'd like to relate to them one by one, so that we can comment on each. Removing deprecated methods - As I was told, 2.9 is the last release we are allowed to mark methods as deprecated, and remove them in 3.0. I.e., after 2.9 is out, if we feel there is a method that should be renamed, its signature should change or be removed altogether, we can't just do it and we'd have to deprecate it and remove it in 4.0 (?). I personally thought that 2.9 allows us to make these changes without letting anyone know about them in advance, which I'm ok with since upgrading to 3.0 is not going to be as 'smooth', but I also understand why 'letting people know in advance' (which means a release prior to the one when they are removed) gives a better service to our users. I also thought that jar drop-in-ability is not
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch {quote} Would you re-use MergePolicy, or make a new RAMMergePolicy? {quote} MergePolicy is used as is with a special IW method that handles merging ram segments for the real directory (which has an issue around merging contiguous segments, can that be relaxed in this case as I don't understand why this is?) The patch is not committable, however I am posting it to show a path that seems to work. It includes test cases for merging in ram and merging to the real directory. * IW.getFlushDirectory is used by internal calls to obtain the directory to flush segments to. This is used in DocumentsWriter related calls. * DocumentsWriter.directory is removed so that methods requiring the directory call IW.getFlushDirectory instead. * IW.setRAMDirectory sets the ram directory to be used. * IW.setRAMMergePolicy sets the merge policy to be used for merging segments on the ram dir. * In IW.updatePendingMerges totalRamUsed is the size of the ram segments + the ram buffer used. If totalRamUsed exceeds the max ram buffer size then IW. updatePendingRamMergesToRealDir is called. * IW. updatePendingRamMergesToRealDir registers a merge of the ram segments to the real directory (currently causes a non-contiguous segments exception) * MergePolicy.OneMerge has a directory attribute used when building the merge.info in _mergeInit. * Test case includes testMergeInRam, testMergeToDisk, testMergeRamExceeded There is one error that occurs regularly in testMergeRamExceeded {code} MergePolicy selected non-contiguous segments to merge (_bo:cx83 _bm:cx4 _bn:cx2 _bl:cx1-_bj _bp:cx1-_bp _bq:cx1-_bp _c2:cx1-_c2 _c3:cx1-_c2 _c4:cx1-_c2 vs _5x:c120 _6a:c8 _6t:c11 _bo:cx83** _bm:cx4** _bn:cx2** _bl:cx1-_bj** _bp:cx1-_bp** _bq:cx1-_bp** _c1:c10 _c2:cx1-_c2** _c3:cx1-_c2** _c4:cx1-_c2**), which IndexWriter (currently) cannot handle {code} Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Commented: (LUCENE-1593) Optimizations to TopScoreDocCollector and TopFieldCollector
[ https://issues.apache.org/jira/browse/LUCENE-1593?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanelfocusedCommentId=12704785#action_12704785 ] Shai Erera commented on LUCENE-1593: bq. add an API to Collector for it to declare if it can handle out-of-order collection, and then ask for the right Scorer. Maybe instead add docsOrderSupportedMode() which returns IN_ORDER, OUT_OF_ORDER, DONT_CARE? I.e., instead of a boolean allow a Collector to say I don't really care (like Mike has pointed out, I think, somewhere above) and let the Scorer creation code decide which one to create in case it knows any better. I.e., if we know that BS performs better than BS2, and we get a Collector saying DONT_CARE, we can always return BS. Unless we assume that OUT_OF_ORDER covers DONT_CARE either, in which case we can leave it as returning boolean and document that if a Collector can support OUT_OF_ORDER, it should always say so, giving the Scorer creator code a chance to decide what is the best Scorer to return. In IndexSearcher we can then: # Where Collector is given as argument, ask it if it about orderness and create the appropriate Scorer. # Where we create our own Collector (i.e. TFC and TSDC) decide on our own what is better. Maybe always ask out-of-order? That way a Query which doesn't only supports in-order without any optimization for out-of-order can return that in-order collector. I didn't think of it initially, but Mike is right - every in-order scorer is also an out-of-order scorer, so this should be fine. I like the approach of deprecating Weight and creating an abstract class, though that requires deprecating Searchable and creating an AbstractSearchable as well. Weight can be wrapped with an AbstractWeightWrapper and passed to the AbstractWeight methods (much like we do with AbstractHitCollector from LUCENE-1575), defaulting its scorer(inOrder) method to call scorer()? This I guess should be done in the scope of that issue, or I revert the changes done to Query (adding scoresDocsInOrder()), but keep those done to TFC and TSDC, and make that optimization in a different issue, which will handle Weight/Searchable and the rest of the changes proposed here? Optimizations to TopScoreDocCollector and TopFieldCollector --- Key: LUCENE-1593 URL: https://issues.apache.org/jira/browse/LUCENE-1593 Project: Lucene - Java Issue Type: Improvement Components: Search Reporter: Shai Erera Fix For: 2.9 Attachments: LUCENE-1593.patch, LUCENE-1593.patch, PerfTest.java This is a spin-off of LUCENE-1575 and proposes to optimize TSDC and TFC code to remove unnecessary checks. The plan is: # Ensure that IndexSearcher returns segements in increasing doc Id order, instead of numDocs(). # Change TSDC and TFC's code to not use the doc id as a tie breaker. New docs will always have larger ids and therefore cannot compete. # Pre-populate HitQueue with sentinel values in TSDC (score = Float.NEG_INF) and remove the check if reusableSD == null. # Also move to use changing top and then call adjustTop(), in case we update the queue. # some methods in Sort explicitly add SortField.FIELD_DOC as a tie breaker for the last SortField. But, doing so should not be necessary (since we already break ties by docID), and is in fact less efficient (once the above optimization is in). # Investigate PQ - can we deprecate insert() and have only insertWithOverflow()? Add a addDummyObjects method which will populate the queue without arranging it, just store the objects in the array (this can be used to pre-populate sentinel values)? I will post a patch as well as some perf measurements as soon as I have them. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Score calculation with new by-segment collection
Did I miss something, or when trunk switched to collecting on SegmentReaders we've lost proper scores? I mean, before score depended on TF calculated across all the index, and now it depends on TF for a given segment (yup, unless I missed something). Per-segment TF can vary wildly, especially in case of smaller segments. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Score calculation with new by-segment collection
On Thu, Apr 30, 2009 at 4:44 PM, Earwin Burrfoot ear...@gmail.com wrote: Did I miss something, or when trunk switched to collecting on SegmentReaders we've lost proper scores? I mean, before score depended on TF calculated across all the index, and now it depends on TF for a given segment (yup, unless I missed something). Per-segment TF can vary wildly, especially in case of smaller segments. tf is per-document, not per index. idf is per index, and is calculated in the creation of Weight at the top-level index reader. -Yonik http://www.lucidimagination.com - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: Score calculation with new by-segment collection
On Fri, May 1, 2009 at 00:47, Yonik Seeley yo...@lucidimagination.com wrote: On Thu, Apr 30, 2009 at 4:44 PM, Earwin Burrfoot ear...@gmail.com wrote: Did I miss something, or when trunk switched to collecting on SegmentReaders we've lost proper scores? I mean, before score depended on TF calculated across all the index, and now it depends on TF for a given segment (yup, unless I missed something). Per-segment TF can vary wildly, especially in case of smaller segments. tf is per-document, not per index. idf is per index, Yup, my bad. and is calculated in the creation of Weight at the top-level index reader. Aha, thanks a lot. -- Kirill Zakharenko/Кирилл Захаренко (ear...@gmail.com) Home / Mobile: +7 (495) 683-567-4 / +7 (903) 5-888-423 ICQ: 104465785 - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
Re: dbsight
Sorry. My mistake. -Mike On Thu, Apr 30, 2009 at 1:22 PM, Jason Rutherglen jason.rutherg...@gmail.com wrote: Hi Mike, You may want to ask your question on java-u...@lucene.apache.org -J On Thu, Apr 30, 2009 at 11:59 AM, Michael Masters mmast...@gmail.com wrote: Hello Everyone, I just started to use lucene recently. Great project BTW. I was wondering if anyone has suggested making an open source version of dbsight (www.dbsight.net/). I've just started using it and I think it would be awesome if it was open source. Does anyone know of a project that's like this that is OS? If not, then how can I propose a project that does a similar thing? Thanks, Mike - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1313) Realtime Search
[ https://issues.apache.org/jira/browse/LUCENE-1313?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Rutherglen updated LUCENE-1313: - Attachment: LUCENE-1313.patch Fixed and cleaned up more. All tests pass Added entry in CHANGES.txt I'm going to integrate LUCENE-1618 and test that out as a part of the next patch. Realtime Search --- Key: LUCENE-1313 URL: https://issues.apache.org/jira/browse/LUCENE-1313 Project: Lucene - Java Issue Type: New Feature Components: Index Affects Versions: 2.4.1 Reporter: Jason Rutherglen Priority: Minor Fix For: 2.9 Attachments: LUCENE-1313.jar, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, LUCENE-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch, lucene-1313.patch Realtime search with transactional semantics. Possible future directions: * Optimistic concurrency * Replication Encoding each transaction into a set of bytes by writing to a RAMDirectory enables replication. It is difficult to replicate using other methods because while the document may easily be serialized, the analyzer cannot. I think this issue can hold realtime benchmarks which include indexing and searching concurrently. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online. - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org
[jira] Updated: (LUCENE-1494) Additional features for searching for value across multiple fields (many-to-one style)
[ https://issues.apache.org/jira/browse/LUCENE-1494?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Hoss Man updated LUCENE-1494: - Attachment: LUCENE-1494-masking.patch some things looked like they wouldn't work with the masking patch, so i wrote some test cases to convince myself they were broken (and because new code should always have test cases). In particular i was worried about the lack of equals/hashCode methods, and the broken rewrite method one interesting thing I discovered was that the code worked in many cases even though rewrite was constantly just returning the masked inner query -- digging into it i realized the reason was because none of the other SpanQuery classes pay any attention to what their nested clauses return when they recursively rewrite, so a SpanNearQuery whose constructor freaks out if the fields of all the clauses don't match, happily generates spans if one of those clauses returns a complteley different SpanQuery on rewrite. I also removed the proxying of getBoost and setBoost ... it was causing problems with some stock testing framework code that expects a q1.equals(q1.clone().setBoost(newBoost)) to be false (this was evaluating to true because it's a shallow clone and setBoost was proxying and modifying the original inner query's boost value) ... this means that FieldMaskingSpanQuery is consistent with how other SpanQueries deal with boost (they ignore the boosts of their nested clauses) new patch (with tests) attached ... i'd like to have some more tests before committing (spans is deep voodoo, we're doing funky stuff with spans, all the more reason to test thoroughly) Additional features for searching for value across multiple fields (many-to-one style) -- Key: LUCENE-1494 URL: https://issues.apache.org/jira/browse/LUCENE-1494 Project: Lucene - Java Issue Type: New Feature Components: Search Affects Versions: 2.4 Reporter: Paul Cowan Priority: Minor Attachments: LUCENE-1494-masking.patch, LUCENE-1494-masking.patch, LUCENE-1494-multifield.patch, LUCENE-1494-positionincrement.patch This issue is to cover the changes required to do a search across multiple fields with the same name in a fashion similar to a many-to-one database. Below is my post on java-dev on the topic, which details the changes we need: --- We have an interesting situation where we are effectively indexing two 'entities' in our system, which share a one-to-many relationship (imagine 'User' and 'Delivery Address' for demonstration purposes). At the moment, we index one Lucene Document per 'many' end, duplicating the 'one' end data, like so: userid: 1 userfirstname: fred addresscountry: au addressphone: 1234 userid: 1 userfirstname: fred addresscountry: nz addressphone: 5678 userid: 2 userfirstname: mary addresscountry: au addressphone: 5678 (note: 2 Documents indexed for user 1). This is somewhat annoying for us, because when we search in Lucene the results we want back (conceptually) are at the 'user' level, so we have to collapse the results by distinct user id, etc. etc (let alone that it blows out the size of our index enormously). So why do we do it? It would make more sense to use multiple fields: userid: 1 userfirstname: fred addresscountry: au addressphone: 1234 addresscountry: nz addressphone: 5678 userid: 2 userfirstname: mary addresscountry: au addressphone: 5678 But imagine the search +addresscountry:au +addressphone:5678. We'd like this to match ONLY Mary, but of course it matches Fred also because he matches both those terms (just for different addresses). There are two aspects to the approach we've (more or less) got working but I'd like to run them past the group and see if they're worth trying to get them into Lucene proper (if so, I'll create a JIRA issue for them) 1) Use a modified SpanNearQuery. If we assume that country + phone will always be one token, we can rely on the fact that the positions of 'au' and '5678' in Fred's document will be different. SpanQuery q1 = new SpanTermQuery(new Term(addresscountry, au)); SpanQuery q2 = new SpanTermQuery(new Term(addressphone, 5678)); SpanQuery snq = new SpanNearQuery(new SpanQuery[]{q1, q2}, 0, false); the slop of 0 means that we'll only return those where the two terms are in the same position in their respective fields. This works brilliantly, BUT requires a change to SpanNearQuery's constructor (which checks that all the clauses are against the same field). Are people amenable to perhaps adding another constructor to SNQ which doesn't do the check, or subclassing it to do the same (give it a
Hudson build is back to normal: Lucene-trunk #813
See http://hudson.zones.apache.org/hudson/job/Lucene-trunk/813/changes - To unsubscribe, e-mail: java-dev-unsubscr...@lucene.apache.org For additional commands, e-mail: java-dev-h...@lucene.apache.org